Vicomtech @ MEDDOCAN
This page contains the documentation on how to obtain and use our scripts and models.
How to
ncrfpp
and
voting
.
Getting set up
- Get the scripts and models by following the instructions in the download area.
-
Uncompress the archive into a directory of your choosing, henceforth
ROOT
: - (Optional) Create a Python 3 virtual environment:
- Install the dependencies:
- Download and unzip Biomedical Word Embeddings for Spanish :
- Run the tests to ensure that you managed to set up the working environment:
- Case 01: check that the required resources can be loaded.
- Case 02: check that all the models exist.
- Case 03: check that feature extraction yields the expected results.
- Case 04: check that the decoding yields the expected results.
tar -xzvf vmeddocan.tar.gz --one-top-level=ROOT
cd ROOT
virtualenv --python=python3 venv3
source venv3/bin/activate
pip3 install -r requirements.txt
cd resources
wget https://zenodo.org/record/2542722/files/Embeddings_2019-01-01.zip
unzip Embeddings_2019-01-01.zip
Embeddings/Embeddings_ES/Scielo_Wikipedia/300
.
You may safely remove the rest of the files if you have no other use for them.
If you modify the default folder structure or
you desire to place the embeddings at a location different from the folder resources
,
please change accordingly the path in the file ROOT/src/features/embeddings.py
:
27 from gensim.models import Word2Vec
28
29
30 def load_embeddings():
31 # make sure that filepath points to your scielo_wiki_w10_c5_300_15epoch.w2vmodel
32 filepath = 'resources/Embeddings/Embeddings_ES/Scielo_Wikipedia/300/scielo_wiki_w10_c5_300_15epoch.w2vmodel'
33 return Word2Vec.load(filepath)
cd ..
python3 -m unittest test.test.Test01ResourcesExist
python3 -m unittest test.test.Test02ModelsExist
python3 -m unittest test.test.Test03FeatureExtractionGoldenTest
python3 -m unittest test.test.Test04DecoGoldenTest.test_01_spacy_deco
python3 -m unittest test.test.Test04DecoGoldenTest.test_02_crf_deco
python3 -m unittest test.test.Test04DecoGoldenTest.test_03_crfxgb_deco
python3 -m unittest test.test.Test04DecoGoldenTest.test_04_ncrfpp_deco
The test includes 4 test cases:
ncrfpp
,
might fail if you don't have the adequate pytorch version installed and/or your machine
does not have GPU. In that case, you will not be able to use the ncrpp
and
voting
classifiers, but you will still be able to use the other classifiers.
Usage
The main executable is vmeddocan.py
, which is located at ROOT
.
It must receive as input, at least, the path to a directory with JSON files and the
type of model to be used for decoding:
usage: vmeddocan.py [-h] -i INPUT -m {spacy,crf,crfxgb,ncrfpp,voting}
[-o OUTPUT] [-p PROCESSES] [--save-features]
[--no-features]
required arguments:
-i, --input Path to folder where the input JSON files are stored
-m, --model Type of model to be used: {spacy,crf,crfxgb,ncrfpp,voting}
optional arguments:
-h, --help Show this help message and exit
-o, --output Path to the output folder (default: <input>-<model>)
-p, --processes Number of parallel processes to extract features (default: <cpu count//2>)
--save-features Output JSONs with extracted features in addition to the predictions (default: False)
--no-features Do not extract features (default: False)
Read more about the input vmeddocan.py
expects in the section How to: Input.
As for the optional arguments:
-
--output
is the path to the folder where the predictions will be written. If not given, a path is computed from the input path and the type of model chosen (e.g., given the input pathdata/my-input-jsons
and the modelvoting
, the computed output path would bedata/my-input-jsons-voting
). The program checks whether the output path exists; if it does, it asks the user for permission to overwrite its contents. -
Feature extraction can be time consuming.
--processes
configures the number of parallel processes to be run to do the extraction. By default, the number of parallel processes is half the CPUs in your machine. -
By default, output files contain just the predicted labels for each input token. Activate the flag
--save-features
in order to obtain the extracted features as well. Read more about the generated output in the section How to: Output. -
Use the flag
--no-features
if the input files already contain the required features in order to avoid unnecessary processing.
Here is an example that should work for you "as is":
python3 vmeddocan.py -i ./test/test-sample -o ./usage-ex-1 -m crf --save-features
In the example above,
- we ask the program to process the directory
ROOT/test/test-sample
, - we want the results saved at
ROOT/usage-ex-1
, - we use the classifier
crf
, and - we want the output JSONs to contain all the features extracted.
Yet another example:
python3 vmeddocan.py -i ./test/test-sample-features -o ./usage-ex-2 -m crf --no-features
This example differs from the previous one crucially in that the input JSON files already
contain the required features and, thus, we indicate with the flag --no-features
that feature extraction is not necessary –you should notice that this example
takes less time to execute. Furthermore, because we do not use the flag --save-features
,
the output JSON files are much smaller than those produced in the first example.
Input
The input to vmeddocan.py
is a path to a directory with JSON files. The JSON files should
comply with the following model:
{
"type": "object",
"properties": {
"sentence_<index>": {
"type": "array",
"items": {
"type": "object",
"properties": {
"token": {
"type": "string"
},
"lemma": {
"type": "string"
},
"pos": {
"type": "string"
}
}
"required": ["token", "lemma", "pos"],
"additionalProperties": False
}
}
}
}
{
"sentence_0": [
{"token": "Nombre", "lemma": "nombre", "pos": "NCMS000"},
{"token": ":", "lemma": ":", "pos": "Fd"},
{"token": "Antonio", "lemma": "antonio", "pos": "NP00000"}
{"token": ".", "lemma": ".", "pos": "Fp"}
],
"sentence_1": [
{"token": "Apellidos:", "lemma": "apellido", "pos": "NCMP000"},
{"token": ":", "lemma": ":", "pos": "Fd"},
{"token": "García", "lemma": "garcía", "pos": "NP00000"},
{"token": "García", "lemma": "garcía", "pos": "NP00000"},
{"token": ".", "lemma": ".", "pos": "Fp"}
]
}
That is, the input JSONs should contain the lemma and part of speech of each token in addition to the wordform. Our models have been trained on lemmas and part-of-speech tags obtained with the SPACC part-of-speech tagger , which uses the EAGLES tagset.
Output
The JSON files resulting from the decoding contain at least the property prediction
:
{
"type": "object",
"properties": {
"sentence_<index>": {
"type": "array",
"items": {
"type": "object",
"properties": {
"token": {
"type": "string"
},
"prediction": {
"type": "string"
}
}
"required": ["token", "prediction"],
"additionalProperties": True
}
}
}
}
{
"sentence_0": [
{"token": "Nombre", "prediction": "O"},
{"token": ":", "prediction": "O"},
{"token": "Antonio", "prediction": "U-NOMBRE_SUJETO_ASISTENCIA"}
{"token": ".", "prediction": "O"}
],
"sentence_1": [
{"token": "Apellidos:", "prediction": "O"},
{"token": ":", "prediction": "O"},
{"token": "García", "prediction": "B-NOMBRE_SUJETO_ASISTENCIA"},
{"token": "García", "prediction": "L-NOMBRE_SUJETO_ASISTENCIA"},
{"token": ".", "prediction": "O"}
]
}
JSON files saved with features (i.e., using the flag --save-features
) contain many
more properties per token. Their size is around 350 times bigger than their respective input files
(e.g., an input file of ~50kB will produce an output file of ~17MB).
As evidenced in the example above, our classifiers use the BILOU tag-scheme. A complete list of tags can be consulted in this article (Table 1).
License
All code modules included in the distributed package (except those listed below, under Exceptions) are distributed under GNU Affero General Public License (AGPL) and copyrighted by Vicomtech. You can find a copy of AGPL at https://www.gnu.org/licenses/agpl-3.0.
All models and resources included in the distributed package are distributed under Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) License and copyrighted by Vicomtech. You can find a copy of CC BY-NC-ND 4.0 license at https://creativecommons.org/licenses/by-nc-nd/4.0/.
Exceptions
Some contents of the distributed package are copyrighted by a third party and distributed under different open licenses. If you want to redistribute Vicomtech @ MEDDOCAN, part of it, or use it or part of it in derived works, make sure you are doing so under the terms stated in the license applying to each of the involved modules.
Code components copyrigthed by a third party with licenses other than AGPL are:
- NCRF++ is distributed under its original Apache 2.0 license. See https://github.com/jiesutd/NCRFpp for details. You'll find the license at http://www.apache.org/licenses/LICENSE-2.0.
Download
You must register yourself to be able to download the scripts and models.