Vicomtech @ MEDDOCAN

This page contains the documentation on how to obtain and use our scripts and models.

How to

Getting our scripts to run should be easy, but there are a few issues to be taken into account. Please read and follow the complete instructions carefully. If you still have problems, please get in touch.

The following instructions assume that you have Python3.5+ and pip installed in a Linux machine capable of running pytorch on GPU. Without GPU, you will still be able to use the code except for the classifiers ncrfpp and voting.

Getting set up

Get the scripts and models by following the instructions in the download area.
Uncompress the archive into a directory of your choosing, henceforth ROOT:

tar -xzvf vmeddocan.tar.gz --one-top-level=ROOT
cd ROOT

(Optional) Create a Python 3 virtual environment:

virtualenv --python=python3 venv3
source venv3/bin/activate

Install the dependencies:

pip3 install -r requirements.txt

You might need different torch and torchvision versions to those specified in the requirements file. Please check which versions suit you best and how to install them at https://pytorch.org/ .

Download and unzip Biomedical Word Embeddings for Spanish :

cd resources
wget https://zenodo.org/record/2542722/files/Embeddings_2019-01-01.zip
unzip Embeddings_2019-01-01.zip

The compressed and uncompressed archive is 8.6 and 12 GB big, respectively. For decoding, you will only need the files under Embeddings/Embeddings_ES/Scielo_Wikipedia/300. You may safely remove the rest of the files if you have no other use for them.

If you modify the default folder structure or you desire to place the embeddings at a location different from the folder resources, please change accordingly the path in the file ROOT/src/features/embeddings.py:

27  from gensim.models import Word2Vec
28
29
30  def load_embeddings():
31      # make sure that filepath points to your scielo_wiki_w10_c5_300_15epoch.w2vmodel
32      filepath = 'resources/Embeddings/Embeddings_ES/Scielo_Wikipedia/300/scielo_wiki_w10_c5_300_15epoch.w2vmodel'
33      return Word2Vec.load(filepath)

Run the tests to ensure that you managed to set up the working environment:

cd ..
python3 -m unittest test.test.Test01ResourcesExist
python3 -m unittest test.test.Test02ModelsExist
python3 -m unittest test.test.Test03FeatureExtractionGoldenTest
python3 -m unittest test.test.Test04DecoGoldenTest.test_01_spacy_deco
python3 -m unittest test.test.Test04DecoGoldenTest.test_02_crf_deco
python3 -m unittest test.test.Test04DecoGoldenTest.test_03_crfxgb_deco
python3 -m unittest test.test.Test04DecoGoldenTest.test_04_ncrfpp_deco

The test includes 4 test cases:

Case 01: check that the required resources can be loaded.
Case 02: check that all the models exist.
Case 03: check that feature extraction yields the expected results.
Case 04: check that the decoding yields the expected results.

Cases 03 and 04 can be very time consuming, with the complete set of tests taking typically more than 5 minutes to run.

The last test, which concerns ncrfpp, might fail if you don't have the adequate pytorch version installed and/or your machine does not have GPU. In that case, you will not be able to use the ncrpp and voting classifiers, but you will still be able to use the other classifiers.

Usage

The main executable is vmeddocan.py, which is located at ROOT. It must receive as input, at least, the path to a directory with JSON files and the type of model to be used for decoding:

usage: vmeddocan.py [-h] -i INPUT -m {spacy,crf,crfxgb,ncrfpp,voting}
                    [-o OUTPUT] [-p PROCESSES] [--save-features]
                    [--no-features]

required arguments:
-i, --input           Path to folder where the input JSON files are stored
-m, --model           Type of model to be used: {spacy,crf,crfxgb,ncrfpp,voting}

optional arguments:
-h, --help            Show this help message and exit
-o, --output          Path to the output folder (default: <input>-<model>)
-p, --processes       Number of parallel processes to extract features (default: <cpu count//2>)
--save-features       Output JSONs with extracted features in addition to the predictions (default: False)
--no-features         Do not extract features (default: False)

Read more about the input vmeddocan.py expects in the section How to: Input.

As for the optional arguments:

--output is the path to the folder where the predictions will be written. If not given, a path is computed from the input path and the type of model chosen (e.g., given the input path data/my-input-jsons and the model voting, the computed output path would be data/my-input-jsons-voting). The program checks whether the output path exists; if it does, it asks the user for permission to overwrite its contents.
Feature extraction can be time consuming. --processes configures the number of parallel processes to be run to do the extraction. By default, the number of parallel processes is half the CPUs in your machine.
By default, output files contain just the predicted labels for each input token. Activate the flag --save-features in order to obtain the extracted features as well. Read more about the generated output in the section How to: Output.
Use the flag --no-features if the input files already contain the required features in order to avoid unnecessary processing.

Here is an example that should work for you "as is":

python3 vmeddocan.py -i ./test/test-sample -o ./usage-ex-1 -m crf --save-features

In the example above,

we ask the program to process the directory ROOT/test/test-sample,
we want the results saved at ROOT/usage-ex-1,
we use the classifier crf, and
we want the output JSONs to contain all the features extracted.

Yet another example:

python3 vmeddocan.py -i ./test/test-sample-features -o ./usage-ex-2 -m crf --no-features

This example differs from the previous one crucially in that the input JSON files already contain the required features and, thus, we indicate with the flag --no-features that feature extraction is not necessary –you should notice that this example takes less time to execute. Furthermore, because we do not use the flag --save-features, the output JSON files are much smaller than those produced in the first example.

Input

The input to vmeddocan.py is a path to a directory with JSON files. The JSON files should comply with the following model:

{
    "type": "object",
    "properties": {
        "sentence_<index>": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "token": {
                        "type": "string"
                    },
                    "lemma": {
                        "type": "string"
                    },
                    "pos": {
                        "type": "string"
                    }
                }
                "required": ["token", "lemma", "pos"],
                "additionalProperties": False
            }
        }
    }
}

{
    "sentence_0": [
        {"token": "Nombre", "lemma": "nombre", "pos": "NCMS000"},
        {"token": ":", "lemma": ":", "pos": "Fd"},
        {"token": "Antonio", "lemma": "antonio", "pos": "NP00000"}
        {"token": ".", "lemma": ".", "pos": "Fp"}
    ],
    "sentence_1": [
        {"token": "Apellidos:", "lemma": "apellido", "pos": "NCMP000"},
        {"token": ":", "lemma": ":", "pos": "Fd"},
        {"token": "García", "lemma": "garcía", "pos": "NP00000"},
        {"token": "García", "lemma": "garcía", "pos": "NP00000"},
        {"token": ".", "lemma": ".", "pos": "Fp"}
    ]
}

That is, the input JSONs should contain the lemma and part of speech of each token in addition to the wordform. Our models have been trained on lemmas and part-of-speech tags obtained with the SPACC part-of-speech tagger , which uses the EAGLES tagset.

Output

The JSON files resulting from the decoding contain at least the property prediction:

{
    "type": "object",
    "properties": {
        "sentence_<index>": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "token": {
                        "type": "string"
                    },
                    "prediction": {
                        "type": "string"
                    }
                }
                "required": ["token", "prediction"],
                "additionalProperties": True
            }
        }
    }
}

{
    "sentence_0": [
        {"token": "Nombre", "prediction": "O"},
        {"token": ":", "prediction": "O"},
        {"token": "Antonio", "prediction": "U-NOMBRE_SUJETO_ASISTENCIA"}
        {"token": ".", "prediction": "O"}
    ],
    "sentence_1": [
        {"token": "Apellidos:", "prediction": "O"},
        {"token": ":", "prediction": "O"},
        {"token": "García", "prediction": "B-NOMBRE_SUJETO_ASISTENCIA"},
        {"token": "García", "prediction": "L-NOMBRE_SUJETO_ASISTENCIA"},
        {"token": ".", "prediction": "O"}
    ]
}

JSON files saved with features (i.e., using the flag --save-features) contain many more properties per token. Their size is around 350 times bigger than their respective input files (e.g., an input file of ~50kB will produce an output file of ~17MB).

As evidenced in the example above, our classifiers use the BILOU tag-scheme. A complete list of tags can be consulted in this article (Table 1).

License

All code modules included in the distributed package (except those listed below, under Exceptions) are distributed under GNU Affero General Public License (AGPL) and copyrighted by Vicomtech. You can find a copy of AGPL at https://www.gnu.org/licenses/agpl-3.0.

All models and resources included in the distributed package are distributed under Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) License and copyrighted by Vicomtech. You can find a copy of CC BY-NC-ND 4.0 license at https://creativecommons.org/licenses/by-nc-nd/4.0/.

Exceptions

Some contents of the distributed package are copyrighted by a third party and distributed under different open licenses. If you want to redistribute Vicomtech @ MEDDOCAN, part of it, or use it or part of it in derived works, make sure you are doing so under the terms stated in the license applying to each of the involved modules.

Code components copyrigthed by a third party with licenses other than AGPL are:

NCRF++ is distributed under its original Apache 2.0 license. See https://github.com/jiesutd/NCRFpp for details. You'll find the license at http://www.apache.org/licenses/LICENSE-2.0.

Download

You must register yourself to be able to download the scripts and models.

General
Home
Demo
Contact