Vicomtech @ MEDDOCAN

This page contains the documentation on how to obtain and use our scripts and models.

How to

Getting set up
  1. Get the scripts and models by following the instructions in the download area.
  2. Uncompress the archive into a directory of your choosing, henceforth ROOT:
  3. tar -xzvf vmeddocan.tar.gz --one-top-level=ROOT
    cd ROOT
  4. (Optional) Create a Python 3 virtual environment:
  5. virtualenv --python=python3 venv3
    source venv3/bin/activate
  6. Install the dependencies:
  7. pip3 install -r requirements.txt
  8. Download and unzip Biomedical Word Embeddings for Spanish :
  9. cd resources
    wget https://zenodo.org/record/2542722/files/Embeddings_2019-01-01.zip
    unzip Embeddings_2019-01-01.zip

    If you modify the default folder structure or you desire to place the embeddings at a location different from the folder resources, please change accordingly the path in the file ROOT/src/features/embeddings.py:

    27  from gensim.models import Word2Vec
    28
    29
    30  def load_embeddings():
    31      # make sure that filepath points to your scielo_wiki_w10_c5_300_15epoch.w2vmodel
    32      filepath = 'resources/Embeddings/Embeddings_ES/Scielo_Wikipedia/300/scielo_wiki_w10_c5_300_15epoch.w2vmodel'
    33      return Word2Vec.load(filepath)
  10. Run the tests to ensure that you managed to set up the working environment:
  11. cd ..
    python3 -m unittest test.test.Test01ResourcesExist
    python3 -m unittest test.test.Test02ModelsExist
    python3 -m unittest test.test.Test03FeatureExtractionGoldenTest
    python3 -m unittest test.test.Test04DecoGoldenTest.test_01_spacy_deco
    python3 -m unittest test.test.Test04DecoGoldenTest.test_02_crf_deco
    python3 -m unittest test.test.Test04DecoGoldenTest.test_03_crfxgb_deco
    python3 -m unittest test.test.Test04DecoGoldenTest.test_04_ncrfpp_deco

    The test includes 4 test cases:

    1. Case 01: check that the required resources can be loaded.
    2. Case 02: check that all the models exist.
    3. Case 03: check that feature extraction yields the expected results.
    4. Case 04: check that the decoding yields the expected results.
Usage

The main executable is vmeddocan.py, which is located at ROOT. It must receive as input, at least, the path to a directory with JSON files and the type of model to be used for decoding:

usage: vmeddocan.py [-h] -i INPUT -m {spacy,crf,crfxgb,ncrfpp,voting}
                    [-o OUTPUT] [-p PROCESSES] [--save-features]
                    [--no-features]

required arguments:
-i, --input           Path to folder where the input JSON files are stored
-m, --model           Type of model to be used: {spacy,crf,crfxgb,ncrfpp,voting}

optional arguments:
-h, --help            Show this help message and exit
-o, --output          Path to the output folder (default: <input>-<model>)
-p, --processes       Number of parallel processes to extract features (default: <cpu count//2>)
--save-features       Output JSONs with extracted features in addition to the predictions (default: False)
--no-features         Do not extract features (default: False)

Read more about the input vmeddocan.py expects in the section How to: Input.

As for the optional arguments:

  • --output is the path to the folder where the predictions will be written. If not given, a path is computed from the input path and the type of model chosen (e.g., given the input path data/my-input-jsons and the model voting, the computed output path would be data/my-input-jsons-voting). The program checks whether the output path exists; if it does, it asks the user for permission to overwrite its contents.
  • Feature extraction can be time consuming. --processes configures the number of parallel processes to be run to do the extraction. By default, the number of parallel processes is half the CPUs in your machine.
  • By default, output files contain just the predicted labels for each input token. Activate the flag --save-features in order to obtain the extracted features as well. Read more about the generated output in the section How to: Output.
  • Use the flag --no-features if the input files already contain the required features in order to avoid unnecessary processing.

Here is an example that should work for you "as is":

python3 vmeddocan.py -i ./test/test-sample -o ./usage-ex-1 -m crf --save-features

In the example above,

  1. we ask the program to process the directory ROOT/test/test-sample,
  2. we want the results saved at ROOT/usage-ex-1,
  3. we use the classifier crf, and
  4. we want the output JSONs to contain all the features extracted.

Yet another example:

python3 vmeddocan.py -i ./test/test-sample-features -o ./usage-ex-2 -m crf --no-features

This example differs from the previous one crucially in that the input JSON files already contain the required features and, thus, we indicate with the flag --no-features that feature extraction is not necessary –you should notice that this example takes less time to execute. Furthermore, because we do not use the flag --save-features, the output JSON files are much smaller than those produced in the first example.

Input

The input to vmeddocan.py is a path to a directory with JSON files. The JSON files should comply with the following model:

{
    "type": "object",
    "properties": {
        "sentence_<index>": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "token": {
                        "type": "string"
                    },
                    "lemma": {
                        "type": "string"
                    },
                    "pos": {
                        "type": "string"
                    }
                }
                "required": ["token", "lemma", "pos"],
                "additionalProperties": False
            }
        }
    }
}
{
    "sentence_0": [
        {"token": "Nombre", "lemma": "nombre", "pos": "NCMS000"},
        {"token": ":", "lemma": ":", "pos": "Fd"},
        {"token": "Antonio", "lemma": "antonio", "pos": "NP00000"}
        {"token": ".", "lemma": ".", "pos": "Fp"}
    ],
    "sentence_1": [
        {"token": "Apellidos:", "lemma": "apellido", "pos": "NCMP000"},
        {"token": ":", "lemma": ":", "pos": "Fd"},
        {"token": "García", "lemma": "garcía", "pos": "NP00000"},
        {"token": "García", "lemma": "garcía", "pos": "NP00000"},
        {"token": ".", "lemma": ".", "pos": "Fp"}
    ]
}

That is, the input JSONs should contain the lemma and part of speech of each token in addition to the wordform. Our models have been trained on lemmas and part-of-speech tags obtained with the SPACC part-of-speech tagger , which uses the EAGLES tagset.

Output

The JSON files resulting from the decoding contain at least the property prediction:

{
    "type": "object",
    "properties": {
        "sentence_<index>": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "token": {
                        "type": "string"
                    },
                    "prediction": {
                        "type": "string"
                    }
                }
                "required": ["token", "prediction"],
                "additionalProperties": True
            }
        }
    }
}
{
    "sentence_0": [
        {"token": "Nombre", "prediction": "O"},
        {"token": ":", "prediction": "O"},
        {"token": "Antonio", "prediction": "U-NOMBRE_SUJETO_ASISTENCIA"}
        {"token": ".", "prediction": "O"}
    ],
    "sentence_1": [
        {"token": "Apellidos:", "prediction": "O"},
        {"token": ":", "prediction": "O"},
        {"token": "García", "prediction": "B-NOMBRE_SUJETO_ASISTENCIA"},
        {"token": "García", "prediction": "L-NOMBRE_SUJETO_ASISTENCIA"},
        {"token": ".", "prediction": "O"}
    ]
}

JSON files saved with features (i.e., using the flag --save-features) contain many more properties per token. Their size is around 350 times bigger than their respective input files (e.g., an input file of ~50kB will produce an output file of ~17MB).

As evidenced in the example above, our classifiers use the BILOU tag-scheme. A complete list of tags can be consulted in this article (Table 1).

License

All code modules included in the distributed package (except those listed below, under Exceptions) are distributed under GNU Affero General Public License (AGPL) and copyrighted by Vicomtech. You can find a copy of AGPL at https://www.gnu.org/licenses/agpl-3.0.

All models and resources included in the distributed package are distributed under Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) License and copyrighted by Vicomtech. You can find a copy of CC BY-NC-ND 4.0 license at https://creativecommons.org/licenses/by-nc-nd/4.0/.

Exceptions

Some contents of the distributed package are copyrighted by a third party and distributed under different open licenses. If you want to redistribute Vicomtech @ MEDDOCAN, part of it, or use it or part of it in derived works, make sure you are doing so under the terms stated in the license applying to each of the involved modules.

Code components copyrigthed by a third party with licenses other than AGPL are:

  1. NCRF++ is distributed under its original Apache 2.0 license. See https://github.com/jiesutd/NCRFpp for details. You'll find the license at http://www.apache.org/licenses/LICENSE-2.0.

Download

You must register yourself to be able to download the scripts and models.