Submission details
The challenge will be graded on Codalab.
A fully working evaluation script is provided to participants, that exactly matches the evaluation formulas used in Codalab. This way participants will have the possibility to evaluate their systems offline and perform hyper-parameter tuning with respect to the same evaluation metrics as used in the competition.
Baseline implementation
A baseline system is provided for participants to compare their results. If necessary, feel free to use the baseline as a starting point for developing your own solution, since the baseline already covers parsing the input and generating output in the correct format.
The baseline implementation is an extremely basic strategy that simply stores all the training, and at test time outputs keyphrases and relations if they exactly match something found in the training.
We recommend all participants to first run the baseline implementation (with the training and development sets) and upload it to Codalab, to get acquainted with the submission process. The following instructions detail this process.
Running the baseline implementation on the development set
The first step consists in downloading the project and running the baseline implementation.
Clone the ehealthkd-2019
project from Github:
$ git clone https://github.com/knowledge-learning/ehealthkd-2019.git
Run the baseline implementation for the main scenario. The baseline implementation is in the scripts/baseline.py
file. The arguments are:
- The path to the
input_training.txt
- The path to the test file (in this case is the development file
input_develop.txt
) - The path to the desired output file (
submit/scenario1-main/output_scenario1.txt
)
In this case we will train with the training
set (600 sentences) and evaluate on the development
set (100 sentences) using the same sentences for the 3 evaluation scenarios. However, in the final TEST phase you will train with both training and development and evaluate on the corresponding test sets (different for each scenario).
Here is a baseline execution example:
$ cd ehealthkd-2019
# Inside the root folder ehealthkd-2019
$ python3 -m scripts.baseline data/training/input_training.txt data/development/input_develop.txt data/submit/scenario1-main/output_scenario1.txt
Then, you can go to data/submit/scenario1-main
and check the corresponding files were generated:
$ ls -l data/submit/scenario1-main
-rw-r--r-- 1 user user 21604 abr 17 19:01 output_a_scenario1.txt
-rw-r--r-- 1 user user 1488 abr 17 19:01 output_b_scenario1.txt
-rw-r--r-- 1 user user 8756 abr 17 19:01 output_scenario1.txt
(!!!) Make sure that your files are named exactly as the files above, since the evaluation script in Codalab will expect these filenames.
(!!!) Also make sure that you have the file
output_scenario1.txt
with the input sentences in your submission folder. This is the exact same file you processed as input, so you can just copy and paste it, but remember to rename it. The baseline script already handles this. This is necessary for the evaluation script to guarantee that you have the right sentences.
Evaluating the main scenario
Now you can run the evaluation script offline just to check your results. The evaluation script is in the file scripts/score.py
and the arguments are:
- The gold annotations (in this case,
data/development/input_develop.txt
). - Your system’s annotations (
data/submit/scenario1-main/output_scenario1.txt
)
The evaluation script outputs the total number of correct, incorrect, partial, missing and spurious matches for each subtask, and the final score as defined in the Evaluation section.
$ python3 -m scripts.score data/development/input_develop.txt data/submit/scenario1-main/output_scenario1.txt
correct_A: 368
incorrect_A: 42
partial_A: 32
spurious_A: 267
missing_A: 162
correct_B: 44
spurious_B: 95
missing_B: 493
--------------------
recall: 0.3751
precision: 0.5047
f1: 0.4304
NOTE: The exact numbers you see with the baseline may vary, as the evaluation script and/or the baseline implementation can suffer changes as we discover bugs or mistakes. These numbers are for illustrative purposes only. The actual scores are the ones published in Codalab.
Additionally, you can pass --verbose
if you want to see detailed information about which keyphrases and relations were correct, missing, etc.
$ python3 -m scripts.score --verbose data/development/input_develop.txt data/submit/scenario1-main/output_scenario1.txt
=================== MISSING_A ===================
Keyphrase(text='enfrentar', label='Action', id=3)
Keyphrase(text='tubos', label='Concept', id=7)
Keyphrase(text='filtran', label='Action', id=10)
Keyphrase(text='limpian', label='Action', id=11)
Keyphrase(text='eliminando', label='Action', id=13)
... LOTS OF OUTPUT
=================== MISSING_B ===================
Relation(from='producen', to='genes', label='subject')
Relation(from='producen', to='proteínas', label='target')
Relation(from='producen', to='correctamente', label='in-context')
Relation(from='trastorno', to='niño', label='target')
Relation(from='trastorno', to='genético', label='in-context')
Relation(from='producen', to='trastorno', label='causes')
Relation(from='producen', to='trastorno', label='causes')
--------------------
recall: 0.3751
precision: 0.5047
f1: 0.4304
Evaluating the optional scenarios
You can also (optionally) perform just subtask A or subtask B, writing the results to the corresponding folders.
For subtask A, the output folder is submit/scenario2-taskA
and the dev file is output_scenario2.txt
. Pass --skip-B
to the baseline script to skip the output for subtask B. Again, we will be reusing the development
set in this case, but in the TEST phase there will be an additional 100 sentences just for this scenario.
$ python3 -m scripts.baseline --skip-B data/training/input_training.txt data/development/input_develop.txt data/submit/scenario2-taskA/output_scenario2.txt
(!!!) When submitting to subtask A, please make sure to write an empty
output_b_scenario2.txt
file. This file will not be taken into consideration when evaluating this scenario but is necessary for the evluation script when it parses the results. The baseline implementation already takes care of this detail.
You can evaluate just scenario 2 with the evaluation script by passing --skip-B
:
$ python3 -m scripts.score --skip-B data/development/input_develop.txt data/submit/scenario2-taskA/output_scenario2.txt
correct_A: 368
incorrect_A: 42
partial_A: 32
spurious_A: 267
missing_A: 162
--------------------
recall: 0.6358
precision: 0.5416
f1: 0.5849
For subtask B, the output folder is submit/scenario3-taskB
and the dev file is output_scenario3.txt
. Pass --skip-A
to the baseline script to skip the output for subtask B. Again, we will be reusing the development
set in this case, but in the TEST phase there will be an additional 100 sentences just for this scenario.
$ python3 -m scripts.baseline --skip-A data/training/input_training.txt data/development/input_develop.txt data/submit/scenario3-taskB/output_scenario3.txt
(!!!) When submitting to subtask B, please make sure to copy the
output_a_scenario3.txt
file from the gold annotations. This is the same as theoutput_a_develop.txt
file in this case. This file will not be taken into consideration when evaluating this scenario but is necessary for the evaluation script when it parses the results. The baseline implementation already takes care of this detail.
(!!!) When submitting to subtask B, make sure to reuse the keyphrase ID provided in the
output_a_develop.txt
(or corresponding TEST file) from the gold annotations. The baseline implementation already takes care of this detail.
You can evaluate just scenario 2 with the evaluation script by passing --skip-A
:
$ python3 -m scripts.score --skip-A data/development/input_develop.txt data/submit/scenario3-taskB/output_scenario3.txt
correct_B: 50
spurious_B: 32
missing_B: 487
--------------------
recall: 0.09311
precision: 0.6098
f1: 0.1616
If you have succesfully generated the output files for all the scenarios, you should have the following structure in the data/submit
folder:
$ ls -lR data/submit/*/
data/submit/scenario1-main/:
total 40
-rw-rw-r-- 1 user user 21604 abr 18 16:50 output_a_scenario1.txt
-rw-rw-r-- 1 user user 1488 abr 18 16:50 output_b_scenario1.txt
-rw-rw-r-- 1 user user 8756 abr 18 16:50 output_scenario1.txt
data/submit/scenario2-taskA/:
total 36
-rw-rw-r-- 1 user user 21604 abr 18 16:51 output_a_scenario2.txt
-rw-rw-r-- 1 user user 0 abr 18 16:51 output_b_scenario2.txt
-rw-rw-r-- 1 user user 8756 abr 18 16:51 output_scenario2.txt
data/submit/scenario3-taskB/:
total 36
-rw-rw-r-- 1 user user 19744 abr 18 16:51 output_a_scenario3.txt
-rw-rw-r-- 1 user user 764 abr 18 16:51 output_b_scenario3.txt
-rw-rw-r-- 1 user user 8756 abr 18 16:51 output_scenario3.txt
Runing the baseline on the test set
Once the test set input files are released, you will be able to test the baseline implementation on the test set as well. Please read the details about the test set structure.
These are the necessary steps:
Run the baseline on the test scenario 1 (this may take a couple minutes):
$ python3 -m scripts.baseline data/training/input_training.txt data/testing/scenario1-main/input_scenario1.txt data/submit/scenario1-main/output_scenario1.txt
Run the baseline on the test scenario 2, skipping subtask B:
$ python3 -m scripts.baseline --skip-B data/training/input_training.txt data/testing/scenario2-taskA/input_scenario2.txt data/submit/scenario2-taskA/output_scenario2.txt
(!!!) Remember that for scenario 2 the file
output_b_scenario2.txt
must exist and be empty. The baseline already does this, but ensure your own implementation takes it into consideration.
Run the baseline on the test scenario 3, skipping subtask A:
$ python3 -m scripts.baseline --skip-A data/training/input_training.txt data/testing/scenario3-taskB/input_scenario3.txt data/submit/scenario3-taskB/output_scenario3.txt
(!!!) Remember that for scenario 3 the file
output_a_scenario3.txt
must exist and be an exact copy of the provided filedata/testing/scenario3-taskB/output_a_scenario3.txt
. The baseline already does this, but ensure your own implementation takes it into consideration.
Once finished, you can submit your results to Codalab.
Remember that for the duration of the challenge the results for the test set will be hidden and only shown after the competition ends.
However, you will receive error notifications if your upload is invalid. You have up to 100 different submissions.
Submitting your results to Codalab
The file submit_baseline_dev.zip contains an example submission for the TRAINING phase.
The file submit_baseline_test.zip contains an example submission for the TEST phase.
Once you have all the corresponding outputs, please bundle the content of the submit folder in a .zip
file:
$ cd data/submit
$ zip -r submit.zip *
(!!!) Make sure you zip the content of the
submit
folder, and not thesubmit
itself. When in doubt,cd
intodata/submit
and runzip
there. The idea is that the root of yoursubmit.zip
file should directly contain the three foldersscenario1-main
, etc., and not asubmit
folder.
Structure of the submit folder
For recap here is the expected structure of the submit.zip
file:
- Folder
scenario1-main
:- File
output_a_scenario1.txt
: Your output for subtask A. - File
output_b_scenario1.txt
: Your output for subtask B. - File
output_scenario1.txt
: Sentences, copied verbatim from input.
- File
- Folder
scenario2-taskA
:- File
output_a_scenario2.txt
: Your output for subtask A. - File
output_b_scenario2.txt
: Empty file, but must exist. - File
output_scenario2.txt
: Sentences, copied verbatim from input.
- File
- Folder
scenario3-taskB
:- File
output_a_scenario3.txt
: Output for subtask A, copied verbatim from input. - File
output_b_scenario3.txt
: Your output for subtask B. - File
output_scenario3.txt
: Sentences, copied verbatim from input.
- File
NOTE: For reference purposes, inside the
data/submit
folder you will find asubmit_baseline_dev.zip
file that corresponds to running the baseline following these instructions. Follow the structure of that file if in doubt.
(!!!) Please double-check the files for all three scenarios, including the
output_scenario*.txt
files, the emptyoutput_b_scenario2.txt
and the copiedoutput_a_scenario3.txt
. If you do not plan to participate in any given scenario, kindly reuse the baseline output then, to avoid the evaluation script from raising errors about missing files.
Uploading your results to the competition server
Please also make sure to fill-in this Google Form to accept the license terms for the corpus.
Go to the Codalab competition page and register if you have not done so already. In Codalab, go to the Participate section and enter the details of your submission:
- Team name.
- Method name: a short, memorable name for the technique you are presenting.
-
Method description: refers to the type of techniques used. Please write a summary (~200 words) of the techniques, algorithms or approaches used. Also specify if you use external sources (other corpora, knowledge bases, etc.). Finally, attach one or more of the following tags regarding techniques and/or resources used in your approach. These tags will help us better understand which approaches are more popular or perform better in this task in the future.
- K: knowledge-bases
- S: Shallow supervised methods (i.e., logistic regression, SVM, Markov models, CRF, …)
- D: Deep supervised methods (e.g, CNNs, LSTMs, …)
- U: Unsupervised methods (e.g. clustering or dimensionality reduction techniques, …)
- E: Embeddings (e.g., word2vec, BERT, ELMo, …)
- N: Standard NLP techniques (pos-tagging, AMR parsing, dependency parsing, NER, …)
- R: Hand-crafted rules
Finally hit the submit button and attach you zip file. If everything is ok, after a few seconds hit the Submit to leaderboard button at the bottom of the page to see your results.
Final words
DISCLAIMER: The scoring you achieve during the training phase is only for your own reference, and should not be taken as an indication that you will achieve a similar score in the test phase. Particularly, participants that achieve the highest scores in the training phase are not guaranteed to win in the TEST phase, since participating in the training phase is completely optional. Likewise, at any point we may decide to change the evaluation script, including during the blind TEST phase, if we discover any kind of bug or error. We will inform you if that’s the case and provide an updated evaluation script.
Finally, if you discover a mistake in the evaluation script, please let us know at ehealth-kd@googlegroups.com or post an issue on our Issues Page in Github.