Evaluation measures

This challenge proposes a main evaluation scenario (Scenario 1) where both subtasks previously described are performed in sequence. The submission that obtains the highest F1 score for the Scenario 1 will be considered the best overall performing system of the challenge. Additionally, participants will have the opportunity to address specific subtasks by submitting to two optional scenarios, once for each subtask. Scoring tables will be published also for each optional scenario.

Main Evaluation (Scenario 1)

This scenario evaluates all of the subtasks together as a pipeline. The input consists only of a plain text, and the expected output will be the two output files for Subtask A and B, as described before. The measures will be precision, recall and F1 as follows:

F1 will determine the ranking of Scenario 1 and consequently of the eHealthKD challenge.

The exact definition of Correct, Missing, Spurious, Partial and Incorrect is presented in the following sections for each subtask.

Optional Subtask A (Scenario 2)

This scenario only evaluates Subtask A. The input is a plain text with several sentences and the output is as described in Subtask A. To compute the scores we define correct, partial, missing, incorrect and spurious matches. The expected and actual output files do not need to agree on the ID for each phrase, nor on their order. The evaluator matches are based on the START and END values and LABEL. A brief description about the metrics follows:

From these definitions, we compute precision, recall, and a standard F1 measure as follows:

A higher precision means that the number of spurious identifications is smaller compared to the number of missing identifications, and a higher recall means the opposite. Partial matches are given half the score of correct matches, while missing and spurious identifications are given no score.

F1 will determine the ranking of Scenario 2.

Optional Subtask B (Scenario 3)

This scenario only evaluates Subtask B. The input is plain text and the correct outputs from Subtask A. The expected output is as described in Subtask C. Similarly to previous scenarios, we define the correct, missing and spurious items, defined as follows:

We define standard precision, recall and F1 metrics as follows:

F1 will determine the ranking of Scenario 3.

NOTE: The Scenario 1 is more complex than solving each optional scenario separately, since errors in subtask A will necessary translate to errors in subtask B. For this reason it is considered the main evaluation metric. Additionally, this scenario also provides the possibility for integrated end-to-end solutions that solve both subtask simultaneously.

Running the final evaluation script

Now that the challenge is finished, we have published the official evaluation script along with gold files for the test set. This script performs the full evaluation of all scenarios for all submissions and outputs a pretty-printed representation (either JSON or CSV) that you can use to verify your results or test different submissions.

To use it, you need to place all your different submissions inside respective folders with descriptive names~(such as using-cnn, using-crf, etc.) and all of that inside the submissions folder. As an example, when you clone (or pull from) the eHealth-KD repository, you will find inside the data/submission folder another subfolder named baseline and inside it all three scenario* subfolders with the corresponding submissions files. Feel free to add alongside the baseline subfolder all your additional submissions.

Then run the following line from the root of the project (i.e., not inside the scripts folder):

$ python -m scripts.evaltest data/submissions data/testing --single --pretty

{
  "submissions": [
    {
      "scenario1": {
        "correct_A": 388,
        "correct_B": 37,
        "f1": 0.43091787439613516,
        "incorrect_A": 32,
        "missing_A": 175,
        "missing_B": 539,
        "partial_A": 42,
        "precision": 0.5204200700116686,
        "recall": 0.3676834295136026,
        "spurious_A": 268,
        "spurious_B": 90,
        "submit": "baseline"
      },
      "scenario2": {
        "correct_A": 355,
        "correct_B": 0,
        "f1": 0.5466377440347071,
        "incorrect_A": 35,
        "missing_A": 210,
        "missing_B": 592,
        "partial_A": 46,
        "precision": 0.5128900949796472,
        "recall": 0.5851393188854489,
        "spurious_A": 301,
        "spurious_B": 0,
        "submit": "baseline"
      },
      "scenario3": {
        "correct_A": 615,
        "correct_B": 40,
        "f1": 0.12307692307692308,
        "incorrect_A": 0,
        "missing_A": 0,
        "missing_B": 528,
        "partial_A": 0,
        "precision": 0.4878048780487805,
        "recall": 0.07042253521126761,
        "spurious_A": 0,
        "spurious_B": 42,
        "submit": "baseline"
      },
      "submit": "baseline"
    }
  ]
}

As output you will get all the submissions (subfolders inside data/submissions) evaluated on each scenario, with several detailed metrics, such as correct, incorrect, missing and spurious matches, as well as precision, recall and F1. Feel free to use these metrics in your system description paper as you see fit.

Finally run python -m scripts.evaltest -h to see all options for the evaluation script, it also provides CSV output and it automatically generates a table with the best submission for each scenario.