Subtask A: Identification and classification of key phrases

Given a list of eHealth documents written in Spanish, the goal of this subtask is to identify all the key phrases per document and their classes. These key phrases are all the relevant terms (single word or multiple words) that represent semantically important elements in a sentence. The following figure shows the relevant key phrases that appear in the example sentences shown in the previous section.

Note that some key phrases (“vías respiratorias” and “60 años”) span more than one word. Key phrases will always consist of one or more complete words (i.e., not a prefix or a suffix of a word), and will never include any surrounding punctuation symbols. There are four categories or classes for key phrases:

Subtask A input is a text document with a sentence per line. All sentences have been tokenized at the word level (i.e., punctuation signs, parenthesis, etc, are separated from the surrounding text). The output consists of a plain text file, where each line represents a key phrase. Each line has the following format:

ID \tab START END ; START END \tab LABEL \tab TEXT

The ID is a numerical identifier that will be used in Subtask B to link key phrases with their relations. The START and END indicate the starting and ending character of the text span. Multi-word phrases such as vías respiratorias where all the words are continuous can either be indicated by a single START / END pair or by several START / END (one for each word) separated by semicolons (;). Multi-word phrases where the words are not continuous must use semicolons to separate the different portions of the phrase. In the training documents we will always represent multi-word phrases separately for consistency. The TEXT portion simply reproduces the full text of the key phrase. This portion will be ignored in the evaluation, so participants are free not to produce it, but it will be provided in all training documents, and we recommend participants to also produce it, since it simplifies manual inspection during development. LABEL is one of the previous four categories defined. In this example, a possible output file is the following:

NOTE: Column headers are optional, and only shown here for illustrative purposes.

Recap: Columns are separated by one or more TAB characters. The two numbers inside each START/END pair are separated by one SPACE character. The different START/END pairs for each multi-word are separated by one SEMICOLON ( ; ) character.

Subtask B: Detection of semantic relations

Subtask B continues from the output of Subtask A, by linking the key phrases detected and labelled in each document. The purpose of this subtask is to recognize all relevant semantic relationships between the entities recognized. Eight of the thirteen semantic relations defined for this challenge can be identified in the following example:

The semantic relations are divided in different categories:

General relations (6): general-purpose relations between two concepts (it involves Concept, Action, Predicate, and Reference) that have a specific semantic. When any of these relations applies, it is preferred over a domain relation –tagging a key phrase as a link between two information units–, since their semantic is independent of any textual label:

Contextual relations (3): allow to refine a concept (it involves Concept, Action, Predicate, and Reference) by attaching modifiers. These are:

Action roles (2): indicate which role plays the concepts related to an Action:

Predicate roles (2): indicate which role plays the concepts related to a Predicate:

The output for Subtask B is a plain text file where each line corresponds to a semantic relation between two key phrases, in the format:

LABEL \tab SOURCE-ID \tab DEST-ID

The LABEL (i.e. column 1) is one of the previously defined, and the IDs correspond to the participants in the relation. Note that every relation is directed, hence the SOURCE-ID (i.e. column 2) and the DEST-ID (i.e column 3) must match the right direction, except for same-as which is symmetric, so both directions are equivalent. For the previous example the output is:

NOTE: Column headers are optional, and only shown here for illustrative purposes.

Recap: Columns are separated by one or more TAB characters.

Important: Note about negated concepts

The eHealth-KD corpus considers negated actions, which are manually annotated in the corresponding Brat files (which will be released after the challenge is completed). However, for competition purposes, we are not considering the annotation of negation as part of the challenge.

This means that, in the corpus, you will find sentences with negated concepts, such as: “No existe un tratamiento que restablezca la función ovárica normal.”. In this and similar sentences, we still expect that your system recognizes existe as Action and tratamiento as Target, as though if the negation did not exist.

If in doubt please contact the organizers at ehealth-kd@googlegroups.com.