-
Notifications
You must be signed in to change notification settings - Fork 904
Discontinuous mentions in the coref model #1468
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Can you shed some light on how the mentions work / what they would translate to in English? I'm thinking that in English, for example, it would be wrong either to treat some kinds of mentions as two separate mentions, or even to just split off the first half of a discontinuous mention as we appear to be doing here. An example:
where we ultimately want to figure out from context if Is that the kind of thing that is annotated as "discontinuous" in the NO dataset? In a case like this, I think the best thing would be to treat it as one big mention, if possible. A limitation of that approach would be if other mentions show up in the middle of the discontinuous mention, though |
Usually, a discontinuous mention is a mention that has been split into several parts by other words that are not part of that mention. In the Norwegian example above, a discontinuous mention is "de alle sammen" ("all of them") and it was split into two parts by "er ganske nye" ("are quite new"). I don't speak Norwegian, but the word-for-word example should be like this.
There are also a limited number of discontinuous mentions in the English-ParCorFull corpus, but they are mostly limited to skipping the punctuation in a mention. For example:
In CorefUD, these mentions have the format
I think your example demonstrates it better :)
I think that is one way to do it. Another way is to just ignore these kinds of mentions completely. Currently, it is converted only partially, when a part of a mention consists of a single token, e.g.
The model supports nested mentions, so this should not be a problem, at least from a technical point of view. In addition, here is the list of current CorefUD v1.2 corpora that have discontinuous mentions:
|
Thanks for the explanation. Based on those examples, I'd definitely prefer to combine into one annotation rather than separate into two. Honestly, I'm not sure why they drew the line at "and" is okay but "," is not okay in the English example... basically the same word, innit? Anyway, this primarily looks like a change in the conversion code. The model itself is span based and wouldn't digest two separate parts of a mention in any way |
Maybe the conversion should be language specific. I looked at other corpora in languages I know and for example Russian-RoCor has many cases like this (I did a word by word translation with entity information in the last column):
Or even like this, which seems completely nonsensical to me:
I am not sure why this is the case, probably due to a bug in the data conversion from the original format to CorefUD. However, for your coref model, it might be better to ignore discontinuous mentions completely in some languages, to avoid marking the whole sentence as a mention, and to combine them into one mention in other languages, where it is less of a problem. |
Hello!
Is your feature request related to a problem? Please describe.
Currently, a closing index of a discontinuous mention is not captured by the regex in the convert_udcoref.py script.
For example, a conllu file like this (taken from CorefUD_Norwegian-BokmaalNARC):
will be converted into the following json:
As you can see, the second part of the mention
e44523
is completely missing from the converted json.Describe the solution you'd like
It is not clear what is the best solution here, since to my knowledge the coref model does not support discontinuous mention spans.
One potential workaround would be to treat discontinuous parts of the same mention as separate mentions under a single entity.
For example, if we change this part of the conversion script:
stanza/stanza/utils/datasets/coref/convert_udcoref.py
Lines 69 to 109 in af3d42b
like this (changed the regex on line 71 and added a condition on line 94):
then the discontinuous mention is included in the converted json:
Describe alternatives you've considered
Another alternative would be to completely ignore all the discontinuous mentions since they are not very frequent.
The text was updated successfully, but these errors were encountered: