Skip to content

Discontinuous mentions in the coref model #1468

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
501Good opened this issue Mar 3, 2025 · 4 comments
Open

Discontinuous mentions in the coref model #1468

501Good opened this issue Mar 3, 2025 · 4 comments

Comments

@501Good
Copy link

501Good commented Mar 3, 2025

Hello!

Is your feature request related to a problem? Please describe.
Currently, a closing index of a discontinuous mention is not captured by the regex in the convert_udcoref.py script.

For example, a conllu file like this (taken from CorefUD_Norwegian-BokmaalNARC):

# newdoc id = ap~20081210-1411542
# global.Entity = eid-etype-head-other
# newpar
# sent_id = 016148
# text = Jeg synes de er ganske nye, alle sammen, tidligst fra 1500-tallet.
1	Jeg	jeg	PRON	_	Animacy=Hum|Case=Nom|Number=Sing|Person=1|PronType=Prs	2	nsubj	_	Entity=(e44528--1)
2	synes	synes	VERB	_	Mood=Ind|Tense=Pres|VerbForm=Fin	0	root	_	_
3	de	de	PRON	_	Case=Nom|Number=Plur|Person=3|PronType=Prs	6	nsubj	_	Entity=(e44523[1/2]--1)
4	er	være	AUX	_	Mood=Ind|Tense=Pres|VerbForm=Fin	6	cop	_	_
5	ganske	ganske	ADV	_	_	6	advmod	_	_
6	nye	ny	ADJ	_	Degree=Pos|Number=Plur	2	ccomp	_	SpaceAfter=No
7	,	$,	PUNCT	_	_	8	punct	_	_
8	alle	all	DET	_	Number=Plur|PronType=Tot	3	det	_	Entity=(e44523[2/2]--1
9	sammen	sammen	ADV	_	_	8	advmod	_	Entity=e44523[2/2])|SpaceAfter=No
10	,	$,	PUNCT	_	_	8	punct	_	_
11	tidligst	tidlig	ADJ	_	Definite=Ind|Degree=Sup	13	advmod	_	_
12	fra	fra	ADP	_	_	13	case	_	_
13	1500-tallet	1500-tall	NOUN	_	Definite=Def|Gender=Neut|Number=Sing	6	conj	_	Entity=(e44532--1)|SpaceAfter=No
14	.	$.	PUNCT	_	_	2	punct	_	_

will be converted into the following json:

[
  {
    "document_id": "ap~20081210-1411542",
    "cased_words": [
      "jeg",
      "synes",
      "de",
      "er",
      "ganske",
      "nye",
      ",",
      "alle",
      "sammen",
      ",",
      "tidligst",
      "fra",
      "1500-tallet",
      "."
    ],
    "sent_id": [
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0
    ],
    "part_id": 0,
    "deprel": [
      "nsubj",
      "root",
      "nsubj",
      "cop",
      "advmod",
      "ccomp",
      "punct",
      "det",
      "advmod",
      "punct",
      "advmod",
      "case",
      "conj",
      "punct"
    ],
    "head": [
      1,
      "null",
      5,
      5,
      5,
      1,
      7,
      2,
      7,
      7,
      12,
      12,
      5,
      1
    ],
    "span_clusters": [
      [
        [
          0,
          1
        ]
      ],
      [
        [
          2,
          3
        ]
      ],
      [
        [
          12,
          13
        ]
      ]
    ],
    "word_clusters": [
      [
        0
      ],
      [
        2
      ],
      [
        12
      ]
    ],
    "head2span": [
      [
        0,
        0,
        1
      ],
      [
        2,
        2,
        3
      ],
      [
        12,
        12,
        13
      ]
    ],
    "lang": "no"
  }
]

As you can see, the second part of the mention e44523 is completely missing from the converted json.

Describe the solution you'd like
It is not clear what is the best solution here, since to my knowledge the coref model does not support discontinuous mention spans.

One potential workaround would be to treat discontinuous parts of the same mention as separate mentions under a single entity.

For example, if we change this part of the conversion script:

head2span = []
word_total = 0
SPANS = re.compile(r"(\(\w+|[%\w]+\))")
for parsed_sentence in doc.sentences:
# spans regex
# parse the misc column, leaving on "Entity" entries
misc = [[k.split("=")
for k in j
if k.split("=")[0] == "Entity"]
for i in parsed_sentence.words
for j in [i.misc.split("|") if i.misc else []]]
# and extract the Entity entry values
entities = [i[0][1] if len(i) > 0 else None for i in misc]
# extract reference information
refs = [SPANS.findall(i) if i else [] for i in entities]
# and calculate spans: the basic rule is (e... begins a reference
# and ) without e before ends the most recent reference
# every single time we get a closing element, we pop it off
# the refdict and insert the pair to final_refs
refdict = defaultdict(list)
final_refs = defaultdict(list)
last_ref = None
for indx, i in enumerate(refs):
for j in i:
# this is the beginning of a reference
if j[0] == "(":
refdict[j[1+UDCOREF_ADDN:]].append(indx)
last_ref = j[1+UDCOREF_ADDN:]
# at the end of a reference, if we got exxxxx, that ends
# a particular refereenc; otherwise, it ends the last reference
elif j[-1] == ")" and j[UDCOREF_ADDN:-1].isnumeric():
if (not UDCOREF_ADDN) or j[0] == "e":
try:
final_refs[j[UDCOREF_ADDN:-1]].append((refdict[j[UDCOREF_ADDN:-1]].pop(-1), indx))
except IndexError:
# this is probably zero anaphora
continue
elif j[-1] == ")":
final_refs[last_ref].append((refdict[last_ref].pop(-1), indx))
last_ref = None
final_refs = dict(final_refs)

like this (changed the regex on line 71 and added a condition on line 94):

        head2span = []
        word_total = 0
        SPANS = re.compile(r"(\(\w+|[%\w]+(?:\[[\d/]+\])?\))")
        for parsed_sentence in doc.sentences:
            # spans regex
            # parse the misc column, leaving on "Entity" entries
            misc = [[k.split("=")
                    for k in j
                    if k.split("=")[0] == "Entity"]
                    for i in parsed_sentence.words
                    for j in [i.misc.split("|") if i.misc else []]]
            # and extract the Entity entry values
            entities = [i[0][1] if len(i) > 0 else None for i in misc]
            # extract reference information
            refs = [SPANS.findall(i) if i else [] for i in entities]
            # and calculate spans: the basic rule is (e... begins a reference
            # and ) without e before ends the most recent reference
            # every single time we get a closing element, we pop it off
            # the refdict and insert the pair to final_refs
            refdict = defaultdict(list)
            final_refs = defaultdict(list)
            last_ref = None
            for indx, i in enumerate(refs):
                for j in i:
                    # remove the discontinuous part from a closing index, e.g. "e1[1/2])" -> "e1)"
                    if j[-1] == ")" and j[-2] == "]":
                        j = re.sub(r"\[[\d/]+\]", "", j)
                    # this is the beginning of a reference
                    if j[0] == "(":
                        refdict[j[1+UDCOREF_ADDN:]].append(indx)
                        last_ref = j[1+UDCOREF_ADDN:]
                    # at the end of a reference, if we got exxxxx, that ends
                    # a particular refereenc; otherwise, it ends the last reference
                    elif j[-1] == ")" and j[UDCOREF_ADDN:-1].isnumeric():
                        if (not UDCOREF_ADDN) or j[0] == "e":
                            try:
                                final_refs[j[UDCOREF_ADDN:-1]].append((refdict[j[UDCOREF_ADDN:-1]].pop(-1), indx))
                            except IndexError:
                                # this is probably zero anaphora
                                continue
                    elif j[-1] == ")":
                        final_refs[last_ref].append((refdict[last_ref].pop(-1), indx))
                        last_ref = None
            final_refs = dict(final_refs)

then the discontinuous mention is included in the converted json:

[
  {
    "document_id": "ap~20081210-1411542",
    "cased_words": [
      "jeg",
      "synes",
      "de",
      "er",
      "ganske",
      "nye",
      ",",
      "alle",
      "sammen",
      ",",
      "tidligst",
      "fra",
      "1500-tallet",
      "."
    ],
    "sent_id": [
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0
    ],
    "part_id": 0,
    "deprel": [
      "nsubj",
      "root",
      "nsubj",
      "cop",
      "advmod",
      "ccomp",
      "punct",
      "det",
      "advmod",
      "punct",
      "advmod",
      "case",
      "conj",
      "punct"
    ],
    "head": [
      1,
      "null",
      5,
      5,
      5,
      1,
      7,
      2,
      7,
      7,
      12,
      12,
      5,
      1
    ],
    "span_clusters": [
      [
        [
          0,
          1
        ]
      ],
      [
        [
          2,
          3
        ],
        [
          7,
          9
        ]
      ],
      [
        [
          12,
          13
        ]
      ]
    ],
    "word_clusters": [
      [
        0
      ],
      [
        2,
        7
      ],
      [
        12
      ]
    ],
    "head2span": [
      [
        0,
        0,
        1
      ],
      [
        2,
        2,
        3
      ],
      [
        7,
        7,
        9
      ],
      [
        12,
        12,
        13
      ]
    ],
    "lang": "no"
  }
]

Describe alternatives you've considered
Another alternative would be to completely ignore all the discontinuous mentions since they are not very frequent.

@AngledLuffa
Copy link
Collaborator

Can you shed some light on how the mentions work / what they would translate to in English?

I'm thinking that in English, for example, it would be wrong either to treat some kinds of mentions as two separate mentions, or even to just split off the first half of a discontinuous mention as we appear to be doing here. An example:

Houjun, and very occasionally, John, have built coref models.  They ...

where we ultimately want to figure out from context if they is coref models or Houjun and John.

Is that the kind of thing that is annotated as "discontinuous" in the NO dataset? In a case like this, I think the best thing would be to treat it as one big mention, if possible. A limitation of that approach would be if other mentions show up in the middle of the discontinuous mention, though

@501Good
Copy link
Author

501Good commented Mar 4, 2025

Can you shed some light on how the mentions work / what they would translate to in English?

Usually, a discontinuous mention is a mention that has been split into several parts by other words that are not part of that mention. In the Norwegian example above, a discontinuous mention is "de alle sammen" ("all of them") and it was split into two parts by "er ganske nye" ("are quite new").

I don't speak Norwegian, but the word-for-word example should be like this.

Jeg synes de   er  ganske nye, alle sammen,  tidligst fra  1500-tallet.
|   |     |    |   |      |    |    |        |        |    |
I   think they are quite  new  all  together earliest from the 16th century

There are also a limited number of discontinuous mentions in the English-ParCorFull corpus, but they are mostly limited to skipping the punctuation in a mention. For example:

# sent_id = 481
# text = And second : This team features three gymnasts , Simone Biles , Gabby Douglas and Lauren `` Laurie '' Hernandez , who have been inspiring so many young girls of color .
1	And	and	CCONJ	CC	_	6	cc	_	_
2	second	second	ADV	RB	Degree=Pos|NumType=Ord	6	advmod	_	_
3	:	:	PUNCT	:	_	2	punct	_	_
4	This	this	DET	DT	Number=Sing|PronType=Dem	5	det	_	Entity=(e154--2-anacata:anaphoric,antetype:entity,mention:np,npmod:demonstrative,nptype:np,split:simple+antecedent
5	team	team	NOUN	NN	Number=Sing	6	nsubj	_	Entity=e154)
6	features	feature	VERB	VBZ	Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin	0	root	_	_
7	three	three	NUM	CD	NumForm=Word|NumType=Card	8	nummod	_	_
8	gymnasts	gymnast	NOUN	NNS	Number=Plur	6	obj	_	_
9	,	,	PUNCT	,	_	10	punct	_	_
10	Simone	Simone	PROPN	NNP	Number=Sing	8	appos	_	Entity=(e153[1/2]--1-anacata:anaphoric,mention:np,npmod:none,nptype:np,split:split+reference
11	Biles	Biles	PROPN	NNP	Number=Sing	10	flat	_	Entity=e153[1/2])
12	,	,	PUNCT	,	_	13	punct	_	_
13	Gabby	Gabby	PROPN	NNP	Number=Sing	10	conj	_	Entity=(e153[2/2]--1-anacata:anaphoric,mention:np,npmod:none,nptype:np,split:split+reference
14	Douglas	Douglas	PROPN	NNP	Number=Sing	13	flat	_	_
15	and	and	CCONJ	CC	_	16	cc	_	_
16	Lauren	Lauren	PROPN	NNP	Number=Sing	10	conj	_	_
17	``	''	PUNCT	``	_	18	punct	_	_
18	Laurie	Laurie	PROPN	NNP	Number=Sing	16	flat	_	_
19	''	''	PUNCT	''	_	18	punct	_	_
20	Hernandez	Hernandez	PROPN	NNP	Number=Sing	16	flat	_	Entity=e153[2/2])
21	,	,	PUNCT	,	_	25	punct	_	_
22	who	who	PRON	WP	PronType=Rel	25	nsubj	_	Entity=(e153--1-agreement:plural,mention:pronoun,position:subject,split:split+reference,type:anaphoric,type_of_pronoun:relative)
23	have	have	AUX	VBP	Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin	25	aux	_	_
24	been	be	AUX	VBN	Tense=Past|VerbForm=Part	25	aux	_	_
25	inspiring	inspire	VERB	VBG	Tense=Pres|VerbForm=Part	8	acl:relcl	_	_
26	so	so	ADV	RB	_	27	advmod	_	_
27	many	many	ADJ	JJ	Degree=Pos	29	amod	_	_
28	young	young	ADJ	JJ	Degree=Pos	29	amod	_	_
29	girls	girl	NOUN	NNS	Number=Plur	25	obj	_	_
30	of	of	ADP	IN	_	31	case	_	_
31	color	color	NOUN	NN	Number=Sing	29	nmod	_	_
32	.	.	PUNCT	.	_	6	punct	_	_

In CorefUD, these mentions have the format (e1[1/2] ... e1[2/2]). It is the same as a normal continuous mention, but it has additional information in square brackets, where the first number is the current part of a discontinuous mention, and the second number is the total number of parts.

I'm thinking that in English, for example, it would be wrong either to treat some kinds of mentions as two separate mentions, or even to just split off the first half of a discontinuous mention as we appear to be doing here. An example:

Houjun, and very occasionally, John, have built coref models.  They ...

where we ultimately want to figure out from context if they is coref models or Houjun and John.

I think your example demonstrates it better :)

In a case like this, I think the best thing would be to treat it as one big mention, if possible.

I think that is one way to do it. Another way is to just ignore these kinds of mentions completely.

Currently, it is converted only partially, when a part of a mention consists of a single token, e.g. (e1[1/2]--1).

A limitation of that approach would be if other mentions show up in the middle of the discontinuous mention, though

The model supports nested mentions, so this should not be a problem, at least from a technical point of view.

In addition, here is the list of current CorefUD v1.2 corpora that have discontinuous mentions:

  • Ancient_Greek-PROIEL
  • Czech-PCEDT
  • Czech-PDT
  • English-ParCorFull
  • German-ParCorFull
  • German-PotsdamCC
  • Hungarian-KorKor
  • Hungarian-SzegedKoref
  • Norwegian-BokmaalNARC
  • Norwegian-NynorskNARC
  • Old_Church_Slavonic-PROIEL
  • Polish-PCC
  • Russian-RuCor
  • Turkish-ITCC

@AngledLuffa
Copy link
Collaborator

Thanks for the explanation. Based on those examples, I'd definitely prefer to combine into one annotation rather than separate into two. Honestly, I'm not sure why they drew the line at "and" is okay but "," is not okay in the English example... basically the same word, innit?

Anyway, this primarily looks like a change in the conversion code. The model itself is span based and wouldn't digest two separate parts of a mention in any way

@501Good
Copy link
Author

501Good commented Mar 5, 2025

Based on those examples, I'd definitely prefer to combine into one annotation rather than separate into two.

Maybe the conversion should be language specific. I looked at other corpora in languages I know and for example Russian-RoCor has many cases like this (I did a word by word translation with entity information in the last column):

В   In  Entity=(e3144[1/2]--3-ref:def,str:noun,type:coref
области the field Entity=e3144[1/2])
изучения    of study    _
индийского  of Indian   _
фольклора   folklore    _
,   ,   _
особенно    in particular _
индийской   of Indian   _
сказки  fairytale   _
и   and _
джатак  of jataka   _
(   (   _
jataka  jataka  _ 
)   )   _
,   ,   _
а   and _
также   also    _
индийского  of Indian   _
,   ,   _
главным mainly  _ 
образом _   _
буддийского of buddhist _
,   ,   _
искусства   art _ 
с   with    _
большим a big   _
успехом success _
работал worked  _
акад    academic    Entity=(e3144[2/2]--3-ref:def,str:noun,type:coref
.   .   Entity=e3144[2/2])

Or even like this, which seems completely nonsensical to me:

»   »   Entity=(e3248[1/2]--3-ref:def,str:noun,type:coref)
,   ,   _
«   «   _
Gotta   Gotta   _
Be  Be  _
Crazy   Crazy   _
»   »   _
и   and _
«   «   Entity=(e3248[2/2]--3-ref:def,str:noun,type:coref
Shine   Shine   _
On  On  _
You You _
Crazy   Crazy   _
Diamond Diamond Entity=e3248[2/2])

I am not sure why this is the case, probably due to a bug in the data conversion from the original format to CorefUD.

However, for your coref model, it might be better to ignore discontinuous mentions completely in some languages, to avoid marking the whole sentence as a mention, and to combine them into one mention in other languages, where it is less of a problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants