Discontinuous mentions in the coref model

Hello!

**Is your feature request related to a problem? Please describe.**
Currently, a closing index of a discontinuous mention is not captured by the regex in the [convert_udcoref.py](https://github.com/stanfordnlp/stanza/blob/af3d42b70ef2d82d96f410214f98dd17dd983f51/stanza/utils/datasets/coref/convert_udcoref.py) script.

For example, a conllu file like this (taken from CorefUD_Norwegian-BokmaalNARC): 

```
# newdoc id = ap~20081210-1411542
# global.Entity = eid-etype-head-other
# newpar
# sent_id = 016148
# text = Jeg synes de er ganske nye, alle sammen, tidligst fra 1500-tallet.
1	Jeg	jeg	PRON	_	Animacy=Hum|Case=Nom|Number=Sing|Person=1|PronType=Prs	2	nsubj	_	Entity=(e44528--1)
2	synes	synes	VERB	_	Mood=Ind|Tense=Pres|VerbForm=Fin	0	root	_	_
3	de	de	PRON	_	Case=Nom|Number=Plur|Person=3|PronType=Prs	6	nsubj	_	Entity=(e44523[1/2]--1)
4	er	være	AUX	_	Mood=Ind|Tense=Pres|VerbForm=Fin	6	cop	_	_
5	ganske	ganske	ADV	_	_	6	advmod	_	_
6	nye	ny	ADJ	_	Degree=Pos|Number=Plur	2	ccomp	_	SpaceAfter=No
7	,	$,	PUNCT	_	_	8	punct	_	_
8	alle	all	DET	_	Number=Plur|PronType=Tot	3	det	_	Entity=(e44523[2/2]--1
9	sammen	sammen	ADV	_	_	8	advmod	_	Entity=e44523[2/2])|SpaceAfter=No
10	,	$,	PUNCT	_	_	8	punct	_	_
11	tidligst	tidlig	ADJ	_	Definite=Ind|Degree=Sup	13	advmod	_	_
12	fra	fra	ADP	_	_	13	case	_	_
13	1500-tallet	1500-tall	NOUN	_	Definite=Def|Gender=Neut|Number=Sing	6	conj	_	Entity=(e44532--1)|SpaceAfter=No
14	.	$.	PUNCT	_	_	2	punct	_	_
```

will be converted into the following json:

```json
[
  {
    "document_id": "ap~20081210-1411542",
    "cased_words": [
      "jeg",
      "synes",
      "de",
      "er",
      "ganske",
      "nye",
      ",",
      "alle",
      "sammen",
      ",",
      "tidligst",
      "fra",
      "1500-tallet",
      "."
    ],
    "sent_id": [
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0
    ],
    "part_id": 0,
    "deprel": [
      "nsubj",
      "root",
      "nsubj",
      "cop",
      "advmod",
      "ccomp",
      "punct",
      "det",
      "advmod",
      "punct",
      "advmod",
      "case",
      "conj",
      "punct"
    ],
    "head": [
      1,
      "null",
      5,
      5,
      5,
      1,
      7,
      2,
      7,
      7,
      12,
      12,
      5,
      1
    ],
    "span_clusters": [
      [
        [
          0,
          1
        ]
      ],
      [
        [
          2,
          3
        ]
      ],
      [
        [
          12,
          13
        ]
      ]
    ],
    "word_clusters": [
      [
        0
      ],
      [
        2
      ],
      [
        12
      ]
    ],
    "head2span": [
      [
        0,
        0,
        1
      ],
      [
        2,
        2,
        3
      ],
      [
        12,
        12,
        13
      ]
    ],
    "lang": "no"
  }
]
```

As you can see, the second part of the mention `e44523` is completely missing from the converted json.

**Describe the solution you'd like**
It is not clear what is the best solution here, since to my knowledge the coref model does not support discontinuous mention spans.

One potential workaround would be to treat discontinuous parts of the same mention as separate mentions under a single entity.

For example, if we change this part of the conversion script: 
https://github.com/stanfordnlp/stanza/blob/af3d42b70ef2d82d96f410214f98dd17dd983f51/stanza/utils/datasets/coref/convert_udcoref.py#L69-L109

like this (changed the regex on line 71 and added a condition on line 94):
```python
        head2span = []
        word_total = 0
        SPANS = re.compile(r"($\w+|[%\w]+(?:\[[\d/]+\])?$)")
        for parsed_sentence in doc.sentences:
            # spans regex
            # parse the misc column, leaving on "Entity" entries
            misc = [[k.split("=")
                    for k in j
                    if k.split("=")[0] == "Entity"]
                    for i in parsed_sentence.words
                    for j in [i.misc.split("|") if i.misc else []]]
            # and extract the Entity entry values
            entities = [i[0][1] if len(i) > 0 else None for i in misc]
            # extract reference information
            refs = [SPANS.findall(i) if i else [] for i in entities]
            # and calculate spans: the basic rule is (e... begins a reference
            # and ) without e before ends the most recent reference
            # every single time we get a closing element, we pop it off
            # the refdict and insert the pair to final_refs
            refdict = defaultdict(list)
            final_refs = defaultdict(list)
            last_ref = None
            for indx, i in enumerate(refs):
                for j in i:
                    # remove the discontinuous part from a closing index, e.g. "e1[1/2])" -> "e1)"
                    if j[-1] == ")" and j[-2] == "]":
                        j = re.sub(r"\[[\d/]+\]", "", j)
                    # this is the beginning of a reference
                    if j[0] == "(":
                        refdict[j[1+UDCOREF_ADDN:]].append(indx)
                        last_ref = j[1+UDCOREF_ADDN:]
                    # at the end of a reference, if we got exxxxx, that ends
                    # a particular refereenc; otherwise, it ends the last reference
                    elif j[-1] == ")" and j[UDCOREF_ADDN:-1].isnumeric():
                        if (not UDCOREF_ADDN) or j[0] == "e":
                            try:
                                final_refs[j[UDCOREF_ADDN:-1]].append((refdict[j[UDCOREF_ADDN:-1]].pop(-1), indx))
                            except IndexError:
                                # this is probably zero anaphora
                                continue
                    elif j[-1] == ")":
                        final_refs[last_ref].append((refdict[last_ref].pop(-1), indx))
                        last_ref = None
            final_refs = dict(final_refs)
```

then the discontinuous mention is included in the converted json:

```json
[
  {
    "document_id": "ap~20081210-1411542",
    "cased_words": [
      "jeg",
      "synes",
      "de",
      "er",
      "ganske",
      "nye",
      ",",
      "alle",
      "sammen",
      ",",
      "tidligst",
      "fra",
      "1500-tallet",
      "."
    ],
    "sent_id": [
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0
    ],
    "part_id": 0,
    "deprel": [
      "nsubj",
      "root",
      "nsubj",
      "cop",
      "advmod",
      "ccomp",
      "punct",
      "det",
      "advmod",
      "punct",
      "advmod",
      "case",
      "conj",
      "punct"
    ],
    "head": [
      1,
      "null",
      5,
      5,
      5,
      1,
      7,
      2,
      7,
      7,
      12,
      12,
      5,
      1
    ],
    "span_clusters": [
      [
        [
          0,
          1
        ]
      ],
      [
        [
          2,
          3
        ],
        [
          7,
          9
        ]
      ],
      [
        [
          12,
          13
        ]
      ]
    ],
    "word_clusters": [
      [
        0
      ],
      [
        2,
        7
      ],
      [
        12
      ]
    ],
    "head2span": [
      [
        0,
        0,
        1
      ],
      [
        2,
        2,
        3
      ],
      [
        7,
        7,
        9
      ],
      [
        12,
        12,
        13
      ]
    ],
    "lang": "no"
  }
]
```

**Describe alternatives you've considered**
Another alternative would be to completely ignore all the discontinuous mentions since they are not very frequent.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Discontinuous mentions in the coref model #1468

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	head2span = []
	word_total = 0
	SPANS = re.compile(r"(\(\w+\|[%\w]+\))")
	for parsed_sentence in doc.sentences:
	# spans regex
	# parse the misc column, leaving on "Entity" entries
	misc = [[k.split("=")
	for k in j
	if k.split("=")[0] == "Entity"]
	for i in parsed_sentence.words
	for j in [i.misc.split("\|") if i.misc else []]]
	# and extract the Entity entry values
	entities = [i[0][1] if len(i) > 0 else None for i in misc]
	# extract reference information
	refs = [SPANS.findall(i) if i else [] for i in entities]
	# and calculate spans: the basic rule is (e... begins a reference
	# and ) without e before ends the most recent reference
	# every single time we get a closing element, we pop it off
	# the refdict and insert the pair to final_refs
	refdict = defaultdict(list)
	final_refs = defaultdict(list)
	last_ref = None
	for indx, i in enumerate(refs):
	for j in i:
	# this is the beginning of a reference
	if j[0] == "(":
	refdict[j[1+UDCOREF_ADDN:]].append(indx)
	last_ref = j[1+UDCOREF_ADDN:]
	# at the end of a reference, if we got exxxxx, that ends
	# a particular refereenc; otherwise, it ends the last reference
	elif j[-1] == ")" and j[UDCOREF_ADDN:-1].isnumeric():
	if (not UDCOREF_ADDN) or j[0] == "e":
	try:
	final_refs[j[UDCOREF_ADDN:-1]].append((refdict[j[UDCOREF_ADDN:-1]].pop(-1), indx))
	except IndexError:
	# this is probably zero anaphora
	continue
	elif j[-1] == ")":
	final_refs[last_ref].append((refdict[last_ref].pop(-1), indx))
	last_ref = None
	final_refs = dict(final_refs)

Discontinuous mentions in the coref model #1468

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions