Skip to content

Commit 384d65c

Browse files
committed
If a token is completely erased by the MWT seq2seq, restore it as a single word. #1401
1 parent 657f467 commit 384d65c

File tree

1 file changed

+4
-0
lines changed

1 file changed

+4
-0
lines changed

stanza/models/mwt/trainer.py

+4
Original file line numberDiff line numberDiff line change
@@ -87,6 +87,10 @@ def predict(self, batch, unsort=True):
8787
pred_tokens.append("".join(pred_seq))
8888
else:
8989
pred_tokens = ["".join(seq) for seq in pred_seqs] # join chars to be tokens
90+
# if any tokens are predicted to expand to blank,
91+
# that is likely an error. use the original text
92+
# this originally came up with the Spanish model turning 's' into a blank
93+
pred_tokens = [x if x else y for x, y in zip(pred_tokens, orig_text)]
9094
if unsort:
9195
pred_tokens = utils.unsort(pred_tokens, orig_idx)
9296
return pred_tokens

0 commit comments

Comments
 (0)