Skip to content

English mwt #1378

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Apr 4, 2024
Merged

English mwt #1378

merged 9 commits into from
Apr 4, 2024

Conversation

AngledLuffa
Copy link
Collaborator

For languages where the MWT words exactly make up the text of the token, build the pieces of the MWT using the text from the original token we are splitting. Should fix a bunch of the errors observed in #1371

…ssible to use the original text directly, even in the face of unknown characters (which currently trips up the copy mechanism)
…f the MWTs. Could potentially check this for other languages
…rds, then whenever possible, replace all the characters with characters from the original text. Should greatly help unknown characters or previously unseen words in general.
…original word matches one of a couple expected casing formats, in which case we can recreate those formats after using the dictionary lookup. Otherwise, you get unexpected tokenizations such as She's -> she 's. #1371
@AngledLuffa AngledLuffa merged commit c27b5b9 into dev Apr 4, 2024
@AngledLuffa AngledLuffa deleted the english_mwt branch April 4, 2024 22:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant