-
Notifications
You must be signed in to change notification settings - Fork 214
feat: condense loan phrases from foreign languages into single tokens #1833
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…ndense-loan-phrases
…ndense-loan-phrases
} | ||
|
||
fn uncached_loan_phrases_expr() -> Lrc<FirstMatchOf> { | ||
Lrc::new(FirstMatchOf::new( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you like Latin, here it is
ad hoc
ad hominem
ad infinitum
ad lib
ad valorem
alter ego
ante bellum
a posteriori
a priori
carpe diem
caveat emptor
caeteris paribus
cogito, ergo sum
cum laude
curriculum vitae
deus ex machina
dramatis personae
e.g. (exempli gratia)
et al. (et alii)
et cetera
ex officio
ex post facto
id est (i.e.)
in absentia
in memoriam
in toto
in vino veritas
in vitro
lapsus linguae
mea culpa
mea maxima culpa
modus operandi
non sequitur
nota bene (N.B.)
opus magnum
persona non grata
post mortem
pro bono
pro forma
quo vadis?
rigor mortis
sine qua non
status quo
tabula rasa
tempus fugit
veni, vidi, vici
vice versa
These two are also Latin, but not multi words:
veto
sic
My gut doesn't love the idea of merging words just because they're part of a larger foreign phrase. Is there a better solution that allows us to keep them as their own tokens? |
Yes but it would probably be down to you to code it (-: make the spellchecker aware of multi-word terms using something like a sliding window that can check two-word and possibly three-word terms. We could probably add a flag to It would need to work with both compounds with spaces between words and with hyphens between words, and accept any kind of whitespace as a space. The good part is the dictionary already works with multi-word terms. |
Issues
Should close #1683 for the Latin terms.
Another part prompted by @elijah-potter 's question in Discord: https://discord.com/channels/1335035237213671495/1335036485765828668/1411099860807057418
Though I'd been thinking of implementing something like this in recent weeks as well.
Description
Condenses all foreign loan phrases that are in our dictionary into single tokens, including
en masse
,kung fu
, etc.I renamed the
condense_latin
method since there's more Latin in the new method now.I rearranged the order of the condensing functions with the helper for
SequenceExp
ones after all its uses. I put a comment before each of those to make it clearer that each consisted of afn_uncached_*_pattern()
, athread_local!
*_EXPR
, and acondense_*
function.I experimented with building the list of multi-word terms to condense by checking all multi-word entries in the dictionary to see which of them included at least one word not in the dictionary in its own right. This revealed a few surprising English entries that fit that pattern too, but since we want to initialize the condensing pattern statically we can't use the
FstDictionary
.I discovered two typos in the dictionary while working on this.
One false positive in the snapshots is resolved by this as a bonus.
How Has This Been Tested?
I've added a unit test that checks that a two-word and a three-word loan phrase both get tokenized.
Checklist