Skip to content

Conversation

hippietrail
Copy link
Collaborator

@hippietrail hippietrail commented Aug 30, 2025

Issues

Should close #1683 for the Latin terms.

Another part prompted by @elijah-potter 's question in Discord: https://discord.com/channels/1335035237213671495/1335036485765828668/1411099860807057418

Though I'd been thinking of implementing something like this in recent weeks as well.

Description

Condenses all foreign loan phrases that are in our dictionary into single tokens, including en masse, kung fu, etc.

I renamed the condense_latin method since there's more Latin in the new method now.
I rearranged the order of the condensing functions with the helper for SequenceExp ones after all its uses. I put a comment before each of those to make it clearer that each consisted of a fn_uncached_*_pattern(), a thread_local! *_EXPR, and a condense_* function.

I experimented with building the list of multi-word terms to condense by checking all multi-word entries in the dictionary to see which of them included at least one word not in the dictionary in its own right. This revealed a few surprising English entries that fit that pattern too, but since we want to initialize the condensing pattern statically we can't use the FstDictionary.
I discovered two typos in the dictionary while working on this.
One false positive in the snapshots is resolved by this as a bonus.

How Has This Been Tested?

I've added a unit test that checks that a two-word and a three-word loan phrase both get tokenized.

Checklist

  • I have performed a self-review of my own code
  • I have added tests to cover my changes

}

fn uncached_loan_phrases_expr() -> Lrc<FirstMatchOf> {
Lrc::new(FirstMatchOf::new(
Copy link
Contributor

@ccoVeille ccoVeille Sep 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you like Latin, here it is

ad hoc
​ad hominem
​ad infinitum
​ad lib
​ad valorem
​alter ego
​ante bellum
​a posteriori
​a priori
​carpe diem
​caveat emptor
​caeteris paribus
​cogito, ergo sum
​cum laude
​curriculum vitae
​deus ex machina
​dramatis personae
​e.g. (exempli gratia)
​et al. (et alii)
​et cetera
​ex officio
​ex post facto
​id est (i.e.)
​in absentia
​in memoriam
​in toto
​in vino veritas
​in vitro
​lapsus linguae
​mea culpa
​mea maxima culpa
​modus operandi
​non sequitur
​nota bene (N.B.)
​opus magnum
​persona non grata
​post mortem
​pro bono
​pro forma
​quo vadis?
​rigor mortis
​sine qua non
​status quo
​tabula rasa
​tempus fugit
​veni, vidi, vici
​vice versa

These two are also Latin, but not multi words:

veto
sic

@elijah-potter
Copy link
Collaborator

My gut doesn't love the idea of merging words just because they're part of a larger foreign phrase. Is there a better solution that allows us to keep them as their own tokens?

@hippietrail
Copy link
Collaborator Author

My gut doesn't love the idea of merging words just because they're part of a larger foreign phrase. Is there a better solution that allows us to keep them as their own tokens?

Yes but it would probably be down to you to code it (-: make the spellchecker aware of multi-word terms using something like a sliding window that can check two-word and possibly three-word terms.

We could probably add a flag to WordMetadata meaning "is used in multi-word terms" so we can prune a lot of more expensive checks.

It would need to work with both compounds with spaces between words and with hyphens between words, and accept any kind of whitespace as a space.

The good part is the dictionary already works with multi-word terms.

@hippietrail hippietrail marked this pull request as draft September 16, 2025 19:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

False positives for various legal terms
3 participants