feat: condense loan phrases from foreign languages into single tokens #1833

hippietrail · 2025-08-30T10:06:43Z

Issues

Should close #1683 for the Latin terms.

Another part prompted by @elijah-potter 's question in Discord: https://discord.com/channels/1335035237213671495/1335036485765828668/1411099860807057418

Though I'd been thinking of implementing something like this in recent weeks as well.

Description

Condenses all foreign loan phrases that are in our dictionary into single tokens, including en masse, kung fu, etc.

I renamed the condense_latin method since there's more Latin in the new method now.
I rearranged the order of the condensing functions with the helper for SequenceExp ones after all its uses. I put a comment before each of those to make it clearer that each consisted of a fn_uncached_*_pattern(), a thread_local! *_EXPR, and a condense_* function.

I experimented with building the list of multi-word terms to condense by checking all multi-word entries in the dictionary to see which of them included at least one word not in the dictionary in its own right. This revealed a few surprising English entries that fit that pattern too, but since we want to initialize the condensing pattern statically we can't use the FstDictionary.
I discovered two typos in the dictionary while working on this.
One false positive in the snapshots is resolved by this as a bonus.

How Has This Been Tested?

I've added a unit test that checks that a two-word and a three-word loan phrase both get tokenized.

Checklist

I have performed a self-review of my own code
I have added tests to cover my changes

…ndense-loan-phrases

harper-core/src/document.rs

…ndense-loan-phrases

ccoVeille · 2025-09-09T19:23:49Z

harper-core/src/document.rs

+    }
+
+    fn uncached_loan_phrases_expr() -> Lrc<FirstMatchOf> {
+        Lrc::new(FirstMatchOf::new(


If you like Latin, here it is

ad hoc
ad hominem
ad infinitum
ad lib
ad valorem
alter ego
ante bellum
a posteriori
a priori
carpe diem
caveat emptor
caeteris paribus
cogito, ergo sum
cum laude
curriculum vitae
deus ex machina
dramatis personae
e.g. (exempli gratia)
et al. (et alii)
et cetera
ex officio
ex post facto
id est (i.e.)
in absentia
in memoriam
in toto
in vino veritas
in vitro
lapsus linguae
mea culpa
mea maxima culpa
modus operandi
non sequitur
nota bene (N.B.)
opus magnum
persona non grata
post mortem
pro bono
pro forma
quo vadis?
rigor mortis
sine qua non
status quo
tabula rasa
tempus fugit
veni, vidi, vici
vice versa

These two are also Latin, but not multi words:

veto
sic

elijah-potter · 2025-09-12T17:54:38Z

My gut doesn't love the idea of merging words just because they're part of a larger foreign phrase. Is there a better solution that allows us to keep them as their own tokens?

hippietrail · 2025-09-12T18:48:01Z

My gut doesn't love the idea of merging words just because they're part of a larger foreign phrase. Is there a better solution that allows us to keep them as their own tokens?

Yes but it would probably be down to you to code it (-: make the spellchecker aware of multi-word terms using something like a sliding window that can check two-word and possibly three-word terms.

We could probably add a flag to WordMetadata meaning "is used in multi-word terms" so we can prune a lot of more expensive checks.

It would need to work with both compounds with spaces between words and with hyphens between words, and accept any kind of whitespace as a space.

The good part is the dictionary already works with multi-word terms.

hippietrail added 6 commits August 30, 2025 18:53

feat: condense loan phrases from foreign languages into single tokens

35688d5

fix: remove temporary separator comment

5156919

Merge branch 'master' of https://github.com/Automattic/harper into co…

4e7c551

…ndense-loan-phrases

Merge branch 'master' into condense-loan-phrases

061a827

Merge branch 'master' into condense-loan-phrases

3b38a02

Merge branch 'master' of https://github.com/Automattic/harper into co…

e930093

…ndense-loan-phrases

hippietrail requested a review from elijah-potter September 6, 2025 11:09

hippietrail mentioned this pull request Sep 9, 2025

False positives for various legal terms #1683

Open

ccoVeille reviewed Sep 9, 2025

View reviewed changes

harper-core/src/document.rs Outdated Show resolved Hide resolved

harper-core/src/document.rs Outdated Show resolved Hide resolved

harper-core/src/document.rs Outdated Show resolved Hide resolved

hippietrail added 2 commits September 10, 2025 03:39

Merge branch 'master' of https://github.com/Automattic/harper into co…

de955c8

…ndense-loan-phrases

fix: problems pointed out by @ccoVeille

b4d3fc1

ccoVeille reviewed Sep 9, 2025

View reviewed changes

hippietrail marked this pull request as draft September 16, 2025 19:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: condense loan phrases from foreign languages into single tokens #1833

feat: condense loan phrases from foreign languages into single tokens #1833

Uh oh!

hippietrail commented Aug 30, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ccoVeille Sep 9, 2025 •

edited

Loading

Uh oh!

elijah-potter commented Sep 12, 2025

Uh oh!

hippietrail commented Sep 12, 2025

Uh oh!

Uh oh!

feat: condense loan phrases from foreign languages into single tokens #1833

Are you sure you want to change the base?

feat: condense loan phrases from foreign languages into single tokens #1833

Uh oh!

Conversation

hippietrail commented Aug 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issues

Description

How Has This Been Tested?

Checklist

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ccoVeille Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elijah-potter commented Sep 12, 2025

Uh oh!

hippietrail commented Sep 12, 2025

Uh oh!

Uh oh!

hippietrail commented Aug 30, 2025 •

edited

Loading

ccoVeille Sep 9, 2025 •

edited

Loading