feat: handle multiple derivations for words in the metadata #1035

hippietrail · 2025-04-06T12:03:43Z

Issues

N/A

Description

The code currently assumes each word has either 0 or 1 derivations.
That is, either the word was directly included in the dictionary, or it was derived from one entry in the dictionary.

But in fact, words can be derived from multiple dictionary entries via different affix rules.

Some derivations are surprising since they're left over from the curated dictionary's origin as a pure spellchecker dictionary from Hunspell, where affixes were more about making the data structure compact than about storing grammatical information.

Most words with multiple derivations are due to one having a prefix in its dictionary entry and a suffix added via the affix attributes, and the other having a suffix in its entry and prefix added via the affix attributes.

There's probably another kind of conflict which this work doesn't uncover: A word having an entry directly in the dictionary, and another resulting from a different base word in the dictionary with a computed affix added.

Demo

Running target/debug/harper-cli metadata proactively

{
  "noun": null,
  "pronoun": null,
  "verb": null,
  "adjective": {
    "degree": null
  },
  "adverb": {},
  "conjunction": null,
  "swear": null,
  "dialect": null,
  "determiner": false,
  "preposition": false,
  "common": false,
  "derived_from": [
    {
      "hash": 9804073089439753757
    },
    {
      "hash": 16574093145829401086
    }
  ]
}
derived_from: ["proactive", "actively"]

Running target/debug/harper-cli metadata bed

{
  "noun": {
    "is_proper": null,
    "is_plural": false,
    "is_possessive": null
  },
  "pronoun": null,
  "verb": {
    "is_linking": null,
    "is_auxiliary": null,
    "tense": null
  },
  "adjective": null,
  "adverb": null,
  "conjunction": null,
  "swear": null,
  "dialect": null,
  "determiner": false,
  "preposition": false,
  "common": true,
  "derived_from": [
    {
      "hash": 4768722997514414045
    }
  ]
}
derived_from: ["b"]

How Has This Been Tested?

All tests still pass. Clippy is happy. No new tests were added yet.

Some other programmers' eyes should go over this since there are a few things I wasn't totally familiar with though everything ended up working.
Also, I'm not sure what consumes the output here. The way I'm outputting the array of derived_from is not JSON like the rest of the just getmetadata output. Maybe it would be better done another way.

Checklist

I have performed a self-review of my own code
I have added tests to cover my changes

elijah-potter · 2025-04-23T16:37:32Z

Also, I'm not sure what consumes the output here.

Nothing, yet. The idea is to eventually use this information to build a graph of word relationships, which we can then use to repair agreement problems.

hippietrail · 2025-04-23T18:53:24Z

Also, I'm not sure what consumes the output here.
Nothing, yet. The idea is to eventually use this information to build a graph of word relationships, which we can then use to repair agreement problems.

Nice! I also really want to continue improvement to the affix system based on setup work that was merged in the last couple of weeks. My tool for scanning Wiktionary and comparing with dictionary.dict is getting very close to useable. I was looking at adding mass vs uncountable for nouns to detect another kind of agreement error such as "an information" but it Wiktionary keeps track of quite a few properties of words that are hard to find resources for, just in a slightly shaggy format.

elijah-potter · 2025-04-29T16:55:54Z

I was looking at adding mass vs uncountable for nouns to detect another kind of agreement error such as "an information"

Yes please!

hippietrail · 2025-04-29T17:03:22Z

I was looking at adding mass vs uncountable for nouns to detect another kind of agreement error such as "an information"

Yes please!

I was a bit worried that if I work on the affixes I might get in the way of what @RunDevelopment is working on at the moment. You know the codebase better than me so what do you think?

At this point I would probably add more annotations and comments for future logic changes but I probably wouldn't be going ahead and making such changes myself at this point.

Also I thought about bikeshedding the affix flags themselves en masse to try and make as many of the most-used ones as mnemonic as possible - what do you think? n v a = noun verb adjective 1 2 3 = first second person etc - at the moment maybe a third are mnemonic, a third i've memorized from using them a lot, and a third i always forget are there (-:

elijah-potter · 2025-04-29T17:46:43Z

I was a bit worried that if I work on the affixes I might get in the way of what @RunDevelopment is working on at the moment. You know the codebase better than me so what do you think?

That's likely a better question for him.

Also I thought about bikeshedding the affix flags themselves en masse to try and make as many of the most-used ones as mnemonic as possible - what do you think? n v a = noun verb adjective 1 2 3 = first second person etc - at the moment maybe a third are mnemonic, a third i've memorized from using them a lot, and a third i always forget are there (-:

If you think that's useful, go for it. You've got a much better handle on what's needed in that area than I do.

RunDevelopment · 2025-04-29T19:02:47Z

I was a bit worried that if I work on the affixes I might get in the way of what @RunDevelopment is working on at the moment.

I don't intend to touch anything in WordMetadata & co. for now, so I don't see any conflicts arising with this PR.

RunDevelopment · 2025-04-29T19:03:35Z

harper-core/src/token_kind.rs

+            TokenKind::Decade => {
+                0.hash(state);
+            }
+            TokenKind::Number(number) => {
+                number.hash(state);
+            }
+            TokenKind::Space(space) => {
+                space.hash(state);
+            }
+            TokenKind::Newline(newline) => {
+                newline.hash(state);
+            }
+            TokenKind::EmailAddress => {
+                0.hash(state);
+            }
+            TokenKind::Url => {
+                0.hash(state);
+            }
+            TokenKind::Hostname => {
+                0.hash(state);
+            }
+            TokenKind::Unlintable => {
+                0.hash(state);
+            }
+            TokenKind::ParagraphBreak => {
+                0.hash(state);
+            }
+            TokenKind::Regexish => {
+                0.hash(state);
+            }


These hashes seem highly dubious to me. Still work in progress?

These hashes seem highly dubious to me. Still work in progress?

I believe @elijah-potter added that with a plan in mind but so far those are not really used by anything.

Also worth noting, the field the hashes are written into is not checked to see if something is already there and the new one stomps the old one. I wrote a patch that's in a PR to use a set instead.

To be clear: my comment was in reference to the fact most variants have the same hash, which defeats the purpose of hashing.

Also worth noting, the field the hashes are written into is not checked to see if something is already there and the new one stomps the old one. I wrote a patch that's in a PR to use a set instead.

Wdym by "the field"? This function is generic over the hasher, so we don't know anything about how hashing is done.

Unless I'm on the wrong track and there's two kinds/places with hashing, the hash gets stored in a field in the metadata, derived_from.

Ah, so you were talking about WordMetadata::derived_from and not about this hashing function. I thought you were talking about the Hasher implementation used in this function.

And the hash in WordMetadata::derived_from is created by WordId, which completely unrelated this hashing function.

RunDevelopment · 2025-04-29T19:08:52Z

harper-core/src/word_metadata.rs

@@ -28,7 +31,28 @@ pub struct WordMetadata {
    #[serde(default = "default_false")]
    pub common: bool,
    #[serde(default = "default_none")]
-    pub derived_from: Option<WordId>,
+    pub derived_from: Option<HashSet<WordId>>,


Is it really a good idea that we now need an allocation for almost every word token? This seems like it'll impact perf quite a bit.

Isn't this hash set pretty much constant for each word? If so, couldn't we replace it with an Arc<HashSet<WordId>>? (Or even more efficient: as an index + len pair into some global list.)

Is it really a good idea that we now need an allocation for almost every word token? This seems like it'll impact perf quite a bit.

Isn't this hash set pretty much constant for each word? If so, couldn't we replace it with an Arc<HashSet<WordId>>? (Or even more efficient: as an index + len pair into some global list.)

Oh I wonder if that's my patch I mentioned gone in? There's been a lot to try to keep up with today. The set is deficient in other ways too. It's only added to for derived forms so when a derived form of one word is the same as a dictionary word only the derived goes in.

I found the code a little messy by Harper's standards plus I didn't know what the intended final uses cases were. @elijah-potter ?

hippietrail · 2025-04-29T21:11:42Z

I was a bit worried that if I work on the affixes I might get in the way of what @RunDevelopment is working on at the moment.

I don't intend to touch anything in WordMetadata & co. for now, so I don't see any conflicts arising with this PR.

Great to know - will consider it in my "inbox" (-:

…-derivations

…ltiple-derivations

feat: handle multiple derivations for words in the metadata

60e0372

hippietrail added the enhancement New feature or request label Apr 27, 2025

RunDevelopment reviewed Apr 29, 2025

View reviewed changes

hippietrail added the harper-core label May 28, 2025

hippietrail force-pushed the multiple-derivations branch from ffa1389 to 60e0372 Compare July 30, 2025 01:07

hippietrail added 18 commits July 30, 2025 10:24

Merge commit 'bcc3df7718e4d4e766228a9e9f074cc14fd366aa' into multiple…

4413bb4

…-derivations

Merge commit 'f4ce421bb496fd3b0b5c7bb7be3f363ab2a0d900' into multiple…

2099ce9

…-derivations

Merge commit 'cc1f15bacac35bb86c5560c770235cb387aec9d2' into multiple…

06916c6

…-derivations

Merge commit 'f7b6bc18cbb49e4ac3ca40075e4b64e676e68707' into multiple…

b98c589

…-derivations

Merge commit '3144b95385aac58ba961c6c6c73d97628f919348' into multiple…

80f63fe

…-derivations

Merge commit '00ea7330d5b078d14ebf6e05d5fc2d0fbc372eab' into multiple…

da7b544

…-derivations

Merge commit 'e0c1acbbdbdc679214a6f02add24d7b9c3c069a6' into multiple…

fb3ad6f

…-derivations

Merge commit 'c0b4dbbabbe679e0e733e5f20912850b866528cd' into multiple…

535872d

…-derivations

Merge commit '32385de6abe68d548bac66e4bdd563b1ab323ef2' into multiple…

02e9d0f

…-derivations

Merge commit '84fc9d8746b8ddf2f9e2ad26acb8abf45f3b7e78' into multiple…

e538a37

…-derivations

chore: merging old branch bit by bit

123fd7f

Merge commit '8878c2c94a9edb307e33bba4d742b0b4d03b01c4' into multiple…

abac832

…-derivations

Merge commit '6f5da14ca5c3c80187d0a3e132455022511efbad' into multiple…

b520d77

…-derivations

Merge commit '09c826fe2fa5ea051aa059124dad886c5f0df47e' into multiple…

50afbe0

…-derivations

Merge commit '5a70b9b69f9f5277a25eb03e679f0b1cd1e74ff8' into multiple…

5756660

…-derivations

Merge commit '74608430b581cab91c8ef941301f5e2f8b7343ba' into multiple…

3b26a8e

…-derivations

Merge commit '47aa8495f0949661e1fd2c5cde2ff9c1e9c1f404' into multiple…

4782a86

…-derivations

Merge commit '1dc6a185a985fcb2ca462b1b7cdd08cf9a199b3e' into multiple…

ef1f9c6

…-derivations

hippietrail added 30 commits July 30, 2025 13:22

Merge commit '55a475eface67d0cb509b56eeb42ff57ee07898a' into multiple…

ae1d34c

…-derivations

Merge commit 'f30d08478d75715d2c2c9fc9da551c3a1884f6ea' into multiple…

c04c6da

…-derivations

Merge commit '9bbe9b7051d03beb91b0c626174915b6314ffcb0' into multiple…

88efd87

…-derivations

Merge commit 'a1fb3d4f4ba7185cf6d41028fdca3e58c97a7393' into multiple…

567fc48

…-derivations

Merge commit '64b20a843008af5a94a6ebe85668d56c4d9082e6' into multiple…

623d62c

…-derivations

Merge commit '1214bd8e1c65d1196bd583aea470914cd5441b4f' into multiple…

0bb3a66

…-derivations

Merge commit '1cef35cb66cf2d8ad6a0a9c4fb2554f9df65540a' into multiple…

a504e63

…-derivations

Merge commit '65b0292760a125f8ebf0b8a098002a79fefb9412' into multiple…

3c648fc

…-derivations

Merge commit '2d358c24d0b7605a4406030d297208dec4255748' into multiple…

637523c

…-derivations

Merge commit '569d6162f01b4755f874ee5d1730cd0422300229' into multiple…

e97226e

…-derivations

Merge commit 'a604ec448ea85b9965d8ef93e2385ed5803b5cbb' into multiple…

5d9bb1f

…-derivations

Merge commit 'a2bc3743a0b8cd7250a0411f290664dd45f6b040' into multiple…

2635db9

…-derivations

Merge commit '4f09cecfc08d02b552d52836f9b3a6cd51b19497' into multiple…

f396684

…-derivations

Merge commit '71521f2a1bbd2cd6631225e951c187df933d69be' into multiple…

06886ae

…-derivations

Merge commit 'f79548fd2ed3e7cb63e1093f91487efc191b1512' into multiple…

7ac10bf

…-derivations

Merge commit '37b0ac5675baad870cec1776038d9c8e09d1bc8e' into multiple…

923647c

…-derivations

Merge commit '6849aad2b331a9b4efd9b5fc3a13e8f7c626eb40' into multiple…

44588ee

…-derivations

Merge commit '90a66a9c8fc7f6308986b117ab2a623c9909a3dd' into multiple…

159a320

…-derivations

Merge commit 'df118218f59a2411694e403b14313385de6ed730' into multiple…

0591673

…-derivations

Merge commit '88244550f829afae8b0cd86fd42b972863c56ca7' into multiple…

abd55f7

…-derivations

Merge branch 'master' of https://github.com/Automattic/harper into mu…

e58c3f2

…ltiple-derivations

Merge branch 'master' of https://github.com/Automattic/harper into mu…

aa2916e

…ltiple-derivations

fix: appease precommit

9403b1d

Merge branch 'master' of https://github.com/Automattic/harper into mu…

d1b7d5c

…ltiple-derivations

Merge branch 'master' into multiple-derivations

7def98a

Merge branch 'master' into multiple-derivations

1a1e0bc

Merge branch 'master' of https://github.com/Automattic/harper into mu…

2f0084b

…ltiple-derivations

chore: merge with upstream

f8f5c55

Merge branch 'master' into multiple-derivations

1c5e616

Merge branch 'master' into multiple-derivations

7a957cf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: handle multiple derivations for words in the metadata #1035

feat: handle multiple derivations for words in the metadata #1035

Uh oh!

hippietrail commented Apr 6, 2025

Uh oh!

elijah-potter commented Apr 23, 2025 •

edited

Loading

Uh oh!

hippietrail commented Apr 23, 2025

Uh oh!

elijah-potter commented Apr 29, 2025

Uh oh!

hippietrail commented Apr 29, 2025

Uh oh!

elijah-potter commented Apr 29, 2025

Uh oh!

RunDevelopment commented Apr 29, 2025

Uh oh!

RunDevelopment Apr 29, 2025

Uh oh!

hippietrail Apr 29, 2025

Uh oh!

RunDevelopment Apr 29, 2025

Uh oh!

hippietrail Apr 30, 2025

Uh oh!

RunDevelopment Apr 30, 2025

Uh oh!

RunDevelopment Apr 29, 2025

Uh oh!

hippietrail Apr 29, 2025

Uh oh!

hippietrail commented Apr 29, 2025

Uh oh!

Uh oh!

feat: handle multiple derivations for words in the metadata #1035

Are you sure you want to change the base?

feat: handle multiple derivations for words in the metadata #1035

Uh oh!

Conversation

hippietrail commented Apr 6, 2025

Issues

Description

Demo

How Has This Been Tested?

Checklist

Uh oh!

elijah-potter commented Apr 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hippietrail commented Apr 23, 2025

Uh oh!

elijah-potter commented Apr 29, 2025

Uh oh!

hippietrail commented Apr 29, 2025

Uh oh!

elijah-potter commented Apr 29, 2025

Uh oh!

RunDevelopment commented Apr 29, 2025

Uh oh!

RunDevelopment Apr 29, 2025

Choose a reason for hiding this comment

Uh oh!

hippietrail Apr 29, 2025

Choose a reason for hiding this comment

Uh oh!

RunDevelopment Apr 29, 2025

Choose a reason for hiding this comment

Uh oh!

hippietrail Apr 30, 2025

Choose a reason for hiding this comment

Uh oh!

RunDevelopment Apr 30, 2025

Choose a reason for hiding this comment

Uh oh!

RunDevelopment Apr 29, 2025

Choose a reason for hiding this comment

Uh oh!

hippietrail Apr 29, 2025

Choose a reason for hiding this comment

Uh oh!

hippietrail commented Apr 29, 2025

Uh oh!

Uh oh!

elijah-potter commented Apr 23, 2025 •

edited

Loading