Skip to content

Conversation

hippietrail
Copy link
Collaborator

Issues

N/A

Description

The code currently assumes each word has either 0 or 1 derivations.
That is, either the word was directly included in the dictionary, or it was derived from one entry in the dictionary.

But in fact, words can be derived from multiple dictionary entries via different affix rules.

Some derivations are surprising since they're left over from the curated dictionary's origin as a pure spellchecker dictionary from Hunspell, where affixes were more about making the data structure compact than about storing grammatical information.

Most words with multiple derivations are due to one having a prefix in its dictionary entry and a suffix added via the affix attributes, and the other having a suffix in its entry and prefix added via the affix attributes.

There's probably another kind of conflict which this work doesn't uncover: A word having an entry directly in the dictionary, and another resulting from a different base word in the dictionary with a computed affix added.

Demo

Running target/debug/harper-cli metadata proactively

{
  "noun": null,
  "pronoun": null,
  "verb": null,
  "adjective": {
    "degree": null
  },
  "adverb": {},
  "conjunction": null,
  "swear": null,
  "dialect": null,
  "determiner": false,
  "preposition": false,
  "common": false,
  "derived_from": [
    {
      "hash": 9804073089439753757
    },
    {
      "hash": 16574093145829401086
    }
  ]
}
derived_from: ["proactive", "actively"]

Running target/debug/harper-cli metadata bed

{
  "noun": {
    "is_proper": null,
    "is_plural": false,
    "is_possessive": null
  },
  "pronoun": null,
  "verb": {
    "is_linking": null,
    "is_auxiliary": null,
    "tense": null
  },
  "adjective": null,
  "adverb": null,
  "conjunction": null,
  "swear": null,
  "dialect": null,
  "determiner": false,
  "preposition": false,
  "common": true,
  "derived_from": [
    {
      "hash": 4768722997514414045
    }
  ]
}
derived_from: ["b"]

How Has This Been Tested?

All tests still pass. Clippy is happy. No new tests were added yet.

Some other programmers' eyes should go over this since there are a few things I wasn't totally familiar with though everything ended up working.
Also, I'm not sure what consumes the output here. The way I'm outputting the array of derived_from is not JSON like the rest of the just getmetadata output. Maybe it would be better done another way.

Checklist

  • I have performed a self-review of my own code
  • I have added tests to cover my changes

@elijah-potter
Copy link
Collaborator

elijah-potter commented Apr 23, 2025

Also, I'm not sure what consumes the output here.

Nothing, yet. The idea is to eventually use this information to build a graph of word relationships, which we can then use to repair agreement problems.

@hippietrail
Copy link
Collaborator Author

Also, I'm not sure what consumes the output here.
Nothing, yet. The idea is to eventually use this information to build a graph of word relationships, which we can then use to repair agreement problems.

Nice! I also really want to continue improvement to the affix system based on setup work that was merged in the last couple of weeks. My tool for scanning Wiktionary and comparing with dictionary.dict is getting very close to useable. I was looking at adding mass vs uncountable for nouns to detect another kind of agreement error such as "an information" but it Wiktionary keeps track of quite a few properties of words that are hard to find resources for, just in a slightly shaggy format.

@hippietrail hippietrail added the enhancement New feature or request label Apr 27, 2025
@elijah-potter
Copy link
Collaborator

I was looking at adding mass vs uncountable for nouns to detect another kind of agreement error such as "an information"

Yes please!

@hippietrail
Copy link
Collaborator Author

I was looking at adding mass vs uncountable for nouns to detect another kind of agreement error such as "an information"

Yes please!

I was a bit worried that if I work on the affixes I might get in the way of what @RunDevelopment is working on at the moment. You know the codebase better than me so what do you think?

At this point I would probably add more annotations and comments for future logic changes but I probably wouldn't be going ahead and making such changes myself at this point.

Also I thought about bikeshedding the affix flags themselves en masse to try and make as many of the most-used ones as mnemonic as possible - what do you think? n v a = noun verb adjective 1 2 3 = first second person etc - at the moment maybe a third are mnemonic, a third i've memorized from using them a lot, and a third i always forget are there (-:

@elijah-potter
Copy link
Collaborator

I was a bit worried that if I work on the affixes I might get in the way of what @RunDevelopment is working on at the moment. You know the codebase better than me so what do you think?

That's likely a better question for him.

Also I thought about bikeshedding the affix flags themselves en masse to try and make as many of the most-used ones as mnemonic as possible - what do you think? n v a = noun verb adjective 1 2 3 = first second person etc - at the moment maybe a third are mnemonic, a third i've memorized from using them a lot, and a third i always forget are there (-:

If you think that's useful, go for it. You've got a much better handle on what's needed in that area than I do.

@RunDevelopment
Copy link
Contributor

I was a bit worried that if I work on the affixes I might get in the way of what @RunDevelopment is working on at the moment.

I don't intend to touch anything in WordMetadata & co. for now, so I don't see any conflicts arising with this PR.

Comment on lines +39 to +68
TokenKind::Decade => {
0.hash(state);
}
TokenKind::Number(number) => {
number.hash(state);
}
TokenKind::Space(space) => {
space.hash(state);
}
TokenKind::Newline(newline) => {
newline.hash(state);
}
TokenKind::EmailAddress => {
0.hash(state);
}
TokenKind::Url => {
0.hash(state);
}
TokenKind::Hostname => {
0.hash(state);
}
TokenKind::Unlintable => {
0.hash(state);
}
TokenKind::ParagraphBreak => {
0.hash(state);
}
TokenKind::Regexish => {
0.hash(state);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These hashes seem highly dubious to me. Still work in progress?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These hashes seem highly dubious to me. Still work in progress?

I believe @elijah-potter added that with a plan in mind but so far those are not really used by anything.

Also worth noting, the field the hashes are written into is not checked to see if something is already there and the new one stomps the old one. I wrote a patch that's in a PR to use a set instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear: my comment was in reference to the fact most variants have the same hash, which defeats the purpose of hashing.

Also worth noting, the field the hashes are written into is not checked to see if something is already there and the new one stomps the old one. I wrote a patch that's in a PR to use a set instead.

Wdym by "the field"? This function is generic over the hasher, so we don't know anything about how hashing is done.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless I'm on the wrong track and there's two kinds/places with hashing, the hash gets stored in a field in the metadata, derived_from.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, so you were talking about WordMetadata::derived_from and not about this hashing function. I thought you were talking about the Hasher implementation used in this function.

And the hash in WordMetadata::derived_from is created by WordId, which completely unrelated this hashing function.

@@ -28,7 +31,28 @@ pub struct WordMetadata {
#[serde(default = "default_false")]
pub common: bool,
#[serde(default = "default_none")]
pub derived_from: Option<WordId>,
pub derived_from: Option<HashSet<WordId>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it really a good idea that we now need an allocation for almost every word token? This seems like it'll impact perf quite a bit.

Isn't this hash set pretty much constant for each word? If so, couldn't we replace it with an Arc<HashSet<WordId>>? (Or even more efficient: as an index + len pair into some global list.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it really a good idea that we now need an allocation for almost every word token? This seems like it'll impact perf quite a bit.

Isn't this hash set pretty much constant for each word? If so, couldn't we replace it with an Arc<HashSet<WordId>>? (Or even more efficient: as an index + len pair into some global list.)

Oh I wonder if that's my patch I mentioned gone in? There's been a lot to try to keep up with today. The set is deficient in other ways too. It's only added to for derived forms so when a derived form of one word is the same as a dictionary word only the derived goes in.

I found the code a little messy by Harper's standards plus I didn't know what the intended final uses cases were. @elijah-potter ?

@hippietrail
Copy link
Collaborator Author

I was a bit worried that if I work on the affixes I might get in the way of what @RunDevelopment is working on at the moment.

I don't intend to touch anything in WordMetadata & co. for now, so I don't see any conflicts arising with this PR.

Great to know - will consider it in my "inbox" (-:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request harper-core
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants