-
Notifications
You must be signed in to change notification settings - Fork 203
feat: handle multiple derivations for words in the metadata #1035
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Nothing, yet. The idea is to eventually use this information to build a graph of word relationships, which we can then use to repair agreement problems. |
Nice! I also really want to continue improvement to the affix system based on setup work that was merged in the last couple of weeks. My tool for scanning Wiktionary and comparing with |
Yes please! |
I was a bit worried that if I work on the affixes I might get in the way of what @RunDevelopment is working on at the moment. You know the codebase better than me so what do you think? At this point I would probably add more annotations and comments for future logic changes but I probably wouldn't be going ahead and making such changes myself at this point. Also I thought about bikeshedding the affix flags themselves en masse to try and make as many of the most-used ones as mnemonic as possible - what do you think? n v a = noun verb adjective 1 2 3 = first second person etc - at the moment maybe a third are mnemonic, a third i've memorized from using them a lot, and a third i always forget are there (-: |
That's likely a better question for him.
If you think that's useful, go for it. You've got a much better handle on what's needed in that area than I do. |
I don't intend to touch anything in |
TokenKind::Decade => { | ||
0.hash(state); | ||
} | ||
TokenKind::Number(number) => { | ||
number.hash(state); | ||
} | ||
TokenKind::Space(space) => { | ||
space.hash(state); | ||
} | ||
TokenKind::Newline(newline) => { | ||
newline.hash(state); | ||
} | ||
TokenKind::EmailAddress => { | ||
0.hash(state); | ||
} | ||
TokenKind::Url => { | ||
0.hash(state); | ||
} | ||
TokenKind::Hostname => { | ||
0.hash(state); | ||
} | ||
TokenKind::Unlintable => { | ||
0.hash(state); | ||
} | ||
TokenKind::ParagraphBreak => { | ||
0.hash(state); | ||
} | ||
TokenKind::Regexish => { | ||
0.hash(state); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These hashes seem highly dubious to me. Still work in progress?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These hashes seem highly dubious to me. Still work in progress?
I believe @elijah-potter added that with a plan in mind but so far those are not really used by anything.
Also worth noting, the field the hashes are written into is not checked to see if something is already there and the new one stomps the old one. I wrote a patch that's in a PR to use a set instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear: my comment was in reference to the fact most variants have the same hash, which defeats the purpose of hashing.
Also worth noting, the field the hashes are written into is not checked to see if something is already there and the new one stomps the old one. I wrote a patch that's in a PR to use a set instead.
Wdym by "the field"? This function is generic over the hasher, so we don't know anything about how hashing is done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless I'm on the wrong track and there's two kinds/places with hashing, the hash gets stored in a field in the metadata, derived_from
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, so you were talking about WordMetadata::derived_from
and not about this hashing function. I thought you were talking about the Hasher
implementation used in this function.
And the hash in WordMetadata::derived_from
is created by WordId
, which completely unrelated this hashing function.
@@ -28,7 +31,28 @@ pub struct WordMetadata { | |||
#[serde(default = "default_false")] | |||
pub common: bool, | |||
#[serde(default = "default_none")] | |||
pub derived_from: Option<WordId>, | |||
pub derived_from: Option<HashSet<WordId>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it really a good idea that we now need an allocation for almost every word token? This seems like it'll impact perf quite a bit.
Isn't this hash set pretty much constant for each word? If so, couldn't we replace it with an Arc<HashSet<WordId>>
? (Or even more efficient: as an index + len pair into some global list.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it really a good idea that we now need an allocation for almost every word token? This seems like it'll impact perf quite a bit.
Isn't this hash set pretty much constant for each word? If so, couldn't we replace it with an
Arc<HashSet<WordId>>
? (Or even more efficient: as an index + len pair into some global list.)
Oh I wonder if that's my patch I mentioned gone in? There's been a lot to try to keep up with today. The set is deficient in other ways too. It's only added to for derived forms so when a derived form of one word is the same as a dictionary word only the derived goes in.
I found the code a little messy by Harper's standards plus I didn't know what the intended final uses cases were. @elijah-potter ?
Great to know - will consider it in my "inbox" (-: |
ffa1389
to
60e0372
Compare
…ltiple-derivations
…ltiple-derivations
…ltiple-derivations
…ltiple-derivations
Issues
N/A
Description
The code currently assumes each word has either 0 or 1 derivations.
That is, either the word was directly included in the dictionary, or it was derived from one entry in the dictionary.
But in fact, words can be derived from multiple dictionary entries via different affix rules.
Some derivations are surprising since they're left over from the curated dictionary's origin as a pure spellchecker dictionary from Hunspell, where affixes were more about making the data structure compact than about storing grammatical information.
Most words with multiple derivations are due to one having a prefix in its dictionary entry and a suffix added via the affix attributes, and the other having a suffix in its entry and prefix added via the affix attributes.
There's probably another kind of conflict which this work doesn't uncover: A word having an entry directly in the dictionary, and another resulting from a different base word in the dictionary with a computed affix added.
Demo
Running
target/debug/harper-cli metadata proactively
Running
target/debug/harper-cli metadata bed
How Has This Been Tested?
All tests still pass. Clippy is happy. No new tests were added yet.
Some other programmers' eyes should go over this since there are a few things I wasn't totally familiar with though everything ended up working.
Also, I'm not sure what consumes the output here. The way I'm outputting the array of
derived_from
is not JSON like the rest of thejust getmetadata
output. Maybe it would be better done another way.Checklist