-
Notifications
You must be signed in to change notification settings - Fork 6
Merge vocabulary and context building #55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Reverting could impact result. Current code does not recalculate tags so there is nothing to gain from replacing use of existing tag with new function calls.
if !Tag::is_numeric(word) && !self.is_unparsable(word) { | ||
for &(left_word, left_uterm) in window.iter() { | ||
if Tag::is_numeric(left_word) || self.is_unparsable(left_word) { | ||
if tag != Tag::Digit && tag != Tag::Unparsable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can easily make "tags to discard" configurable and just use that here. LIAAD/yake wanted that, but never got to it.
I just noticed that the CI checks are duplicated. There is one for pull_request and one for pull_request target. Completely unnecessary. I will look into how to fix it. |
I like it! Allocations are a better as well as the speed. 4-6% before
after
|
With my latest changes looks like we won another 10% with this PR. On my laptop then 1kb = 1ms. If I plug power in, even better.
|
Partially addresses #46
Context and vocabulary building each had their own separate loop over all tokens, and tag functions were called multiple times for the same word. Occurrences stored shifts and offsets even though the only thing that was relevant was whether it was at the start of a sentence or not.
This merges context and vocabulary building into one function, one loop, that calculates the tag for each occurrence only once.
I suspect that
extract_features
and filtering could also be merged into the same loop.I assume that reducing the number of loops and function calls is a performance improvement, but I will rely on @xamgore to verify this :)