Skip to content

Conversation

bunny-therapist
Copy link
Collaborator

Partially addresses #46

Context and vocabulary building each had their own separate loop over all tokens, and tag functions were called multiple times for the same word. Occurrences stored shifts and offsets even though the only thing that was relevant was whether it was at the start of a sentence or not.

This merges context and vocabulary building into one function, one loop, that calculates the tag for each occurrence only once.

I suspect that extract_features and filtering could also be merged into the same loop.

I assume that reducing the number of loops and function calls is a performance improvement, but I will rely on @xamgore to verify this :)

Reverting could impact result. Current code does not
recalculate tags so there is nothing to gain from
replacing use of existing tag with new function calls.
if !Tag::is_numeric(word) && !self.is_unparsable(word) {
for &(left_word, left_uterm) in window.iter() {
if Tag::is_numeric(left_word) || self.is_unparsable(left_word) {
if tag != Tag::Digit && tag != Tag::Unparsable {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can easily make "tags to discard" configurable and just use that here. LIAAD/yake wanted that, but never got to it.

@bunny-therapist
Copy link
Collaborator Author

I just noticed that the CI checks are duplicated. There is one for pull_request and one for pull_request target. Completely unnecessary. I will look into how to fix it.

@xamgore
Copy link
Collaborator

xamgore commented Feb 8, 2025

I like it! Allocations are a better as well as the speed. 4-6%

before

bench          fastest       │ slowest       │ median        │ mean          │ samples │ iters
├─ text_3kb    3.523 ms      │ 4.558 ms      │ 3.572 ms      │ 3.579 ms      │ 27925   │ 27925
├─ text_100kb  122.1 ms      │ 130.2 ms      │ 124.2 ms      │ 124.6 ms      │ 803     │ 803
╰─ text_170b   159.2 µs      │ 391 µs        │ 162.7 µs      │ 164 µs        │ 609118  │ 609118

after

bench          fastest       │ slowest       │ median        │ mean          │ samples │ iters
├─ text_3kb    3.378 ms      │ 22.48 ms      │ 3.416 ms      │ 3.424 ms      │ 29191   │ 29191
├─ text_100kb  115.7 ms      │ 120.3 ms      │ 116.5 ms      │ 116.5 ms      │ 858     │ 858
╰─ text_170b   151 µs        │ 357.6 µs      │ 153.5 µs      │ 154.1 µs      │ 648169  │ 648169

@xamgore xamgore merged commit c8b1a97 into master Feb 8, 2025
10 checks passed
@xamgore xamgore deleted the unloop branch February 8, 2025 15:08
@xamgore
Copy link
Collaborator

xamgore commented Feb 8, 2025

With my latest changes looks like we won another 10% with this PR. On my laptop then 1kb = 1ms. If I plug power in, even better.

bench          fastest       │ slowest       │ median        │ mean          │ samples │ iters
├─ text_3kb    3.122 ms      │ 4.182 ms      │ 3.173 ms      │ 3.193 ms      │ 18775   │ 18775
├─ text_100kb  104.4 ms      │ 109.4 ms      │ 105.4 ms      │ 105.8 ms      │ 567     │ 567
╰─ text_170b   144 µs        │ 305.7 µs      │ 146.8 µs      │ 148.6 µs      │ 67226   │ 67226

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants