Merge vocabulary and context building #55

bunny-therapist · 2025-02-07T17:32:31Z

Partially addresses #46

Context and vocabulary building each had their own separate loop over all tokens, and tag functions were called multiple times for the same word. Occurrences stored shifts and offsets even though the only thing that was relevant was whether it was at the start of a sentence or not.

This merges context and vocabulary building into one function, one loop, that calculates the tag for each occurrence only once.

I suspect that extract_features and filtering could also be merged into the same loop.

I assume that reducing the number of loops and function calls is a performance improvement, but I will rely on @xamgore to verify this :)

Reverting could impact result. Current code does not recalculate tags so there is nothing to gain from replacing use of existing tag with new function calls.

yake_rust/src/lib.rs

bunny-therapist · 2025-02-07T22:45:29Z

yake_rust/src/lib.rs

-                if !Tag::is_numeric(word) && !self.is_unparsable(word) {
-                    for &(left_word, left_uterm) in window.iter() {
-                        if Tag::is_numeric(left_word) || self.is_unparsable(left_word) {
+                if tag != Tag::Digit && tag != Tag::Unparsable {


We can easily make "tags to discard" configurable and just use that here. LIAAD/yake wanted that, but never got to it.

bunny-therapist · 2025-02-08T14:10:58Z

I just noticed that the CI checks are duplicated. There is one for pull_request and one for pull_request target. Completely unnecessary. I will look into how to fix it.

xamgore · 2025-02-08T15:06:00Z

I like it! Allocations are a better as well as the speed. 4-6%

before

bench          fastest       │ slowest       │ median        │ mean          │ samples │ iters
├─ text_3kb    3.523 ms      │ 4.558 ms      │ 3.572 ms      │ 3.579 ms      │ 27925   │ 27925
├─ text_100kb  122.1 ms      │ 130.2 ms      │ 124.2 ms      │ 124.6 ms      │ 803     │ 803
╰─ text_170b   159.2 µs      │ 391 µs        │ 162.7 µs      │ 164 µs        │ 609118  │ 609118

after

bench          fastest       │ slowest       │ median        │ mean          │ samples │ iters
├─ text_3kb    3.378 ms      │ 22.48 ms      │ 3.416 ms      │ 3.424 ms      │ 29191   │ 29191
├─ text_100kb  115.7 ms      │ 120.3 ms      │ 116.5 ms      │ 116.5 ms      │ 858     │ 858
╰─ text_170b   151 µs        │ 357.6 µs      │ 153.5 µs      │ 154.1 µs      │ 648169  │ 648169

xamgore · 2025-02-08T20:20:04Z

With my latest changes looks like we won another 10% with this PR. On my laptop then 1kb = 1ms. If I plug power in, even better.

bench          fastest       │ slowest       │ median        │ mean          │ samples │ iters
├─ text_3kb    3.122 ms      │ 4.182 ms      │ 3.173 ms      │ 3.193 ms      │ 18775   │ 18775
├─ text_100kb  104.4 ms      │ 109.4 ms      │ 105.4 ms      │ 105.8 ms      │ 567     │ 567
╰─ text_170b   144 µs        │ 305.7 µs      │ 146.8 µs      │ 148.6 µs      │ 67226   │ 67226

Merge vocabulary and context building

213544a

bunny-therapist requested a review from xamgore February 7, 2025 17:32

Remove todo comment

74f2a6d

Reverting could impact result. Current code does not recalculate tags so there is nothing to gain from replacing use of existing tag with new function calls.

bunny-therapist commented Feb 7, 2025

View reviewed changes

yake_rust/src/lib.rs Outdated Show resolved Hide resolved

bunny-therapist commented Feb 7, 2025

View reviewed changes

Remove unnecessary clone

538a36e

Refactor code

a192895

xamgore merged commit c8b1a97 into master Feb 8, 2025
10 checks passed

xamgore deleted the unloop branch February 8, 2025 15:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Merge vocabulary and context building #55

Merge vocabulary and context building #55

Uh oh!

bunny-therapist commented Feb 7, 2025

Uh oh!

Uh oh!

bunny-therapist Feb 7, 2025

Uh oh!

bunny-therapist commented Feb 8, 2025

Uh oh!

xamgore commented Feb 8, 2025

Uh oh!

Uh oh!

xamgore commented Feb 8, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Merge vocabulary and context building #55

Merge vocabulary and context building #55

Uh oh!

Conversation

bunny-therapist commented Feb 7, 2025

Uh oh!

Uh oh!

bunny-therapist Feb 7, 2025

Choose a reason for hiding this comment

Uh oh!

bunny-therapist commented Feb 8, 2025

Uh oh!

xamgore commented Feb 8, 2025

Uh oh!

Uh oh!

xamgore commented Feb 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

xamgore commented Feb 8, 2025 •

edited

Loading