Fix BPE tokenization alignment in ATEPC aspect extraction predictions #420
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Problem
Fixes #417
ATEPC (Aspect Term Extraction and Polarity Classification) models were failing to correctly identify aspects during inference, especially with multilingual data where BPE tokenization splits words into multiple subtokens.
For example, when training on Spanish data with "amables" (kind) marked as a positive aspect, the model would incorrectly identify "todos" (all) as the aspect instead. This happened even when predicting on the exact same text the model was trained on.
Root Cause
The bug was in the
_extractmethod ofaspect_extractor.py(lines 551-567). When mapping BPE token predictions back to original tokens, the code collected predictions sequentially without using thevalid_idstensor to filter which predictions correspond to the start of original words.When a word like "amables" gets tokenized into multiple BPE subtokens ["am", "##ables"], the code would incorrectly collect predictions from both the first subtoken AND continuation subtokens, causing misalignment with the original token sequence. This resulted in predictions being mapped to the wrong words.
Example:
Solution
Added a check to only collect predictions where
valid_ids[i][j] == 1, which marks the first BPE subtoken of each original word. This ensures proper alignment between BPE token predictions and original tokens.Changes:
valid_idstensor to CPU numpy array (line 551)if valid_ids[i][j] == 1:check before appending predictions (line 566)Impact
Testing
Created and verified unit tests demonstrating the bug and confirming the fix correctly handles BPE tokenization alignment across multiple scenarios.
Warning
Firewall rules blocked me from connecting to one or more addresses (expand for details)
I tried to connect to the following addresses, but was blocked by firewall rules:
pastebin.comcurl -s REDACTED(dns block)If you need me to access, download, or install something from one of these locations, you can either:
Original prompt
💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.