Fix BPE tokenization alignment in ATEPC aspect extraction predictions #420

Copilot · 2025-10-13T15:41:33Z

Problem

Fixes #417

ATEPC (Aspect Term Extraction and Polarity Classification) models were failing to correctly identify aspects during inference, especially with multilingual data where BPE tokenization splits words into multiple subtokens.

For example, when training on Spanish data with "amables" (kind) marked as a positive aspect, the model would incorrectly identify "todos" (all) as the aspect instead. This happened even when predicting on the exact same text the model was trained on.

Root Cause

The bug was in the _extract method of aspect_extractor.py (lines 551-567). When mapping BPE token predictions back to original tokens, the code collected predictions sequentially without using the valid_ids tensor to filter which predictions correspond to the start of original words.

When a word like "amables" gets tokenized into multiple BPE subtokens ["am", "##ables"], the code would incorrectly collect predictions from both the first subtoken AND continuation subtokens, causing misalignment with the original token sequence. This resulted in predictions being mapped to the wrong words.

Example:

Original tokens: ["Muy", "amables", "todos", "!"]
BPE tokens:      ["[CLS]", "Muy", "am", "##ables", "todos", "!", "[SEP]"]
valid_ids:       [0,       1,     1,    0,         1,       1,   0]
Predictions:     [pad,     O,     B-ASP, I-ASP,     O,       O,   pad]

Without fix: Collects predictions at positions 1,2,3 → ["O", "B-ASP", "I-ASP"]
             Maps to ["Muy", "amables", "todos"] ❌ WRONG

With fix:    Collects predictions only where valid_ids==1 (positions 1,2,4,5) → ["O", "B-ASP", "O", "O"]  
             Maps to ["Muy", "amables", "todos", "!"] ✓ CORRECT

Solution

Added a check to only collect predictions where valid_ids[i][j] == 1, which marks the first BPE subtoken of each original word. This ensures proper alignment between BPE token predictions and original tokens.

Changes:

Convert valid_ids tensor to CPU numpy array (line 551)
Add if valid_ids[i][j] == 1: check before appending predictions (line 566)
Add explanatory comment for clarity (line 565)

Impact

✅ Fixes aspect extraction for multilingual models (Spanish, French, German, etc.) where BPE frequently creates multiple subtokens
✅ Resolves the issue where trained models failed to identify correct aspects
✅ No breaking changes - maintains backward compatibility
✅ Minimal, surgical fix (4 insertions, 1 deletion)

Testing

Created and verified unit tests demonstrating the bug and confirming the fix correctly handles BPE tokenization alignment across multiple scenarios.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

pastebin.com
- Triggering command: curl -s REDACTED (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Configure Actions setup steps to set up my environment, which run before the firewall is enabled
Add the appropriate URLs or hosts to the custom allowlist in this repository's Copilot coding agent settings (admins only)

Original prompt

Fix @yangheng95/PyABSA/issues/417

💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Co-authored-by: yangheng95 <[email protected]>

Initial plan

8ce5b4a

Copilot AI assigned Copilot and yangheng95 Oct 13, 2025

Copilot started work on behalf of yangheng95 October 13, 2025 15:42 View session

Fix BPE tokenization alignment in aspect extraction predictions

e128335

Co-authored-by: yangheng95 <[email protected]>

Copilot AI changed the title ~~[WIP] Fix issue 417 in PyABSA repository~~ Fix BPE tokenization alignment in ATEPC aspect extraction predictions Oct 13, 2025

Copilot AI requested a review from yangheng95 October 13, 2025 15:54

Copilot finished work on behalf of yangheng95 October 13, 2025 15:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix BPE tokenization alignment in ATEPC aspect extraction predictions #420

Fix BPE tokenization alignment in ATEPC aspect extraction predictions #420

Uh oh!

Copilot AI commented Oct 13, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix BPE tokenization alignment in ATEPC aspect extraction predictions #420

Are you sure you want to change the base?

Fix BPE tokenization alignment in ATEPC aspect extraction predictions #420

Uh oh!

Conversation

Copilot AI commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Root Cause

Solution

Impact

Testing

I tried to connect to the following addresses, but was blocked by firewall rules:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Oct 13, 2025 •

edited

Loading