Skip to content

Add ImpactRangeQuery for Impact-Based Document Range Prioritization #15023

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

atris
Copy link
Contributor

@atris atris commented Aug 1, 2025

Implements a query wrapper that prioritizes document ranges based on their
scoring potential using Lucene's impact information. The implementation
divides the document space into ranges and evaluates each range's maximum
possible score using ImpactsEnum data, then processes ranges in descending
order of scoring potential.

Key features:

  • Supports range sizes and min and max document bounds.
  • Uses ImpactsEnum when ScoreMode.TOP_SCORES indicates impacts are available
  • Falls back to standard scoring when impacts are unavailable
  • Supports early termination based on competitive scoring thresholds

This optimization is particularly beneficial for indices where document
clustering (BP-style ordering) groups similar documents with adjacent IDs,
allowing efficient skipping of low-scoring document ranges.

Signed-off-by: Atri Sharma <[email protected]>
Copy link

github-actions bot commented Aug 1, 2025

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

Signed-off-by: Atri Sharma <[email protected]>
@atris
Copy link
Contributor Author

atris commented Aug 1, 2025

@jpountz Please review

Copy link

github-actions bot commented Aug 1, 2025

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

@jpountz
Copy link
Contributor

jpountz commented Aug 1, 2025

Very cool. Have you been able to measure any speedup with this approach?

FYI, this breaks some API contracts, e.g. a BulkScorer is expected to score ranges of doc IDs in doc ID order. You would need to create a new BulkScorer every time that you need to go back.

Somewhat related, I think that implementing this via a helper function that evaluates a query against an entire index would be better than a query wrapper, you'd be less subjet to fewer expectations. E.g. IndexSearcher sometimes splits the doc ID space into ranges of increasing size via TimeLimitingBulkScorer. This would likely not play well with this change which really expects to score the whole index at once. This would also allow you to order ranges by priority across all segments and not only within a single segment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants