Skip to content

Conversation

grussdorian
Copy link

This commit modifies HNSWlib to filter duplicate document IDs during KNN search, ensuring only one embedding per unique document ID is returned. Key changes include:

  • Added internal_id_to_doc_id_ vector to HierarchicalNSW to map internal IDs to document IDs, populated in addPoint.
  • Introduced getMetadata method to retrieve document IDs.
  • Extended VisitedList with seen_doc_ids set to track seen document IDs thread-locally, avoiding mutex contention.
  • Updated searchBaseLayerST to skip candidates with already-seen document IDs using vl->is_doc_seen(doc_id).
  • Removed unused visited_metadata_ and visited_metadata_lock_ as filtering is now handled by VisitedList. The duplicate filtering works as intended, though knnQuery may raise a RuntimeError if k exceeds the number of unique document IDs due to result array shape constraints. Tests for basic filtering, single ID, and large datasets pass, while empty index and insufficient IDs cases require further handling.

Files modified:

  • hnswalg.h: Added duplicate filtering logic and mappings.
  • visited_list_pool.h: Enhanced VisitedList for document ID tracking.

This commit modifies HNSWlib to filter duplicate document IDs during KNN
search, ensuring only one embedding per unique document ID is returned.
Key changes include:

- Added `internal_id_to_doc_id_` vector to `HierarchicalNSW` to map internal
  IDs to document IDs, populated in `addPoint`.
- Introduced `getMetadata` method to retrieve document IDs.
- Extended `VisitedList` with `seen_doc_ids` set to track seen document IDs
  thread-locally, avoiding mutex contention.
- Updated `searchBaseLayerST` to skip candidates with already-seen document
  IDs using `vl->is_doc_seen(doc_id)`.
- Removed unused `visited_metadata_` and `visited_metadata_lock_` as filtering
  is now handled by `VisitedList`.
The duplicate filtering works as intended, though `knnQuery` may raise a
`RuntimeError` if `k` exceeds the number of unique document IDs due to
result array shape constraints. Tests for basic filtering, single ID, and
large datasets pass, while empty index and insufficient IDs cases require
further handling.

Files modified:
- hnswalg.h: Added duplicate filtering logic and mappings.
- visited_list_pool.h: Enhanced `VisitedList` for document ID tracking.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant