-
Notifications
You must be signed in to change notification settings - Fork 0
Fix/qdrant collection selection #66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This pull request improves the logic for classifying document slices per Qdrant collection by implementing a more robust approach for mapping document slices to their respective collections. The changes focus on better handling of multilingual and monolingual collection naming patterns.
- Refactored
classify_documents_per_collection
function to use language and model-based collection matching - Added comprehensive test coverage for edge cases including multilingual collections with unexpected naming patterns
- Improved code maintainability by simplifying imports and removing unused variables
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
File | Description |
---|---|
qdrant_syncronizer.py | Added logging statement to track collection processing during synchronization |
qdrant_handler.py | Completely refactored collection classification logic and cleaned up imports/unused variables |
test_qdrant_handler.py | Added new test case for multilingual collections with non-standard naming patterns |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
Co-authored-by: Copilot <[email protected]>
… into Fix/qdrant-collection-selection # Conflicts: # welearn_datastack/modules/qdrant_handler.py
welearn_datastack/nodes_workflow/QdrantSyncronizer/qdrant_syncronizer.py
Outdated
Show resolved
Hide resolved
…ronizer.py Co-authored-by: Sandra Guerreiro <[email protected]>
… into Fix/qdrant-collection-selection
This pull request improves the logic for classifying document slices per Qdrant collection, updates related tests, and makes minor code cleanups. The main focus is on more accurately mapping document slices to their respective collections, especially for multilingual and monolingual scenarios.
Improvements to collection classification logic:
classify_documents_per_collection
function inqdrant_handler.py
to use a more robust approach for determining the correct collection for each document slice. The function now checks for both multilingual and monolingual collection naming patterns, and logs an error if a matching collection is not found.classify_documents_per_collection
from adefaultdict
to a standard dictionary for clarity and consistency.Testing enhancements:
test_should_handle_multiple_slices_for_same_collection_with_multi_lingual_collection_and_gibberish
, to cover scenarios where document slices belong to multilingual collections or collections with unexpected names, ensuring the new classification logic works as intended.Logging and code cleanup:
qdrant_syncronizer.py
to indicate which collection is being processed during synchronization.qdrant_handler.py
for better code hygiene. [1] [2]