Skip to content

Conversation

lpi-tn
Copy link
Collaborator

@lpi-tn lpi-tn commented Sep 30, 2025

This pull request improves the logic for classifying document slices per Qdrant collection, updates related tests, and makes minor code cleanups. The main focus is on more accurately mapping document slices to their respective collections, especially for multilingual and monolingual scenarios.

Improvements to collection classification logic:

  • Refactored the classify_documents_per_collection function in qdrant_handler.py to use a more robust approach for determining the correct collection for each document slice. The function now checks for both multilingual and monolingual collection naming patterns, and logs an error if a matching collection is not found.
  • Updated the return type initialization in classify_documents_per_collection from a defaultdict to a standard dictionary for clarity and consistency.

Testing enhancements:

  • Added a new test, test_should_handle_multiple_slices_for_same_collection_with_multi_lingual_collection_and_gibberish, to cover scenarios where document slices belong to multilingual collections or collections with unexpected names, ensuring the new classification logic works as intended.

Logging and code cleanup:

  • Added an info-level log statement in qdrant_syncronizer.py to indicate which collection is being processed during synchronization.
  • Simplified import statements and removed unused variables in qdrant_handler.py for better code hygiene. [1] [2]

Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This pull request improves the logic for classifying document slices per Qdrant collection by implementing a more robust approach for mapping document slices to their respective collections. The changes focus on better handling of multilingual and monolingual collection naming patterns.

  • Refactored classify_documents_per_collection function to use language and model-based collection matching
  • Added comprehensive test coverage for edge cases including multilingual collections with unexpected naming patterns
  • Improved code maintainability by simplifying imports and removing unused variables

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
qdrant_syncronizer.py Added logging statement to track collection processing during synchronization
qdrant_handler.py Completely refactored collection classification logic and cleaned up imports/unused variables
test_qdrant_handler.py Added new test case for multilingual collections with non-standard naming patterns

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

lpi-tn added 2 commits October 1, 2025 10:39
… into Fix/qdrant-collection-selection

# Conflicts:
#	welearn_datastack/modules/qdrant_handler.py
@lpi-tn lpi-tn merged commit cb713ea into main Oct 1, 2025
3 checks passed
@lpi-tn lpi-tn deleted the Fix/qdrant-collection-selection branch October 1, 2025 09:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants