Use LLM to deduplicate extracted similar entities during the insertion phase #2102
+1,437
−1
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
During the insertion phase, use LLM to deduplicate extracted similar entities.
Related Issues
#1323
Changes Made
1.Create a new deduplicate.py to handle the entire functionality.
2.Add a feature toggle and related configurations in lightrag.py
3.Add relevant functions in operate.py
4.Add relevant prompts and examples in prompt.py.
Checklist
Additional Notes
1. First, retrieve all entities to be inserted and cluster them (batch_size = 30) to ensure that the entities passed to the LLM are sufficiently similar, improving the accuracy of merging.
['汽车', '洋车', '車', '车', '新车', '车口', '洋车夫', '车份儿', '西安门大街人和车厂', '洋车厂子', '人和车厂', '车厂', '洋车界', '北平的洋车夫', '洋车夫派别', '年轻力壮的洋车夫', '年轻人力车夫', '车份儿和嚼谷', '买上车再说']
2. Only pass the entity_name to the LLM for preliminary merging, and return the initial merging results for the batch.
3. Add the descriptions of the preliminary results and pass them again to the LLM to determine whether they should be merged.
4. LLM returns the final resul.
5. Throughout this process, similarity matching is performed on entity_name to ensure that the results are all originally existing nodes