Skip to content

Conversation

FloretKu
Copy link

@FloretKu FloretKu commented Sep 15, 2025

Description

During the insertion phase, use LLM to deduplicate extracted similar entities.

Related Issues

#1323

Changes Made

1.Create a new deduplicate.py to handle the entire functionality.

2.Add a feature toggle and related configurations in lightrag.py

    enable_deduplication: bool = field(default=False)
    deduplication_config: dict[str, Any] = field(
        default_factory=lambda: {
            "strategy": "llm_based",  # Strategy name: currently only "llm_based" is implemented
            "llm_based": {
                "batch_size": get_env_value("DEDUP_BATCH_SIZE", 30, int),
                "similarity_threshold": get_env_value(
                    "DEDUP_SIMILARITY_THRESHOLD", 0.85, float
                ),
                "system_prompt": None,  # Use default if None
                "strictness_level": get_env_value(
                    "DEDUP_STRICTNESS_LEVEL", "strict", str
                ),  # "strict", "medium", "loose"
                # strict: merge nodes ONLY if they represent the exact same real-world concept (e.g., spelling variations, synonyms, or explicit duplicates). Never merge nodes that are merely topically related.
                # medium: merge nodes if they represent the same core concept, including near-synonyms or semantically equivalent phrasing.
                # loose: merge nodes if they represent the same thematic concept, including near-synonyms or semantically equivalent phrasing.
            },
            # Future strategies can be added here by extending the architecture
            # Example: "new_strategy": { ... }
        }
    )

3.Add relevant functions in operate.py

    if deduplication_service and all_entities_for_dedup:
        logger.info(f"Starting comprehensive entity deduplication with {len(all_entities_for_dedup)} entities")
        # Extract deduplication configuration from global_config
        dedup_config_data = global_config.get("deduplication_config", {})
        strategy_name = dedup_config_data.get("strategy", "llm_based")
        strategy_config = dedup_config_data.get(strategy_name, {})

        # Create strategy-specific configuration using ConfigFactory
        try:
            from .duplicate import ConfigFactory

            dedup_config = ConfigFactory.create_config(
                strategy_name,
                {
                    "target_batch_size": strategy_config.get("batch_size", 30),
                    "similarity_threshold": strategy_config.get(
                        "similarity_threshold", 0.85
                    ),
                    "system_prompt": strategy_config.get("system_prompt"),
                    "strictness_level": strategy_config.get(
                        "strictness_level", "strict"
                    ),
                },
            )
    .........

4.Add relevant prompts and examples in prompt.py.

PROMPTS["goal_clean_strict"]
PROMPTS["goal_clean_medium"]
PROMPTS["goal_clean_loose"]
PROMPTS["goal_clean_examples"]
PROMPTS["name_only_analysis_instruction"]
PROMPTS["secondary_merge_verification"]
PROMPTS["secondary_verification_examples"]

Checklist

  • Changes tested locally
  • Code reviewed
  • Documentation updated (if necessary)
  • Unit tests added (if applicable)

Additional Notes

1. First, retrieve all entities to be inserted and cluster them (batch_size = 30) to ensure that the entities passed to the LLM are sufficiently similar, improving the accuracy of merging.

['汽车', '洋车', '車', '车', '新车', '车口', '洋车夫', '车份儿', '西安门大街人和车厂', '洋车厂子', '人和车厂', '车厂', '洋车界', '北平的洋车夫', '洋车夫派别', '年轻力壮的洋车夫', '年轻人力车夫', '车份儿和嚼谷', '买上车再说']

2. Only pass the entity_name to the LLM for preliminary merging, and return the initial merging results for the batch.

{
    "merge": [
        {"summary": "车","keywords": ["汽车","車","车","新车"]},
        {"summary": "西安门大街人和车厂","keywords": ["西安门大街人和车厂","人和车厂","洋车厂子","车厂"]},
        ..........
    ]
}

3. Add the descriptions of the preliminary results and pass them again to the LLM to determine whether they should be merged.

1. 汽车
   Description: 祥子买的汽车是他辛勤工作的结果,也是他生活的象征和希望的来源。
2. 車
   Description: 祥子希望通过卖骆驼买一辆车。
3. 车
   Description: <SEP>祥子租赁了一辆破旧的车来练习拉车的技术。<SEP>祥子的车是他生活的依靠,他相信这辆车能产生烙饼和其他食物,是万能的土地。<SEP>祥子的车被兵匪劫走,成为他不幸经历的一部分。<SEP>祥子拥有的交通工具,被抢走后成为了他心中难以忘怀的事情。
4. 新车
   Description: 新车是指祥子想要购买的车辆,具有弓子软、铜活地道等特性。

4. LLM returns the final resul.

{
    "merge": [
        {"summary": "汽车","keywords": ["汽车", "車", "车"]}
    ]
}

5. Throughout this process, similarity matching is performed on entity_name to ensure that the results are all originally existing nodes

@danielaskdd
Copy link
Collaborator

This is a highly anticipated feature, and I’ll be able to dedicate time to researching and testing it only after addressing my current tasks. Please resolve the conflicts with the main branch first. Thank you.

@FloretKu
Copy link
Author

This is a highly anticipated feature, and I’ll be able to dedicate time to researching and testing it only after addressing my current tasks. Please resolve the conflicts with the main branch first. Thank you.这是一个备受期待的功能,只有在解决当前任务后,我才能花时间研究和测试它。请先解决与主分支的冲突。谢谢。

The conflict has been resolved. Thank you for your dedication and support to the project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants