Entity Matching with Sentence Transformers and FAISS

This project performs blocking for entity matching between any two datasets, using LLM embeddings and FAISS-based approximate nearest neighbor search.

1. Set up a Virtual Environment

# Create a virtual environment
python3 -m venv venv

# Activate the virtual environment
source venv/bin/activate

2. Install Required Packages

# Upgrade pip
pip install --upgrade pip

# Install project dependencies
pip install -r requirements_gpu.txt

3. Prepare the Dataset

Ensure the following files exist under data/amazon_google/:

data/amazon_google/Amazon.csv
data/amazon_google/GoogleProducts.csv
data/amazon_google/Amzon_GoogleProducts_perfectMapping.csv

📂 Example Project Structure:

your_project/
├── data/
│   └── amazon_google/
│       ├── Amazon.csv
│       ├── GoogleProducts.csv
│       └── Amzon_GoogleProducts_perfectMapping.csv
├── amazon-google.py
├── requirements_gpu.txt

4. Run the Program

python amazon-google.py \
    --batch_size 32 \
    --gpus 0 1 \
    --topk 10 \
    --model Alibaba-NLP/gte-large-en-v1.5 \
    --embedding_dim 1024 \
    --use_fp16

5. Argument Descriptions

Argument	Description	Default
`--batch_size`	Batch size for embedding sentences	`1`
`--gpus`	GPU IDs to use (space-separated)	`0`
`--topk`	Number of top candidates retrieved per query	`10`
`--model`	Huggingface model name for embeddings	`'Alibaba-NLP/gte-large-en-v1.5'`
`--embedding_dim`	Dimension of embeddings (depends on model)	`1024`
`--use_fp16`	(Flag) Use half-precision (fp16) inference	`False`

⚠️ Note: For --use_fp16, you do not pass a value — just add --use_fp16 to enable it.

6. Notes

Ensure your system has CUDA-enabled GPUs if you want to use --gpus argument.
If --use_fp16 is enabled, embeddings will be computed faster with lower memory usage, but final embeddings must be cast back to float32 before FAISS search.

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
data/amazon_google		data/amazon_google
utils		utils
.gitignore		.gitignore
README.md		README.md
amazon-google.py		amazon-google.py
requirements.txt		requirements.txt
requirements_gpu.txt		requirements_gpu.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Entity Matching with Sentence Transformers and FAISS

1. Set up a Virtual Environment

2. Install Required Packages

3. Prepare the Dataset

4. Run the Program

5. Argument Descriptions

6. Notes

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Sujan242/distributed-sematic-search-entity-matching

Folders and files

Latest commit

History

Repository files navigation

Entity Matching with Sentence Transformers and FAISS

1. Set up a Virtual Environment

2. Install Required Packages

3. Prepare the Dataset

4. Run the Program

5. Argument Descriptions

6. Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages