• This project implements scalable Ray-based enrichment of datasets with CLIP embeddings to enable retrieval with image and text queries, cluster analysis,and deduplication.
• The notebook provides code to read in images, generate embeddings, and publish the dataset via Hugging Face.



All images were sourced from Francesco/insects-mytwu.
https://huggingface.co/datasets/hkanade/insect_image_retrieval/
from datasets import load_dataset
ds_new = load_dataset("hkanade/insect_image_retrieval")
ds_new["train"][0]["image"]