This repository contains experiments and code for working with Retrieval-Augmented Generation (RAG), vector embeddings, and document ingestion/parsing using Python and popular libraries such as LangChain, HuggingFace, FAISS, and ChromaDB.
-
Data Ingestion & Parsing:
Notebooks and scripts for parsing and ingesting documents (PDF, DOCX, etc.) for downstream NLP tasks. -
Vector Embeddings & Databases:
Examples of generating vector embeddings for text using HuggingFace models and storing/querying them with vector databases like FAISS and ChromaDB. -
Jupyter Notebooks:
Interactive notebooks for step-by-step experimentation and visualization.
.
├── main.py
├── requirements.txt
├── pyproject.toml
├── .env
├── .gitignore
├── data/
│ └── attention .pdf
├── DataIngestionParsing/
│ └── dataingestion.ipynb
├── VectorEmbeddingAndDatabases/
│ └── embedding.ipynb
main.py: Entry point for running the project.requirements.txt/pyproject.toml: Python dependencies.data/: Example data files.DataIngestionParsing/: Notebooks for document ingestion and parsing.VectorEmbeddingAndDatabases/: Notebooks for embedding and vector database experiments.
-
Python Version:
Requires Python 3.13 (see .python-version). -
Install Dependencies:
You can use pip or your preferred environment manager:pip install -r requirements.txt
-
Environment Variables:
Create a.envfile for any required API keys or configuration. -
Run Notebooks:
Open the notebooks in VS Code or Jupyter Lab and run the cells interactively. -
Run Main Script:
python main.py
-
Data Ingestion:
SeeDataIngestionParsing/dataingestion.ipynb -
Vector Embedding & Visualization:
SeeVectorEmbeddingAndDatabases/embedding.ipynb
Key libraries used:
See requirements.txt and pyproject.toml for the full list.
MIT (add a LICENSE file if you want to specify)
This repo is a playground for learning and experimenting with RAG, embeddings, and vector databases. Contributions and suggestions are welcome!