NaS Knowledge Model

The NaS Knowledge Model is a modular Retrieval-Augmented Generation (RAG) system designed for large-scale ingestion, embedding, and fine-tuning of biomedical literature. Built as a research and infrastructure platform, it enables scalable AI training on open-access biomedical content, with autonomous monthly ingestion, fine-tuning, and deployment.

Highlights

FastAPI‑powered RAG API with top‑k semantic retrieval and context‑aware generation
Automated ingestion from PubMed Central plus arbitrary PDF drop‑box
Clean chunking and storage by year & month (data/clean/YYYY/MM)
Vector indexing using FAISS and SentenceTransformers
LoRA‑based fine‑tuning of Google TxGemma‑2B (MedGemma) on Apple Silicon
Ten distinct training buckets (instructions • dialogues • cited QA • structured tables • sequences • CoT • rag_pairs • tool_calls • safety • eval_holdout) merged automatically before each run
Dual‑bucket storage: *-dataset (LoRA corpora) and *-pdfs (raw RAG corpus)
Prefect‑orchestrated end‑to‑end pipeline with daily autonomous RAG‑refresh + weekly LoRA update

Directory Overview

knowledge-model/
├── api/                       # FastAPI RAG endpoint
├── adapters/                  # LoRA adapters per training batch
├── data/
│   ├── clean/YYYY/MM/         # Chunked, cleaned article text
│   └── index/YYYY/MM/         # FAISS index shards
├── deployments/               # Prefect deployment wrappers
├── ingestion/                 # PubMed + PDF fetching (parallel / back‑off)
├── pipelines/
│   ├── flows/                 # Prefect flow definitions
│   ├── tasks/                 # Task wrappers (fetch, build_faiss, eval)
│   └── utils/                 # Time helpers etc.
├── processing/                # Text cleaning and chunking pipeline
├── training/                  # Fine‑tuning with PEFT/LoRA
├── tests/                     # Unit tests and eval query set
└── README.md

RAG API

Accepts user questions and performs semantic retrieval from embedded biomedical chunks
Uses LoRA‑fine‑tuned TxGemma‑2B (MedGemma) as the response generator
Auto-trims context for token limits, returns answer with cited sources
Hosted locally via FastAPI or deployable to cloud

All retrieval uses the newest FAISS index automatically selected by the pipeline.

Pipeline & Automation

The entire workflow is orchestrated by Prefect:

Refresh‑Corpus – crawls data/corpus/raw/ for new PDFs, converts to clean text, rebuilds FAISS.
Eval‑Snapshot – fixed recall@10 check; flow fails if score < 0.80.
Finetune‑LoRA – trains TxGemma adapters on updated ten‑bucket corpus.

A daily deployment (pipelines/flows/continuous.py) is scheduled at 03:00 local time via Prefect CRON and picked up by a prefect worker polling the default queue.

Model & Training Pipeline

Adapters are re‑trained weekly on the merged ten‑bucket corpus
Base model: google/txgemma‑2b‑predict with 4‑bit QLoRA
All chunked data is stored in data/clean/YYYY/MM
Train files written to data/science_articles/YYYY-MM.jsonl
LoRA adapters are saved and versioned per batch in adapters/
All artifacts are uploaded to AWS S3 using the integrated upload module

Technologies Used

Python 3.12
HuggingFace Transformers + PEFT
FAISS (vector store)
FastAPI (API layer)
PyMuPDF (PDF parsing)
PubMed E-Utilities (article ingestion)
AWS S3 (storage backend)
Prefect 2.x (orchestration)
BeautifulSoup4 + lxml (PDF link discovery)
tqdm / concurrent.futures (parallel ingestion)
Apple Silicon / Metal (MPS) backend for local fine‑tuning

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
deployments		deployments
inference		inference
knowledge_model		knowledge_model
migrations		migrations
pipelines		pipelines
scripts		scripts
tests		tests
training		training
.gitignore		.gitignore
.prefectignore		.prefectignore
.python-version		.python-version
Dockerfile		Dockerfile
README.md		README.md
alembic.ini		alembic.ini
continuous_nas-deployment.yaml		continuous_nas-deployment.yaml
pipeline_runner.py		pipeline_runner.py
render.yaml		render.yaml
requirements.txt		requirements.txt
setup.py		setup.py
smoke_test.py		smoke_test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

NaS Knowledge Model

Highlights

Directory Overview

RAG API

Pipeline & Automation

Model & Training Pipeline

Technologies Used

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

NaS-Research/knowledge-model

Folders and files

Latest commit

History

Repository files navigation

NaS Knowledge Model

Highlights

Directory Overview

RAG API

Pipeline & Automation

Model & Training Pipeline

Technologies Used

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages