A minimal service to ingest, summarize, and search news articles using FastAPI, LangChain, and PostgreSQL with pgvector for semantic search capabilities. Includes automatic scheduled ingestion and a comprehensive evaluation framework.
- Article Ingestion: Extract and process news articles from URLs using LangChain loaders.
- AI Summarization: Generate concise summaries using OpenAI GPT models.
- Semantic Search: Vector-based similarity search using pgvector and cosine distance.
- Automatic Scheduled Ingestion: Continuously ingest articles for configured topics using APScheduler.
- Topic Management: Easy configuration and management of news topics and sources (RSS support included).
- Deduplication: Prevent duplicate articles using URL-based deduplication.
- Evaluation Framework:
- Golden Datasets: Manage ground truth datasets for evaluation.
- Multi-type Evaluation: Assess retrieval (Precision/Recall/F1), generation (ROUGE), and end-to-end RAG performance.
- Ragas Integration: Calculate advanced metrics like Faithfulness and Answer Relevancy.
- MLFlow Integration: Track experiments, metrics, and model performance.
- RESTful API: Clean, documented API endpoints with OpenAPI/Swagger integration.
- Error Handling & Monitoring: Robust error handling, structured logging, and health checks.
- Docker Support: Containerized deployment with Docker Compose.
- Database Migrations: Automated schema management with Alembic.
| Method | Endpoint | Description |
|---|---|---|
| GET | /healthz | Health check endpoint |
| GET | /docs | Interactive API documentation |
| Core RAG | ||
| POST | /api/v1/ingest/url | Ingest article from URL |
| GET | /api/v1/search | Search articles with semantic similarity |
| GET | /api/v1/content/{id} | Get full article content by ID |
| Topics & Ingestion | ||
| POST | /api/v1/topics | Create a new topic |
| GET | /api/v1/topics | List all topics |
| GET | /api/v1/topics/{id} | Get topic details |
| PATCH | /api/v1/topics/{id} | Update a topic |
| DELETE | /api/v1/topics/{id} | Delete a topic |
| POST | /api/v1/topics/{id}/sources | Add source to topic |
| GET | /api/v1/topics/{id}/sources | List topic sources |
| DELETE | /api/v1/topics/{id}/sources/{source_id} | Delete source |
| POST | /api/v1/topics/{id}/ingest | Trigger manual ingestion |
| GET | /api/v1/topics/stats/summary | Get ingestion statistics |
| Evaluation Framework | ||
| POST | /api/v1/evaluation/golden-datasets | Create a golden dataset |
| POST | /api/v1/evaluation/run | Run an evaluation against a dataset |
| GET | /api/v1/evaluation/history | Get evaluation run history |
| POST | /api/v1/evaluation/compare | Compare metrics across multiple runs |
-
Clone and configure environment:
git clone <repository-url> cd news-rag # Create a .env file for configuration touch .env
-
Set required environment variables in .env:
Note: The DATABASE_URL below matches the credentials and service name defined in docker-compose.yml.
# Connection string for the 'db' service in Docker Compose DATABASE_URL=postgresql+asyncpg://user:password@db:5432/newsragdb OPENAI_API_KEY=your_openai_api_key_here # Optional: If using MLFlow tracking server # MLFLOW_TRACKING_URI=http://localhost:5000
-
Start the services:
docker compose up --build
-
Access the API:
- API Documentation: http://localhost:8080/docs
- Health Check: http://localhost:8080/healthz
- Python 3.10+
- PostgreSQL (16+) with pgvector extension
- OpenAI API key
This project uses Poetry for dependency management.
-
Install Poetry (if not already installed):
curl -sSL https://install.python-poetry.org | python3 - -
Install dependencies:
poetry install
-
Set up environment variables: Create a .env file in the root directory.
# Example for local development connection DATABASE_URL=postgresql+asyncpg://user:password@localhost:5432/newsragdb OPENAI_API_KEY=your_openai_api_key_here -
Run migrations:
poetry run alembic upgrade head
-
Run the development server:
poetry run uvicorn app.main:app --reload --host 0.0.0.0 --port 8080
(Examples assume the service is running on http://localhost:8080)
curl -X POST "http://localhost:8080/api/v1/ingest/url" \
-H "Content-Type: application/json" \
-d '{"url": "https://www.bbc.co.uk/news/articles/cy85905dj2wo"}'curl -X GET "http://localhost:8080/api/v1/search?query=university%20merger&k=5"curl -X POST "http://localhost:8080/api/v1/topics" \
-H "Content-Type: application/json" \
-d '{
"name": "Technology News",
"slug": "tech-news",
"schedule_interval_minutes": 120,
"is_active": true
}'Configuration is managed via environment variables, loaded using Pydantic Settings.
| Variable | Description | Example |
|---|---|---|
| DATABASE_URL | PostgreSQL connection string (must use asyncpg driver) | postgresql+asyncpg://user:pass@host:5432/db |
| OPENAI_API_KEY | OpenAI API key for LLM and embeddings | sk-... |
| Variable | Default | Description |
|---|---|---|
| General | ||
| LLM_MODEL | gpt-5-mini | OpenAI model for summarization |
| EMB_MODEL | text-embedding-3-small | OpenAI model for embeddings |
| EMB_DIM | 1536 | Embedding dimension (must match model) |
| LOG_LEVEL | INFO | Logging level |
| API_KEY | None | Optional API key for endpoint protection (X-API-Key header) |
| ALLOWED_DOMAINS | None | Comma-separated list of allowed domains for ingestion |
| Ingestion & Scheduling | ||
| HTTP_FETCH_TIMEOUT_SECONDS | 15 | URL fetch timeout |
| MAX_CONTENT_CHARS | 200000 | Maximum content length to process |
| ENABLE_SCHEDULER | True | Enable automatic scheduled ingestion |
| DEFAULT_SCHEDULE_INTERVAL_MINUTES | 60 | Default ingestion interval |
| RSS_DATE_THRESHOLD_HOURS | 24 | How far back to look in RSS feeds |
| Evaluation | ||
| ENABLE_EVALUATION | True | Enable the evaluation framework |
| MLFLOW_TRACKING_URI | http://localhost:5000 | URI for the MLFlow tracking server |
| EVALUATION_BATCH_SIZE | 10 | Batch size for evaluation processing |
| AUTO_EVALUATE_ON_UPDATE | False | Automatically run evaluation when models change |
The service supports automatic, scheduled ingestion of articles for configured topics.
Topics can be configured via the API or using a YAML configuration file.
See the "Create a Topic" usage example above. After creating a topic, add sources:
# Add an RSS source to the topic (replace {topic_id} with the actual ID)
curl -X POST "http://localhost:8080/api/v1/topics/{topic_id}/sources" \
-H "Content-Type: application/json" \
-d '{
"name": "OilPrice.com RSS",
"url": "https://oilprice.com/rss/main",
"source_type": "rss",
"is_active": true
}'Create a config/topics.yaml file. Topics defined here are loaded automatically on startup by app/utils/topic_loader.py.
# config/topics.yaml
topics:
- name: "Crude Oil Markets"
slug: "crude-oil-markets"
schedule_interval_minutes: 60
is_active: true
sources:
- name: "OilPrice.com RSS"
url: "https://oilprice.com/rss/main"
source_type: "rss"Monitor ingestion status and history via the API:
# Get overall statistics
curl "http://localhost:8080/api/v1/topics/stats/summary"
# Get ingestion history for a topic
curl "http://localhost:8080/api/v1/topics/{topic_id}/runs"The service includes a comprehensive framework for evaluating the performance of the RAG pipeline using golden datasets. It integrates with Ragas for metric calculation and MLFlow for experiment tracking.
- Golden Dataset: A curated set of queries with expected answers and/or expected retrieved documents (ground truth).
- Evaluation Run: An execution of the RAG pipeline against a Golden Dataset.
-
Retrieval Evaluation (
retrieval): Assesses the accuracy of the vector search component.- Metrics: Precision@k, Recall@k, F1 Score.
-
Generation Evaluation (
generation): Assesses the quality of the generated summaries/answers compared to ground truth answers.- Metrics: ROUGE-1, ROUGE-2, ROUGE-L.
-
End-to-End Evaluation (
end_to_end): Assesses the entire pipeline from query to final answer.- Metrics (powered by Ragas): Context Precision, Context Recall, Faithfulness, Answer Relevancy.
To track evaluations visually, run an MLFlow server.
poetry run mlflow server --host 0.0.0.0 --port 5000Ensure MLFLOW_TRACKING_URI in the application's .env points to this server.
Define your evaluation criteria and ground truth data.
curl -X POST "http://localhost:8080/api/v1/evaluation/golden-datasets" \
-H "Content-Type: application/json" \
-d '{
"name": "Q4 2025 Technology Trends Evaluation",
"version": "1.0.0",
"queries": [
{
"query_text": "What are the recent advancements in AI?",
"expected_answer": "Recent advancements include large multimodal models...",
"expected_article_ids": ["uuid-of-relevant-article-1"]
}
]
}'Tip: Use the scripts/init_golden_dataset.py script to bootstrap an initial dataset based on existing articles.
Trigger an evaluation run against the dataset.
curl -X POST "http://localhost:8080/api/v1/evaluation/run" \
-H "Content-Type: application/json" \
-d '{
"dataset_id": "uuid-of-the-dataset",
"evaluation_type": "end_to_end"
}'Results are stored in the database and tracked in MLFlow.
View History:
curl "http://localhost:8080/api/v1/evaluation/history?dataset_id=uuid-of-the-dataset"Compare Runs:
curl -X POST "http://localhost:8080/api/v1/evaluation/compare" \
-H "Content-Type: application/json" \
-d '["uuid-of-run-1", "uuid-of-run-2"]'MLFlow UI: Access the MLFlow UI (default: http://localhost:5000) to view detailed metrics and trends.
The project includes demo scripts to showcase functionality:
demo_ingestion.py: Demonstrates the automatic ingestion workflow via API calls.demo_evaluation.py: A self-contained demonstration of the evaluation logic.scripts/evaluation_demo.py: An end-to-end demo running evaluations against the service.
- Optional API Key Authentication: Protect endpoints with the
API_KEYenvironment variable (useX-API-Keyheader). - SSRF Protection: Validates ingested URLs to prevent access to internal networks (localhost, private IPs, reserved ranges). Implemented in
app/core/security.py. - Domain Restrictions: Limit ingestion to specific domains using the
ALLOWED_DOMAINSenvironment variable. - Input Validation: Comprehensive request validation with Pydantic.
- Request ID Tracking: Structured logging with request correlation IDs for traceability.
┌────────────────┐
│ OpenAI API │
│ (LLM + Embed) │
└────────────────┘
▲
│
┌──────────────┐ ┌────────────┐ │ ┌───────────┐ ┌────────────┐
│ Scheduler ├─────▶ FastAPI ◀─────┴─────┤ LangChain ├─────▶ PostgreSQL │
│ (APScheduler)│ │ (Async API)│ │Integration│ │ + pgvector │
└──────────────┘ └────────────┘ └───────────┘ └────────────┘
│
▼
┌─────────────┐
│ Evaluation │
│(Ragas/ROUGE)│
└─────────────┘
│
▼
┌────────────┐
│ MLFlow │
│ Tracking │
└────────────┘
- FastAPI: Modern, async web framework.
- LangChain: Document loading (NewsURLLoader, WebBaseLoader) and LLM integration.
- pgvector: PostgreSQL extension for vector similarity search.
- Alembic: Database migration management.
- APScheduler: For scheduled ingestion tasks (
app/core/scheduler.py). - Ragas/ROUGE: For evaluation metrics.
- MLFlow: Experiment tracking and metric logging.
The project uses pytest. Configuration is in pytest.ini.
Run the test suite (assuming tests are present in the tests directory):
poetry run pytestThe docker-compose.yml file defines the application and database services.
version: '3.8'
services:
app:
build: .
# ... (see docker-compose.yml for full configuration)
environment:
- DATABASE_URL=postgresql+asyncpg://user:password@db:5432/newsragdb
# ...
db:
# Uses PostgreSQL 17 image as defined in docker-compose.yml
image: pgvector/pgvector:pg17
# ...- Vector Search: Uses cosine distance. Embeddings are normalized (L2 norm) client-side before storage.
- Indexing: Utilizes ivfflat indexes on the
articles.embeddingcolumn for efficient vector search. - Connection Pooling: Async SQLAlchemy connection pool configured in
app/db/session.py.
View application logs:
# Docker Compose
docker compose logs -f appEnable debug logging:
Set LOG_LEVEL=DEBUG in the .env file.