ArXiv Paper Curator

An automated system for discovering, filtering, and summarizing relevant research papers from arXiv with continuous learning from user feedback. Supports multiple research topics with individual configurations and scheduling.

Features

Multi-Topic Support: Configure multiple research topics with different search criteria and schedules
Advanced Search: Supports arXiv Advanced Search URLs for complex queries
AI-Powered Filtering: Uses Claude Opus to evaluate paper relevance (1-10 scale)
Code Detection: Automatically identifies papers with GitHub repositories
Smart Summarization: Generates concise summaries using Claude with feedback learning
Email Digests: Sends HTML/text digests via Postmark with customizable scheduling
Feedback Learning: Learns from email reply edits to improve future summaries
Database Tracking: SQLite database for papers, edits, and email queue management
Retry Capability: Email queue system with automatic retry for failed sends

Setup

Clone and Install:

git clone <repository>
cd arxiv-curator
pip install -r requirements.txt

Configure Environment:
- Copy .env.example to .env
- Copy topics.example.yaml to topics.yaml
- Add your API keys:
  - CLAUDE_API_KEY: Claude API key from Anthropic
  - POSTMARK_SERVER_TOKEN: Postmark API token
  - POSTMARK_FROM_EMAIL: Your verified sender email
  - DEFAULT_RECIPIENT_EMAIL: Default recipient for digests
Configure Topics:
- Edit topics.yaml to define your research interests
- Each topic can have:
  - Multiple search queries or Advanced Search URLs
  - Custom schedule (days of week)
  - Minimum relevance score threshold
  - Recipient email override
Set Up Gmail (optional, for reply processing):
```
import ezgmail
ezgmail.init()  # Follow OAuth flow
```
This creates credentials.json and token.json files.

Test Run:

python arxiv_curator.py --dry-run --force-run

Schedule with Cron:

crontab -e
# Add the cron jobs from crontab_config.txt

Usage

Command Line Options

python arxiv_curator.py - Normal run (respects day schedule)
python arxiv_curator.py --force-run - Run regardless of schedule
python arxiv_curator.py --check-replies-only - Only process email replies
python arxiv_curator.py --dry-run - Test without sending emails
python arxiv_curator.py --rebuild-summaries - Regenerate all summaries
python arxiv_curator.py --retry-emails - Retry failed emails from queue

Email Feedback Format

Reply to digest emails with:

Selection: 1-3, 5, 7-10, 15

Edits:
Paper 3: [Your improved summary here]
Paper 7: [Another improved summary]

Configuration

Topics Configuration (`topics.yaml`)

topics:
  topic_id:
    name: "Topic Display Name"
    searches:
      - query: "keyword search"  # Simple search
      - url: "https://arxiv.org/search/advanced?..."  # Advanced search URL
    max_results: 30
    days_back: 7
    min_relevance_score: 7
    schedule_days: ["Mon", "Wed", "Fri"]  # or ["daily"]
    recipient_email: "[email protected]"

Environment Variables (`.env`)

Key settings:

CLAUDE_MODEL: Model to use (default: claude-opus-4-20250514)
MAX_PAPERS_PER_EMAIL: Papers per digest (default: 20)
MIN_PAPERS_TO_SEND: Minimum papers required to send email (default: 3)
ENABLE_CODE_DETECTION: Detect GitHub links (default: true)
ENABLE_PDF_DOWNLOAD: Download PDFs for analysis (default: true)
ENABLE_REPLY_PROCESSING: Process email replies (default: true)

Database Schema

The system uses SQLite with four main tables:

papers: All discovered papers with metadata, scores, and summaries
summary_edits: User feedback for continuous improvement
email_logs: Tracking of sent digests and replies
email_queue: Queue for reliable email delivery with retry

Monitoring

Check logs in ./logs/arxiv_curator.log for:

Papers discovered
Filtering decisions
Email status
Error messages

Troubleshooting

No papers found:

Check search terms are not too specific
Verify arXiv API is accessible
For Advanced Search URLs, test them in browser first
Check days_back setting isn't too restrictive

Emails not sending:

Verify Postmark token and sender email
Check recipient email is correct
Use --retry-emails to retry failed sends
Check email queue in database

Replies not processing:

Ensure ezgmail is initialized
Check Gmail API credentials
Verify ENABLE_REPLY_PROCESSING is true

High API costs:

Adjust BATCH_SIZE for filtering
Increase MIN_RELEVANCE_SCORE threshold
Reduce max_results per topic

Architecture

The system follows a modular design:

ArxivManager: Handles arXiv API searches and Advanced URL parsing
ClaudeManager: Manages AI filtering and summarization
CodeDetector: Identifies GitHub repositories in papers
EmailManager: Handles Postmark sending and Gmail reply processing
Database: SQLite interface with migration support

Contributing

The system is designed to be extensible. Key extension points:

Add new paper sources beyond arXiv
Implement different AI models for summarization
Add support for other email providers
Enhance code detection algorithms
Add web interface for configuration

License

MIT License - See LICENSE file for details

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
prompts		prompts
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
arxiv_curator.py		arxiv_curator.py
crontab_config.txt		crontab_config.txt
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.sh		setup.sh
topics.example.yaml		topics.example.yaml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ArXiv Paper Curator

Features

Setup

Usage

Command Line Options

Email Feedback Format

Configuration

Topics Configuration (`topics.yaml`)

Environment Variables (`.env`)

Database Schema

Monitoring

Troubleshooting

Architecture

Contributing

License

About

Uh oh!

Releases

Packages

Languages

info-wordcab/paper-curator

Folders and files

Latest commit

History

Repository files navigation

ArXiv Paper Curator

Features

Setup

Usage

Command Line Options

Email Feedback Format

Configuration

Topics Configuration (topics.yaml)

Environment Variables (.env)

Database Schema

Monitoring

Troubleshooting

Architecture

Contributing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Topics Configuration (`topics.yaml`)

Environment Variables (`.env`)

Packages