Gene-Disease Analysis Service

A web application that analyzes potential correlations between genes and diseases using public life sciences data and LLM capabilities. Built as a demonstration project showcasing modern Python development practices and AI integration.

NB: Working online version: https://gene-disease-analysis-service-production.up.railway.app/

🚀 Quick Start

# Clone the repository
git clone <repository-url>
cd genestack_test

# Run with Docker (recommended)
docker-compose up

# Access the application
open http://localhost:8000

The application will be fully functional after running docker-compose up.

🏗️ Architecture Overview

Technology Stack

Backend: FastAPI (async Python web framework)
Frontend: Vanilla JavaScript with modern CSS
Database: SQLite (lightweight, sufficient for demo)
LLM Integration: OpenAI and Anthropic APIs
Data Sources: OpenTargets, Ensembl, Europe PMC, GWAS Catalog
Deployment: Docker & Docker Compose

System Design

┌─────────────┐     ┌──────────────┐     ┌─────────────────┐
│   Frontend  │────▶│   FastAPI    │────▶│  Public APIs    │
│  (JS/HTML)  │◀────│   Backend    │◀────│  (OpenTargets,  │
└─────────────┘     └──────────────┘     │   Ensembl, etc) │
                           │              └─────────────────┘
                           │
                           ├──────────────┐
                           │              │
                           ▼              ▼
                    ┌──────────────┐     ┌─────────────────┐
                    │   SQLite     │     │   LLM APIs      │
                    │   Database   │     │ (OpenAI/Claude) │
                    └──────────────┘     └─────────────────┘

📋 Features

Core Functionality

User Management: Session-based authentication with secure API key storage
Gene-Disease Analysis: Comprehensive correlation analysis using multiple data sources
Real-time Updates: WebSocket support for live analysis progress
History Tracking: Complete analysis history per user
Multi-LLM Support: Works with both OpenAI and Anthropic models

Data Integration

OpenTargets Platform: Disease-gene association scores and evidence
Ensembl: Gene symbol resolution and normalization
Europe PMC: Scientific literature evidence
GWAS Catalog: Genetic association studies
EBI OLS: Disease ontology mapping

Security Features

In-memory API key storage (never persisted to database)
Session timeout after 20 minutes of inactivity
Rate limiting on API endpoints
Secure session management

🛠️ Development Setup

Prerequisites

Python 3.10+
Docker and Docker Compose (for containerized deployment)
OpenAI or Anthropic API key

Local Development

# Navigate to backend directory
cd backend

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Run the application
uvicorn app.main:app --host 0.0.0.0 --port 8000 --reload

# Access at http://localhost:8000

Running Tests

cd backend

# Run all tests
pytest

# Run with coverage
pytest --cov=app --cov-report=term-missing

# Run specific test categories
pytest tests/unit/          # Unit tests only
pytest tests/integration/   # Integration tests only

📖 API Documentation

Once the application is running, access the interactive API documentation:

Swagger UI: http://localhost:8000/docs
ReDoc: http://localhost:8000/redoc

Key Endpoints

POST /api/v1/auth/login - User login with API key
POST /api/v1/analyses - Create new gene-disease analysis
GET /api/v1/analyses/{id} - Get analysis results
GET /api/v1/analyses - List user's analysis history
WS /api/v1/ws/{analysis_id} - WebSocket for real-time updates

🎯 Design Decisions & Trade-offs

1. Async Architecture (FastAPI + httpx)

Decision: Used async/await throughout the application
Rationale: Efficient handling of multiple concurrent API calls to external services
Trade-off: Slightly more complex code, but significantly better performance for I/O-bound operations

2. In-Memory Session Storage

Decision: Store API keys in memory, not in database
Rationale: Enhanced security - API keys are never persisted
Trade-off: Sessions lost on server restart, but acceptable for demo application
Benefit: No risk of API key exposure through database access

3. Multiple Data Source Integration

Decision: Integrate 4+ public APIs instead of just one
Rationale: Comprehensive evidence gathering for better analysis quality
Trade-off: Increased complexity and potential points of failure
Mitigation: Graceful degradation - analysis continues even if some sources fail

4. SQLite Database

Decision: Use SQLite instead of PostgreSQL/MySQL
Rationale: Simplicity for demo, no additional services needed
Trade-off: Not suitable for production scale
Note: Easy to migrate to PostgreSQL if needed (SQLAlchemy abstraction)

5. Vanilla JavaScript Frontend

Decision: No frontend framework (React/Vue/Angular)
Rationale: Simplicity, no build process needed, focuses on backend skills
Trade-off: Less maintainable for larger applications
Benefit: Zero dependencies, instant loading, easy to understand

6. Structured LLM Prompts

Decision: Enforce JSON schema in LLM responses
Rationale: Reliable parsing and consistent output format
Trade-off: Occasional LLM failures when strict format not followed
Mitigation: Fallback parsing strategies implemented

7. 20-Minute Session Timeout

Decision: Short session timeout based on inactivity
Rationale: Security best practice for API key protection
Trade-off: Users need to re-login more frequently
Benefit: Reduced risk of session hijacking

🚧 Limitations & Production Considerations

This is a demonstration application built in limited time. For production use, consider:

Database: Migrate to PostgreSQL/MySQL for better concurrency
Caching: Add Redis for session storage and API response caching
Authentication: Implement proper OAuth2/JWT authentication
Monitoring: Add logging, metrics, and error tracking (Sentry, Datadog)
API Keys: Use key management service (AWS KMS, HashiCorp Vault)
Rate Limiting: Implement per-user rate limiting with Redis
Frontend: Consider React/Vue for better state management
Testing: Expand test coverage (currently ~80%)
CI/CD: Add GitHub Actions for automated testing and deployment
Documentation: Add API versioning and deprecation policies

🌟 Bonus Features Implemented

Beyond the basic requirements, this implementation includes:

WebSocket Support: Real-time analysis progress updates
Advanced Data Integration: 4+ data sources vs required 1
Comprehensive Error Handling: Graceful degradation for all external services
Smart LLM Context: System prevents redundant data source recommendations
Literature Evidence Display: Direct links to scientific papers
Session Security: In-memory storage with automatic expiration
Test Suite: 73 tests covering unit and integration scenarios
CLI Tool: Standalone command-line interface included

📊 Performance Characteristics

Concurrent Requests: Handles 100+ simultaneous analyses
Response Time: <2s for cached data, 5-15s for full analysis
Memory Usage: ~200MB baseline, +5MB per active session
Database Size: ~10KB per analysis (efficient JSON storage)

📄 License

This project is provided as-is for demonstration purposes.

🙏 Acknowledgments

OpenTargets Platform for comprehensive gene-disease associations
EMBL-EBI for providing public bioinformatics APIs
OpenAI and Anthropic for LLM capabilities
FastAPI for the excellent async web framework

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
backend		backend
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
docker-compose.yml		docker-compose.yml
gene_disease_cli.py		gene_disease_cli.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Gene-Disease Analysis Service

🚀 Quick Start

🏗️ Architecture Overview

Technology Stack

System Design

📋 Features

Core Functionality

Data Integration

Security Features

🛠️ Development Setup

Prerequisites

Local Development

Running Tests

📖 API Documentation

Key Endpoints

🎯 Design Decisions & Trade-offs

1. Async Architecture (FastAPI + httpx)

2. In-Memory Session Storage

3. Multiple Data Source Integration

4. SQLite Database

5. Vanilla JavaScript Frontend

6. Structured LLM Prompts

7. 20-Minute Session Timeout

🚧 Limitations & Production Considerations

🌟 Bonus Features Implemented

📊 Performance Characteristics

📄 License

🙏 Acknowledgments

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

Michael31416/genestack_demo

Folders and files

Latest commit

History

Repository files navigation

Gene-Disease Analysis Service

🚀 Quick Start

🏗️ Architecture Overview

Technology Stack

System Design

📋 Features

Core Functionality

Data Integration

Security Features

🛠️ Development Setup

Prerequisites

Local Development

Running Tests

📖 API Documentation

Key Endpoints

🎯 Design Decisions & Trade-offs

1. Async Architecture (FastAPI + httpx)

2. In-Memory Session Storage

3. Multiple Data Source Integration

4. SQLite Database

5. Vanilla JavaScript Frontend

6. Structured LLM Prompts

7. 20-Minute Session Timeout

🚧 Limitations & Production Considerations

🌟 Bonus Features Implemented

📊 Performance Characteristics

📄 License

🙏 Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages