The Map of Open Source Science (MOSS) is an open-source application and collaborative effort to model and map the domain of open-source research software and its intersection with academic scholarship. It aims to reveal the hidden connections within this ecosystem by constructing a rich knowledge graph that links software repositories, scholarly publications, researchers, and institutions. This is accomplished through a reproducible framework that uses structured data integration, overlapping web traversal, and graph construction strategies. The system is grounded in existing ontologies (e.g., OpenAlex topics, Schema.org entities) and uses context-driven rules to establish relationships, ensuring data provenance and consistency. By making these connections visible, MOSS helps answer critical questions about the impact, sustainability, and collaborative nature of open source in science.
This repository contains the backend services, API, and frontend application for the MOSS platform. It provides the tools to ingest data from sources like GitHub and OpenAlex, store it in a structured database, and expose it for analysis and exploration.
- Data Ingestion:
- Repository Ingestion: Ingest GitHub repositories directly via URL, asynchronously as workers with keyword searches, or discover through connection traversal
- Scholarly Linking:
- DOI Extraction: Automatically extracts DOIs from repository files like
README.md
andCITATION.cff
. - Publication Mapping: Resolves DOIs using the OpenAlex API to fetch detailed publication metadata.
- Citation Traversal: Recursively processes citation networks (references and citations) to build a deeper graph.
- DOI Extraction: Automatically extracts DOIs from repository files like
- Entity Tracking:
- Comprehensive Modeling: Stores detailed information for repositories, repository owners (users/organizations), contributors, scholarly works, authors, and institutions.
- Relationship Tracking: Models affiliations between authors and institutions, and dependencies from common package manager files (e.g.,
requirements.txt
,package.json
).
- Data Provenance:
- Discovery Chains: A robust
DiscoveryChain
system tracks the origin of every piece of data, recording how it was discovered and linked. This ensures transparency and reproducibility.
- Discovery Chains: A robust
- Asynchronous Processing:
- Background Tasks: Leverages Celery and Redis to handle long-running processes like repository discovery and citation traversal without blocking the API.
- Modern API:
- FastAPI Backend: A high-performance RESTful API provides endpoints for triggering ingestion and querying the knowledge graph.
- Interactive Docs: Automatic API documentation is available via Swagger UI and ReDoc.
-
Backend: Python + uvicorn + FastAPI
-
Frontend: Node + React (Vite)
-
Package Managers: uv + pnpm
-
Database: PostgreSQL
-
Background Tasks: Celery
-
Message Broker / Cache: Redis
-
ORM: SQLAlchemy
-
Migrations: Alembic
-
HTTP Client: Requests
-
Logging: Python
logging
,concurrent-log-handler
-
Configuration:
python-dotenv
-
API Clients: Custom clients for GitHub API v3 and OpenAlex API
-
Analysis (Optional): NetworkX, python-louvain
Before you begin, ensure you have the following installed on your system:
- uv: Python package manager | uv - Install
- pnpm: Node package manager | pnpm - Install
- Docker: Containerization platform | Docker - Install (Recommended for simplified setup of PostgreSQL and Redis)
-
Configure Environment:
- Copy
.env.example
to.env
and add yourGITHUB_API_TOKEN
. This is the only variable you need to change to get started.
cp .env.example .env
- Edit
.env
and add your GitHub Token. This is the only variable you need to change to get started. GITHUB_API_TOKEN
: Your GitHub Personal Access Token (PAT).- Generate one at: https://github.com/settings/personal-access-tokens
- Select the "Public repositories" option for repository access
- Copy
-
Start Background Services: This command starts PostgreSQL and Redis using Docker Compose.
docker compose up -d
-
Set up the Database & Frontend:
- This command, powered by
poethepoet
, runs database migrations and installs all frontend dependencies. It usesuv run
to executepoe
from the virtual environment without needing to activate it.
uv run poe setup
- This command, powered by
-
Run the Application:
- This single command starts the API server, Celery worker, and frontend development server concurrently in one terminal.
uv run poe start
- To stop all services, press
Ctrl+C
.
-
Access the Application:
- The API documentation will be available at
http://localhost:8000/docs
. - The frontend will be available at the URL provided by the
pnpm dev
command (usuallyhttp://localhost:5173
).
- The API documentation will be available at
This project uses poethepoet
for task automation, which simplifies the setup process. The recommended setup above is the easiest path. If you wish to run commands manually, you can inspect the tasks defined in pyproject.toml
under [tool.poe.tasks]
.
The manual steps are:
-
Create Virtual Environment:
uv venv source .venv/bin/activate
-
Install Dependencies:
uv sync
-
Run Database Migrations:
uv run python scripts/setup_db.py
-
Install Frontend Dependencies:
pnpm --dir frontend install
-
Start the Services (in 3 separate terminals):
-
Terminal 1: API Server
uv run uvicorn backend.main:app --reload --host 0.0.0.0 --port 8000
-
Terminal 2: Celery Worker
uv run celery -A backend.celery_app worker -l info
-
Terminal 3: Frontend Server
pnpm --dir frontend dev
-
If you are not using Docker, ensure PostgreSQL and Redis are installed and running, then:
- PostgreSQL:
- Create a database (e.g.,
moss_db
). - Create a user and password (e.g.,
moss_user
). - Grant the user privileges on the database.
- Update the
POSTGRES_*
variables and theDATABASE_URL
in your.env
file to match.
- Create a database (e.g.,
- Redis:
- Ensure the Redis server is running.
- Update
REDIS_HOST
,REDIS_PORT
,CELERY_BROKER_URL
, andCELERY_RESULT_BACKEND_URL
in.env
if your server is not on the defaultlocalhost:6379
.
The .env
file is crucial for configuring the application. Here is a detailed breakdown:
POSTGRES_USER
,POSTGRES_PASSWORD
,POSTGRES_DB
,POSTGRES_HOST
,POSTGRES_PORT
: These are used by Docker Compose to initialize the PostgreSQL container. They are also used to construct theDATABASE_URL
.REDIS_HOST
,REDIS_PORT
: Used by Docker Compose and to construct the Celery URLs.DATABASE_URL
: The full connection string for PostgreSQL, used by SQLAlchemy. Must be consistent with thePOSTGRES_*
variables.CELERY_BROKER_URL
: The URL for your Redis server (or other message broker) for task queuing.CELERY_RESULT_BACKEND_URL
: The URL for your Redis server to store task results.GITHUB_API_TOKEN
: (Required) Your GitHub Personal Access Token (PAT) for interacting with the GitHub API.OPENALEX_EMAIL
: (Recommended) Your email address for the OpenAlex API "polite pool" to get better rate limits.VITE_API_BASE_URL
: The base URL for the backend API, used by the frontend.
If you make changes to the database models (backend/data/models/
) later, you will need to:
-
Generate a new migration script:
alembic revision --autogenerate -m "Short description of changes"
(Review the generated script in
backend/data/migrations/versions/
) -
Apply the migration:
uv run python scripts/setup_db.py
(Alternatively, you can use
alembic upgrade head
)
A high-level overview of the project structure:
moss/
: Project root.backend/
: Contains all the backend code (API, services, data layer).api/
: FastAPI endpoints and dependencies.config/
: Configuration loading (settings.py
) and logging (logging_config.py
).data/
: Database interaction (models, repositories, migrations).external/
: Clients for external APIs (GitHub, OpenAlex).schemas/
: Pydantic models for API request/response validation.services/
: Business logic layer.tasks/
: Celery background task definitions.utils/
: Shared utility functions.
contrib/
: Location for contributed "recipe" scripts (analysis, affiliation, discovery).frontend/
: Contains the React frontend code (setup instructions not covered here).logs/
: Where log files (moss_api.log
,moss_celery.log
) are stored.scripts/
: Helper scripts (database setup).