MOSS - Map of Open Source Science

Overview

The Map of Open Source Science (MOSS) is an open-source application and collaborative effort to model and map the domain of open-source research software and its intersection with academic scholarship. It aims to reveal the hidden connections within this ecosystem by constructing a rich knowledge graph that links software repositories, scholarly publications, researchers, and institutions. This is accomplished through a reproducible framework that uses structured data integration, overlapping web traversal, and graph construction strategies. The system is grounded in existing ontologies (e.g., OpenAlex topics, Schema.org entities) and uses context-driven rules to establish relationships, ensuring data provenance and consistency. By making these connections visible, MOSS helps answer critical questions about the impact, sustainability, and collaborative nature of open source in science.

This repository contains the backend services, API, and frontend application for the MOSS platform. It provides the tools to ingest data from sources like GitHub and OpenAlex, store it in a structured database, and expose it for analysis and exploration.

Key Features

Data Ingestion:
- Repository Ingestion: Ingest GitHub repositories directly via URL, asynchronously as workers with keyword searches, or discover through connection traversal
Scholarly Linking:
- DOI Extraction: Automatically extracts DOIs from repository files like README.md and CITATION.cff.
- Publication Mapping: Resolves DOIs using the OpenAlex API to fetch detailed publication metadata.
- Citation Traversal: Recursively processes citation networks (references and citations) to build a deeper graph.
Entity Tracking:
- Comprehensive Modeling: Stores detailed information for repositories, repository owners (users/organizations), contributors, scholarly works, authors, and institutions.
- Relationship Tracking: Models affiliations between authors and institutions, and dependencies from common package manager files (e.g., requirements.txt, package.json).
Data Provenance:
- Discovery Chains: A robust DiscoveryChain system tracks the origin of every piece of data, recording how it was discovered and linked. This ensures transparency and reproducibility.
Asynchronous Processing:
- Background Tasks: Leverages Celery and Redis to handle long-running processes like repository discovery and citation traversal without blocking the API.
Modern API:
- FastAPI Backend: A high-performance RESTful API provides endpoints for triggering ingestion and querying the knowledge graph.
- Interactive Docs: Automatic API documentation is available via Swagger UI and ReDoc.

Technology Stack

Backend: Python + uvicorn + FastAPI
Frontend: Node + React (Vite)
Package Managers: uv + pnpm
Database: PostgreSQL
Background Tasks: Celery
Message Broker / Cache: Redis
ORM: SQLAlchemy
Migrations: Alembic
HTTP Client: Requests
Logging: Python logging, concurrent-log-handler
Configuration: python-dotenv
API Clients: Custom clients for GitHub API v3 and OpenAlex API
Analysis (Optional): NetworkX, python-louvain

Prerequisites

Before you begin, ensure you have the following installed on your system:

uv: Python package manager | uv - Install
pnpm: Node package manager | pnpm - Install
Docker: Containerization platform | Docker - Install (Recommended for simplified setup of PostgreSQL and Redis)

Setup

Configure Environment:
- Copy .env.example to .env and add your GITHUB_API_TOKEN. This is the only variable you need to change to get started.
```
cp .env.example .env
```
- Edit .env and add your GitHub Token. This is the only variable you need to change to get started.
- GITHUB_API_TOKEN: Your GitHub Personal Access Token (PAT).
  - Generate one at: https://github.com/settings/personal-access-tokens
  - Select the "Public repositories" option for repository access
Start Background Services: This command starts PostgreSQL and Redis using Docker Compose.
```
docker compose up -d
```
Set up the Database & Frontend:
- This command, powered by poethepoet, runs database migrations and installs all frontend dependencies. It uses uv run to execute poe from the virtual environment without needing to activate it.
```
uv run poe setup
```
Run the Application:
- This single command starts the API server, Celery worker, and frontend development server concurrently in one terminal.
```
uv run poe start
```
- To stop all services, press Ctrl+C.
Access the Application:
- The API documentation will be available at http://localhost:8000/docs.
- The frontend will be available at the URL provided by the pnpm dev command (usually http://localhost:5173).

Manual Setup & Configuration

Python Environment Setup

This project uses poethepoet for task automation, which simplifies the setup process. The recommended setup above is the easiest path. If you wish to run commands manually, you can inspect the tasks defined in pyproject.toml under [tool.poe.tasks].

The manual steps are:

Create Virtual Environment:
```
uv venv
source .venv/bin/activate
```
Install Dependencies:
```
uv sync
```
Run Database Migrations:
```
uv run python scripts/setup_db.py
```
Install Frontend Dependencies:
```
pnpm --dir frontend install
```

Start the Services (in 3 separate terminals):

Terminal 1: API Server

uv run uvicorn backend.main:app --reload --host 0.0.0.0 --port 8000

Terminal 2: Celery Worker

uv run celery -A backend.celery_app worker -l info

Terminal 3: Frontend Server
```
pnpm --dir frontend dev
```

Service Setup (PostgreSQL & Redis)

If you are not using Docker, ensure PostgreSQL and Redis are installed and running, then:

PostgreSQL:
1. Create a database (e.g., moss_db).
2. Create a user and password (e.g., moss_user).
3. Grant the user privileges on the database.
4. Update the POSTGRES_* variables and the DATABASE_URL in your .env file to match.
Redis:
1. Ensure the Redis server is running.
2. Update REDIS_HOST, REDIS_PORT, CELERY_BROKER_URL, and CELERY_RESULT_BACKEND_URL in .env if your server is not on the default localhost:6379.

Environment Variable Details (`.env`)

The .env file is crucial for configuring the application. Here is a detailed breakdown:

POSTGRES_USER, POSTGRES_PASSWORD, POSTGRES_DB, POSTGRES_HOST, POSTGRES_PORT: These are used by Docker Compose to initialize the PostgreSQL container. They are also used to construct the DATABASE_URL.
REDIS_HOST, REDIS_PORT: Used by Docker Compose and to construct the Celery URLs.
DATABASE_URL: The full connection string for PostgreSQL, used by SQLAlchemy. Must be consistent with the POSTGRES_* variables.
CELERY_BROKER_URL: The URL for your Redis server (or other message broker) for task queuing.
CELERY_RESULT_BACKEND_URL: The URL for your Redis server to store task results.
GITHUB_API_TOKEN: (Required) Your GitHub Personal Access Token (PAT) for interacting with the GitHub API.
OPENALEX_EMAIL: (Recommended) Your email address for the OpenAlex API "polite pool" to get better rate limits.
VITE_API_BASE_URL: The base URL for the backend API, used by the frontend.

Database Migrations

If you make changes to the database models (backend/data/models/) later, you will need to:

Generate a new migration script:
```
alembic revision --autogenerate -m "Short description of changes"
```
(Review the generated script in backend/data/migrations/versions/)
Apply the migration:
```
uv run python scripts/setup_db.py
```
(Alternatively, you can use alembic upgrade head)

Directory Structure

A high-level overview of the project structure:

moss/: Project root.
- backend/: Contains all the backend code (API, services, data layer).
  - api/: FastAPI endpoints and dependencies.
  - config/: Configuration loading (settings.py) and logging (logging_config.py).
  - data/: Database interaction (models, repositories, migrations).
  - external/: Clients for external APIs (GitHub, OpenAlex).
  - schemas/: Pydantic models for API request/response validation.
  - services/: Business logic layer.
  - tasks/: Celery background task definitions.
  - utils/: Shared utility functions.
- contrib/: Location for contributed "recipe" scripts (analysis, affiliation, discovery).
- frontend/: Contains the React frontend code (setup instructions not covered here).
- logs/: Where log files (moss_api.log, moss_celery.log) are stored.
- scripts/: Helper scripts (database setup).

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
.github		.github
Older Experiments		Older Experiments
backend		backend
contrib		contrib
docs		docs
frontend		frontend
scripts		scripts
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
CITATION.cff		CITATION.cff
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
alembic.ini		alembic.ini
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Repository files navigation

MOSS - Map of Open Source Science

Overview

Key Features

Technology Stack

Prerequisites

Setup

Manual Setup & Configuration

Python Environment Setup

Service Setup (PostgreSQL & Redis)

Environment Variable Details (`.env`)

Database Migrations

Directory Structure

About

Uh oh!

Releases

Sponsor this project

Uh oh!

Packages

Languages

Uh oh!

License

numfocus/MOSS

Folders and files

Latest commit

History

Repository files navigation

MOSS - Map of Open Source Science

Overview

Key Features

Technology Stack

Prerequisites

Setup

Manual Setup & Configuration

Python Environment Setup

Service Setup (PostgreSQL & Redis)

Environment Variable Details (.env)

Database Migrations

Directory Structure

About

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Sponsor this project

Uh oh!

Packages 0

Languages

Environment Variable Details (`.env`)

Packages