Disease database

Code to create a historical disease database (19th-20th century) for municipalities in the Netherlands.

^{Cholera mention rates in the mid-1860s. source code}

This database was produced by:

🚜 Harvesting >80 million Dutch newspaper texts in the period 1830-1940 from Delpher.
🔎 Finding mentions of locations and diseases in these texts via hand-crafted regex.
💽 Processing the results and creating a user-friendly historical disease database for the following diseases:
- cholera, diphteria, dysentery, influenza, malaria, measles, scarlet fever, smallpox, tuberculosis, and typhus.

⏬ Download the database from the latest release page ⏬

Other resources related to this database:

🐻‍❄️ Polars is the engine that powers this data processing pipeline, together with the Apache Parquet format
🌍 NLGIS Provides historical geographic data for mapping, plotting, and more.
🗺️ Disease database viewer: an experimental R shiny app to interactively view the disease database.
🕵️‍♀️ Initial exploration into smoothing the mention rates within the disease database, using spatial, temporal, and spatiotemporal models.

Installation

This project uses pyproject.toml to handle its dependencies. You can install them using pip like so:

pip install .

However, we recommend using uv to manage the environment. First, install uv, then clone / download this repo, then run:

uv sync

this will automatically install the right python version, create a virtual environment, and install the required packages. If you choose not to use uv, you can replace uv run in the code examples in this repo with python.

🍏 macOS note: if you encounter error: command 'cmake' failed: No such file or directory, you need to install cmake first, e.g., through brew install cmake. Similarly, you may have to install apache-arrow separately as well (brew install apache-arrow). Once these dependency issues are solved, run uv sync one more time.

Running the data processing pipeline

The full data processing pipeline looks like this:

Each of the separate processing steps (rectangles in the above image) has its own subfolder with its own readme documentation:

Open archive processing in ./src/process_open_archive/
Delpher API harvesting in ./src/harvest_delpher_api/
Final database creation in ./src/create_database/

Data analysis

For a basic analysis after the database has been created, take a look at the file src/analysis/query_db.py.

For more in-depth analysis and usage scripts, take a look at our analysis repository: disease_database_analysis.

Contact

This project is developed and maintained by the ODISSEI Social Data Science (SoDa) team.

Do you have questions, suggestions, or remarks? File an issue in the issue tracker or feel free to contact the team at odissei-soda.nl

Name		Name	Last commit message	Last commit date
Latest commit History 88 Commits
archive		archive
img		img
processed_data		processed_data
raw_data		raw_data
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Disease database

Installation

Running the data processing pipeline

Data analysis

Contact

About

Uh oh!

Releases 3

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

sodascience/disease_database

Folders and files

Latest commit

History

Repository files navigation

Disease database

Installation

Running the data processing pipeline

Data analysis

Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages