Code to create a historical disease database (19th-20th century) for municipalities in the Netherlands.
Cholera mention rates in the mid-1860s. source code
This database was produced by:
- 🚜 Harvesting >80 million Dutch newspaper texts in the period 1830-1940 from Delpher.
- 🔎 Finding mentions of locations and diseases in these texts via hand-crafted regex.
- 💽 Processing the results and creating a user-friendly historical disease database for the following diseases:
- cholera, diphteria, dysentery, influenza, malaria, measles, scarlet fever, smallpox, tuberculosis, and typhus.
⏬ Download the database from the latest release page ⏬
Other resources related to this database:
- 🐻❄️ Polars is the engine that powers this data processing pipeline, together with the Apache Parquet format
- 🌍 NLGIS Provides historical geographic data for mapping, plotting, and more.
- 🗺️ Disease database viewer: an experimental R shiny app to interactively view the disease database.
- 🕵️♀️ Initial exploration into smoothing the mention rates within the disease database, using spatial, temporal, and spatiotemporal models.
This project uses pyproject.toml to handle its dependencies. You can install them using pip like so:
pip install .However, we recommend using uv to manage the environment. First, install uv, then clone / download this repo, then run:
uv syncthis will automatically install the right python version, create a virtual environment, and install the required packages. If you choose not to use uv, you can replace uv run in the code examples in this repo with python.
🍏 macOS note: if you encounter
error: command 'cmake' failed: No such file or directory, you need to install cmake first, e.g., throughbrew install cmake. Similarly, you may have to installapache-arrowseparately as well (brew install apache-arrow). Once these dependency issues are solved, runuv syncone more time.
The full data processing pipeline looks like this:
Each of the separate processing steps (rectangles in the above image) has its own subfolder with its own readme documentation:
- Open archive processing in
./src/process_open_archive/ - Delpher API harvesting in
./src/harvest_delpher_api/ - Final database creation in
./src/create_database/
For a basic analysis after the database has been created, take a look at the file src/analysis/query_db.py.
For more in-depth analysis and usage scripts, take a look at our analysis repository: disease_database_analysis.
This project is developed and maintained by the ODISSEI Social Data Science (SoDa) team.
Do you have questions, suggestions, or remarks? File an issue in the
issue tracker or feel free to contact the team at odissei-soda.nl
