Skip to content

sodascience/disease_database

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

88 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Disease database

Project Status: Active – The project has reached a stable, usable state and is being actively developed. GitHub Release uv

Code to create a historical disease database (19th-20th century) for municipalities in the Netherlands.

Cholera in the Netherlands Cholera mention rates in the mid-1860s. source code

This database was produced by:

  • 🚜 Harvesting >80 million Dutch newspaper texts in the period 1830-1940 from Delpher.
  • 🔎 Finding mentions of locations and diseases in these texts via hand-crafted regex.
  • 💽 Processing the results and creating a user-friendly historical disease database for the following diseases:
    • cholera, diphteria, dysentery, influenza, malaria, measles, scarlet fever, smallpox, tuberculosis, and typhus.

Download the database from the latest release page

Other resources related to this database:

  • 🐻‍❄️ Polars is the engine that powers this data processing pipeline, together with the Apache Parquet format
  • 🌍 NLGIS Provides historical geographic data for mapping, plotting, and more.
  • 🗺️ Disease database viewer: an experimental R shiny app to interactively view the disease database.
  • 🕵️‍♀️ Initial exploration into smoothing the mention rates within the disease database, using spatial, temporal, and spatiotemporal models.

Installation

This project uses pyproject.toml to handle its dependencies. You can install them using pip like so:

pip install .

However, we recommend using uv to manage the environment. First, install uv, then clone / download this repo, then run:

uv sync

this will automatically install the right python version, create a virtual environment, and install the required packages. If you choose not to use uv, you can replace uv run in the code examples in this repo with python.

🍏 macOS note: if you encounter error: command 'cmake' failed: No such file or directory, you need to install cmake first, e.g., through brew install cmake. Similarly, you may have to install apache-arrow separately as well (brew install apache-arrow). Once these dependency issues are solved, run uv sync one more time.

Running the data processing pipeline

The full data processing pipeline looks like this:

disease database flow

Each of the separate processing steps (rectangles in the above image) has its own subfolder with its own readme documentation:

Data analysis

For a basic analysis after the database has been created, take a look at the file src/analysis/query_db.py.

For more in-depth analysis and usage scripts, take a look at our analysis repository: disease_database_analysis.

Contact

This project is developed and maintained by the ODISSEI Social Data Science (SoDa) team.

Do you have questions, suggestions, or remarks? File an issue in the issue tracker or feel free to contact the team at odissei-soda.nl

SoDa logo

About

Historical disease database (19th-20th century) for municipalities in the Netherlands

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •