feat: Add local MIMIC-IV full dataset support (Replace SQLite with Parquet + DuckDB) #62

hannesill · 2025-10-29T20:22:38Z

Why?

Local support for the full MIMIC-IV dataset makes it possible for researchers to avoid using Google BigQuery and is free.
SQLite didn’t scale for the full MIMIC-IV dataset and required upfront ingestion; it also made the demo and full paths diverge.
Parquet + DuckDB provides columnar performance, streaming conversion, low memory footprint, and a unified local backend for both demo and full datasets.

What changed?

Config
- Added mimic-iv-full to SUPPORTED_DATASETS with default DuckDB path and verification table.
- Runtime config tracks duckdb_paths and parquet_roots for both demo and full, with detection and active_dataset.
CLI
- m3 convert mimic-iv-[demo|full]: stream CSV.gz → Parquet via DuckDB.
- m3 init mimic-iv-[demo|full]: create DuckDB views over Parquet.
- m3 use [demo|full|bigquery] and m3 status updated to handle both local datasets.
Data IO
- Generic Parquet → DuckDB view creation with progress logging and resource knobs.
- Utilities for directory size and ensuring DB/view creation.
Docs
- README adds “Local Full Dataset (DuckDB + Parquet)” with steps and performance tuning.
Tests
- Added/updated tests for convert/init flows, verification checks, and CLI behaviors.

Important notes

Local demo now uses DuckDB over Parquet (SQLite is no longer used).
Users should run:
- m3 download mimic-iv-demo && m3 convert mimic-iv-demo (once), then m3 init mimic-iv-demo
- For full: Download manually and run m3 convert mimic-iv-full then m3 init mimic-iv-full

Out of scope

Full dataset downloader remains manual. m3 download still supports demo only.

Performance tuning

M3_CONVERT_MAX_WORKERS, M3_DUCKDB_MEM, M3_DUCKDB_THREADS environment variables supported.

Tests

All tests pass, including the newly added tests covering the new CLI and Parquet + DuckDB workflows.

Branch

local_full_dataset

simonprovost

Super nice to see a DuckDB support! Can't wait to try it out! Surely @rafiattrach will proceed with his review but I pre-approve this nice PR!

Though may we wait for #60 to pass first, much easier that way for you @hannesill to rebase and solve the README potential forthcoming tiny conflicts?

rafiattrach · 2025-10-29T23:20:39Z

@hannesill Thanks for this excellent PR! 🙏 Really appreciate the work on local full dataset support, this will definitely be a super nice addition!

I'll look into this in more detail over the coming days, but from a quick scan of your notes I have some questions about the flow and user experience, especially for clinicians who need simple workflows as we've seen in datathons:

Main concern

The current flow seems to add complexity:

m3 download mimic-iv-demo && m3 convert mimic-iv-demo (once), then m3 init mimic-iv-demo

Questions:

Simplicity vs. separation - Would splitting download/convert/init make onboarding harder for clinicians? The original m3 init was designed to be one-command simple.
Why init after convert? - If we're separating steps, what does init add that convert doesn't handle?
Unified approach? - What about keeping it simple with:
- m3 init demo (downloads + converts)
- m3 init full (downloads + converts)
Or if you prefer separation, could you explain the benefit of the three-step process?

Thanks once again though!

simonprovost · 2025-10-29T23:37:18Z

To avoid confusion, I added @rafiattrach as the main peer reviewer to prevent this from being misinterpreted and merged by mistake. ^^

hannesill · 2025-10-30T18:17:54Z

@rafiattrach thanks for the comment! I was mainly doing it for flexibility as a dev:

init as a fast command to (re)create the DuckDB views over the Parquet files and do some quick DB verification
download and convert as commands that take much longer that may require special flags like --continue if the download stops for example (download of full took 10h for me, I did it with wget, but we may want to support m3 download full in the future)

However, I see what you mean with that it should be super simple for clinicians. We could perhaps do it like that:

m3 init mimic-iv-demo (or just demo):
- If Parquet exists: create/refresh views only
- Else if raw CSVs exist: convert → init
- Else: download → convert → init
m3 init mimic-iv-full (or just full):
- If Parquet exists: init only
- Else if raw CSVs exist: convert → init
- Else: Display a clear message explaining where to place the CSV files (since full download isn't available via CLI), show the expected directory structure, and link to documentation

Optional flags for power users:

--force-download: Force download even if files exist
--force-convert: Force conversion even if Parquet exists
--continue / -c: Resume interrupted downloads or conversions (especially useful for the 10h full dataset download where you don't want to restart if it got interrupted)
Maybe you can think of other flags that may be useful

This approach keeps the clinical workflow simple (one command) while giving devs/power-users the control they need by hiding the extra functionality/complexity in optional flags. What do you think?

rafiattrach · 2025-11-04T19:18:22Z

@hannesill yes sounds good, thanks a lot Hannes!

hannesill · 2025-11-06T00:58:50Z

@rafiattrach here's the modified version as described in my suggestion.

The available flags right now for m3 init are:

--src if you have the dataset downloaded somewhere else and want to initialize from there
--db_path for specifying where to put the .duckdb file (this could be useful if you don't want to override a .duckdb file in the default database directory)

The flags I suggested above are probably only useful when we start supporting auto-download for big datasets like mimic-iv-full.

Important note: For using the demo, the CLI is unchanged now. Clinicians and devs already using m3 with the demo dataset don't need to relearn anything.

All tests still pass.

rafiattrach · 2025-11-13T12:45:13Z

@hannesill this is fantastic work! I've tested the entire workflow for the demo duckdb, BigQuery, and the new full local dataset, and it all works perfectly!

Could you please just rebase the branch on main to incorporate the latest updates especially the conflicts with the README? A thought I had was to extend the newly added Quick Start table with a third column for the new Local Full Dataset option. It could be a really clear way to show users all three setup paths side-by-side.

DuckDB becomes useful when we want to query not just the small demo dataset locally, but the full MIMIC-IV dataset.

…ith duckdb backend. Also, fixed the working path to not be m3/src/m3 anymore, but m3/ instead.

…pied into m3_data/raw manually.

… keep init fast (views only) * Remove SQLite; unify local backends on DuckDB + Parquet for demo and full * CLI: * add m3 download (demo only) to fetch CSV.gz * add m3 convert (demo/full) to convert CSV.gz → Parquet * m3 init now creates/refreshes DuckDB views over existing Parquet * Update status/use/config to new dataset model * Refresh tests and README for new workflow

README.md

rafiattrach · 2025-11-16T17:29:34Z

also @hannesill if you could check the test failures, probably from the recent additions since you mentioned all passed before

hannesill · 2025-11-16T21:30:10Z

@rafiattrach Weird, all tests pass for me locally. I'll have a look into it.

simonprovost approved these changes Oct 29, 2025

View reviewed changes

simonprovost requested a review from rafiattrach October 29, 2025 23:36

simonprovost added enhancement New feature or request good first issue Good for newcomers datasets CLI labels Nov 4, 2025

hannesill added 7 commits November 13, 2025 17:32

Add DuckDB as an engine to init CLI command.

684ff09

DuckDB becomes useful when we want to query not just the small demo dataset locally, but the full MIMIC-IV dataset.

Add duckdb to CLI config command and make MCP tool calls compatible w…

97adb96

…ith duckdb backend. Also, fixed the working path to not be m3/src/m3 anymore, but m3/ instead.

Add mimic-iv-full dataset support, but it has to be downloaded and co…

bd8638c

…pied into m3_data/raw manually.

Fix small errors, improve CLI UX, add tests for new functionality

aa8a68d

Merge CLI commands download and convert into init

5667466

Update README and update config

4e09739

hannesill force-pushed the local_full_dataset branch from b91f32a to 4e09739 Compare November 15, 2025 22:55

rafiattrach reviewed Nov 16, 2025

View reviewed changes

README.md Outdated Show resolved Hide resolved

Remove 2nd backend comparison from README

5df59ec

hannesill added 2 commits November 18, 2025 15:09

Remove local path reference

927b38c

Run pre-commit

3a47c2c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Add local MIMIC-IV full dataset support (Replace SQLite with Parquet + DuckDB) #62

feat: Add local MIMIC-IV full dataset support (Replace SQLite with Parquet + DuckDB) #62

Uh oh!

hannesill commented Oct 29, 2025

Uh oh!

simonprovost left a comment

Uh oh!

rafiattrach commented Oct 29, 2025

Uh oh!

simonprovost commented Oct 29, 2025

Uh oh!

hannesill commented Oct 30, 2025 •

edited

Loading

Uh oh!

rafiattrach commented Nov 4, 2025

Uh oh!

hannesill commented Nov 6, 2025

Uh oh!

rafiattrach commented Nov 13, 2025

Uh oh!

Uh oh!

rafiattrach commented Nov 16, 2025

Uh oh!

hannesill commented Nov 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: Add local MIMIC-IV full dataset support (Replace SQLite with Parquet + DuckDB) #62

Are you sure you want to change the base?

feat: Add local MIMIC-IV full dataset support (Replace SQLite with Parquet + DuckDB) #62

Uh oh!

Conversation

hannesill commented Oct 29, 2025

Why?

What changed?

Important notes

Out of scope

Performance tuning

Tests

Branch

Uh oh!

simonprovost left a comment

Choose a reason for hiding this comment

Uh oh!

rafiattrach commented Oct 29, 2025

Main concern

Questions:

Uh oh!

simonprovost commented Oct 29, 2025

Uh oh!

hannesill commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rafiattrach commented Nov 4, 2025

Uh oh!

hannesill commented Nov 6, 2025

Uh oh!

rafiattrach commented Nov 13, 2025

Uh oh!

Uh oh!

rafiattrach commented Nov 16, 2025

Uh oh!

hannesill commented Nov 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hannesill commented Oct 30, 2025 •

edited

Loading