Skip to content

Conversation

@hannesill
Copy link

Why?

  • Local support for the full MIMIC-IV dataset makes it possible for researchers to avoid using Google BigQuery and is free.
  • SQLite didn’t scale for the full MIMIC-IV dataset and required upfront ingestion; it also made the demo and full paths diverge.
  • Parquet + DuckDB provides columnar performance, streaming conversion, low memory footprint, and a unified local backend for both demo and full datasets.

What changed?

  • Config
    • Added mimic-iv-full to SUPPORTED_DATASETS with default DuckDB path and verification table.
    • Runtime config tracks duckdb_paths and parquet_roots for both demo and full, with detection and active_dataset.
  • CLI
    • m3 convert mimic-iv-[demo|full]: stream CSV.gz → Parquet via DuckDB.
    • m3 init mimic-iv-[demo|full]: create DuckDB views over Parquet.
    • m3 use [demo|full|bigquery] and m3 status updated to handle both local datasets.
  • Data IO
    • Generic Parquet → DuckDB view creation with progress logging and resource knobs.
    • Utilities for directory size and ensuring DB/view creation.
  • Docs
    • README adds “Local Full Dataset (DuckDB + Parquet)” with steps and performance tuning.
  • Tests
    • Added/updated tests for convert/init flows, verification checks, and CLI behaviors.

Important notes

  • Local demo now uses DuckDB over Parquet (SQLite is no longer used).
  • Users should run:
    • m3 download mimic-iv-demo && m3 convert mimic-iv-demo (once), then m3 init mimic-iv-demo
    • For full: Download manually and run m3 convert mimic-iv-full then m3 init mimic-iv-full

Out of scope

  • Full dataset downloader remains manual. m3 download still supports demo only.

Performance tuning

  • M3_CONVERT_MAX_WORKERS, M3_DUCKDB_MEM, M3_DUCKDB_THREADS environment variables supported.

Tests

  • All tests pass, including the newly added tests covering the new CLI and Parquet + DuckDB workflows.

Branch

  • local_full_dataset

Copy link
Collaborator

@simonprovost simonprovost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super nice to see a DuckDB support! Can't wait to try it out! Surely @rafiattrach will proceed with his review but I pre-approve this nice PR!

Though may we wait for #60 to pass first, much easier that way for you @hannesill to rebase and solve the README potential forthcoming tiny conflicts?

@rafiattrach
Copy link
Owner

@hannesill Thanks for this excellent PR! 🙏 Really appreciate the work on local full dataset support, this will definitely be a super nice addition!

I'll look into this in more detail over the coming days, but from a quick scan of your notes I have some questions about the flow and user experience, especially for clinicians who need simple workflows as we've seen in datathons:

Main concern

The current flow seems to add complexity:

m3 download mimic-iv-demo && m3 convert mimic-iv-demo (once), then m3 init mimic-iv-demo

Questions:

  1. Simplicity vs. separation - Would splitting download/convert/init make onboarding harder for clinicians? The original m3 init was designed to be one-command simple.

  2. Why init after convert? - If we're separating steps, what does init add that convert doesn't handle?

  3. Unified approach? - What about keeping it simple with:

    • m3 init demo (downloads + converts)
    • m3 init full (downloads + converts)

    Or if you prefer separation, could you explain the benefit of the three-step process?

Thanks once again though!

@simonprovost
Copy link
Collaborator

To avoid confusion, I added @rafiattrach as the main peer reviewer to prevent this from being misinterpreted and merged by mistake. ^^

@hannesill
Copy link
Author

hannesill commented Oct 30, 2025

@rafiattrach thanks for the comment! I was mainly doing it for flexibility as a dev:

  • init as a fast command to (re)create the DuckDB views over the Parquet files and do some quick DB verification
  • download and convert as commands that take much longer that may require special flags like --continue if the download stops for example (download of full took 10h for me, I did it with wget, but we may want to support m3 download full in the future)

However, I see what you mean with that it should be super simple for clinicians. We could perhaps do it like that:

  • m3 init mimic-iv-demo (or just demo):

    • If Parquet exists: create/refresh views only
    • Else if raw CSVs exist: convert → init
    • Else: download → convert → init
  • m3 init mimic-iv-full (or just full):

    • If Parquet exists: init only
    • Else if raw CSVs exist: convert → init
    • Else: Display a clear message explaining where to place the CSV files (since full download isn't available via CLI), show the expected directory structure, and link to documentation

Optional flags for power users:

  • --force-download: Force download even if files exist
  • --force-convert: Force conversion even if Parquet exists
  • --continue / -c: Resume interrupted downloads or conversions (especially useful for the 10h full dataset download where you don't want to restart if it got interrupted)
  • Maybe you can think of other flags that may be useful

This approach keeps the clinical workflow simple (one command) while giving devs/power-users the control they need by hiding the extra functionality/complexity in optional flags. What do you think?

@rafiattrach
Copy link
Owner

@hannesill yes sounds good, thanks a lot Hannes!

@simonprovost simonprovost added enhancement New feature or request good first issue Good for newcomers datasets CLI labels Nov 4, 2025
@hannesill
Copy link
Author

@rafiattrach here's the modified version as described in my suggestion.

The available flags right now for m3 init are:

  • --src if you have the dataset downloaded somewhere else and want to initialize from there
  • --db_path for specifying where to put the .duckdb file (this could be useful if you don't want to override a .duckdb file in the default database directory)

The flags I suggested above are probably only useful when we start supporting auto-download for big datasets like mimic-iv-full.

Important note: For using the demo, the CLI is unchanged now. Clinicians and devs already using m3 with the demo dataset don't need to relearn anything.

All tests still pass.

@rafiattrach
Copy link
Owner

@hannesill this is fantastic work! I've tested the entire workflow for the demo duckdb, BigQuery, and the new full local dataset, and it all works perfectly!

Could you please just rebase the branch on main to incorporate the latest updates especially the conflicts with the README? A thought I had was to extend the newly added Quick Start table with a third column for the new Local Full Dataset option. It could be a really clear way to show users all three setup paths side-by-side.

DuckDB becomes useful when we want to query not just the small demo dataset locally, but the full MIMIC-IV dataset.
…ith duckdb backend.

Also, fixed the working path to not be m3/src/m3 anymore, but m3/ instead.
… keep init fast (views only)

* Remove SQLite; unify local backends on DuckDB + Parquet for demo and full
* CLI:
  * add m3 download (demo only) to fetch CSV.gz
  * add m3 convert (demo/full) to convert CSV.gz → Parquet
  * m3 init now creates/refreshes DuckDB views over existing Parquet
* Update status/use/config to new dataset model
* Refresh tests and README for new workflow
@rafiattrach
Copy link
Owner

also @hannesill if you could check the test failures, probably from the recent additions since you mentioned all passed before

@hannesill
Copy link
Author

@rafiattrach Weird, all tests pass for me locally. I'll have a look into it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLI datasets enhancement New feature or request good first issue Good for newcomers

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants