-
Notifications
You must be signed in to change notification settings - Fork 12
feat: Add local MIMIC-IV full dataset support (Replace SQLite with Parquet + DuckDB) #62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
simonprovost
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super nice to see a DuckDB support! Can't wait to try it out! Surely @rafiattrach will proceed with his review but I pre-approve this nice PR!
Though may we wait for #60 to pass first, much easier that way for you @hannesill to rebase and solve the README potential forthcoming tiny conflicts?
|
@hannesill Thanks for this excellent PR! 🙏 Really appreciate the work on local full dataset support, this will definitely be a super nice addition! I'll look into this in more detail over the coming days, but from a quick scan of your notes I have some questions about the flow and user experience, especially for clinicians who need simple workflows as we've seen in datathons: Main concernThe current flow seems to add complexity: m3 download mimic-iv-demo && m3 convert mimic-iv-demo (once), then m3 init mimic-iv-demoQuestions:
Thanks once again though! |
|
To avoid confusion, I added @rafiattrach as the main peer reviewer to prevent this from being misinterpreted and merged by mistake. ^^ |
|
@rafiattrach thanks for the comment! I was mainly doing it for flexibility as a dev:
However, I see what you mean with that it should be super simple for clinicians. We could perhaps do it like that:
Optional flags for power users:
This approach keeps the clinical workflow simple (one command) while giving devs/power-users the control they need by hiding the extra functionality/complexity in optional flags. What do you think? |
|
@hannesill yes sounds good, thanks a lot Hannes! |
|
@rafiattrach here's the modified version as described in my suggestion. The available flags right now for
The flags I suggested above are probably only useful when we start supporting auto-download for big datasets like mimic-iv-full. Important note: For using the demo, the CLI is unchanged now. Clinicians and devs already using m3 with the demo dataset don't need to relearn anything. All tests still pass. |
|
@hannesill this is fantastic work! I've tested the entire workflow for the demo duckdb, BigQuery, and the new full local dataset, and it all works perfectly! Could you please just rebase the branch on main to incorporate the latest updates especially the conflicts with the README? A thought I had was to extend the newly added Quick Start table with a third column for the new Local Full Dataset option. It could be a really clear way to show users all three setup paths side-by-side. |
DuckDB becomes useful when we want to query not just the small demo dataset locally, but the full MIMIC-IV dataset.
…ith duckdb backend. Also, fixed the working path to not be m3/src/m3 anymore, but m3/ instead.
…pied into m3_data/raw manually.
… keep init fast (views only) * Remove SQLite; unify local backends on DuckDB + Parquet for demo and full * CLI: * add m3 download (demo only) to fetch CSV.gz * add m3 convert (demo/full) to convert CSV.gz → Parquet * m3 init now creates/refreshes DuckDB views over existing Parquet * Update status/use/config to new dataset model * Refresh tests and README for new workflow
b91f32a to
4e09739
Compare
|
also @hannesill if you could check the test failures, probably from the recent additions since you mentioned all passed before |
|
@rafiattrach Weird, all tests pass for me locally. I'll have a look into it. |
Why?
What changed?
mimic-iv-fulltoSUPPORTED_DATASETSwith default DuckDB path and verification table.duckdb_pathsandparquet_rootsfor bothdemoandfull, with detection andactive_dataset.m3 convert mimic-iv-[demo|full]: stream CSV.gz → Parquet via DuckDB.m3 init mimic-iv-[demo|full]: create DuckDB views over Parquet.m3 use [demo|full|bigquery]andm3 statusupdated to handle both local datasets.Important notes
m3 download mimic-iv-demo && m3 convert mimic-iv-demo(once), thenm3 init mimic-iv-demom3 convert mimic-iv-fullthenm3 init mimic-iv-fullOut of scope
m3 downloadstill supports demo only.Performance tuning
M3_CONVERT_MAX_WORKERS,M3_DUCKDB_MEM,M3_DUCKDB_THREADSenvironment variables supported.Tests
Branch
local_full_dataset