This repository provides a full pipeline for simulating longitudinal patient-note data, extracting contextual embeddings from clinical notes, and training various deep learning architectures to predict chronic kidney disease (CKD) progression. The pipeline consists of two main scripts:
This script reads tabular event data and ICD mappings to generate synthetic, per-patient, per-day clinical notes. These notes are constructed by integrating demographic attributes, diagnoses (ICD-10 codes), medications, and procedures. The script maps GFR readings to CKD stages (1--5) using forward-filled values while enforcing monotonic progression. The resulting notes are encoded into dense representations (CLS embeddings) using a pretrained transformer model (e.g., ClinicalBERT).
Each embedding is stored in a per-patient directory with corresponding metadata (GFR, CKD stage, text). Metadata is saved to a CSV file for downstream modeling.
Usage:
bash run_embedding.shThis script reads the embeddings and associated metadata and builds sequences of patient-note embeddings. The task is to predict whether the patient will progress to stage 4 or higher CKD at the next time step. Labels are derived from cleaned CKD stages, converted into a binary label indicating whether stage 4+ is reached.
It supports training and evaluation of the following sequence models:
- RNN and LSTM with optional bidirectionality
- Transformer encoder with positional encoding
- MLP over flattened embedding windows
- Temporal Convolutional Network (TCN)
- Neural ODE (if
torchdiffeqis installed)
Each model is trained using early stopping and learning rate scheduling. Test performance includes Accuracy, F1, Precision, Recall, AUROC, and AUPRC. An optional label-switch analysis tracks whether models anticipate CKD progression earlier than the ground truth.
Usage:
bash run_modeling.shThe script embedding.py produces a metadata CSV file named patient_embedding_metadata.csv. Each row in the file corresponds to a single patient-day pair and contains the following fields:
| PatientID | EventDate | GFR | CKD_stage | text | embedding_file |
|-----------|------------|------|------------|----------------------------------------|---------------------------|
| Z1062220 | 2022-07-14 | 89.0 | 2 | ions: Ophthalmics - Misc. | Z1062220_20220714.npz |
| Z1062220 | 2022-07-15 | 89.0 | 2 | Central Muscle Relaxants | Z1062220_20220715.npz |
| Z1062220 | 2022-07-19 | 89.0 | 2 | uptake Inhibitors (SNRIs) | Z1062220_20220719.npz |
| Z1062220 | 2022-07-20 | 89.0 | 2 | ions: Antiperistaltic Agents | Z1062220_20220720.npz |
| Z1062220 | 2022-07-21 | 89.0 | 2 | ons: Antiperistaltic Agents | Z1062220_20220721.npz |
| Z1062220 | 2022-07-25 | 89.0 | 2 | cations: Analgesics Other | Z1062220_20220725.npz |
| Z1062220 | 2022-07-27 | 89.0 | 2 | counter: APPOINTMENT | Z1062220_20220727.npz |
| Z1062220 | 2022-08-01 | 89.0 | 2 | s: Penicillin Combinations | Z1062220_20220801.npz |
| Z1062220 | 2022-08-03 | 89.0 | 2 | e - Labs: WBC - Labs: K | Z1062220_20220803.npz |
| Z1062220 | 2022-08-04 | 89.0 | 2 | 110.1T: Unknown condition | Z1062220_20220804.npz |
| Z1062220 | 2022-08-10 | 89.0 | 2 | order, recurrent, moderate | Z1062220_20220810.npz |
| Z1062220 | 2022-08-15 | 89.0 | 2 | ions: Oil Soluble Vitamins | Z1062220_20220815.npz |
| Z1062220 | 2022-08-16 | 89.0 | 2 | ions: Antibiotics - Topical | Z1062220_20220816.npz |
| Z1062220 | 2022-08-17 | 89.0 | 2 | ions: Ophthalmics - Misc. | Z1062220_20220817.npz |Each embedding is stored in a compressed .npz file and follows the folder structure:
{output_dir}/{PatientID}/{PatientID}_{EventDate}.npz
For instance, embeddings for patient Z1062220 would be located at:
ckd_embeddings_100/Z1062220/Z1062220_20220714.npz
ckd_embeddings_100/Z1062220/Z1062220_20220715.npz
...
This design allows efficient access of per-patient longitudinal embeddings for modeling.
run_embedding.sh
#!/bin/bash
python embeddings.py \
--csv patients_subset_100.csv \
--icd icd_mapping.csv \
--output_dir ckd_embeddings_100 \
--model_name emilyalsentzer/Bio_ClinicalBERT \
--embed_dim 768 \
--batch_size 128run_modeling.sh
#!/bin/bash
python ckd_prediction.py \
--embedding-root ckd_embeddings_100 \
--window-size 10 \
--embed-dim 768 \
--epochs 50 \
--batch-size 64 \
--lr 5e-3 \
--patience 5 \
--scheduler-patience 2 \
--metadata-file patient_embedding_metadata.csv \
--hidden-dim 128 \
--num-layers 2 \
--rnn-dropout 0.2 \
--rnn-bidir \
--transformer-nhead 4 \
--transformer-dim-feedforward 256 \
--transformer-dropout 0.2Make sure to chmod +x run_embedding.sh run_modeling.sh before executing the scripts.
This work utilized resources provided by the UCLA Department of Computational Medicine.
This work was also supported by OptumLabs.
For further inquiries or collaboration, please contact:
- Simon A. Lee: [email protected]
- Jeffrey N. Chiang: [email protected]