MedAlign: A Clinician-Generated Dataset for Instruction Following with Electronic Medical Records

MedAlign is a benchmark dataset of 983 clinician-curated natural language instructions grounded in 275 longitudinal EHRs. It includes 303 reference responses to support evaluation of large language models (LLMs) on clinical reasoning, timeline understanding, and multi-document synthesis.

Warning

MedAlign is a test-only benchmark. It is not intended for supervised model training.

📊 Dataset Contents

The benchmark includes:

Component	Count
Patients	275
Clinical notes	46,252
Distinct note types	128
Clinical events (OMOP format)	3.6 million
Instructions (deduplicated)	983
Reference responses	303

All EHR data is longitudinal and standardized using the OMOP Common Data Model (CDM).

⚠️ Usage Restrictions

Caution

Violations of the data use agreement will result in revoked access and may trigger institutional reporting.

❌ Do not train or fine-tune models on MedAlign data (evaluation only)
❌ Do not transmit MedAlign data to commercial APIs (e.g., ChatGPT, Claude, Gemini) that are not HIPAA compliant.
❌ Do not redistribute dataset files or any derivative datasets.
✅ Derived research artifacts (e.g., annotations, synthetic data) must be hosted on Redivis with prior approval from the MedAlign team.

All usage must strictly comply with the MedAlign DUA.

🔐 Access Requirements

To gain access, please complete the following steps:

Apply via Redivis Portal
Use an academic, government, or industry research email. Applications from personal (e.g., Gmail) accounts will be rejected.
Complete HIPAA-compliant CITI training
You must include proof of training in your application.
Describe your research use case
A short paragraph outlining your intended use is sufficient.
Sign the MedAlign Data Use Agreement (DUA)
This will be sent to you after your application is reviewed.
Verify encryption and secure storage
You must attest to storing the data on encrypted, access-controlled machines. Cloud use requires HIPAA compliance.

⏱️ Applications are reviewed within 7–10 business days.

📚 Citation

If you use MedAlign in your work, please cite:

@inproceedings{DBLP:conf/aaai/FlemingLHJRTBGS24,
  author       = {Scott L. Fleming and Alejandro Lozano and William J. Haberkorn and Jenelle A. Jindal and Eduardo Reis and Rahul Thapa and Louis Blankemeier and Julian Z. Genkins and Ethan Steinberg and Ashwin Nayak and Birju S. Patel and Chia{-}Chun Chiang and Alison Callahan and Zepeng Huo and Sergios Gatidis and Scott J. Adams and Oluseyi Fayanju and Shreya J. Shah and Thomas Savage and Ethan Goh and Akshay S. Chaudhari and Nima Aghaeepour and Christopher D. Sharp and Michael A. Pfeffer and Percy Liang and Jonathan H. Chen and Keith E. Morse and Emma P. Brunskill and Jason A. Fries and Nigam H. Shah},
  title        = {MedAlign: {A} Clinician-Generated Dataset for Instruction Following with Electronic Medical Records},
  booktitle    = {Thirty-Eighth {AAAI} Conference on Artificial Intelligence},
  year         = {2024},
  url          = {https://doi.org/10.1609/aaai.v38i20.30205},
  doi          = {10.1609/AAAI.V38I20.30205},
}

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
data		data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MedAlign: A Clinician-Generated Dataset for Instruction Following with Electronic Medical Records

📊 Dataset Contents

⚠️ Usage Restrictions

🔐 Access Requirements

📚 Citation

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

License

som-shahlab/medalign

Folders and files

Latest commit

History

Repository files navigation

MedAlign: A Clinician-Generated Dataset for Instruction Following with Electronic Medical Records

📊 Dataset Contents

⚠️ Usage Restrictions

🔐 Access Requirements

📚 Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Packages