[AAAI 2025] EcoDatum: Quality over Quantity: Boosting Data Efficiency Through Ensembled Multimodal Data Curation
[AAAI 2025] This repository provides the implementation of EcoDatum, a data curation framework introduced in the paper Quality over Quantity: Boosting Data Efficiency Through Ensembled Multimodal Data Curation. EcoDatum enhances dataset quality by integrating various unimodal and multimodal data curation operators within a weak supervision ensemble framework, leading to improved model training efficiency.
- 2025/04: Print Version is released.
- 2025/03: Code is released.
- 2025/02: Paper is published on the ArXiv.
- 2024/12: Paper is accepted at the AAAI 2025.
- 2024/08: SOTA on the Datacomp Leaderboard.
In the era of big data, effectively curating web-crawled datasets is crucial for optimizing model performance. Traditional heuristic curation methods often fail to capture complex features, leading to biases and the exclusion of relevant data. EcoDatum addresses these challenges by strategically integrating various data curation operators within a weak supervision ensemble framework, utilizing automated optimization to score each data point effectively. This approach significantly improves data curation quality and efficiency, outperforming existing state-of-the-art techniques.
![]() Bad Data |
![]() Good Data |
- Ensembled Multimodal Data Curation: Combines multiple data curation operators to enhance dataset quality.
- Quality-Guided Deduplication: Ensures balanced feature distributions by removing redundant data based on quality metrics.
- Automated Optimization: Utilizes a composite metric and a small labeled dataset to fine-tune the integration of curation operators.
- Improved Model Training Efficiency: Demonstrated to enhance model performance across diverse evaluation datasets.
To use EcoDatum, clone this repository and install the required dependencies:
git clone git@github.com:Daming-W/ecodatum.git
cd ecodatum
conda env create -f environment.yml
or
pip install -r requirements.txt
EcoDatum can be used to curate datasets before training visual-language models. Here's a basic example of how to apply EcoDatum to your dataset:
Place your dataset with JSONL into examples/data/
— follow the same file structure used in the examples/
template.
Read the operator description in Vaquitai/README.md
and choose the operators you need.
Run the selected operators to produce prediction files (JSONL) and move them to examples/ops_results/
.
Edit the YAML under examples/config/
:
-
data_path
: absolute or relative path to the files inexamples/data/
. -
ops_results_path
: pointing to the JSONL files inexamples/ops_results/
.
Any operator‑specific thresholds or parameters.
-
Option A - Apply pre-defined LFs: check
labeling_functions.py
, EcoDatum provides few pre-defined LFs, you may want to try different parameters combinations! -
Option B – Apply your own LFs: open
labeling_functions.py
and add lfs whatever you like that refs to the operators you inferenced with at step 1!
Make sure each LF is added to the 'lfs' list of ensemble.py
.
Tip: Keep LF names self‑explanatory; this helps when reading the LFAnalysis summary.
python ensemble.py --config examples/config/your_config.yaml
Output:
examples/output/curated_dataset.jsonl
A curated dataset (JSONL) is saved to examples/output/.
EcoDatum has been evaluated on the DataComp leaderboard, achieving an average performance score of 0.182 across 38 diverse evaluation datasets. This represents a 28% improvement over the DataComp baseline method, demonstrating its effectiveness in improving dataset curation and model training efficiency.
We welcome contributions to EcoDatum! If you'd like to contribute, please fork the repository and use a feature branch. Pull requests are warmly welcome.
This project is licensed under the MIT License. See the LICENSE file for details.
For more information, please refer to our paper.