KL3M Training Data

Collection and Preprocessing of Training Data for KL3M

Description

This ALEA project contains the complete source code to collect and preprocess all training data related to the KL3M embedding and generative models. The KL3M Data Project provides a comprehensive, copyright-clean dataset for training large language models, addressing legal risks in AI data collection.

Key Features

Over 132 million documents spanning trillions of tokens
Verifiably public domain or appropriately licensed sources
Complete source code for document acquisition and processing
Multi-stage data access with original formats, extracted content, and pre-tokenized representations

Paper

The KL3M Data Project: Copyright-Clean Training Resources for Large Language Models

Dataset

Hugging Face Dataset: kl3m-data-snapshot-20250324

Citation

@misc{bommarito2025kl3mdata,
  title={The KL3M Data Project: Copyright-Clean Training Resources for Large Language Models},
  author={Bommarito II, Michael J. and Bommarito, Jillian and Katz, Daniel Martin},
  year={2025},
  eprint={2504.07854},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
}

Primary Sources

Summary

TODO: Table

US

EU ("Federal")

eu/eurlex_oj: EU Official Journal via Cellar/Europa

UK

uk/legislation: All enacted UK legislation via legislation.gov.uk bulk download

Germany

de/bundesgesetzblatt: Bundesgesetzblatt (BGBl) 2023- from recht.bund.de

Australia

Canada

India

Tasks

Extraction

Summarization

Transform and Convert

Installation

# Clone the repository
git clone https://github.com/alea-institute/kl3m-data.git
cd kl3m-data

# Install dependencies using Poetry
poetry install

Usage

Accessing the Dataset

The KL3M dataset is available through multiple channels:

Hugging Face:

from datasets import load_dataset
dataset = load_dataset("alea-institute/kl3m-data-snapshot-20250324")

S3 Bucket:
```
aws s3 ls s3://data.kl3m.ai/
```
Project Website: Visit https://gallery.kl3m.ai/ for more information.

License

The source code for this ALEA project is released under the MIT License. See the LICENSE file for details.

Top-level dependencies are all licensed MIT, BSD-3, or Apache 2.0 See poetry show --tree for details.

Support

If you encounter any issues or have questions about using this ALEA project, please open an issue on GitHub.

Learn More

To learn more about ALEA and our KL3M models and data, visit the ALEA website.

Name		Name	Last commit message	Last commit date
Latest commit History 140 Commits
config/loader/mlm		config/loader/mlm
docker		docker
docs		docs
kl3m_data		kl3m_data
scripts		scripts
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
config.json		config.json
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

KL3M Training Data

Collection and Preprocessing of Training Data for KL3M

Description

Key Features

Paper

Dataset

Citation

Primary Sources

Summary

US

EU ("Federal")

UK

Germany

Australia

Canada

India

Tasks

Extraction

Summarization

Transform and Convert

Installation

Usage

Accessing the Dataset

License

Support

Learn More

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

alea-institute/kl3m-data

Folders and files

Latest commit

History

Repository files navigation

KL3M Training Data

Collection and Preprocessing of Training Data for KL3M

Description

Key Features

Paper

Dataset

Citation

Primary Sources

Summary

US

EU ("Federal")

UK

Germany

Australia

Canada

India

Tasks

Extraction

Summarization

Transform and Convert

Installation

Usage

Accessing the Dataset

License

Support

Learn More

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages