This ALEA project contains the complete source code to collect and preprocess all training data related to the KL3M embedding and generative models. The KL3M Data Project provides a comprehensive, copyright-clean dataset for training large language models, addressing legal risks in AI data collection.
- Over 132 million documents spanning trillions of tokens
- Verifiably public domain or appropriately licensed sources
- Complete source code for document acquisition and processing
- Multi-stage data access with original formats, extracted content, and pre-tokenized representations
The KL3M Data Project: Copyright-Clean Training Resources for Large Language Models
Hugging Face Dataset: kl3m-data-snapshot-20250324
@misc{bommarito2025kl3mdata,
title={The KL3M Data Project: Copyright-Clean Training Resources for Large Language Models},
author={Bommarito II, Michael J. and Bommarito, Jillian and Katz, Daniel Martin},
year={2025},
eprint={2504.07854},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
TODO: Table
- us/dockets: PACER/RECAP docket sheets via archive.org
- us/dotgov: filtered .gov TLD domains via direct retrieval
- us/ecfr: Electronic Code of Federal Regulations (eCFR) via NARA/GPO API
- us/edgar: SEC EDGAR data via SEC feed
- us/fdlp: US Federal Depository Library Program (FDLP) via GPO
- us/fr: Federal Register data via NARA/GPO API
- us/govinfo: US Government Publishing Office (GPO) data via GovInfo API
- us/recap: RECAP raw documents via S3
- us/recap_docs: RECAP attached docs (Word, WordPerfect, PDF, MP3) via S3
- us/reg_docs: Documents associated with regulations.gov dockets via regulations.gov API
- us/usc: US Code releases via Office of the Law Revision Counsel (OLRC)
- us/uspto_patents: USPTO patent grants via USPTO bulk data
- eu/eurlex_oj: EU Official Journal via Cellar/Europa
- uk/legislation: All enacted UK legislation via legislation.gov.uk bulk download
- de/bundesgesetzblatt: Bundesgesetzblatt (BGBl) 2023- from recht.bund.de
# Clone the repository
git clone https://github.com/alea-institute/kl3m-data.git
cd kl3m-data
# Install dependencies using Poetry
poetry install
The KL3M dataset is available through multiple channels:
-
Hugging Face:
from datasets import load_dataset dataset = load_dataset("alea-institute/kl3m-data-snapshot-20250324")
-
S3 Bucket:
aws s3 ls s3://data.kl3m.ai/
-
Project Website: Visit https://gallery.kl3m.ai/ for more information.
The source code for this ALEA project is released under the MIT License. See the LICENSE file for details.
Top-level dependencies are all licensed MIT, BSD-3, or Apache 2.0 See poetry show --tree
for details.
If you encounter any issues or have questions about using this ALEA project, please open an issue on GitHub.
To learn more about ALEA and our KL3M models and data, visit the ALEA website.