SwallowCode and SwallowMath are two high-quality, openly licensed datasets designed to enhance the performance of large language models (LLMs) in program synthesis and mathematical reasoning. Derived from public data under the Llama 3.3 Community License, these datasets address the limitations of existing pre-training corpora by applying rigorous filtering and LLM-driven rewriting to eliminate noise and improve educational value.
- SwallowCode (~16.1 billion tokens): A Python code dataset refined from The-Stack-v2-train-smol-ids through a four-stage pipeline involving syntax validation, pylint-based style filtering, and two-stage LLM rewriting (Style-Guided Code Rewriting, SGCR, and Self-Contained Optimization Rewriting, SCOR). It delivers self-contained, algorithmically efficient code snippets.
- SwallowMath (~2.3 billion tokens): A mathematical reasoning dataset derived from FineMath-4+ via LLM rewriting to remove boilerplate, restore missing context, and reformat solutions into concise, step-by-step explanations.
Datasets:
- SwallowCode: 🤗 https://huggingface.co/datasets/tokyotech-llm/swallow-code
- SwallowMath: 🤗 https://huggingface.co/datasets/tokyotech-llm/swallow-math
Paper: Rewriting Pre-Training Data Boosts LLM Performance in Math and Code
The performance of LLMs in specialized domains, such as coding and mathematics, is constrained by the quality of the pre-training data. SwallowCode and SwallowMath address this by transforming raw corpora into high-quality, curated datasets through advanced filtering and rewriting techniques. Our experiments demonstrate significant performance gains:
- SwallowCode: Improves pass@1 by +17.0 on HumanEval and +17.7 on HumanEval+ compared to Stack-Edu in a 50B-token continual pre-training of Llama-3.1-8B.
- SwallowMath: Boosts accuracy by +12.4 on GSM8K and +7.6 on MATH compared to FineMath-4+ in a similar setting.
SwallowCode refines Python code from The-Stack-v2 through a four-stage pipeline, reducing the dataset from 41M to 24.1M samples while enhancing quality:
- Programming Language Filter: Selects Python code exclusively to ensure consistency and facilitate automated evaluation.
- Python Syntax Error Filter: Uses Python’s
compile()
function to remove invalid code, reducing samples by 9.7% (from 41M to 37M). - Linter Filter: Applies pylint with a score threshold of 7.0 and a custom comment penalty heuristic, further reducing samples by 34.3% (to 24.1M).
- LLM Rewriting:
- Style-Guided Code Rewriting (SGCR): Enforces Google Python Style Guide criteria using Llama-3.3-70B-Instruct, improving readability and consistency.
- Self-Contained Optimization Rewriting (SCOR): Ensures self-containment, optimizes algorithms, and transforms trivial snippets into educational examples.
Caption: "Four-stage pipeline for SwallowCode: language filtering, syntax validation, linter filtering, and two-stage LLM rewriting (SGCR and SCOR)."
The job scripts for the SwallowCode pipeline are located in the scripts/code/ directory.
scripts/code/filter.sh
:- syntax error checking, pylint scoring, and comment language detection.
- output jsonl's "analysis_results" fields contain the results of the python
compile()
- output jsonl's "pylint_score" fields contain the results of the pylint score
- output jsonl's "language_type" fields contain the results of the comment language detection
scripts/code/sgcr_python.sh
:- LLM rewriting using Llama-3.3-70B-Instruct for SGCR
- output jsonl's "improved_code" fields contain the results of the SGCR
scripts/code/scor_python.sh
:- LLM rewriting using Llama-3.3-70B-Instruct for SCOR
- output jsonl's "text" fields contain the results of the SCOR
SwallowMath enhances FineMath-4+ through a tailored LLM rewriting pipeline using Llama-3.3-70B-Instruct:
- Boilerplate Removal: Eliminates web headers, footers, privacy notices, and metadata (e.g., timestamps).
- Context Restoration: Fills in missing information in incomplete questions or answers.
- Explanation Reformatting: Rewrites solutions into concise, step-by-step explanations for clarity and educational value.
Pipeline details, including prompts and scripts, are available in this repository.
The FineMath-4+ rewriting job script is located in the scripts/math/finemath-4+-rewrite-v1.sh file.
scripts/math/finemath-4+-rewrite-v1.sh
:- LLM rewriting using Llama-3.3-70B-Instruct
- output jsonl's "text" fields contain the results of the rewriting
We conducted extensive ablation experiments to evaluate each pipeline stage, detailed in the paper.
- Model: Llama-3.1-8B, continually pre-trained for 50B tokens.
- Data Mix:
- SwallowCode: 16% code (8B tokens) + 84% multilingual text.
- SwallowMath: 4.79% math (2.4B tokens), 13% code, 82.2% text.
- Hardware: 64 NVIDIA H100 GPUs on the TSUBAME supercomputer.
- Software: Megatron-LM (core_r0.9.0), lm-evaluation-harness, BigCodeBench.
- SwallowCode: Ablation datasets (exp1–exp13) are available in the
ablation/
directory of the SwallowCode dataset. Experiment 11 (SCOR) achieves the highest performance (HumanEval: 0.5396, HumanEval+: 0.5445 at 50B tokens). - SwallowMath: Experiment 2 (rewritten FineMath-4+) outperforms the baseline (GSM8K: +12.4, MATH: +7.6).
Caption: "FineMath-4+ rewriting: boilerplate removal, context restoration, and explanation reformatting."
Evaluation results and model checkpoints are available in the SwallowCode and SwallowMath collections.
Both datasets are released under the Llama 3.3 Community License. Usage is subject to:
- The-Stack-v2’s licensing terms for SwallowCode.
- CommonCrawl’s Terms of Use for both datasets.
@misc{fujii2025rewritingpretrainingdataboosts,
title={Rewriting Pre-Training Data Boosts LLM Performance in Math and Code},
author={Kazuki Fujii and Yukito Tajima and Sakae Mizuki and Hinari Shimada and Taihei Shiotani and Koshiro Saito and Masanari Ohi and Masaki Kawamura and Taishi Nakamura and Takumi Okamoto and Shigeki Ishida and Kakeru Hattori and Youmi Ma and Hiroya Takamura and Rio Yokota and Naoaki Okazaki},
year={2025},
eprint={2505.02881},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2505.02881},
}