A one-stop repository containing comprehensive instructions, scripts, and resources for continued pretraining and supervised fine-tuning (SFT) of large language models (LLMs), specifically tailored for chemical domain models including Qwen-2.5 and Llama-3.1.
This repository provides:
- Complete setup guides for continued pretraining using Megatron and Nanotron.
- Conversion scripts for seamless transitions between Hugging Face and Megatron checkpoints.
- Preprocessing pipelines for chemical data integration.
- Resources for generating synthetic chemical data.
- Modified Megatron pipeline from Alibaba: Pai-Megatron-Patch
- Pretraining data:
/work/liac/pretrain_data
- SFT data:
/work/liac/sft_data
Use convert.slurm
:
Modify the following variables at the top of the script:
-
Environment modules (
dev_env
,torch_env
) -
Toolkit path
-
PYTHONPATH
toPai-Megatron-Patch
andPAI-Megatron-LM-240718
-
Tensor Parallelism (
TP
) and Pipeline Parallelism (PP
). Use:- TP = 2 for models ≤ 3B
- PP = 1 (default unless pipeline parallelism is required)
-
Node and GPU count (
nnodes
,GPUS_PER_NODE
) -
Source path (HF checkpoints)
-
Target path (Megatron checkpoints)
Launch script:
sbatch convert.slurm
Use process_data.slurm
:
- Modify paths, sequence length, padding, batch size as needed.
Launch preprocessing:
sbatch process_data.slurm
Pretraining with pretrain.slurm
:
- Adjust
number_of_steps
,seq_len
, and GPU configurations.
Calculate total steps:
# Example:
# global_batch_size * batch_accumulation_steps * dp = 8 * 4 * 1 = 32
# full_batch_size = seq_len * global_batch_size * accumulation_steps * dp
# steps_per_epoch = total_tokens / full_batch_size
Launch pretraining:
sbatch pretrain.slurm
Convert Megatron checkpoints back to Hugging Face format using convert_back.slurm
. Note that for SFT, convert the HF checkpoint again back to Megatron format using prepare_sft.slurm
and then launch SFT with run_sft.sbatch
.
Diagnostic logs will appear in the output files (currently no W&B integration).
Minimal modifications required for Megatron-based pretraining. Refer to:
- Llama checkpoint conversion:
qwen_pretrain/kuma/Pai-Megatron-Patch/toolkits/model_checkpoints_convertor/llama
Nanotron pretraining setup (CSCS specific) available here:
Scripts to integrate chemical structures (SMILES) into textual datasets:
-
Extract chemical entities:
- Run
chem_extract.py
(requireschemdataextractor2
) - Outputs pickle with entities' positions
- Run
-
Interleave SMILES into text:
- Run
fineweb_smiles.py
(requiresempty4.pkl
andstring2smiles3.pkl
for entity handling)
- Run
Use SMILESbench for generating synthetic chemical data (used previously for benchmarking and exam purposes).
Feel free to open issues, submit PRs, or contact maintainers for any questions or suggestions.