Just Go Parallel: Improving the Multilingual Capabilities of Large Language Models

This repository provides the code to systematically investigate the the impact of adding parallel data on LLMs' multilingual capabilities, as reported in the following publication:

Just Go Parallel: Improving the Multilingual Capabilities of Large Language Models
Muhammad Reza Qorib, Junyi Li, and Hwee Tou Ng
The 63rd Annual Meeting of the Association for Computational Linguistics (to appear)

The codebase is built upon TinyLlama

Model

No Parallel: nusnlp/JGP-No-Parallel
Multilingual: nusnlp/JGP-Multilingual
Parallel Non-Adjacent: nusnlp/JGP-Parallel-Non-Adjacent
Parallel First: nusnlp/JGP-Parallel-First
Parallel Distributed: nusnlp/JGP-Parallel-Distributed
Parallel Last (all): nusnlp/JGP-Parallel-Last-all
Parallel Last (uni):
- EN→ID: nusnlp/JGP-Parallel-Last-EN-ID
- ID→EN: nusnlp/JGP-Parallel-Last-ID-EN
- EN→ZH: nusnlp/JGP-Parallel-Last-EN-ZH
- ZH→EN: nusnlp/JGP-Parallel-Last-ZH-EN

Training Data

Experiment	Datasets
No-Parallel	nusnlp/JGP-SlimPajama
Multilingual	nusnlp/JGP-SlimPajama + nusnlp/JGP-Multilingual
Parallel Non-Adjacent	nusnlp/JGP-SlimPajama + nusnlp/JGP-Parallel-Non-Adjacent
Parallel First, Parallel Distributed, Parallel Last (all)	nusnlp/JGP-SlimPajama + nusnlp/JGP-Parallel
Parallel Last (uni): EN→ID	nusnlp/JGP-SlimPajama + nusnlp/JGP-Parallel-EN-ID
Parallel Last (uni): ID→EN	nusnlp/JGP-SlimPajama + nusnlp/JGP-Parallel-ID-EN
Parallel Last (uni): EN→ZH	nusnlp/JGP-SlimPajama + nusnlp/JGP-Parallel-EN-ZH
Parallel Last (uni): ZH→EN	nusnlp/JGP-SlimPajama + nusnlp/JGP-Parallel-ZH-EN

Installation

We expect that you have CUDA>=11.8 installed.

Install Pytorch.

Follow the official guidance to install the appropriate Pytorch version that fits the installed CUDA.

Install XFormers

You can install the pre-built version or build from source as shown below:

pip uninstall ninja -y && pip install ninja -U
pip install -v -U git+https://github.com/facebookresearch/xformers.git@main#egg=xformers

Install Flash-Attention 2 and other fused operators:

git clone https://github.com/Dao-AILab/flash-attention
cd flash-attention
python setup.py install
cd csrc/rotary && pip install .
cd ../layer_norm && pip install .
cd ../xentropy && pip install .
cd ../.. && rm -rf flash-attention

Install Remaining Dependencies

Install the remaining dependencies:

pip install -r requirements.txt tokenizers sentencepiece

It may take ≥ 5 minutes to build XFormers/Flash-Attention. Don’t worry if the process seems stagnant or if the terminal prints many warnings.

Then you are ready to go 🎉!

Pretrain

Please refer to PRETRAIN.md for instructions on reproducing the pretraining of our models.

Evaluation

Please use ALMA to evaluate translation performance and LM-Evaluation-Harness to evaluate common-sense reasoning.

License

This repository is licensed under the Apache-2.0 license.

Acknowledgements

This repository is built upon TinyLlama, which was built upon lit-gpt and flash-attention.

@misc{zhang2024tinyllama,
      title={TinyLlama: An Open-Source Small Language Model}, 
      author={Peiyuan Zhang and Guangtao Zeng and Tianduo Wang and Wei Lu},
      year={2024},
      eprint={2401.02385},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
@online{lit-gpt,
  author    = {Lightning AI},
  title     = {Lit-GPT},
  url       = {https://github.com/Lightning-AI/lit-gpt},
  year      = {2023},
}
@article{dao2023flashattention2,
  title     ={Flash{A}ttention-2: Faster Attention with Better Parallelism and Work Partitioning},
  author    ={Dao, Tri},
  year      ={2023}
}

Name		Name	Last commit message	Last commit date
Latest commit History 153 Commits
.github		.github
chat_gradio		chat_gradio
lit_gpt		lit_gpt
pretrain		pretrain
scripts		scripts
sft		sft
speculative_decoding		speculative_decoding
.gitignore		.gitignore
EVAL.md		EVAL.md
LICENSE		LICENSE
PRETRAIN.md		PRETRAIN.md
README.md		README.md
README_zh-CN.md		README_zh-CN.md
requirements.txt		requirements.txt
script.sh		script.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Just Go Parallel: Improving the Multilingual Capabilities of Large Language Models

Model

Training Data

Installation

Install Pytorch.

Install XFormers

Install Flash-Attention 2 and other fused operators:

Install Remaining Dependencies

Pretrain

Evaluation

License

Acknowledgements

About

Uh oh!

Releases

Packages

Languages

License

nusnlp/Just-Go-Parallel

Folders and files

Latest commit

History

Repository files navigation

Just Go Parallel: Improving the Multilingual Capabilities of Large Language Models

Model

Training Data

Installation

Install Pytorch.

Install XFormers

Install Flash-Attention 2 and other fused operators:

Install Remaining Dependencies

Pretrain

Evaluation

License

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages