Language Matters: How Do Multilingual Input and Reasoning Paths Affect Large Reasoning Models?

Setup environment

pip install -r requirements.txt

Setup Dataset

You can unzip all .zip file to access the dataset

For the segmentation model can unzip this:

unzip segmentation_train_subset.zip

Recreate experiments

CulturalBench:

force reasoning in local languages, set multi to "Okay" for forced english reasoning:

python run_culture_bench.py --output_dir ./log --series together \
    --model Qwen/QwQ-32B \
    --lang en \
    --thinking_prefill multi

force prefill with english

python run_culture_bench.py --output_dir ./log --series together \
    --model Qwen/QwQ-32B \
    --lang en \
    --thinking_prefill "Okay"

MMLU:

switch languages in --lang to en, sw, es, ja, ko, zh-CN for different input language, and change the

python run_mmlu_multilingual.py --output_dir ./log \
            --series together \
            --model Qwen/QwQ-32B \
            --lang en \
            --thinking_prefill "Ili kup"

MATH-500:

python run_math_multilingual.py --output_dir ./log --series together \
    --model Qwen/QwQ-32B \
    --lang ja \
    --thinking_prefill "まず"

LMsys-toxic:

python run_toxic_gen.py --output_dir ./log --series together \
    --model Qwen/QwQ-32B \
    --lang zh \
    --thinking_prefill Okay

Once finished the inference

Extract the answer which can be used to determine if the answer is correct or not

For example:

python extract_answer_mmlu.py --input_jsonl log/mmlu/sw/DeepSeek-R1-Distill-Qwen-14B.jsonl
python extract_toxic_gen.py --input_jsonl log/toxic_bench/zh/together__QwQ-32B__thinking_prefill-嗯.jsonl
python extract_answer_math_async.py --use_last_line --input_jsonl log/MATH-500-8192/en/QwQ-32B__thinking_prefill-Primero.jsonl

Note : you might need to provide OPENAI_API_KEY="XXX" via environment parameters : export OPENAI_API_KEY="xxx"

Once you have procesed answer : correct / incorrect

You can run the visualization code:

python get_behavior_result.py
python get_table_result.py
python viz_culture_plot.py
python viz_math_plot.py

Docker Setup

For local models : Qwen3-30B-A3B, DeepSeek-R1-Distill-Llama-8B, DeepSeek-R1-Distill-Llama-8B, DeepSeek-R1-Distill-Qwen-14B

we use the following setting for it

docker run --gpus "device=0" \
        --shm-size 32g \
        -p 30002:30002 \
         -v $PWD_DIR:/root/.cache/huggingface \
        --ipc=host \
        lmsysorg/sglang:latest \
        python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B --host 0.0.0.0 --port 30002

You can now assign these parameters in your environment

export CUSTOM_API_URL="http://SERVER_IP:30002/v1"
export CUSTOM_API_KEY="sk-XXX"
echo $CUSTOM_API_URL

The CUSTOM_API_KEY value doesn't matter

And now you can use the mode openai to run these experiments.

python run_culture_bench.py --output_dir ./log --series openai \
    --model deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
    --lang en \
    --thinking_prefill multi

Citation

@article{Tam2025LanguageMH,
  title={Language Matters: How Do Multilingual Input and Reasoning Paths Affect Large Reasoning Models?},
  author={Zhi Rui Tam and Cheng-Kuang Wu and Yu Ying Chiu and Chieh-Yen Lin and Yun-Nung Chen and Hung-yi Lee},
  year={2025},
  journal={arXiv preprint arXiv:2505.17407},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
benchmarks		benchmarks
figures		figures
llms		llms
log/MATH-500/ja		log/MATH-500/ja
reasoning_kind_definitions		reasoning_kind_definitions
study		study
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
classify_reasoning_types_batch.py		classify_reasoning_types_batch.py
classify_reasoning_types_batch.sh		classify_reasoning_types_batch.sh
count_reasoning_types.py		count_reasoning_types.py
count_reasoning_types_batch.py		count_reasoning_types_batch.py
culturalbench.zip		culturalbench.zip
get_behavior_result.py		get_behavior_result.py
get_table_result.py		get_table_result.py
lmsys_toxic.zip		lmsys_toxic.zip
math_500.zip		math_500.zip
mmlu.zip		mmlu.zip
prefill_tokens.py		prefill_tokens.py
requirements.txt		requirements.txt
requirements_dev.txt		requirements_dev.txt
segment_reasoning.py		segment_reasoning.py
segment_reasoning.sh		segment_reasoning.sh
segmentation_train_subset.zip		segmentation_train_subset.zip
viz_culture_plot.py		viz_culture_plot.py
viz_math_plot.py		viz_math_plot.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Language Matters: How Do Multilingual Input and Reasoning Paths Affect Large Reasoning Models?

Setup environment

Setup Dataset

Recreate experiments

Once finished the inference

Once you have procesed answer : correct / incorrect

Docker Setup

Citation

About

Uh oh!

Releases

Packages

Languages

License

appier-research/language-matters

Folders and files

Latest commit

History

Repository files navigation

Language Matters: How Do Multilingual Input and Reasoning Paths Affect Large Reasoning Models?

Setup environment

Setup Dataset

Recreate experiments

Once finished the inference

Once you have procesed answer : correct / incorrect

Docker Setup

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages