Guided by Gut (GG) is an efficient Test-Time Scaling (TTS) framework that combines
- Reinforcement-learning fine-tuning (confidence-based GRPO) implemented in
openr1/
, and - Self-guided Tree Search implemented in
Search/
GG
├── openr1/ # RL fine-tuning (modified Open-R1 fork)
│ ├── src/
│ │ ├── openr1/ # Core OpenR1 code (lightly patched)
│ │ │ └── rewards.py # ➜ confidence_based_reward()
│ │ └── trl0.16/
│ │ └── grpo_trainer.py # ➜ custom GRPO trainer
│ ├── merge_lora.py # Utility to merge LoRA weights
│ └── ... # recipes/, accelerate configs, YAMLs
└── Search/
├── src/
│ └── sal/search/diverse_verifier_tree_search_GG.py
└── recipes/ # YAML configs for search hyper-params for different configurations
Step | What you do | Where |
---|---|---|
1️⃣ | Install deps (make sure TRL 0.16.0) | root |
2️⃣ | Copy custom grpo_trainer.py over TRL’s grpo_trainer |
root → site-packages/trl/... |
3️⃣ | Fine-tune with GRPO + confidence reward | open-r1/ |
4️⃣ | Merge LoRA weights into the base ckpt | open-r1/merge_lora.py |
5️⃣ | Run GG search via DVTS recipes | Search/ |
6️⃣ | Evaluate & score results | Search/ (eval.py ) |
Make sure your environment satisfies the software and hardware prerequisites listed in GG/openr1/README.md
.
That document covers CUDA, Python, and all Python-package versions (notably trl == 0.16.0
) needed for confidence-based GRPO training.
After installing TRL 0.16.0, overwrite its default GRPO trainer with the customised file at
GG/open-r1/src/trl0.16/grpo_trainer.py
.
Manual step Copy the file above to
site-packages/trl/trainer/grpo_trainer.py
, replacing the original. Re-apply this patch any time you upgrade or reinstall TRL.
# From the repository root …
cd openr1 # <-- step 1
# Two A100 80 GB cards; adjust CUDA_VISIBLE_DEVICES as needed
CUDA_VISIBLE_DEVICES=0,1 \
ACCELERATE_LOG_LEVEL=info \
accelerate launch \
--config_file recipes/accelerate_configs/zero2.yaml \
--num_processes 1 \
src/open_r1/grpo.py \
--config recipes/DeepSeek-R1-Distill-Qwen-1.5B/grpo/LIMO_ConfReward.yaml
Key points
grpo.py
is patched to perform PEFT (LoRA) instead of full fine-tuning.- Reward is confidence_based_reward (
open_r1/rewards.py
). - Logs & checkpoints land under
outputs/
(path configurable in YAML).
After training, bake the LoRA deltas into a standalone HF model:
python open-r1/merge_lora.py \
--base_model deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
--lora_path open-r1/outputs/ckpt-lora \
--output_dir open-r1/outputs/merged-gg-1p5B
Edit paths inside merge_lora.py
or supply CLI args, then upload to the Hub if desired.
We provide the following models fine-tuned using the "Guided by Gut" (GG) methodology with a Confidence-Based Reward via Group Relative Policy Optimization (GRPO) on the LIMO dataset:
-
AmirhoseinGH/DS-Qwen-1.5b-GG-CalibratedConfRL
- Base Model: DeepSeek-R1 Qwen Distill 1.5B
- Description: This model was fine-tuned to enhance its intrinsic confidence calibration, making it highly effective with the Guided-by-Gut search strategy.
- Link: DS-Qwen-1.5b-GG-CalibratedConfRL
-
AmirhoseinGH/DS-Qwen-7b-GG-CalibratedConfRL
- Base Model: DeepSeek-R1 Qwen Distill 7B
- Description: Similar to its 1.5B counterpart, this 7B parameter model has undergone confidence calibration fine-tuning.
- Link: DS-Qwen-7b-GG-CalibratedConfRL
To use these models with the search strategies below, set the model_path
in your YAML configuration file to the respective Hugging Face identifier (e.g., AmirhoseinGH/DS-Qwen-1.5b-GG-CalibratedConfRL
).
To run the the test time scaloing code, first, create a Python virtual environment using e.g. Conda:
conda create -n sal python=3.11 && conda activate sal
cd Search
pip install -e '.[dev]'
Next, log into your Hugging Face account as follows:
huggingface-cli login
Finally, install Git LFS so that you can push models to the Hugging Face Hub:
sudo apt-get install git-lfs
You can now check out the scripts
and recipes
directories for instructions on how to scale test-time compute for open models!
All YAMLs live in Search/recipes/
.
Example (GG, DVTS, N = 32, AIME 2024):
GG/Search/recipes/R1_Distll_Qwen1.5B-7B/GG/\
R1-1p5B__n32__aime_2024/\
config_approach_dvts4--dataset_name_HuggingFaceH4aime_2024--dataset_split_train--model_name__R1-1p5B--n_32--seed_10--temperature_0_6--temperature0_1_0.yaml
To Run Search with a specific setting, use the following command:
export CONFIG=recipes/path_to_option_file.yaml
python scripts/test_time_compute.py $CONFIG
Key | Type | Meaning / Allowed values |
---|---|---|
approach | str |
Which search algorithm to run. Use dvts4 for Guided-by-Gut and dvts for normal DVTS search with PRM; best_of_n , cot , etc. for baselines. |
use_prm | bool |
false → use GG’s intrinsic confidence + novelty.true → score with an external PRM (give its HF path in prm_path ). |
system_prompt | str |
system prompt. |
temperature0 | float |
LLM temperature for the first 10 expansion steps (higher → more diverse roots). |
temperature | float |
Temperature after step 10. |
max_tokens | int |
Hard cap on tokens generated per path. |
n | int |
Compute budget—target number of final paths kept. |
beam_width | int |
Parallel beams explored at each tree depth. |
min_step_tokens | int |
minimum length of a reasoning step. |
num_iterations | int |
Maximum search steps (tree expansions). |
novelty_bonus | float |
Weight of the novelty score in GG’s intrinsic reward. |
model_path | str |
HF identifier or local path of the base model. |
search_batch_size | int |
How many partial paths to expand per forward pass. |
Quick switch-over guide
• GG intrinsic signals
approach: dvts4
+use_prm: false
(default)
• PRM-guided searchuse_prm: true
and setprm_path:
to one of
RLHFlow/Llama3.1-8B-PRM-Deepseek-Data
,
peiyi9979/math-shepherd-mistral-7b-prm
,
Skywork/Skywork-o1-Open-PRM-Qwen-2.5-1.5B
, or
Skywork/Skywork-o1-Open-PRM-Qwen-2.5-7B
.
-
AIME / AMC
Before running the script, open
Search/src/sal/search/eval.py
and set theCOMPLETIONS_PATH
(or equivalent variable) to the location of your JSONL file:# INPUT_FILE = "results/R1-1p5B/aime24/seed0/dvts_n32.jsonl" python Search/src/sal/search/eval.py
-
MATH-500 To get the final numbers for the evalations, we use a fork of the Qwen2.5-Math evaluation repo. Please follow the installation and usage instructions in our fork to obtain accuracies on MATH-500.
If you find our work useful, please consider citing our paper in your research.
@misc{ghasemabadi2025guidedgutefficienttesttime,
title={Guided by Gut: Efficient Test-Time Scaling with Reinforced Intrinsic Confidence},
author={Amirhosein Ghasemabadi and Keith G. Mills and Baochun Li and Di Niu},
year={2025},
eprint={2505.20325},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2505.20325},
}
Huge thanks to the authors and maintainers of
- huggingface/search-and-learn – the foundation of our Guided-by-Gut tree-search code, and
- huggingface/open-r1 – the basis for our confidence-RL fine-tuning pipeline.
Their open-source work made this project possible.