This repository contains the code implementation for our research paper "Systematic Optimization of Open-Source Large Language Models for Mathematical Reasoning". We present a comprehensive framework for optimizing and evaluating five state-of-the-art open-source LLMs on mathematical reasoning tasks, with a focus on the GSM8K benchmark.
Our approach combines ReAct (Reasoning + Acting) methodology with adaptive planning, tool integration, and model-specific parameter optimization to achieve significant improvements in mathematical reasoning capabilities across all tested models.
- Multi-model Optimization: Systematic parameter optimization for 5 leading open-source LLMs (Qwen2.5-72B, Llama-3.1-70B, DeepSeek-V3, Mixtral-8x22B, Yi-Lightning)
- ReAct Framework Enhancement: Extended ReAct with adaptive planning intervals and dynamic plan revision
- Tool Integration: Intelligent tool selection and integration with cost-benefit analysis
- Efficiency Metrics: Novel cost-of-pass metric balancing accuracy and computational efficiency
- Reproducible Evaluation: Comprehensive evaluation methodology on the GSM8K benchmark
Model | Size | Base Performance | Optimized Performance | Efficiency Gain |
---|---|---|---|---|
Qwen2.5-72B-Instruct | 72B | 84.2% | 88.7% | +14.6% |
Llama-3.1-70B-Instruct | 70B | 82.1% | 86.3% | +12.8% |
DeepSeek-V3 | 67B | 80.9% | 85.1% | +15.2% |
Mixtral-8x22B-Instruct | 176B (8×22B) | 79.4% | 83.7% | +10.3% |
Yi-Lightning | 34B | 76.2% | 81.5% | +18.4% |
Our framework consists of several key components:
The foundation of our system is a modular agent architecture with the following components:
- BaseAgent: Standard interface for all LLM agents with unified inference handling
- ModelManager: Efficient management of multiple models with shared resources
- ReActAgent: Implementation of Reasoning + Acting methodology for mathematical problem solving
Our enhanced planning system provides:
- Task Decomposition: Breaking complex problems into manageable steps
- Adaptive Planning: Dynamic adjustment of planning intervals based on problem complexity
- Plan Revision: Real-time revision capabilities based on execution context
We integrate mathematical tools with:
- Calculator Tool: Advanced mathematical expression parsing and calculation
- Tool Selection: Intelligent selection based on query analysis and historical performance
- Efficiency Analysis: Comprehensive tracking of tool usage impact on performance
Our parameter optimization system includes:
- Model-Specific Configurations: Tailored parameter spaces for each model
- Parameter Grid Generation: Advanced sampling strategies for efficient exploration
- Configuration Validation: Robust validation of parameter combinations
Our evaluation methodology features:
- Comprehensive Metrics: Accuracy, cost-of-pass, token efficiency, and inference time
- GSM8K Benchmark: Standardized evaluation on mathematical reasoning tasks
- Comparative Analysis: Cross-model performance comparison with statistical significance
We measure success through several key metrics:
- Accuracy: Percentage of correctly solved mathematical problems
- Cost-of-Pass: Average token usage for correct answers (lower is better)
- Token Efficiency: Number of correct answers per 1000 tokens
- Inference Time: Average time to solve problems
- Planning Efficiency: Impact of planning intervals on performance
- Tool Integration Benefit: Performance improvement from tool usage
# Clone repository
git clone https://github.com/username/systematic-optimization-of-llms.git
cd systematic-optimization-of-llms
# Install dependencies
pip install -r requirements.txt
You can either run the full pipeline from the provided Jupyter notebooks in `notebooks/` or execute the modular `.py` scripts inside `open_source_agents/`.
For quick testing, try `python scripts/run_optimization.py --help` to see available options.
# Set up HuggingFace authentication (required for model access)
export HF_TOKEN=your_huggingface_token
Systematic-Optimization-of-Open-Source-Large-Language-Models-for-Mathematical-Reasoning/
├── Final_draft.tex and pdf/ # Contains Main Paper and .tex file
│ ├── 3d_parameter_landscapes.png
│ ├── Configuration by model.png
│ ├── Performance1.png
│ ├── Sequence_Diagram_SFD.png
│ ├── The_Latest_Draft.pdf # Compiled draft document (PDF version)
│ ├── The_Latest_Draft.tex # LaTeX source file for the draft
│ ├── correlation_network.png
│ ├── efficiency_frontier.png
│ ├── optimization_dashboard.png
│ ├── parameter_space_exploration.png
│ ├── performance_radar_chart.png
│ ├── problem_category_performance.png
│
├── configs/ # Configuration files
│ └── optimization_summary.json # Summary of all model configurations
│
├── data/ # Data for evaluation
│ ├── raw/ # Raw GSM8K dataset
│ └── processed/ # Processed evaluation data
│
├── notebooks/
│ ├── Code.ipynb # Main experiment notebook
│ ├── model_exploration.ipynb # Model exploration and analysis
│ └── results_visualization.ipynb # Results visualization
│
├── open_source_agents/ # Main project package
│ ├── __init__.py
│ ├── agents/ # Agent implementations
│ │ ├── __init__.py
│ │ ├── base_agent.py # Base agent architecture
│ │ ├── react_agent.py # ReAct reasoning implementation
│ │ └── tool_framework.py # Tool integration framework
│ ├── configs/ # Configuration files
│ │ ├── qwen2.5-72b_config.json
│ │ ├── llama-3.1-70b_config.json
│ │ ├── deepseek-v3_config.json
│ │ ├── mixtral-8x22b_config.json
│ │ └── yi-lightning_config.json
│ └── data/ # Data utilities
│ ├── __init__.py
│ └── data_loader.py # Dataset loading utilities
│ ├── models/ # Model management
│ │ ├── __init__.py
│ │ └── model_loader.py # Model loading utilities
│ ├── utils/ # Utility functions
│ │ ├── __init__.py
│ │ ├── cost_tracker.py # Token and cost tracking
│ │ ├── evaluation_system.py # Evaluation utilities
│ │ ├── logger.py # Logging system
│ │ └── metrics.py # Performance and evaluation metrics
│ │ └── model_optimization.py # Functions for model optimization
│ │ └── model_templates.py # Predefined templates for models
│ │ ├── optimization_config.py # Optimization configuration
│ │ ├── parameter_grid.py # Parameter grid generation
│ │ └── plan_revision.py # Plan refinement and revision logic
│ │ └── planner.py # Planning and reasoning
│ │ └── tool_analytics.py # Analytics and monitoring for tools
│
├── outputs/ # Generated outputs
│ ├── plots/ # Contains generated plots, charts, and visualizations
│ ├── reports/ # Contains generated reports (PDF, LaTeX, summaries, etc.)
│
├── tests/ # Unit tests
│ ├── test_agents.py # Agent tests
│ ├── test_tools.py # Tool framework tests
│ └── test_optimization.py # Optimization tests
│
├── results/ # Experimental results
│ └── figures/ # Generated figures and plots
│
├── README.md # Project documentation
├── requirements.txt # Project dependencies
from open_source_agents import OptimizedAgent
# Initialize with optimized configuration
agent = OptimizedAgent(model_name="qwen2.5-72b")
# Solve a mathematical problem
result = agent.solve("If a store has 48 apples and sells 3/4 of them, how many apples are left?")
print(f"Answer: {result['final_answer']}")
from open_source_agents.utils.evaluator import Evaluator
# Evaluate a specific model on GSM8K
evaluator = Evaluator(benchmark="gsm8k")
results = evaluator.evaluate_model("llama-3.1-70b")
# Compare multiple models
comparison = evaluator.compare_models(["qwen2.5-72b", "llama-3.1-70b", "deepseek-v3"])
Our repository includes a command-line script for running optimization experiments:
# Run optimization for Mistral 7B on GSM8K
python scripts/run_optimization.py --model mistral-7b-instruct --benchmark gsm8k --agent-type react --num-examples 10 --output results/optimization_runs/mistral_gsm8k.json --visualize
# Run optimization for Phi-3 Mini on MATH
python scripts/run_optimization.py --model phi-3-mini --benchmark math --agent-type react --num-examples 5 --output results/optimization_runs/phi3_math.json --visualize --quantize
The optimization script supports the following options:
--model
: Model key (e.g., qwen2.5-72b, mistral-7b-instruct)--benchmark
: Benchmark dataset (e.g., gsm8k, math)--agent-type
: Agent type (base or react)--num-examples
: Number of examples to evaluate--output
: Output file path for results--visualize
: Create visualizations of results--quantize
: Quantize model to reduce memory usage--verbose
: Print verbose output
from open_source_agents.utils.config_loader import ConfigLoader
from open_source_agents.models import ModelLoader
from open_source_agents.agents.react_agent import ReActAgent
# Load configuration
config_loader = ConfigLoader()
# Get default parameters for a model
default_params = config_loader.get_best_params("llama-3.1-70b")
# Customize parameters
custom_params = default_params.copy()
custom_params["temperature"] = 0.25
custom_params["top_p"] = 0.92
# Initialize model loader with quantization for memory efficiency
model_loader = ModelLoader(quantize=True)
# Create agent with custom parameters
agent = ReActAgent(
model_key="llama-3.1-70b",
model_loader=model_loader,
**custom_params
)
# Solve a problem
solution = agent.solve_problem("If a triangle has sides of length 3, 4, and 5, what is its area?")
print(f"Answer: {solution['final_answer']}")
print(f"Reasoning: {solution['reasoning']}")
Each model is optimized using tailored parameter spaces:
- Optimization Focus: Analytical reasoning precision
- Temperature: 0.1 - 0.5 (lower for precision)
- Key Finding: Performs best with low temperature (0.2-0.3) and frequent planning (interval=2)
- Optimization Focus: Conversational reasoning balance
- Temperature: 0.2 - 0.6 (moderate for balanced exploration)
- Key Finding: Benefits from higher temperature (0.4) with moderate planning frequency
- Optimization Focus: Deep reasoning chains
- Max Steps: 8 - 16 (higher for complex reasoning)
- Key Finding: Excels with long reasoning chains (12+ steps) and very low temperature (0.15-0.2)
- Optimization Focus: Token efficiency
- Planning Interval: 2 - 4 (less frequent planning)
- Key Finding: Most token-efficient with moderate steps (6) and temperature (0.3)
- Optimization Focus: Balanced performance
- Temperature: 0.2 - 0.5
- Key Finding: Most consistent performance across parameter settings
Our experiments show significant improvements across all models:
- Accuracy Improvement: +4.5% average improvement over base configurations
- Cost Efficiency: 22% reduction in token usage for successful solutions
- Planning Impact: Adaptive planning intervals provide 15% performance boost over fixed intervals
If you find our work useful, please cite our paper:
@article{author2023systematic,
title={Systematic Optimization of Open-Source Large Language Models for Mathematical Reasoning},
author={Author, A. and Author, B.},
journal={Conference/Journal Name},
year={2023}
}
We provide two Python scripts for analyzing the optimization results and generating visualizations:
The simple_analysis.py
script analyzes the performance data from the research paper and generates key insights about model optimization results:
python simple_analysis.py
This script will:
- Calculate performance improvements across models
- Generate summary statistics for accuracy, cost, and speed
- Save a comprehensive performance summary to
analysis_output/performance_summary.txt
The viz_generator.py
script creates visualizations similar to those in the gohil
folder:
python viz_generator.py
This script will generate the following visualizations in the analysis_output/visualizations
directory:
- bar_chart_improvements.png: Comparison of baseline vs. optimized accuracy across models
- bar_chart_top_p.png: Optimal top-p values for each model
- cost_reduction_chart.png: Cost reduction from optimization for each model
- param_heatmap.png: Heatmap showing optimal temperature and max steps settings
These visualizations provide a clear visual representation of the optimization benefits and parameter settings that yield the best performance for each model.
This project is licensed under the MIT License - see the LICENSE file for details.
- We thank the developers of the open-source models used in this research
- GSM8K dataset creators for providing a standardized benchmark
- HuggingFace for model hosting and API access