Skip to content

A comprehensive analysis of seven state-of-the-art SLMs on JEEBench, a rigorous benchmark for STEM reasoning. This project explores the impact of zero-shot, few-shot, and Chain-of-Thought prompting on model performance in Physics, Chemistry, and Mathematics.

License

MIT, Unknown licenses found

Licenses found

MIT
LICENSE
Unknown
LICENSE.rtf
Notifications You must be signed in to change notification settings

Abduhu1/Evaluating-SLMs-on-JEEBench

Repository files navigation

Evaluating Small Language Models on the JEEBench BenchmarkA Comprehensive Analysis of STEM Reasoning in State-of-the-Art SLMsThis repository contains the code and data for the M.Tech thesis project, "Evaluating Small Language Models on an Advanced Benchmark Dataset." The study presents a rigorous evaluation of seven leading Small Language Models (SLMs) on JEEBench, a highly challenging benchmark derived from India's prestigious IIT JEE-Advanced examinations, to assess their mathematical and scientific reasoning capabilities.📜 AbstractDespite the growing interest in deploying Small Language Models (SLMs) for educational applications, a systematic evaluation of their mathematical and scientific reasoning capabilities is lacking. This research addresses this gap by evaluating seven SLMs on JEEBench, a challenging benchmark of 120 problems from the IIT JEE-Advanced exams, covering Physics, Chemistry, and Mathematics. We test three prompting methods—zero-shot, few-shot, and Chain-of-Thought—using a production-grade evaluation framework with strict exact-match scoring. The analysis, based on 2,880 evaluations, reveals that mathematical specialization is decisive, parameter efficiency is a key factor, and certain models exhibit an "anti-prompting" phenomenon. A universal failure on numerical computation tasks highlights fundamental architectural gaps, suggesting that while SLMs can offer significant educational support, careful deployment and safety considerations are essential.🚀 Breakthrough DiscoveriesThis research uncovered several revolutionary findings that challenge current assumptions about SLM performance and provide critical insights for future development:🧠 Mathematical Specialization Supremacy: Domain-specific models like Qwen2.5-Math-7B achieved 40-92% higher accuracy than general-purpose models, proving that specialized training is critical for complex STEM reasoning.⚡ Parameter Efficiency Revolution: The Qwen2-1.5B-Instruct model (1.5B parameters) ranked #2 overall, outperforming five larger 7-8B models. This demonstrates that high-quality training and architecture can be more important than parameter count.⚠️ The "Anti-Prompting" Phenomenon: A novel discovery where advanced prompting catastrophically degrades performance. Meta-Llama-3-8B showed a -55% accuracy drop with few-shot prompting, revealing a critical deployment risk.⛓️ Chain-of-Thought (CoT) Effectiveness: CoT was validated as the optimal strategy for 4 out of 7 models, confirming its benefits for enhancing specialized reasoning, especially in math-tuned models.🔢 Universal Computational Barrier: A systematic 93.3% failure rate on numeric problems across all models was identified, revealing fundamental architectural limitations in handling precise calculations.🔬 Subject Difficulty Hierarchy: A clear difficulty gradient was established: Chemistry (avg 18.2%) > Physics (avg 15.7%) > Mathematics (avg 12.1%).📊 Key VisualizationsComparative Performance Across Prompting TechniquesThis heatmap shows the exact-match accuracy of each model across the three prompting methods. The dominance of Qwen2.5-Math-7B and the severe underperformance of Meta-Llama-3-8B in the few-shot setting are clearly visible.Performance by Subject Under Optimal ConditionsThis visualization breaks down the peak performance of each model by subject, highlighting domain-specific strengths and weaknesses.Overall Model Performance by Prompting TechniqueThis bar chart provides a direct comparison of all models, illustrating the impact of different prompting strategies on their accuracy.📚 The JEEBench BenchmarkThe evaluation is performed on JEEBench, a curated dataset of 120 problems from the highly competitive IIT JEE-Advanced examinations (2016-2023). This benchmark is designed to test deep, multi-step reasoning and domain knowledge across three subjects:Physics (40 problems)Chemistry (40 problems)Mathematics (40 problems)The dataset files used in this project are:data/dataset120best.json: The main dataset of 120 problems.data/few_shot_examples.json: Curated examples for few-shot prompting.🤖 Models EvaluatedSeven state-of-the-art Small Language Models were evaluated, representing a diverse range of architectures and training methodologies:Qwen2.5-Math-7B-Instruct (7B)Qwen2-1.5B-Instruct (1.5B)Qwen2.5-Math-1.5B (1.5B)Meta-Llama-3-8B-Instruct (8B)DeepSeek-R1-Distill-Qwen-7B (7B)DeepSeek-R1-Distill-Qwen-1.5B (1.5B)Mistral-7B-Instruct-v0.3 (7B)📁 Repository Structure. ├── assets/ │ ├── heatmap_comparative.png │ ├── heatmap_subjects.png │ └── figure_4_1_slm_performance.jpg ├── data/ │ ├── dataset120best.json │ └── few_shot_examples.json ├── finalpy.py ├── requirements.txt ├── README.md └── LICENSE

⚙️ InstallationTo set up the environment and run the evaluation script, follow these steps:Clone the repository:git clone https://github.com/Abduhu1/Evaluating-SLMs-on-JEEBench.git cd Evaluating-SLMs-on-JEEBench

Create and activate a Python virtual environment:python3 -m venv venv source venv/bin/activate

On Windows, use venv\Scripts\activate

Install the required dependencies:pip install -r requirements.txt

Note: Ensure you have a compatible CUDA version installed for GPU acceleration.▶️ How to Run the EvaluationThe finalpy.py script is the main entry point for running the evaluations. It is highly configurable via command-line arguments.Key Arguments:--model_name: The Hugging Face model identifier (e.g., Qwen/Qwen2.5-Math-7B-Instruct).--method: The prompting method to use (zero_shot, few_shot, cot, both, all).--dataset: Path to the dataset file (default: data/dataset120best.json).--max_problems: The number of problems to evaluate from the dataset.--temperature, --temp_cot, --temp_few_shot: Set the temperature for different methods.Example Commands:Run a zero-shot evaluation on the first 10 problems with Qwen2.5-Math-7B:python finalpy.py --model_name "Qwen/Qwen2.5-Math-7B-Instruct" --method zero_shot --max_problems 10

Run all three prompting methods on the full 120-problem dataset with Llama-3-8B:python finalpy.py --model_name "meta-llama/Meta-Llama-3-8B-Instruct" --method all --max_problems 120

Run a few-shot evaluation using custom examples:python finalpy.py --model_name "mistralai/Mistral-7B-Instruct-v0.3" --method few_shot --few_shot_examples "path/to/my_examples.json"

🏆 Master ResultsThe final ranking of all seven models based on their peak exact-match accuracy on JEEBench is as follows:| Rank | Model | Parameters | Best Method | Peak Accuracy | Consistency Score || 1 | Qwen2.5-Math-7B-Instruct | 7B | CoT | 22.5% | 9.2 / 10 || 2 | Qwen2-1.5B-Instruct | 1.5B | CoT | 16.1% | 8.8 / 10 || 3 | Qwen2.5-Math-1.5B | 1.5B | CoT | 15.6% | 8.5 / 10 || 4 | Meta-Llama-3-8B-Instruct | 8B | Zero-Shot | 15.0% | 6.2 / 10 || 5 | DeepSeek-R1-Distill-Qwen-7B | 7B | Zero-Shot | 14.2% | 9.5 / 10 || 6 | DeepSeek-R1-Distill-Qwen-1.5B | 1.5B | Zero/Few-Shot | 14.2% | 9.8 / 10 || 7 | Mistral-7B-Instruct-v0.3 | 7B | CoT | 11.7% | 7.5 / 10 |✍️ CitationIf you find this work useful in your research, please consider citing the thesis:@mastersthesis{Andrabi2025JeeBench, author = {Abdullah Khurshid Andrabi}, title = {Evaluating Small Language Models on an Advanced Benchmark Dataset}, school = {National Institute of Technology, Srinagar}, year = {2025}, month = {July}, address = {Srinagar, Jammu & Kashmir, India}, note = {M.Tech Thesis}, supervisor = {Dr. Shaima Qureshi} }

🙏 AcknowledgmentsI would like to express my deepest gratitude to my guide, Dr. Shaima Qureshi, for her invaluable guidance and unwavering support throughout this project. I also thank the Department of Computer Science and Engineering at the National Institute of Technology, Srinagar for providing the resources and environment for this research.📄 LicenseThis project is licensed under the MIT License. See the LICENSE file for details.

About

A comprehensive analysis of seven state-of-the-art SLMs on JEEBench, a rigorous benchmark for STEM reasoning. This project explores the impact of zero-shot, few-shot, and Chain-of-Thought prompting on model performance in Physics, Chemistry, and Mathematics.

Resources

License

MIT, Unknown licenses found

Licenses found

MIT
LICENSE
Unknown
LICENSE.rtf

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published