Evaluating Small Language Models on the JEEBench BenchmarkA Comprehensive Analysis of STEM Reasoning in State-of-the-Art SLMsThis repository contains the code and data for the M.Tech thesis project, "Evaluating Small Language Models on an Advanced Benchmark Dataset." The study presents a rigorous evaluation of seven leading Small Language Models (SLMs) on JEEBench, a highly challenging benchmark derived from India's prestigious IIT JEE-Advanced examinations, to assess their mathematical and scientific reasoning capabilities.📜 AbstractDespite the growing interest in deploying Small Language Models (SLMs) for educational applications, a systematic evaluation of their mathematical and scientific reasoning capabilities is lacking. This research addresses this gap by evaluating seven SLMs on JEEBench, a challenging benchmark of 120 problems from the IIT JEE-Advanced exams, covering Physics, Chemistry, and Mathematics. We test three prompting methods—zero-shot, few-shot, and Chain-of-Thought—using a production-grade evaluation framework with strict exact-match scoring. The analysis, based on 2,880 evaluations, reveals that mathematical specialization is decisive, parameter efficiency is a key factor, and certain models exhibit an "anti-prompting" phenomenon. A universal failure on numerical computation tasks highlights fundamental architectural gaps, suggesting that while SLMs can offer significant educational support, careful deployment and safety considerations are essential.🚀 Breakthrough DiscoveriesThis research uncovered several revolutionary findings that challenge current assumptions about SLM performance and provide critical insights for future development:🧠 Mathematical Specialization Supremacy: Domain-specific models like Qwen2.5-Math-7B achieved 40-92% higher accuracy than general-purpose models, proving that specialized training is critical for complex STEM reasoning.⚡ Parameter Efficiency Revolution: The Qwen2-1.5B-Instruct model (1.5B parameters) ranked #2 overall, outperforming five larger 7-8B models. This demonstrates that high-quality training and architecture can be more important than parameter count.
⚙️ InstallationTo set up the environment and run the evaluation script, follow these steps:Clone the repository:git clone https://github.com/Abduhu1/Evaluating-SLMs-on-JEEBench.git cd Evaluating-SLMs-on-JEEBench
Create and activate a Python virtual environment:python3 -m venv venv source venv/bin/activate
Install the required dependencies:pip install -r requirements.txt
Note: Ensure you have a compatible CUDA version installed for GPU acceleration.
Run all three prompting methods on the full 120-problem dataset with Llama-3-8B:python finalpy.py --model_name "meta-llama/Meta-Llama-3-8B-Instruct" --method all --max_problems 120
Run a few-shot evaluation using custom examples:python finalpy.py --model_name "mistralai/Mistral-7B-Instruct-v0.3" --method few_shot --few_shot_examples "path/to/my_examples.json"
🏆 Master ResultsThe final ranking of all seven models based on their peak exact-match accuracy on JEEBench is as follows:| Rank | Model | Parameters | Best Method | Peak Accuracy | Consistency Score || 1 | Qwen2.5-Math-7B-Instruct | 7B | CoT | 22.5% | 9.2 / 10 || 2 | Qwen2-1.5B-Instruct | 1.5B | CoT | 16.1% | 8.8 / 10 || 3 | Qwen2.5-Math-1.5B | 1.5B | CoT | 15.6% | 8.5 / 10 || 4 | Meta-Llama-3-8B-Instruct | 8B | Zero-Shot | 15.0% | 6.2 / 10 || 5 | DeepSeek-R1-Distill-Qwen-7B | 7B | Zero-Shot | 14.2% | 9.5 / 10 || 6 | DeepSeek-R1-Distill-Qwen-1.5B | 1.5B | Zero/Few-Shot | 14.2% | 9.8 / 10 || 7 | Mistral-7B-Instruct-v0.3 | 7B | CoT | 11.7% | 7.5 / 10 |✍️ CitationIf you find this work useful in your research, please consider citing the thesis:@mastersthesis{Andrabi2025JeeBench, author = {Abdullah Khurshid Andrabi}, title = {Evaluating Small Language Models on an Advanced Benchmark Dataset}, school = {National Institute of Technology, Srinagar}, year = {2025}, month = {July}, address = {Srinagar, Jammu & Kashmir, India}, note = {M.Tech Thesis}, supervisor = {Dr. Shaima Qureshi} }
🙏 AcknowledgmentsI would like to express my deepest gratitude to my guide, Dr. Shaima Qureshi, for her invaluable guidance and unwavering support throughout this project. I also thank the Department of Computer Science and Engineering at the National Institute of Technology, Srinagar for providing the resources and environment for this research.📄 LicenseThis project is licensed under the MIT License. See the LICENSE file for details.