AtmosSci-Bench is a comprehensive benchmark framework for evaluating Large Language Models (LLMs) on atmospheric science tasks. This repository contains the code and resources for the paper: "AtmosSci-Bench: Evaluating the Recent Advance of Large Language Model for Atmospheric Science".
The benchmark consists of:
- Multiple-Choice Questions (MCQ): Rigorous, physics-based questions generated using symbolic techniques
- Open-Ended Questions (OEQ): Problems requiring step-by-step reasoning and detailed explanations
- Evaluation Framework: Tools for assessing LLM performance on atmospheric science tasks
AtmosSci-Bench evaluates LLMs on their ability to understand and reason about atmospheric science concepts. The benchmark covers various domains including:
- Atmospheric Dynamics
- Hydrology
- Geophysics
- Climate Science
- Meteorology
- Python 3.10.9
- Dependencies listed in
requirements.txt
- GPU resources for local model inference (optional)
-
Clone this repository:
git clone https://github.com/your-username/atmossci-bench.git cd atmossci-bench
-
Run the setup script:
./setup.sh
-
Install dependencies:
pip install -r requirements.txt
The MCQ Generation Framework creates rigorous, physics-based multiple-choice questions using symbolic techniques. It ensures questions test genuine reasoning ability rather than pattern matching.
- Symbolic Question Generation: Creates questions with variable parameters
- Template-Based Perturbation: Uses placeholder variables that can be systematically instantiated
- Rule-Based Mechanism: Ensures logical coherence and alignment with physical laws
- Diverse Question Types: Covers various domains in atmospheric science
To generate the MCQ dataset:
./mcq_gen_framework/scripts/generate_mcq.sh
For more details, see mcq_gen_framework/README.md.
The benchmark dataset is available in the data/
directory. It includes:
- Main MCQ Set: Core multiple-choice questions
- Extra MCQ Set: Additional multiple-choice questions
- OEQ Set: Open-ended questions requiring detailed explanations
If you generate the dataset using the MCQ Generation Framework, the output will be placed in data/
.
Models hosted by providers such as OpenAI, Google, Deepseek, and TogetherAI are accessed via public inference APIs.
- Create a
.env
file using.env_example
as a template - Add your API keys to the
.env
file - Run inference using the provided scripts
We use the Ray
Python library for parallel execution of API requests, enabling efficient large-scale evaluation.
Models available through HuggingFace can be run locally using:
- HuggingFace
transformers
library Accelerate
for GPU acceleration
Our evaluation hardware setups include:
- Single machine with 8×NVIDIA RTX 4090 GPUs
- Two nodes with 4×NVIDIA A800 GPUs each
Performance varies by model size:
- 70B models: ~90 hours with batch size 4
- 7B models: ~6 hours with batch size 64
All output is stored in the /output
directory, with separate folders for each model and dataset.
- Find the LLM name and base in
src/models/__init__.py
(BASE_REGISTRY) - Run the appropriate script:
- API-based models:
# Edit parameters in the script first ./scripts/generate_api/generate.sh
- Local models:
# Edit parameters in the script first ./scripts/generate_gpu/generate_2gpu.sh
- API-based models:
- Add the model to the appropriate file:
- API-based:
src/models/api_base.py
andsrc/models/__init__.py
- Local-based:
src/models/local_base.py
andsrc/models/__init__.py
- API-based:
- Run inference using the scripts mentioned above
To evaluate model responses:
- Edit parameters in
scripts/evaluate/evaluate.sh
- Run the evaluation script:
./scripts/evaluate/evaluate.sh
This creates evaluation.jsonl
and results.json
in the same folder as the inference output.
To analyze evaluation results:
-
Ensure result consistency:
python scripts/generate_result2.py
-
Generate analysis results:
python scripts/analysis/analysis_*.py
-
Generate instance analysis:
python scripts/instance_analysis/create_instance_acc.py
The dataset is hosted on Kaggle: AtmosSci-Bench Dataset
Additional documentation:
- Croissant metadata:
doc/croissant/atmossci-bench-metadata-croissant.json
- Validation report:
doc/croissant/report_croissant-validation_ATMOSSCI-BENCH.md
Detailed information about hyperparameters and experimental compute resources is available in doc/settings.md.
Information about model and library usage licenses, data sources, and usage statements can be found in doc/sources_license.md.