This repository provides Python programs for creating training data to fine-tune models using scientific papers. The system uses a modern pydantic-based configuration system that supports both convenient shortnames (like o4mini
) and traditional provider:model formats. There are two workflows implemented (or being implemented) here:
Multiple Choice Question (MCQ) Workflow does the following:
- Converts PDF-format papers into JSON
- Uses an AI model to generate Multiple Choice Questions (MCQs) for each paper. Each paper is split into n-token chunks, and the model creates an MCQ for each chunk.
- Uses one or more models to answer the MCQs
- All models used score answers from all other models.
New Knowledge Nugget (NKN) Workflow (under construction) will:
- Convert PDF-format papers into JSON
- Use an AI model to extract Knowledge Nuggets from each paper. Each paper is split into n-token chunks, and the will extract knowledge nuggets from each.
- Test each nugget using a model to be fine-tuned, eliminating nuggets that are already known to the model. This will create a set of New Knowledge Nuggets (NKNs) for fine-tuning the target model.
Finally, this repo contains a work-in-progress, exploratory project to use the components from these two workflows as part of an agentic systems.
The repository is thus organized as follows:
- Stable MCQ workflow in legacy/scripts uses components in src including:
- src/common - tools common to both the MCQ and Nugget workflows, including model access, configuration, etc.,
- src/mcq_workflow - tools specific to generating, answering, and scoring MCQs,
- src/nugget_workflow - tools specific to extracting knowledge nuggets and screening for those not already know by a target model,
- src/test - test routines including a stub model for testing workflows quickly without model delays (including offline testing), and
- src/tune_workflow - tools to take MCQs (and eventually NKNs) to fine-tune a model. (also under construction, thus not yet included in either workflow)
All of the components in src/common src/mcq_workflow and src/nugget_workflow work both as Python modules (called form the CLI) and as part of an exploratory agent-based system, where each pipeline component is a subclass of agent_base.Agent which enforces a python contract of the form:
def run(context: dict) -> dict
Each component of the pipeline performs its specific set of tasks and returns its results to a shared context A light-weight orchestrator.py imports and runs the agents.
The remainder of this README is currently specific to the CLI (legacy, stable) MCQ workflow.
Contact: Please email {foster|stevens|catlett}@anl.gov if you see things that are unclear or missing.
This pipeline converts scientific papers in PDF format into JSON and then uses AI models to:
- Generate multiple-choice questions (MCQs)
- Create answers to those MCQs
- Score the generated answers
Step-by-Step Workflow:
- Convert PDFs to JSON representations
- Generate MCQs from JSON files
- Combine MCQ JSON files
- Select a subset of MCQs (optional)
- Generate additional answers for MCQs
- Score AI-generated answers
- Review MCQ generation and scoring status
The repository is organized as follows:
- src/common - tools common to both the MCQ and Nugget workflows, including model access, configuration, etc.,
- src/mcq_workflow - tools specific to generating, answering, and scoring MCQs,
- src/nugget_workflow - tools specific to extracting knowledge nuggets and screening for those not already know by a target model,
- src/tune_workflow - tools to take MCQs (and eventually NKNs) to fine-tune a model. (also under construction, thus not yet included in either workflow)
- src/test - test routines including a stub model for testing workflows quickly without model delays (including offline testing), and
- legacy/scripts shell script to execute workflow (replaced with a python script in src/mcq_workflow).
Contact: Please email {foster|stevens|catlett}@anl.gov if you see things that are unclear or missing.
Clone the Repository:
git clone https://github.com/auroraGPT-ANL/MCQ-and-SFT-code.git
cd MCQ-and-SFT-code
Create Conda Environment:
conda env create -f environment.yml
conda activate augpt_env
Set Python Path (Required - do this every time):
export PYTHONPATH="$PWD:$PWD/src${PYTHONPATH:+:$PYTHONPATH}"
echo "PYTHONPATH set to: $PYTHONPATH"
💡 Tip: Add this to your shell profile (~/.bashrc
or ~/.zshrc
) to make it permanent:
echo 'export PYTHONPATH="$HOME/path/to/MCQ-and-SFT-code:$HOME/path/to/MCQ-and-SFT-code/src${PYTHONPATH:+:$PYTHONPATH}"' >> ~/.zshrc
Create Working Directories:
mkdir -p _PAPERS _JSON _MCQ _RESULTS
The system uses 4 configuration files with this precedence (highest → lowest):
env vars
▸ config.local.yml
▸ servers.yml
▸ config.yml
File | Tracked? | Purpose | Action Needed |
---|---|---|---|
config.yml |
✓ | Stable defaults | Already exists |
servers.yml |
✓ | Endpoint catalog | Already exists |
config.local.yml |
✗ | Your models & settings | You must create |
secrets.yml |
✗ | Your API keys | You must create |
Step 2a: Create config.local.yml
# Create your local configuration file
cat > config.local.yml << 'EOF'
# Your model choices for this run - use convenient shortnames!
workflow:
extraction: o4mini # Model for generating MCQs
contestants: [o4mini, gpt41nano] # Models for answering MCQs
target: o4mini # Target model (optional)
# Optional: Override defaults
timeout: 60
default_temperature: 0.7
EOF
Step 2b: Create secrets.yml
# Create your secrets file (never commit this!)
cat > secrets.yml << 'EOF'
# Add your API credentials here
openai_api_key: "sk-your-openai-key-here"
# Uncomment and add other credentials as needed:
# argo_username: "your-argo-username"
# alcf_token: "your-alcf-token"
EOF
"sk-your-openai-key-here"
with your actual OpenAI API key.
Verify Configuration:
python -c "
try:
from settings import load_settings
settings = load_settings()
print('✅ Configuration system loaded successfully')
print('Available endpoints:', len(settings.endpoints))
print('Configured workflow models:', settings.workflow.contestants)
print('✅ Setup looks good! You can proceed to run the workflow.')
except Exception as e:
print(f'❌ Configuration error: {e}')
print('Check that you have created config.local.yml and secrets.yml')
"
Add Sample Papers:
# Place PDF files in _PAPERS directory
cp /path/to/your/papers/*.pdf _PAPERS/
Run Quick Test:
# Test the complete workflow with minimal settings
python -m mcq_workflow.run_workflow -p 2 -v
Our modern configuration system provides several conveniences:
- 🏷️ Shortnames: Use
o4mini
instead ofopenai:gpt-4o-mini
- 🔒 Automatic credential validation: Only checks credentials for models you're actually using
- 🛡️ Secure secrets: Credentials are automatically wrapped in SecretStr for security
- 📁 Hierarchical loading: Environment variables > config.local.yml > servers.yml > config.yml
- ✅ Smart validation: Prevents common configuration errors
Both shortnames and full provider:model
formats work seamlessly.
Check servers.yml
to see available endpoints. Common patterns:
- OpenAI:
openai:gpt-4o-mini
,openai:gpt-4o
- Local servers: Use shortnames like
scout
,qwen
- ALCF:
alcf:meta-llama/Meta-Llama-3-70B-Instruct
- Test models:
test:all
(for offline development)
Simple OpenAI setup:
workflow:
extraction: openai:gpt-4o-mini
contestants: [openai:gpt-4o-mini]
Multiple models:
workflow:
extraction: openai:gpt-4o
contestants: [openai:gpt-4o, openai:gpt-4o-mini, scout]
target: openai:gpt-4o
Offline testing:
workflow:
extraction: test:all
contestants: [test:all]
- Add endpoint to
servers.yml
:
my_endpoint:
shortname: my_model
provider: openai # or: argo, alcf, local, hf
base_url: https://api.example.com/v1
model: my-model-name
cred_key: my_api_key
- Add credential to
secrets.yml
:
my_api_key: "your-secret-key"
- Reference in
config.local.yml
:
workflow:
extraction: my_model # or: openai:my-model-name
Configuration not working?
# Check file exists and format
ls -la config.local.yml secrets.yml
python -c "import yaml; print(yaml.safe_load(open('config.local.yml')))"
Import errors?
# Ensure PYTHONPATH is set
echo $PYTHONPATH
conda activate augpt_env
API errors?
# Test credentials
python -c "
from settings import load_settings
settings = load_settings()
print('Available endpoints:', list(settings.endpoints.keys()))
print('Configured models:', settings.workflow.contestants)
"
Full MCQ workflow with 4 parallel workers:
python -m mcq_workflow.run_workflow -p 4 -v
Start from specific step (1-5):
python -m mcq_workflow.run_workflow --step 3 -p 4
Process subset of MCQs:
python -m mcq_workflow.run_workflow -n 50 -p 4
Convert PDFs to JSON:
python -m common.simple_parse -i _PAPERS -o _JSON
Generate MCQs:
python -m mcq_workflow.generate_mcqs -p 4 -v
python -m mcq_workflow.generate_mcqs -m openai:gpt-4o -a 5 # 5-choice MCQs
Generate answers:
python -m mcq_workflow.generate_answers -i MCQ-combined.json -m openai:gpt-4o -p 4
Score answers:
python -m mcq_workflow.score_answers -a openai:gpt-4o -b argo:mistral-7b -p 4
Combine MCQ files:
python -m common.combine_json_files -o MCQ-combined.json
Integration test with specific PDF:
./src/test/test_workflow.sh -i path/to/paper.pdf -v
Test model verification (offline testing):
python -m test.test_model_verification -v
Legacy workflow test:
./legacy/scripts/run_mcq_workflow.sh -p 2 -v
List configured models:
python -m common.list_models -p 4
Review processing status:
python -m mcq_workflow.review_status -o _RESULTS
Check ALCF service status:
python -m common.check_alcf_service_status
Extract Q&A pairs from results:
python -m mcq_workflow.extract_qa -i input.json -o output.json
Select random MCQ subset:
python -m common.select_mcqs_at_random -i MCQ-combined.json -o MCQ-subset.json -n 100
python -m common.inference_auth_token authenticate
python -m common.inference_auth_token get_access_token # Check current token
Add to secrets.yml
:
openai_api_key: "sk-your-key-here"
Add to secrets.yml
:
argo_username: "your-argo-username"
-
Q: I changed models in
config.local.yml
but the run still calls the old ones.
A: Make sure you didn't also setAUGPT_WORKFLOW__…
environment variables; env vars override YAML. -
Q: How do I add a brand-new endpoint?
- Add a block in
servers.yml
withprovider
,base_url
,model
,cred_key
. - Add the credential in
secrets.yml
. - Reference its
shortname
(orprovider:model
) inconfig.local.yml
.
- Add a block in
-
Q: Where did the old
-m
flags go?
A: They still work, but the recommended way is to list models inconfig.local.yml
so the whole workflow is reproducible.
Last updated: 2025-07-01