Code for Creating Scientific Training Data from Papers

Overview

This repository provides Python programs for creating training data to fine-tune models using scientific papers. The system uses a modern pydantic-based configuration system that supports both convenient shortnames (like o4mini) and traditional provider:model formats. There are two workflows implemented (or being implemented) here:

Multiple Choice Question (MCQ) Workflow does the following:

Converts PDF-format papers into JSON
Uses an AI model to generate Multiple Choice Questions (MCQs) for each paper. Each paper is split into n-token chunks, and the model creates an MCQ for each chunk.
Uses one or more models to answer the MCQs
All models used score answers from all other models.

New Knowledge Nugget (NKN) Workflow (under construction) will:

Convert PDF-format papers into JSON
Use an AI model to extract Knowledge Nuggets from each paper. Each paper is split into n-token chunks, and the will extract knowledge nuggets from each.
Test each nugget using a model to be fine-tuned, eliminating nuggets that are already known to the model. This will create a set of New Knowledge Nuggets (NKNs) for fine-tuning the target model.

Finally, this repo contains a work-in-progress, exploratory project to use the components from these two workflows as part of an agentic systems.

The repository is thus organized as follows:

Stable MCQ workflow in legacy/scripts uses components in src including:

src/common - tools common to both the MCQ and Nugget workflows, including model access, configuration, etc.,
src/mcq_workflow - tools specific to generating, answering, and scoring MCQs,
src/nugget_workflow - tools specific to extracting knowledge nuggets and screening for those not already know by a target model,
src/test - test routines including a stub model for testing workflows quickly without model delays (including offline testing), and
src/tune_workflow - tools to take MCQs (and eventually NKNs) to fine-tune a model. (also under construction, thus not yet included in either workflow)

All of the components in src/common src/mcq_workflow and src/nugget_workflow work both as Python modules (called form the CLI) and as part of an exploratory agent-based system, where each pipeline component is a subclass of agent_base.Agent which enforces a python contract of the form:

def run(context: dict) -> dict

Each component of the pipeline performs its specific set of tasks and returns its results to a shared context A light-weight orchestrator.py imports and runs the agents.

The remainder of this README is currently specific to the CLI (legacy, stable) MCQ workflow.

Contact: Please email {foster|stevens|catlett}@anl.gov if you see things that are unclear or missing.

MCQ Workflow Overview

This pipeline converts scientific papers in PDF format into JSON and then uses AI models to:

Generate multiple-choice questions (MCQs)
Create answers to those MCQs
Score the generated answers

Step-by-Step Workflow:

Convert PDFs to JSON representations
Generate MCQs from JSON files
Combine MCQ JSON files
Select a subset of MCQs (optional)
Generate additional answers for MCQs
Score AI-generated answers
Review MCQ generation and scoring status

This Repo

The repository is organized as follows:

src/common - tools common to both the MCQ and Nugget workflows, including model access, configuration, etc.,
src/mcq_workflow - tools specific to generating, answering, and scoring MCQs,
src/nugget_workflow - tools specific to extracting knowledge nuggets and screening for those not already know by a target model,
src/tune_workflow - tools to take MCQs (and eventually NKNs) to fine-tune a model. (also under construction, thus not yet included in either workflow)
src/test - test routines including a stub model for testing workflows quickly without model delays (including offline testing), and
legacy/scripts shell script to execute workflow (replaced with a python script in src/mcq_workflow).

Contact: Please email {foster|stevens|catlett}@anl.gov if you see things that are unclear or missing.

Quick Start Guide

1. Clone and Setup Environment

Clone the Repository:

git clone https://github.com/auroraGPT-ANL/MCQ-and-SFT-code.git
cd MCQ-and-SFT-code

Create Conda Environment:

conda env create -f environment.yml
conda activate augpt_env

Set Python Path (Required - do this every time):

export PYTHONPATH="$PWD:$PWD/src${PYTHONPATH:+:$PYTHONPATH}"
echo "PYTHONPATH set to: $PYTHONPATH"

💡 Tip: Add this to your shell profile (~/.bashrc or ~/.zshrc) to make it permanent:

echo 'export PYTHONPATH="$HOME/path/to/MCQ-and-SFT-code:$HOME/path/to/MCQ-and-SFT-code/src${PYTHONPATH:+:$PYTHONPATH}"' >> ~/.zshrc

Create Working Directories:

mkdir -p _PAPERS _JSON _MCQ _RESULTS

2. Configuration Setup

The system uses 4 configuration files with this precedence (highest → lowest):
env vars ▸ config.local.yml ▸ servers.yml ▸ config.yml

File	Tracked?	Purpose	Action Needed
`config.yml`	✓	Stable defaults	Already exists
`servers.yml`	✓	Endpoint catalog	Already exists
`config.local.yml`	✗	Your models & settings	You must create
`secrets.yml`	✗	Your API keys	You must create

Create Your Local Configuration

Step 2a: Create config.local.yml

# Create your local configuration file
cat > config.local.yml << 'EOF'
# Your model choices for this run - use convenient shortnames!
workflow:
  extraction: o4mini                 # Model for generating MCQs
  contestants: [o4mini, gpt41nano]   # Models for answering MCQs
  target: o4mini                     # Target model (optional)

# Optional: Override defaults
timeout: 60
default_temperature: 0.7
EOF

Step 2b: Create secrets.yml

# Create your secrets file (never commit this!)
cat > secrets.yml << 'EOF'
# Add your API credentials here
openai_api_key: "sk-your-openai-key-here"

# Uncomment and add other credentials as needed:
# argo_username: "your-argo-username"
# alcf_token: "your-alcf-token"
EOF

⚠️ Important: Replace "sk-your-openai-key-here" with your actual OpenAI API key.

3. Test Your Setup

Verify Configuration:

python -c "
try:
    from settings import load_settings
    settings = load_settings()
    print('✅ Configuration system loaded successfully')
    print('Available endpoints:', len(settings.endpoints))
    print('Configured workflow models:', settings.workflow.contestants)
    print('✅ Setup looks good! You can proceed to run the workflow.')
except Exception as e:
    print(f'❌ Configuration error: {e}')
    print('Check that you have created config.local.yml and secrets.yml')
"

Add Sample Papers:

# Place PDF files in _PAPERS directory
cp /path/to/your/papers/*.pdf _PAPERS/

Run Quick Test:

# Test the complete workflow with minimal settings
python -m mcq_workflow.run_workflow -p 2 -v

4. Understanding the Configuration System

New Pydantic Configuration Features 🎉

Our modern configuration system provides several conveniences:

🏷️ Shortnames: Use o4mini instead of openai:gpt-4o-mini
🔒 Automatic credential validation: Only checks credentials for models you're actually using
🛡️ Secure secrets: Credentials are automatically wrapped in SecretStr for security
📁 Hierarchical loading: Environment variables > config.local.yml > servers.yml > config.yml
✅ Smart validation: Prevents common configuration errors

Both shortnames and full provider:model formats work seamlessly.

Available Model Endpoints

Check servers.yml to see available endpoints. Common patterns:

OpenAI: openai:gpt-4o-mini, openai:gpt-4o
Local servers: Use shortnames like scout, qwen
ALCF: alcf:meta-llama/Meta-Llama-3-70B-Instruct
Test models: test:all (for offline development)

Model Configuration Examples

Simple OpenAI setup:

workflow:
  extraction: openai:gpt-4o-mini
  contestants: [openai:gpt-4o-mini]

Multiple models:

workflow:
  extraction: openai:gpt-4o
  contestants: [openai:gpt-4o, openai:gpt-4o-mini, scout]
  target: openai:gpt-4o

Offline testing:

workflow:
  extraction: test:all
  contestants: [test:all]

Adding New Endpoints

Add endpoint to servers.yml:

my_endpoint:
  shortname: my_model
  provider: openai  # or: argo, alcf, local, hf
  base_url: https://api.example.com/v1
  model: my-model-name
  cred_key: my_api_key

Add credential to secrets.yml:

my_api_key: "your-secret-key"

Reference in config.local.yml:

workflow:
  extraction: my_model  # or: openai:my-model-name

5. Troubleshooting Setup

Configuration not working?

# Check file exists and format
ls -la config.local.yml secrets.yml
python -c "import yaml; print(yaml.safe_load(open('config.local.yml')))"

Import errors?

# Ensure PYTHONPATH is set
echo $PYTHONPATH
conda activate augpt_env

API errors?

# Test credentials
python -c "
from settings import load_settings
settings = load_settings()
print('Available endpoints:', list(settings.endpoints.keys()))
print('Configured models:', settings.workflow.contestants)
"

Detailed Workflow Commands

Main Workflow Execution

Full MCQ workflow with 4 parallel workers:

python -m mcq_workflow.run_workflow -p 4 -v

Start from specific step (1-5):

python -m mcq_workflow.run_workflow --step 3 -p 4

Process subset of MCQs:

python -m mcq_workflow.run_workflow -n 50 -p 4

Individual Pipeline Components

Convert PDFs to JSON:

python -m common.simple_parse -i _PAPERS -o _JSON

Generate MCQs:

python -m mcq_workflow.generate_mcqs -p 4 -v
python -m mcq_workflow.generate_mcqs -m openai:gpt-4o -a 5  # 5-choice MCQs

Generate answers:

python -m mcq_workflow.generate_answers -i MCQ-combined.json -m openai:gpt-4o -p 4

Score answers:

python -m mcq_workflow.score_answers -a openai:gpt-4o -b argo:mistral-7b -p 4

Combine MCQ files:

python -m common.combine_json_files -o MCQ-combined.json

Testing Commands

Integration test with specific PDF:

./src/test/test_workflow.sh -i path/to/paper.pdf -v

Test model verification (offline testing):

python -m test.test_model_verification -v

Legacy workflow test:

./legacy/scripts/run_mcq_workflow.sh -p 2 -v

Utility Commands

List configured models:

python -m common.list_models -p 4

Review processing status:

python -m mcq_workflow.review_status -o _RESULTS

Check ALCF service status:

python -m common.check_alcf_service_status

Extract Q&A pairs from results:

python -m mcq_workflow.extract_qa -i input.json -o output.json

Select random MCQ subset:

python -m common.select_mcqs_at_random -i MCQ-combined.json -o MCQ-subset.json -n 100

Authentication Setup

ALCF Inference Service

python -m common.inference_auth_token authenticate
python -m common.inference_auth_token get_access_token  # Check current token

OpenAI

Add to secrets.yml:

openai_api_key: "sk-your-key-here"

Argo

Add to secrets.yml:

argo_username: "your-argo-username"

FAQ

Q: I changed models in config.local.yml but the run still calls the old ones.
A: Make sure you didn't also set AUGPT_WORKFLOW__… environment variables; env vars override YAML.
Q: How do I add a brand-new endpoint?
1. Add a block in servers.yml with provider, base_url, model, cred_key.
2. Add the credential in secrets.yml.
3. Reference its shortname (or provider:model) in config.local.yml.
Q: Where did the old -m flags go?
A: They still work, but the recommended way is to list models in config.local.yml so the whole workflow is reproducible.

Last updated: 2025-07-01

Name		Name	Last commit message	Last commit date
Latest commit History 662 Commits
TPC-HELSINKI-Hackathon		TPC-HELSINKI-Hackathon
agent/src		agent/src
legacy/scripts		legacy/scripts
src		src
tests		tests
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
HELSINKI_PLAN.md		HELSINKI_PLAN.md
LICENSE		LICENSE
MCQ-Pipeline-25June2025.pdf		MCQ-Pipeline-25June2025.pdf
MCQ-Pipeline-25June2025.pptx		MCQ-Pipeline-25June2025.pptx
MCQ-Workflow.png		MCQ-Workflow.png
PYDANTIC-CONFIG-SUMMARY.md		PYDANTIC-CONFIG-SUMMARY.md
README-legacy.md		README-legacy.md
README.bkp		README.bkp
README.md		README.md
base_env.yml		base_env.yml
config.local.yml.example		config.local.yml.example
config.local.yml.template		config.local.yml.template
config.yml		config.yml
cuda.yml		cuda.yml
environment.yml		environment.yml
launch_augpt_env.sh		launch_augpt_env.sh
rocm.yml		rocm.yml
secrets.yml.template		secrets.yml.template
servers.yml		servers.yml
settings.py		settings.py
test_config_smoke.py		test_config_smoke.py
test_settings_basic.py		test_settings_basic.py
test_settings_validation.py		test_settings_validation.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Code for Creating Scientific Training Data from Papers

Overview

MCQ Workflow Overview

This Repo

Quick Start Guide

1. Clone and Setup Environment

2. Configuration Setup

Create Your Local Configuration

3. Test Your Setup

4. Understanding the Configuration System

New Pydantic Configuration Features 🎉

Available Model Endpoints

Model Configuration Examples

Adding New Endpoints

5. Troubleshooting Setup

Detailed Workflow Commands

Main Workflow Execution

Individual Pipeline Components

Testing Commands

Utility Commands

Authentication Setup

ALCF Inference Service

OpenAI

Argo

FAQ

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors 6

Uh oh!

Languages

License

auroraGPT-ANL/MCQ-and-SFT-code

Folders and files

Latest commit

History

Repository files navigation

Code for Creating Scientific Training Data from Papers

Overview

MCQ Workflow Overview

This Repo

Quick Start Guide

1. Clone and Setup Environment

2. Configuration Setup

Create Your Local Configuration

3. Test Your Setup

4. Understanding the Configuration System

New Pydantic Configuration Features 🎉

Available Model Endpoints

Model Configuration Examples

Adding New Endpoints

5. Troubleshooting Setup

Detailed Workflow Commands

Main Workflow Execution

Individual Pipeline Components

Testing Commands

Utility Commands

Authentication Setup

ALCF Inference Service

OpenAI

Argo

FAQ

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors 6

Uh oh!

Languages

Packages