Skip to content

auroraGPT-ANL/MCQ-and-SFT-code

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Code for Creating Scientific Training Data from Papers

Overview

This repository provides Python programs for creating training data to fine-tune models using scientific papers. The system uses a modern pydantic-based configuration system that supports both convenient shortnames (like o4mini) and traditional provider:model formats. There are two workflows implemented (or being implemented) here:

Multiple Choice Question (MCQ) Workflow does the following:

  1. Converts PDF-format papers into JSON
  2. Uses an AI model to generate Multiple Choice Questions (MCQs) for each paper. Each paper is split into n-token chunks, and the model creates an MCQ for each chunk.
  3. Uses one or more models to answer the MCQs
  4. All models used score answers from all other models.

New Knowledge Nugget (NKN) Workflow (under construction) will:

  1. Convert PDF-format papers into JSON
  2. Use an AI model to extract Knowledge Nuggets from each paper. Each paper is split into n-token chunks, and the will extract knowledge nuggets from each.
  3. Test each nugget using a model to be fine-tuned, eliminating nuggets that are already known to the model. This will create a set of New Knowledge Nuggets (NKNs) for fine-tuning the target model.

Finally, this repo contains a work-in-progress, exploratory project to use the components from these two workflows as part of an agentic systems.

The repository is thus organized as follows:

  1. Stable MCQ workflow in legacy/scripts uses components in src including:
  • src/common - tools common to both the MCQ and Nugget workflows, including model access, configuration, etc.,
  • src/mcq_workflow - tools specific to generating, answering, and scoring MCQs,
  • src/nugget_workflow - tools specific to extracting knowledge nuggets and screening for those not already know by a target model,
  • src/test - test routines including a stub model for testing workflows quickly without model delays (including offline testing), and
  • src/tune_workflow - tools to take MCQs (and eventually NKNs) to fine-tune a model. (also under construction, thus not yet included in either workflow)

All of the components in src/common src/mcq_workflow and src/nugget_workflow work both as Python modules (called form the CLI) and as part of an exploratory agent-based system, where each pipeline component is a subclass of agent_base.Agent which enforces a python contract of the form:

def run(context: dict) -> dict 

Each component of the pipeline performs its specific set of tasks and returns its results to a shared context A light-weight orchestrator.py imports and runs the agents.

The remainder of this README is currently specific to the CLI (legacy, stable) MCQ workflow.

Contact: Please email {foster|stevens|catlett}@anl.gov if you see things that are unclear or missing.

MCQ Workflow Overview

This pipeline converts scientific papers in PDF format into JSON and then uses AI models to:

  • Generate multiple-choice questions (MCQs)
  • Create answers to those MCQs
  • Score the generated answers

Step-by-Step Workflow:

  1. Convert PDFs to JSON representations
  2. Generate MCQs from JSON files
  3. Combine MCQ JSON files
  4. Select a subset of MCQs (optional)
  5. Generate additional answers for MCQs
  6. Score AI-generated answers
  7. Review MCQ generation and scoring status

This Repo

The repository is organized as follows:

  • src/common - tools common to both the MCQ and Nugget workflows, including model access, configuration, etc.,
  • src/mcq_workflow - tools specific to generating, answering, and scoring MCQs,
  • src/nugget_workflow - tools specific to extracting knowledge nuggets and screening for those not already know by a target model,
  • src/tune_workflow - tools to take MCQs (and eventually NKNs) to fine-tune a model. (also under construction, thus not yet included in either workflow)
  • src/test - test routines including a stub model for testing workflows quickly without model delays (including offline testing), and
  • legacy/scripts shell script to execute workflow (replaced with a python script in src/mcq_workflow).

Contact: Please email {foster|stevens|catlett}@anl.gov if you see things that are unclear or missing.


Quick Start Guide

1. Clone and Setup Environment

Clone the Repository:

git clone https://github.com/auroraGPT-ANL/MCQ-and-SFT-code.git
cd MCQ-and-SFT-code

Create Conda Environment:

conda env create -f environment.yml
conda activate augpt_env

Set Python Path (Required - do this every time):

export PYTHONPATH="$PWD:$PWD/src${PYTHONPATH:+:$PYTHONPATH}"
echo "PYTHONPATH set to: $PYTHONPATH"

💡 Tip: Add this to your shell profile (~/.bashrc or ~/.zshrc) to make it permanent:

echo 'export PYTHONPATH="$HOME/path/to/MCQ-and-SFT-code:$HOME/path/to/MCQ-and-SFT-code/src${PYTHONPATH:+:$PYTHONPATH}"' >> ~/.zshrc

Create Working Directories:

mkdir -p _PAPERS _JSON _MCQ _RESULTS

2. Configuration Setup

The system uses 4 configuration files with this precedence (highest → lowest):
env varsconfig.local.ymlservers.ymlconfig.yml

File Tracked? Purpose Action Needed
config.yml Stable defaults Already exists
servers.yml Endpoint catalog Already exists
config.local.yml Your models & settings You must create
secrets.yml Your API keys You must create

Create Your Local Configuration

Step 2a: Create config.local.yml

# Create your local configuration file
cat > config.local.yml << 'EOF'
# Your model choices for this run - use convenient shortnames!
workflow:
  extraction: o4mini                 # Model for generating MCQs
  contestants: [o4mini, gpt41nano]   # Models for answering MCQs
  target: o4mini                     # Target model (optional)

# Optional: Override defaults
timeout: 60
default_temperature: 0.7
EOF

Step 2b: Create secrets.yml

# Create your secrets file (never commit this!)
cat > secrets.yml << 'EOF'
# Add your API credentials here
openai_api_key: "sk-your-openai-key-here"

# Uncomment and add other credentials as needed:
# argo_username: "your-argo-username"
# alcf_token: "your-alcf-token"
EOF

⚠️ Important: Replace "sk-your-openai-key-here" with your actual OpenAI API key.

3. Test Your Setup

Verify Configuration:

python -c "
try:
    from settings import load_settings
    settings = load_settings()
    print('✅ Configuration system loaded successfully')
    print('Available endpoints:', len(settings.endpoints))
    print('Configured workflow models:', settings.workflow.contestants)
    print('✅ Setup looks good! You can proceed to run the workflow.')
except Exception as e:
    print(f'❌ Configuration error: {e}')
    print('Check that you have created config.local.yml and secrets.yml')
"

Add Sample Papers:

# Place PDF files in _PAPERS directory
cp /path/to/your/papers/*.pdf _PAPERS/

Run Quick Test:

# Test the complete workflow with minimal settings
python -m mcq_workflow.run_workflow -p 2 -v

4. Understanding the Configuration System

New Pydantic Configuration Features 🎉

Our modern configuration system provides several conveniences:

  • 🏷️ Shortnames: Use o4mini instead of openai:gpt-4o-mini
  • 🔒 Automatic credential validation: Only checks credentials for models you're actually using
  • 🛡️ Secure secrets: Credentials are automatically wrapped in SecretStr for security
  • 📁 Hierarchical loading: Environment variables > config.local.yml > servers.yml > config.yml
  • ✅ Smart validation: Prevents common configuration errors

Both shortnames and full provider:model formats work seamlessly.

Available Model Endpoints

Check servers.yml to see available endpoints. Common patterns:

  • OpenAI: openai:gpt-4o-mini, openai:gpt-4o
  • Local servers: Use shortnames like scout, qwen
  • ALCF: alcf:meta-llama/Meta-Llama-3-70B-Instruct
  • Test models: test:all (for offline development)

Model Configuration Examples

Simple OpenAI setup:

workflow:
  extraction: openai:gpt-4o-mini
  contestants: [openai:gpt-4o-mini]

Multiple models:

workflow:
  extraction: openai:gpt-4o
  contestants: [openai:gpt-4o, openai:gpt-4o-mini, scout]
  target: openai:gpt-4o

Offline testing:

workflow:
  extraction: test:all
  contestants: [test:all]

Adding New Endpoints

  1. Add endpoint to servers.yml:
my_endpoint:
  shortname: my_model
  provider: openai  # or: argo, alcf, local, hf
  base_url: https://api.example.com/v1
  model: my-model-name
  cred_key: my_api_key
  1. Add credential to secrets.yml:
my_api_key: "your-secret-key"
  1. Reference in config.local.yml:
workflow:
  extraction: my_model  # or: openai:my-model-name

5. Troubleshooting Setup

Configuration not working?

# Check file exists and format
ls -la config.local.yml secrets.yml
python -c "import yaml; print(yaml.safe_load(open('config.local.yml')))"

Import errors?

# Ensure PYTHONPATH is set
echo $PYTHONPATH
conda activate augpt_env

API errors?

# Test credentials
python -c "
from settings import load_settings
settings = load_settings()
print('Available endpoints:', list(settings.endpoints.keys()))
print('Configured models:', settings.workflow.contestants)
"

Detailed Workflow Commands

Main Workflow Execution

Full MCQ workflow with 4 parallel workers:

python -m mcq_workflow.run_workflow -p 4 -v

Start from specific step (1-5):

python -m mcq_workflow.run_workflow --step 3 -p 4

Process subset of MCQs:

python -m mcq_workflow.run_workflow -n 50 -p 4

Individual Pipeline Components

Convert PDFs to JSON:

python -m common.simple_parse -i _PAPERS -o _JSON

Generate MCQs:

python -m mcq_workflow.generate_mcqs -p 4 -v
python -m mcq_workflow.generate_mcqs -m openai:gpt-4o -a 5  # 5-choice MCQs

Generate answers:

python -m mcq_workflow.generate_answers -i MCQ-combined.json -m openai:gpt-4o -p 4

Score answers:

python -m mcq_workflow.score_answers -a openai:gpt-4o -b argo:mistral-7b -p 4

Combine MCQ files:

python -m common.combine_json_files -o MCQ-combined.json

Testing Commands

Integration test with specific PDF:

./src/test/test_workflow.sh -i path/to/paper.pdf -v

Test model verification (offline testing):

python -m test.test_model_verification -v

Legacy workflow test:

./legacy/scripts/run_mcq_workflow.sh -p 2 -v

Utility Commands

List configured models:

python -m common.list_models -p 4

Review processing status:

python -m mcq_workflow.review_status -o _RESULTS

Check ALCF service status:

python -m common.check_alcf_service_status

Extract Q&A pairs from results:

python -m mcq_workflow.extract_qa -i input.json -o output.json

Select random MCQ subset:

python -m common.select_mcqs_at_random -i MCQ-combined.json -o MCQ-subset.json -n 100

Authentication Setup

ALCF Inference Service

python -m common.inference_auth_token authenticate
python -m common.inference_auth_token get_access_token  # Check current token

OpenAI

Add to secrets.yml:

openai_api_key: "sk-your-key-here"

Argo

Add to secrets.yml:

argo_username: "your-argo-username"

FAQ

  • Q: I changed models in config.local.yml but the run still calls the old ones.
    A: Make sure you didn't also set AUGPT_WORKFLOW__… environment variables; env vars override YAML.

  • Q: How do I add a brand-new endpoint?

    1. Add a block in servers.yml with provider, base_url, model, cred_key.
    2. Add the credential in secrets.yml.
    3. Reference its shortname (or provider:model) in config.local.yml.
  • Q: Where did the old -m flags go?
    A: They still work, but the recommended way is to list models in config.local.yml so the whole workflow is reproducible.


Last updated: 2025-07-01

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 6