Convert HuggingFace datasets to Apple Foundation Model training format (JSONL) with intelligent format detection and Claude Code SDK integration.
- Intelligent Dataset Analysis: Automatically detects conversation patterns and optimal field mappings
- Claude Code SDK Integration: Optional AI-powered analysis for complex dataset structures
- Multiple Format Support: Handles instruction-response, question-answer, multi-turn conversations, and more
- Comprehensive Logging: Generates detailed conversion rationale with decision explanations
- Backward Compatible: Works with or without Claude Code SDK
# Basic installation
pip install -r requirements.txt
# For Claude Code SDK integration (optional)
pip install anthropicpython hf_to_apple_jsonl.py dataset_name output_dirpython hf_to_apple_jsonl.py dataset_name output_dir --use-claude-hookpython hf_to_apple_jsonl.py dataset_name output_dir \
--split train \
--max-examples 1000 \
--train-split-ratio 0.9 \
--text-field "text" \
--conversation-field "messages" \
--use-claude-hookThe script converts datasets to Apple's expected format:
[
{"role": "user", "content": "PROMPT"},
{"role": "assistant", "content": "RESPONSE"}
]- Instruction-Response:
instruction+outputfields - Question-Answer:
question+answerfields - Prompt-Response:
prompt+responsefields - Input-Output:
input+outputfields - Multi-turn Conversations: Arrays with
roleandcontentfields - Text with Markers: Single text fields with
Human:andAssistant:markers
train.jsonl: Training examplesvalid.jsonl: Validation examples (if split ratio < 1.0)conversion_rationale.txt: Detailed analysis and decision log (with--use-claude-hook)
# Convert Alpaca dataset
python hf_to_apple_jsonl.py tatsu-lab/alpaca ./output
# Convert with intelligent analysis
python hf_to_apple_jsonl.py microsoft/DialoGPT-medium ./output --use-claude-hook
# Convert specific field mapping
python hf_to_apple_jsonl.py dataset_name ./output --text-field "conversation"- Python 3.7+
- datasets >= 2.0.0
- huggingface_hub >= 0.15.0
- anthropic >= 0.25.0 (optional, for Claude Code SDK)
MIT License