Create diverse, synthetic datasets. Start from scratch or augment an existing dataset. Simply define your dataset schema as a set of generators, typically being LLMs with a prompt describing what kind of examples you want.
Basic installation (includes OpenAI, Anthropic, and core functionality):
pip install chatan
With optional features:
# For local model support (transformers + PyTorch)
pip install chatan[local]
# For advanced evaluation features (semantic similarity, BLEU score)
pip install chatan[eval]
# For all optional features
pip install chatan[all]
import chatan
# Create a generator
gen = chatan.generator("openai", "YOUR_API_KEY")
# Define a dataset schema
ds = chatan.dataset({
"topic": chatan.sample.choice(["Python", "JavaScript", "Rust"]),
"prompt": gen("write a programming question about {topic}"),
"response": gen("answer this question: {prompt}")
})
# Generate the data with a progress bar
df = ds.generate(n=10)
# OpenAI
gen = chatan.generator("openai", "YOUR_OPENAI_API_KEY")
# Anthropic
gen = chatan.generator("anthropic", "YOUR_ANTHROPIC_API_KEY")
# HuggingFace Transformers
gen = chatan.generator("transformers", model="microsoft/DialoGPT-medium")
Create Data Mixes
from chatan import dataset, generator, sample
import uuid
gen = generator("openai", "YOUR_API_KEY")
mix = [
"san antonio, tx",
"marfa, tx",
"paris, fr"
]
ds = dataset({
"id": sample.uuid(),
"topic": sample.choice(mix),
"prompt": gen("write an example question about the history of {topic}"),
"response": gen("respond to: {prompt}"),
})
Augment datasets
from chatan import generator, dataset, sample
from datasets import load_dataset
gen = generator("openai", "YOUR_API_KEY")
hf_data = load_dataset("some/dataset")
ds = dataset({
"original_prompt": sample.from_dataset(hf_data, "prompt"),
"variation": gen("rewrite this prompt: {original_prompt}"),
"response": gen("respond to: {variation}")
})
Evaluate rows inline or compute aggregate metrics:
from chatan import dataset, eval, sample
ds = dataset({
"col1": sample.choice(["a", "a", "b"]),
"col2": "b",
"score": eval.exact_match("col1", "col2")
})
df = ds.generate()
aggregate = ds.evaluate({
"exact_match": ds.eval.exact_match("col1", "col2")
})
# Semantic similarity using sentence transformers
aggregate = ds.evaluate({
"semantic_sim": ds.eval.semantic_similarity("col1", "col2")
})
# BLEU score evaluation
aggregate = ds.evaluate({
"bleu": ds.eval.bleu_score("col1", "col2")
})
Feature | Install Command | What's Included |
---|---|---|
Basic | pip install chatan |
OpenAI, Anthropic, core sampling, basic evaluation |
Local Models | pip install chatan[local] |
+ HuggingFace Transformers, PyTorch |
Advanced Eval | pip install chatan[eval] |
+ Semantic similarity, BLEU scores, NLTK |
Everything | pip install chatan[all] |
All features above |
If you use this code in your research, please cite:
@software{reetz2025chatan,
author = {Reetz, Christian},
title = {chatan: Create synthetic datasets with LLM generators.},
url = {https://github.com/cdreetz/chatan},
year = {2025}
}
Community contributions are more than welcome, bug reports, bug fixes, feature requests, feature additions, please refer to the Issues tab.