Skip to content

cdreetz/chatan

Repository files navigation

Chatan

Create diverse, synthetic datasets. Start from scratch or augment an existing dataset. Simply define your dataset schema as a set of generators, typically being LLMs with a prompt describing what kind of examples you want.

Installation

Basic installation (includes OpenAI, Anthropic, and core functionality):

pip install chatan

With optional features:

# For local model support (transformers + PyTorch)
pip install chatan[local]

# For advanced evaluation features (semantic similarity, BLEU score)
pip install chatan[eval]

# For all optional features
pip install chatan[all]

Getting Started

import chatan

# Create a generator
gen = chatan.generator("openai", "YOUR_API_KEY")

# Define a dataset schema
ds = chatan.dataset({
    "topic": chatan.sample.choice(["Python", "JavaScript", "Rust"]),
    "prompt": gen("write a programming question about {topic}"),
    "response": gen("answer this question: {prompt}")
})

# Generate the data with a progress bar
df = ds.generate(n=10)

Generator Options

API-based Generators (included in base install)

# OpenAI
gen = chatan.generator("openai", "YOUR_OPENAI_API_KEY")

# Anthropic
gen = chatan.generator("anthropic", "YOUR_ANTHROPIC_API_KEY")

Local Model Support (requires pip install chatan[local])

# HuggingFace Transformers
gen = chatan.generator("transformers", model="microsoft/DialoGPT-medium")

Examples

Create Data Mixes

from chatan import dataset, generator, sample
import uuid

gen = generator("openai", "YOUR_API_KEY")

mix = [
    "san antonio, tx",
    "marfa, tx",
    "paris, fr"
]

ds = dataset({
    "id": sample.uuid(),
    "topic": sample.choice(mix),
    "prompt": gen("write an example question about the history of {topic}"),
    "response": gen("respond to: {prompt}"),
})

Augment datasets

from chatan import generator, dataset, sample
from datasets import load_dataset

gen = generator("openai", "YOUR_API_KEY")
hf_data = load_dataset("some/dataset")

ds = dataset({
    "original_prompt": sample.from_dataset(hf_data, "prompt"),
    "variation": gen("rewrite this prompt: {original_prompt}"),
    "response": gen("respond to: {variation}")
})

Evaluation

Evaluate rows inline or compute aggregate metrics:

from chatan import dataset, eval, sample

ds = dataset({
    "col1": sample.choice(["a", "a", "b"]),
    "col2": "b",
    "score": eval.exact_match("col1", "col2")
})

df = ds.generate()
aggregate = ds.evaluate({
    "exact_match": ds.eval.exact_match("col1", "col2")
})

Advanced Evaluation (requires pip install chatan[eval])

# Semantic similarity using sentence transformers
aggregate = ds.evaluate({
    "semantic_sim": ds.eval.semantic_similarity("col1", "col2")
})

# BLEU score evaluation
aggregate = ds.evaluate({
    "bleu": ds.eval.bleu_score("col1", "col2")
})

Installation Options Summary

Feature Install Command What's Included
Basic pip install chatan OpenAI, Anthropic, core sampling, basic evaluation
Local Models pip install chatan[local] + HuggingFace Transformers, PyTorch
Advanced Eval pip install chatan[eval] + Semantic similarity, BLEU scores, NLTK
Everything pip install chatan[all] All features above

Citation

If you use this code in your research, please cite:

@software{reetz2025chatan,
  author = {Reetz, Christian},
  title = {chatan: Create synthetic datasets with LLM generators.},
  url = {https://github.com/cdreetz/chatan},
  year = {2025}
}

Contributing

Community contributions are more than welcome, bug reports, bug fixes, feature requests, feature additions, please refer to the Issues tab.

About

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages