Dataset Builder

Patch Notes - (0.1 Beta 5/15/2025):

Qwen3 0.6B was removed because it could not handel large inputs like when it had to review the votes of other models. Some Qwen outputs may still be in the early lines of the dataset.

About

The Synthetic Conversations dataset is a set made up of inputs and outputs that was completely automated and generated by AI language models. I used AI models such as DeepSeek R1 Llama 70B Distil, Google's Gemini 2.0 Flash, Microsoft's Phi 3, and Qwen3-0.6B.

Development Process

This is a fully automated dataset that was built from Google's Gemini 2.0 Flash AI model asking complex questions and other AI models answering those questions. I used:

DeepSeek R1 Llama 70B Distil
Gemini 2.0 Flash
Phi 4 Reasoning
Qwen3 0.6B

Only the best responses are selected and added to the dataset. This is done by having all of the AI models voting on which output they think is the best without being able to vote for their own output.

File Map

The main dataset file is new_dataset.jsonl. The input and outputs are classified as:

"user_input:"
"output:"

If you are looking for all of the prompts that were asked to generate the outputs, look in the asked.txt file.

The Resources folder contains static images and GIF assets and the files for the tools that were used to create them. Resources/synthetic-conversations.gif and Resources/cluster-logo.png, are used as the title for this file.

The outputs folder is used to store outputs from all the AI models after they reply to the question input generated by Gemini 2.0 Flash.

vote.txt is where the AI models will write their vote. dataset_builder.py reads this to find the winner.

dataset_builder.py is the main file. When run it will begin to build the dataset.

Each AI has its own .py file which is used to interact with it via its API. (Be sure to create your own .env and add your API keys as they appear in their corresponding AI file!).

The AI's that are included in this version of the program are:

gemini.py Using gemini-2.0-flash the offical Gemini API and google-geai SDK.
deepseek.py Running DeepSeek R1 Llama 70B Distil running as a DigitalOcean AI Agent. Interact with the agent with the openai SDK.
phi.py Using an OpenRouter API and the openai SDK.

All dependencies can be found in and installed from requirements.txt.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Dataset Builder

Patch Notes - (0.1 Beta 5/15/2025):

About

Development Process

File Map

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
Resources		Resources
outputs		outputs
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
asked.txt		asked.txt
dataset_builder.py		dataset_builder.py
deepseek.py		deepseek.py
gemini.py		gemini.py
new_dataset.jsonl		new_dataset.jsonl
phi.py		phi.py
qwen.py		qwen.py
requirements.txt		requirements.txt
universal.py		universal.py
vote.txt		vote.txt

License

Tyguy047/Cluster-Dataset-Builder

Folders and files

Latest commit

History

Repository files navigation

Dataset Builder

Patch Notes - (0.1 Beta 5/15/2025):

About

Development Process

File Map

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages