A powerful tool for creating fine-tuning datasets for Large Language Models
Features • Quick Start • Documentation • Contributing • License
If you like this project, please give it a Star⭐️, or buy the author a coffee => Donate ❤️!
Easy Dataset is an application specifically designed for creating fine-tuning datasets for Large Language Models (LLMs). It provides an intuitive interface for uploading domain-specific files, intelligently splitting content, generating questions, and producing high-quality training data for model fine-tuning.
With Easy Dataset, you can transform domain knowledge into structured datasets, compatible with all LLM APIs that follow the OpenAI format, making the fine-tuning process simple and efficient.
- Intelligent Document Processing: Supports intelligent recognition and processing of multiple formats including PDF, Markdown, DOCX, etc.
 - Intelligent Text Splitting: Supports multiple intelligent text splitting algorithms and customizable visual segmentation
 - Intelligent Question Generation: Extracts relevant questions from each text segment
 - Domain Labels: Intelligently builds global domain labels for datasets, with global understanding capabilities
 - Answer Generation: Uses LLM API to generate comprehensive answers and Chain of Thought (COT)
 - Flexible Editing: Edit questions, answers, and datasets at any stage of the process
 - Multiple Export Formats: Export datasets in various formats (Alpaca, ShareGPT, multilingual-thinking) and file types (JSON, JSONL)
 - Wide Model Support: Compatible with all LLM APIs that follow the OpenAI format
 - User-Friendly Interface: Intuitive UI designed for both technical and non-technical users
 - Custom System Prompts: Add custom system prompts to guide model responses
 
ed3.mp4
| Windows | MacOS | Linux | |
      
         
        Setup.exe  | 
    
      
         
        Intel  | 
    
      
         
        M  | 
    
      
         
        AppImage  | 
  
- Clone the repository:
 
   git clone https://github.com/ConardLi/easy-dataset.git
   cd easy-dataset- Install dependencies:
 
   npm install- Start the development server:
 
   npm run build
   npm run start- Open your browser and visit 
http://localhost:1717 
- Clone the repository:
 
git clone https://github.com/ConardLi/easy-dataset.git
cd easy-dataset- Modify the 
docker-compose.ymlfile: 
services:
  easy-dataset:
    image: ghcr.io/conardli/easy-dataset
    container_name: easy-dataset
    ports:
      - '1717:1717'
    volumes:
      - ./local-db:/app/local-db
      # - ./prisma:/app/prisma  If mounting is required, please manually initialize the database file first.
    restart: unless-stoppedNote: Replace
{YOUR_LOCAL_DB_PATH}and{LOCAL_PRISMA_PATH}with the actual paths where you want to store the local database. It is recommended to use thelocal-dbandprismafolders in the current code repository directory to maintain consistency with the database paths when starting via NPM.
Note: If you need to mount the database file (PRISMA), you need to run
npm run db:pushin advance to initialize the database file.
- Start with docker-compose:
 
docker-compose up -d- Open a browser and visit 
http://localhost:1717 
If you want to build the image yourself, use the Dockerfile in the project root directory:
- Clone the repository:
 
git clone https://github.com/ConardLi/easy-dataset.git
cd easy-dataset- Build the Docker image:
 
docker build -t easy-dataset .- Run the container:
 
docker run -d \
  -p 1717:1717 \
  -v {YOUR_LOCAL_DB_PATH}:/app/local-db \
  -v {LOCAL_PRISMA_PATH}:/app/prisma \
  --name easy-dataset \
  easy-datasetNote: Replace
{YOUR_LOCAL_DB_PATH}and{LOCAL_PRISMA_PATH}with the actual paths where you want to store the local database. It is recommended to use thelocal-dbandprismafolders in the current code repository directory to maintain consistency with the database paths when starting via NPM.
- Open a browser and visit 
http://localhost:1717 
![]()  | 
        ![]()  | 
    
- Click the "Create Project" button on the homepage;
 - Enter a project name and description;
 - Configure your preferred LLM API settings
 
![]()  | 
        ![]()  | 
    
- Upload your files in the "Text Split" section (supports PDF, Markdown, txt, DOCX);
 - View and adjust the automatically split text segments;
 - View and adjust the global domain tree
 
![]()  | 
        ![]()  | 
    
- Batch construct questions based on text blocks;
 - View and edit the generated questions;
 - Organize questions using the label tree
 
![]()  | 
        ![]()  | 
    
- Batch construct datasets based on questions;
 - Generate answers using the configured LLM;
 - View, edit, and optimize the generated answers
 
![]()  | 
        ![]()  | 
    
- Click the "Export" button in the Datasets section;
 - Choose your preferred format (Alpaca or ShareGPT or multilingual-thinking);
 - Select the file format (JSON or JSONL);
 - Add custom system prompts as needed;
 - Export your dataset
 
- View the demo video of this project: Easy Dataset Demo Video
 - For detailed documentation on all features and APIs, visit our Documentation Site
 - View the paper of this project: Easy Dataset: A Unified and Extensible Framework for Synthesizing LLM Fine-Tuning Data from Unstructured Documents
 
- Easy Dataset × LLaMA Factory: Enabling LLMs to Efficiently Learn Domain Knowledge
 - Easy Dataset Practical Guide: How to Build High-Quality Datasets?
 - Interpretation of Key Feature Updates in Easy Dataset
 - Foundation Models Fine-tuning Datasets: Basic Knowledge Popularization
 
We welcome contributions from the community! If you'd like to contribute to Easy Dataset, please follow these steps:
- Fork the repository
 - Create a new branch (
git checkout -b feature/amazing-feature) - Make your changes
 - Commit your changes (
git commit -m 'Add some amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request (submit to the DEV branch)
 
Please ensure that tests are appropriately updated and adhere to the existing coding style.
https://docs.easy-dataset.com/geng-duo/lian-xi-wo-men
This project is licensed under the AGPL 3.0 License - see the LICENSE file for details.
If this work is helpful, please kindly cite as:
@misc{miao2025easydataset,
  title={Easy Dataset: A Unified and Extensible Framework for Synthesizing LLM Fine-Tuning Data from Unstructured Documents},
  author={Ziyang Miao and Qiyu Sun and Jingyuan Wang and Yuchen Gong and Yaowei Zheng and Shiqi Li and Richong Zhang},
  year={2025},
  eprint={2507.04009},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2507.04009}
}

        
        
        








