ACE-Step Training Fork

Complete Beginner Guide for Dataset Processing & LoRA Training

This Fork | Original Repo | Original Project | Hugging Face Model | Discord Support

🎯 This fork specializes in:

Automated Dataset Processing with Faster-Whisper
Clean LoRA Training CLIs with real-time dashboards
Complete Beginner Workflows from raw audio to trained model
Step-by-step automation - no manual file editing required

🏆 What Makes This Fork Special

This is a specialized fork of the original ACE-Step project, focused entirely on making training and dataset preparation accessible to beginners.

✨ Complete Automation

🤖 Automated dataset processing from raw audio files
🎤 Faster-Whisper integration for automatic transcription
⚙️ Auto-generated LoRA configs based on your dataset
🚀 One-command training with beautiful dashboards

🎯 Beginner-Friendly

📚 Step-by-step guides for every process
🖥️ Clean CLI interfaces with progress tracking
💡 Helpful tips and troubleshooting
🔧 No manual configuration required

🏗️ Professional Training Tools

📊 Real-time training dashboard with live metrics
💾 Resume capability for interrupted training
🎛️ Resource optimization for different hardware setups
📈 Built-in validation and progress tracking

🙏 Credits & Original Work

Original ACE-Step Project: ace-step/ACE-Step
Original Authors: Junmin Gong, Wenxiao Zhao, Sen Wang, Shengyuan Xu, Jing Guo
Organizations: ACE Studio and StepFun

This fork builds upon their excellent foundation model work, adding specialized tooling for training workflows. All credit for the core ACE-Step model and research goes to the original team.

🚀 Quick Start Guide

🎯 Complete Workflow: Raw Audio → Trained Model

This fork provides a 3-step automated workflow from raw audio files to a trained ACE-Step model:

# Step 1: Process your audio files into training dataset
python -m dataset_cli_tool

# Step 2: Train your LoRA model with real-time dashboard
python train_cli_advanced.py --dataset_path ./prepared_dataset --lora_config_path ./lora_config.json

# Step 3: Use your trained model for music generation
acestep --port 7865

That's it! The tools handle everything else automatically.

🎬 What Each Step Does

📁 Step 1: Dataset Processing

Scans your audio files (MP3, WAV, FLAC, etc.)
Converts to consistent format
Uses Faster-Whisper to generate transcriptions automatically
Creates training-ready dataset structure
Generates optimized LoRA configuration

🏋️ Step 2: LoRA Training

Beautiful real-time dashboard with live metrics
Automatic checkpointing and resume capability
GPU optimization and resource management
Progress tracking with ETA calculations

🎵 Step 3: Music Generation

Load your trained LoRA adapter
Generate music in your custom style
Web interface for easy interaction
Export and share your creations

📦 Installation

🎯 One-Command Setup

# Clone the repository
git clone https://github.com/WebChatAppAi/ACE-Step.git
cd ACE-Step

# Create environment and install
conda create -n ace_step python=3.10 -y
conda activate ace_step
pip install -e .

📋 System Requirements

Minimum Requirements:

Python 3.10+
8GB RAM
CUDA-compatible GPU (GTX 1660+ or RTX series)
20GB free disk space

Recommended Setup:

Python 3.10
16GB+ RAM
RTX 3090/4090 or A100
50GB+ SSD storage
Fast internet for model downloads

Training Requirements:

CUDA 11.8+ or 12.1+
12GB+ VRAM for LoRA training
FFmpeg (for dataset processing)

🔧 Additional Dependencies

For dataset processing, install:

pip install faster-whisper>=1.0.0 rich>=13.0.0 loguru librosa soundfile

Windows users need FFmpeg:

Download from FFmpeg.org
Add to system PATH

🎯 Dataset Processing

🚀 Automated Dataset Tool

Our dataset CLI tool handles everything from raw audio to training-ready dataset:

# Interactive dataset processing
python -m dataset_cli_tool

What it does:

🔍 Scans for audio files (MP3, WAV, FLAC, M4A, etc.)
🔄 Converts to consistent format
🎤 Generates transcriptions with Faster-Whisper
✅ Validates dataset structure
⚙️ Creates optimized LoRA configuration

📁 Manual Dataset Processing

If you prefer step-by-step control:

# Step 1: Scan audio files
python -m dataset_cli_tool scan --path /your/audio/folder

# Step 2: Convert audio format
python -m dataset_cli_tool convert --input /audio --output /dataset --format mp3

# Step 3: Generate transcriptions
python -m dataset_cli_tool transcribe --input /dataset --model distil-large-v3

# Step 4: Generate LoRA config
python -m dataset_cli_tool generate-lora --dataset /dataset

🎤 Faster-Whisper Models

Model	Speed	Quality	VRAM	Best For
`distil-large-v3`	⚡⚡⚡	🌟🌟🌟🌟🌟	6GB	Recommended
`large-v3`	⚡	🌟🌟🌟🌟🌟	10GB	Best quality
`base`	⚡⚡⚡⚡	🌟🌟🌟	1GB	Low VRAM

Complete dataset processing guide

🎵 Training Your Model

🎨 Simple Training (Basic CLI)

python train_cli.py --dataset_path ./prepared_dataset --lora_config_path ./lora_config.json

Clean, organized output with progress tracking:

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃                    Training Configuration                    ┃
┃ Dataset Path:     ./prepared_dataset                        ┃
┃ LoRA Config:      ./lora_config.json                        ┃
┃ Learning Rate:    1.00e-04                                  ┃
┃ Max Steps:        2,000,000                                 ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛

Step: 1000/2000000 (0.1%) | Loss: 0.4532 | LR: 1.00e-04

📊 Advanced Training (Dashboard CLI)

python train_cli_advanced.py --dataset_path ./prepared_dataset --lora_config_path ./lora_config.json

Beautiful real-time dashboard:

╔═══════════════════════════════════════════════════════════╗
║           🎵 ACE-Step Training Dashboard 🎵               ║
╚═══════════════════════════════════════════════════════════╝

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃         Progress              ┃      Metrics             ┃
┃ Step:     1,234/2,000,000    ┃ Total Loss:    0.4532    ┃
┃ Progress: 6.2%               ┃ Denoising:     0.3421    ┃
┃ Speed:    1.45 steps/s       ┃ Learning Rate: 1.00e-04  ┃
┃ ETA:      12h 34m            ┃ VRAM Usage:    8.2GB     ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┻━━━━━━━━━━━━━━━━━━━━━━━━━┛

⚡ Quick Training Examples

# High-end GPU (RTX 4090, A100)
python train_cli.py --dataset_path ./dataset --lora_config_path ./config.json \
                    --batch_size 4 --precision 16

# Mid-range GPU (RTX 3080, 3090)
python train_cli.py --dataset_path ./dataset --lora_config_path ./config.json \
                    --batch_size 2 --precision 16

# Lower VRAM (GTX 1660, RTX 3060)
python train_cli.py --dataset_path ./dataset --lora_config_path ./config.json \
                    --batch_size 1 --accumulate_grad_batches 4

Complete training guide and examples

🖥️ Using Your Trained Model

🚀 Launch ACE-Step with Your LoRA

# Basic usage
acestep --port 7865

# With optimizations
acestep --torch_compile true --cpu_offload true --overlapped_decode true --port 7865

🎛️ Load Your Custom Model

In the web interface:

Navigate to the Settings tab
Find LoRA Settings
Upload your trained .safetensors file
Set LoRA scale (usually 0.7-1.0)
Generate music with your custom style!

💡 Performance Tips

# Memory optimization (8GB VRAM)
acestep --cpu_offload true --overlapped_decode true

# Speed optimization (High-end GPU)
acestep --torch_compile true --bf16 true

# Windows users (need triton)
pip install triton-windows

📚 Detailed Guides

This fork provides comprehensive documentation for every aspect:

📖 Core Guides

CLI Training Guide - Complete training CLI reference with examples
Dataset Processing Guide - Audio-to-dataset automation
Original Training Guide - Traditional training approach

🎯 Quick References

Troubleshooting - Common issues and solutions
Performance Tuning - Hardware optimization
LoRA Configuration - Custom training settings
Model Integration - Using trained models

🔧 Advanced Topics

Multi-GPU training setup
Dataset format specifications
Custom model architectures
Evaluation and validation

🤝 Community & Support

💬 Get Help

Discord: ACE-Step Community
GitHub Issues: Report bugs and request features
Discussions: Share your trained models and results

🤲 Contributing

This fork welcomes contributions to:

Improve CLI interfaces
Add dataset processing features
Enhance training workflows
Fix bugs and optimize performance

📢 Share Your Work

We'd love to see what you create:

Share your trained models
Post your generated music
Help other beginners learn

📄 Original Research

🎯 Baseline Quality

🌈 Diverse Styles & Genres

🎸 Supports all mainstream music styles with various description formats including short tags, descriptive text, or use-case scenarios
🎷 Capable of generating music across different genres with appropriate instrumentation and style

🌍 Multiple Languages

🗣️ Supports 19 languages with top 10 well-performing languages including:
- 🇺🇸 English, 🇨🇳 Chinese, 🇷🇺 Russian, 🇪🇸 Spanish, 🇯🇵 Japanese, 🇩🇪 German, 🇫🇷 French, 🇵🇹 Portuguese, 🇮🇹 Italian, 🇰🇷 Korean
⚠️ Due to data imbalance, less common languages may underperform

🎻 Instrumental Styles

🎹 Supports various instrumental music generation across different genres and styles
🎺 Capable of producing realistic instrumental tracks with appropriate timbre and expression for each instrument
🎼 Can generate complex arrangements with multiple instruments while maintaining musical coherence

🎤 Vocal Techniques

🎙️ Capable of rendering various vocal styles and techniques with good quality
🗣️ Supports different vocal expressions including various singing techniques and styles

🎛️ Controllability

🔄 Variations Generation

⚙️ Implemented using training-free, inference-time optimization techniques
🌊 Flow-matching model generates initial noise, then uses trigFlow's noise formula to add additional Gaussian noise
🎚️ Adjustable mixing ratio between original initial noise and new Gaussian noise to control variation degree

🎨 Repainting

🖌️ Implemented by adding noise to the target audio input and applying mask constraints during the ODE process
🔍 When input conditions change from the original generation, only specific aspects can be modified while preserving the rest
🔀 Can be combined with Variations Generation techniques to create localized variations in style, lyrics, or vocals

✏️ Lyric Editing

💡 Innovatively applies flow-edit technology to enable localized lyric modifications while preserving melody, vocals, and accompaniment
🔄 Works with both generated content and uploaded audio, greatly enhancing creative possibilities
ℹ️ Current limitation: can only modify small segments of lyrics at once to avoid distortion, but multiple edits can be applied sequentially

🚀 Applications

🎤 Lyric2Vocal (LoRA)

🔊 Based on a LoRA fine-tuned on pure vocal data, allowing direct generation of vocal samples from lyrics
🛠️ Offers numerous practical applications such as vocal demos, guide tracks, songwriting assistance, and vocal arrangement experimentation
⏱️ Provides a quick way to test how lyrics might sound when sung, helping songwriters iterate faster

📝 Text2Samples (LoRA)

🎛️ Similar to Lyric2Vocal, but fine-tuned on pure instrumental and sample data
🎵 Capable of generating conceptual music production samples from text descriptions
🧰 Useful for quickly creating instrument loops, sound effects, and musical elements for production

🔮 Coming Soon

🎤 RapMachine

🔥 Fine-tuned on pure rap data to create an AI system specialized in rap generation
🏆 Expected capabilities include AI rap battles and narrative expression through rap
📚 Rap has exceptional storytelling and expressive capabilities, offering extraordinary application potential

🎛️ StemGen

🎚️ A controlnet-lora trained on multi-track data to generate individual instrument stems
🎯 Takes a reference track and specified instrument (or instrument reference audio) as input
🎹 Outputs an instrument stem that complements the reference track, such as creating a piano accompaniment for a flute melody or adding jazz drums to a lead guitar

🎤 Singing2Accompaniment

🔄 The reverse process of StemGen, generating a mixed master track from a single vocal track
🎵 Takes a vocal track and specified style as input to produce a complete vocal accompaniment
🎸 Creates full instrumental backing that complements the input vocals, making it easy to add professional-sounding accompaniment to any vocal recording

📋 Roadmap

Release training code 🔥
Release LoRA training code 🔥
Release RapMachine LoRA 🎤
Release evaluation performance and technical report 📄
Train and Release ACE-Step V1.5
Release ControlNet training code 🔥
Release Singing2Accompaniment ControlNet 🎮

🖥️ Hardware Performance

We have evaluated ACE-Step across different hardware setups, yielding the following throughput results:

Device	RTF (27 steps)	Time to render 1 min audio (27 steps)	RTF (60 steps)	Time to render 1 min audio (60 steps)
NVIDIA RTX 4090	34.48 ×	1.74 s	15.63 ×	3.84 s
NVIDIA A100	27.27 ×	2.20 s	12.27 ×	4.89 s
NVIDIA RTX 3090	12.76 ×	4.70 s	6.48 ×	9.26 s
MacBook M2 Max	2.27 ×	26.43 s	1.03 ×	58.25 s

We use RTF (Real-Time Factor) to measure the performance of ACE-Step. Higher values indicate faster generation speed. 27.27x means to generate 1 minute of music, it takes 2.2 seconds (60/27.27).

📦 Installation

1. Clone the Repository

First, clone the ACE-Step repository to your local machine and navigate into the project directory:

git clone https://github.com/WebChatAppAi/ACE-Step.git
cd ACE-Step

2. Prerequisites

Ensure you have the following installed:

Python: Version 3.10 or later is recommended. You can download it from python.org.
Conda or venv: For creating a virtual environment (Conda is recommended).

3. Set Up a Virtual Environment

It is highly recommended to use a virtual environment to manage project dependencies and avoid conflicts. Choose one of the following methods:

Option A: Using Conda

Create the environment named ace_step with Python 3.10:
```
conda create -n ace_step python=3.10 -y
```
Activate the environment:
```
conda activate ace_step
```

Option B: Using venv

Navigate to the cloned ACE-Step directory.
Create the virtual environment (commonly named venv):
```
python -m venv venv 
```
Activate the environment:
- On Windows (cmd.exe):
```
venv\Scripts\activate.bat
```
- On Windows (PowerShell):
```
.\venv\Scripts\Activate.ps1 
```
  (If you encounter execution policy errors, you might need to run Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope Process first)
- On Linux / macOS (bash/zsh):
```
source venv/bin/activate
```

4. Install Dependencies

Once your virtual environment is activated: a. (Windows Only) If you are on Windows and plan to use an NVIDIA GPU, install PyTorch with CUDA support first:

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126

(Adjust cu126 if you have a different CUDA version. For other PyTorch installation options, refer to the official PyTorch website).

b. Install ACE-Step and its core dependencies:

pip install -e .

The ACE-Step application is now installed. The GUI works on Windows, macOS, and Linux. For instructions on how to run it, please see the Usage section.

🚀 Usage

📥 Model Download (Recommended First Step)

Download models to a custom location with full control:

# Download full model (7.2 GB)
python modeldownloader.py --output_dir ./models

# Download quantized model (2.5 GB, faster)
python modeldownloader.py --output_dir ./models --quantized

# Download to custom path
python modeldownloader.py --output_dir /path/to/my/models

The model downloader ensures proper directory structure and verifies all components.

🔍 Basic Usage

# Use downloaded models
acestep --checkpoint_path ./models --port 7865

# Auto-download (if no models specified)
acestep --port 7865

⚙️ Advanced Usage

acestep --checkpoint_path ./models --port 7865 --device_id 0 --share true --bf16 true

Model Loading Priority:

If --checkpoint_path is set and models exist at the path, load from checkpoint_path.
If --checkpoint_path is set but models do not exist at the path, auto download models to checkpoint_path.
If --checkpoint_path is not set, auto download models to the default path ~/.cache/ace-step/checkpoints.

Note: Use python modeldownloader.py for reliable downloads with progress tracking.

If you are using macOS, please use --bf16 false to avoid errors.

🔍 API Usage

If you intend to integrate ACE-Step as a library into your own Python projects, you can install the latest version directly from GitHub using the following pip command.

Direct Installation via pip:

Ensure Git is installed: This method requires Git to be installed on your system and accessible in your system's PATH.
Execute the installation command:
```
pip install git+https://github.com/WebChatAppAi/ACE-Step.git
```
It's recommended to use this command within a virtual environment to avoid conflicts with other packages.

🛠️ Command Line Arguments

--checkpoint_path: Path to the model checkpoint (default: downloads automatically)
--server_name: IP address or hostname for the Gradio server to bind to (default: '127.0.0.1'). Use '0.0.0.0' to make it accessible from other devices on the network.
--port: Port to run the Gradio server on (default: 7865)
--device_id: GPU device ID to use (default: 0)
--share: Enable Gradio sharing link (default: False)
--bf16: Use bfloat16 precision for faster inference (default: True)
--torch_compile: Use torch.compile() to optimize the model, speeding up inference (default: False).
- Windows need to install triton:
```
pip install triton-windows
```
--cpu_offload: Offload model weights to CPU to save GPU memory (default: False)
--overlapped_decode: Use overlapped decoding to speed up inference (default: False)

📱 User Interface Guide

The ACE-Step interface provides several tabs for different music generation and editing tasks:

📝 Text2Music Tab

📋 Input Fields:
- 🏷️ Tags: Enter descriptive tags, genres, or scene descriptions separated by commas
- 📜 Lyrics: Enter lyrics with structure tags like [verse], [chorus], and [bridge]
- ⏱️ Audio Duration: Set the desired duration of the generated audio (-1 for random)
⚙️ Settings:
- 🔧 Basic Settings: Adjust inference steps, guidance scale, and seeds
- 🔬 Advanced Settings: Fine-tune scheduler type, CFG type, ERG settings, and more
🚀 Generation: Click "Generate" to create music based on your inputs

🔄 Retake Tab

🎲 Regenerate music with slight variations using different seeds
🎚️ Adjust variance to control how much the retake differs from the original

🎨 Repainting Tab

🖌️ Selectively regenerate specific sections of the music
⏱️ Specify start and end times for the section to repaint
🔍 Choose the source audio (text2music output, last repaint, or upload)

✏️ Edit Tab

🔄 Modify existing music by changing tags or lyrics
🎛️ Choose between "only_lyrics" mode (preserves melody) or "remix" mode (changes melody)
🎚️ Adjust edit parameters to control how much of the original is preserved

📏 Extend Tab

➕ Add music to the beginning or end of an existing piece
📐 Specify left and right extension lengths
🔍 Choose the source audio to extend

📂 Examples

The examples/input_params directory contains sample input parameters that can be used as references for generating music.

📜 License & Disclaimer

This project is licensed under Apache License 2.0

ACE-Step enables original music generation across diverse genres, with applications in creative production, education, and entertainment. While designed to support positive and artistic use cases, users should be mindful of ethical considerations and intellectual property rights in the application of this technology.

📖 Citation

If you find this project useful for your research, please consider citing the original work:

@misc{gong2025acestep,
	title={ACE-Step: A Step Towards Music Generation Foundation Model},
	author={Junmin Gong, Wenxiao Zhao, Sen Wang, Shengyuan Xu, Jing Guo}, 
	howpublished={\url{https://github.com/ace-step/ACE-Step}},
	year={2025},
	note={GitHub repository}
}

Name		Name	Last commit message	Last commit date
Latest commit History 272 Commits
acestep		acestep
assets		assets
config		config
data		data
dataset_cli_tool		dataset_cli_tool
examples		examples
zh_lora_dataset		zh_lora_dataset
.gitignore		.gitignore
CLI_GUIDE.md		CLI_GUIDE.md
DATASET_CLI_GUIDE.md		DATASET_CLI_GUIDE.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
TRAIN_INSTRUCTION.md		TRAIN_INSTRUCTION.md
ZH_RAP_LORA.md		ZH_RAP_LORA.md
colab_inference.ipynb		colab_inference.ipynb
convert2hf_dataset.py		convert2hf_dataset.py
docker-compose.yaml		docker-compose.yaml
infer-api.py		infer-api.py
infer.py		infer.py
inference.ipynb		inference.ipynb
modeldownloader.py		modeldownloader.py
requirements.txt		requirements.txt
setup.py		setup.py
train_cli.py		train_cli.py
train_cli_advanced.py		train_cli_advanced.py
trainer-api.py		trainer-api.py
trainer.py		trainer.py

License

WebChatAppAi/ACE-Step

Folders and files

Latest commit

History

Repository files navigation

ACE-Step Training Fork

Complete Beginner Guide for Dataset Processing & LoRA Training

🏆 What Makes This Fork Special

✨ Complete Automation

🎯 Beginner-Friendly

🏗️ Professional Training Tools

🙏 Credits & Original Work

📋 Table of Contents

🚀 Quick Start Guide

🎯 Complete Workflow: Raw Audio → Trained Model

🎬 What Each Step Does

📁 Step 1: Dataset Processing

🏋️ Step 2: LoRA Training

🎵 Step 3: Music Generation

📦 Installation

🎯 One-Command Setup

📋 System Requirements

🔧 Additional Dependencies

🎯 Dataset Processing

🚀 Automated Dataset Tool

📁 Manual Dataset Processing

🎤 Faster-Whisper Models

🎵 Training Your Model

🎨 Simple Training (Basic CLI)

📊 Advanced Training (Dashboard CLI)

⚡ Quick Training Examples

🖥️ Using Your Trained Model

🚀 Launch ACE-Step with Your LoRA

🎛️ Load Your Custom Model

💡 Performance Tips

📚 Detailed Guides

📖 Core Guides

🎯 Quick References

🔧 Advanced Topics

🤝 Community & Support

💬 Get Help

🤲 Contributing

📢 Share Your Work

📄 Original Research

🎯 Baseline Quality

🌈 Diverse Styles & Genres

🌍 Multiple Languages

🎻 Instrumental Styles

🎤 Vocal Techniques

🎛️ Controllability

🔄 Variations Generation

🎨 Repainting

✏️ Lyric Editing

🚀 Applications

🎤 Lyric2Vocal (LoRA)

📝 Text2Samples (LoRA)

🔮 Coming Soon

🎤 RapMachine

🎛️ StemGen

🎤 Singing2Accompaniment

📋 Roadmap

🖥️ Hardware Performance

📦 Installation

1. Clone the Repository

2. Prerequisites

3. Set Up a Virtual Environment

Option A: Using Conda

Option B: Using venv

4. Install Dependencies

🚀 Usage

📥 Model Download (Recommended First Step)

🔍 Basic Usage

⚙️ Advanced Usage

🔍 API Usage

🛠️ Command Line Arguments

📱 User Interface Guide

📝 Text2Music Tab

🔄 Retake Tab

🎨 Repainting Tab

Packages