ACE-Step 1.5 - One Click to Run AI on Your Own Computer

Features

Open SourceMusic

System Requirements

16GB RAM recommended. 28GB+ storage recommended.
macOS 15+: M-series chips required.
Windows 10/11: Intel/AMD GPUs supported, NVIDIA GPU recommended.
Note: For NVIDIA GPUs, install a newer driver.

Introduction

ACE-Step 1.5 is an open-source music generation foundation model co-led by ACE Studio and StepFun. It is currently a leading local music generation model that outperforms most commercial music generation models in terms of generation quality. It runs smoothly on consumer-grade hardware and serves as an efficient AI music creation tool for musicians, producers and content creators.

Core Functions & Product Features

Ultra-fast Generation & Flexible Duration：It takes less than 2 seconds to generate a complete song on an A100 GPU and within 10 seconds on an RTX 3090 GPU. It supports audio generation from 10 seconds to 10 minutes, and can generate up to 8 songs in batches simultaneously, maximizing creation efficiency.
Commercial-grade Audio Quality & Rich Styles：The generated audio quality reaches commercial standards between Suno v4.5 and Suno v5, supporting more than 1000 instruments and styles with fine-grained timbre description. It also adapts to lyric generation in more than 50 languages, meeting multilingual creation needs.
Full-scenario Creation Control：It supports reference audio to guide generation style, cover creation from existing audio, selective local audio repainting and editing, as well as track separation, multi-track layering and automatic BGM generation for vocal tracks. It can precisely control metadata such as music duration, BPM, key/scale and time signature, and directly generate complete songs through simple text descriptions. The AI will also automatically optimize creation tags and lyrics.
Lightweight Personalized Training：One-click LoRA training is available in the Gradio interface. Only 8 reference songs are needed, and a custom style model can be generated by training for 1 hour on an RTX 3090 with 12GB VRAM, making it easy to create a personalized AI creation style.
Intelligent Audio Parsing：It can automatically extract BPM, key/scale and time signature from audio and generate descriptive copy, and also automatically match LRC lyrics with timestamps for the generated music. It also has a built-in audio quality scoring function to help creators control work quality.
Low VRAM Requirement & Hardware Compatibility：It only requires less than 4GB of VRAM to run locally, supporting CUDA, MPS, Intel XPU GPUs and pure CPU operation.

Underlying Core Technology

The model adopts an innovative hybrid architecture of Language Model (LM) + Diffusion Transformer (DiT)：

Language Model (LM)：Fine-tuned based on the Qwen3 series models, including three versions of 0.6B/1.7B/4B. It acts as an "omni-capable planner", transforming simple user creation needs into a complete song blueprint through Chain-of-Thought, while generating lyrics, metadata and descriptive copy, and also has the capabilities of audio understanding and query rewriting.
Diffusion Transformer (DiT)：Including four versions of base/sft/turbo/turbo-rl, it completes core tasks such as audio generation, cover creation and repainting according to the blueprint of the language model. Among them, the turbo version can generate high-quality audio with only 8 diffusion steps, greatly improving the generation speed.
Unique Alignment Method：The collaboration between LM and DiT is realized through intrinsic reinforcement learning without external reward models or human preference annotations, fundamentally avoiding biases caused by external intervention.
Engineering Optimization：It supports INT8 quantization, CPU offloading, vllm/PyTorch backend adaptation, and designs a GPU compatibility tier system for GPUs with different VRAM capacities. At the same time, it combines PEFT, TorchAO and other technologies to achieve efficient model training and inference.

GitHubhttps://github.com/ace-step/ACE-Step-1.5

LicenseMIT