Skip to content
ACE-Step 1

ACE-Step 1

Turns lyrics into songs in seconds, generates music by style keywords.

Features

Open SourceMusic

System Requirements

Minimum 16GB RAM. 10GB+ storage recommended.
macOS 15+: M-series chips required.
Windows 10/11: NVIDIA GPU with 8GB+ VRAM required.
Note: For NVIDIA GPUs, install a newer driver.

Introduction

ACE-Step is a novel open-source foundation model for music generation that overcomes key limitations of existing approaches and achieves state-of-the-art performance through a holistic architectural design. Current methods face inherent trade-offs between generation speed, musical coherence, and controllability. For instance, LLM-based models (e.g., Yue, SongGen) excel at lyric alignment but suffer from slow inference and structural artifacts. Diffusion models (e.g., DiffRhythm), on the other hand, enable faster synthesis but often lack long-range structural coherence.

ACE-Step bridges this gap by integrating diffusion-based generation with Sana’s Deep Compression AutoEncoder (DCAE) and a lightweight linear transformer. It further leverages MERT and m-hubert to align semantic representations (REPA) during training, enabling rapid convergence. As a result, the model synthesizes up to 4 minutes of music in just 20 seconds on an A100 GPU—15× faster than LLM-based baselines—while achieving superior musical coherence and lyric alignment across melody, harmony, and rhythm metrics. Moreover, ACE-Step preserves fine-grained acoustic details, enabling advanced control mechanisms such as voice cloning, lyric editing, remixing, and track generation (e.g., lyric2vocal, singing2accompaniment).

The vision is not to build yet another end-to-end text-to-music pipeline but to establish a foundation model for music AI: a fast, general-purpose, efficient yet flexible architecture that makes it easy to train sub-tasks on top of it. This paves the way for developing powerful tools that seamlessly integrate into the creative workflows of music artists, producers, and content creators. In short, the aim is to build the Stable Diffusion moment for music.

Features

  • Text2Music Foundation Model:Supports music generation from text, including Lyric2Vocal (LoRA), Text2Samples (LoRA), etc.
  • Diverse Styles & Genres:Supports all mainstream music styles and is capable of generating music across different genres with appropriate instrumentation and style.
  • Multiple Languages:Supports 19 languages, with the top 10 well-performing languages including English, Chinese, Russian, Spanish, etc.
  • Instrumental Styles:Supports various instrumental music generation across different genres and styles, and can produce realistic instrumental tracks with appropriate timbre and expression.
  • Vocal Techniques:Capable of rendering various vocal styles and techniques with good quality, and supports different vocal expressions.
  • Controllability:Includes functions such as Variations Generation, Repainting, Lyric Editing, etc., allowing users to fine-tune the generated music.
  • Applications:Covers multiple fields such as Lyric2Vocal, Text2Samples, RapMachine, StemGen, Singing2Accompaniment, etc.

Hardware Performance

The evaluation results of ACE-Step on different hardware setups are as follows:

Device RTF (27 steps) Time to render 1 min audio (27 steps) RTF (60 steps) Time to render 1 min audio (60 steps)
NVIDIA RTX 4090 34.48× 1.74 s 15.63× 3.84 s
NVIDIA A100 27.27× 2.20 s 12.27× 4.89 s
NVIDIA RTX 3090 12.76× 4.70 s 6.48× 9.26 s
MacBook M2 Max 2.27× 26.43 s 1.03× 58.25 s