Skip to content
ACE-Step 1.5 XL

ACE-Step 1.5 XL

Upgraded model architecture with greatly improved sound quality and song integrity, supports ultra-long duration and multilingual creation

Features

Open SourceMusic

System Requirements

32GB RAM recommended. 50GB+ storage recommended.
Windows 10/11 64-bit: NVIDIA GPU with 16GB+ VRAM required.
Note: For NVIDIA GPUs, install a newer driver.

Introduction

ACE‑Step 1.5 is an open‑source high‑performance music foundation model co‑developed by StepFun and ACE Studio, designed to deliver commercial‑grade music generation on consumer hardware. Its latest flagship ACE‑Step 1.5 XL brings major upgrades in audio quality, creativity, and inference efficiency.

1. Key Highlights & Hardware Requirements of ACE‑Step 1.5 XL

  • Key Highlights
    • Larger LM+DiT architecture for stronger consistency in structure, melody, and lyrics, with quality near top commercial music models.
    • Ultra‑fast inference: <2 seconds per full song on A100, <10 seconds on RTX 3090; supports batch generation of up to 8 tracks.
    • Flexible duration: generates music from 10 seconds up to 10 minutes.
    • Multilingual support: works with lyrics and prompts in 50+ languages.
    • Native reinforcement learning alignment: uses internal mechanisms without external reward models, reducing biases.
  • Higher Hardware Requirements
    • XL recommends ≥24GB VRAM to load 4B LM + large DiT at full speed without offloading.
    • 16–24GB VRAM can run 1.7B LM with moderate offloading; <12GB VRAM not recommended.

2. Easy‑to‑Understand Core Features

  • Text‑to‑Music: Generate full songs with lyrics from simple text descriptions.
  • Cover Generation: Create stylized covers from reference audio.
  • Repaint Editing: Modify specific sections without breaking the whole track.
  • Vocal‑to‑BGM: Auto‑generate accompaniment from vocal tracks.
  • Multi‑track Layering: Add melodies, drums, bass, etc. like building blocks.
  • Track Separation: Split songs into vocal, guitar, drum, bass stems.
  • One‑click LoRA: Train custom styles from ~8 songs in about 1 hour.
  • Metadata Control: Set duration, BPM, key, time signature precisely.

3. Underlying Technology & Use Cases

  • Technology: LM (based on Qwen3) as planner with chain‑of‑thought for song blueprints; DiT for audio synthesis; intrinsic RL alignment; INT8/CPU offloading; multi‑backend acceleration (vLLM/MLX/ROCm/XPU).
  • Scenarios: Music production, content creation soundtracks, game/film audio, education, personal DIY music, and AI music tool development.