Skip to content
Step-Audio-EditX

Step-Audio-EditX

Zero-shot voice cloning and expressive editing of emotion, style, and paralinguistic cues

Features

Open SourceTTS

System Requirements

32GB RAM recommended. 20GB+ storage recommended.
Windows 10/11: NVIDIA GPU 12GB+ VRAM required.
Note: For NVIDIA GPUs, install a newer driver.

Introduction

Step-Audio-EditX is an open-source audio generation and editing large model developed by the StepFun AI team. Designed for expressive and iterative speech editing, it empowers even non-technical users to easily modify voice characteristics—simply provide a short reference audio clip and a text instruction, and the model can precisely adjust emotion, speaking style, or paralinguistic cues.

The complete list of officially supported tags: "Step-Audio-EditX Usage Tips and Installation Guide"

🎯 User-Friendly Features

  • Zero-Shot Voice Cloning (TTS): Upload any short audio sample as a “voice template,” and synthesize new text in that exact voice. Supports Mandarin, English, Sichuanese, Cantonese, and more.
  • Emotion Editing: Add tags like [Happy], [Sad], or [Angry] to instantly infuse the desired emotion into the synthesized speech.
  • Speaking Style Control: Choose from dozens of styles—including whisper, childlike, serious, exaggerated, or act_coy—to make speech more vivid and natural.
  • Paralinguistic Editing: Precisely insert human-like nuances such as breathing, laughter, surprised “oh!”, hesitant “uhm”, or sighs for lifelike expressiveness.

⚙️ Technical Architecture & Advantages

Built on a 3-billion-parameter Audio Large Language Model (Audio LLM), Step-Audio-EditX integrates three core components:

  1. Dual-Codebook Audio Tokenizer: Efficiently converts raw audio into discrete symbolic sequences.
  2. Audio LLM: Generates new token sequences conditioned on both text instructions and reference audio.
  3. Flow Matching-Based Audio Decoder: Reconstructs high-fidelity, natural-sounding waveforms from token sequences.