Zero-shot voice cloning and expressive editing of emotion, style, and paralinguistic cues
32GB RAM recommended. 20GB+ storage recommended.
Windows 10/11: NVIDIA GPU 12GB+ VRAM required.
Note: For NVIDIA GPUs, install a newer driver.Step-Audio-EditX is an open-source audio generation and editing large model developed by the StepFun AI team. Designed for expressive and iterative speech editing, it empowers even non-technical users to easily modify voice characteristics—simply provide a short reference audio clip and a text instruction, and the model can precisely adjust emotion, speaking style, or paralinguistic cues.
The complete list of officially supported tags: "Step-Audio-EditX Usage Tips and Installation Guide"
[Happy], [Sad], or [Angry] to instantly infuse the desired emotion into the synthesized speech.Built on a 3-billion-parameter Audio Large Language Model (Audio LLM), Step-Audio-EditX integrates three core components: