A speech synthesis system that generates multi-speaker conversations with voice cloning and multilingual support.
Minimum 16GB RAM. This app starts slowly, an SSD is recommended. 30GB+ storage recommended.
macOS 15+: M-series chips required.
Windows 10/11: CPU supported, NVIDIA GPU recommended.
Note: For NVIDIA GPUs, install a newer driver.FireRedTTS-2 is an advanced speech synthesis system developed by the audio technology team of Xiaohongshu. It is primarily designed for podcast generation and interactive chatbot scenarios.
For general users or beginners, it easily enables the following functions:
FireRedTTS-2 introduces two main technical innovations that set it apart from traditional TTS systems:
Efficient Speech Tokenizer: This component acts like a "speech compressor," converting continuous speech signals into a sequence of efficient discrete symbols (tokens) at a low frame rate of 12.5Hz. This reduces the computational load required per second of audio, thereby accelerating generation speed and enabling the system to handle longer conversations.
Text-Speech Interleaved Sequence & Dual-Transformer Architecture: This is the core innovation. Instead of processing text and speech separately, the system interleaves the text and its corresponding speech tokens from different speakers in chronological order to form a single, cohesive sequence. A large 1.5B-parameter Transformer model then handles modeling the broad context and rhythm of the conversation, while a smaller 0.2B-parameter Transformer focuses on refining acoustic details. This collaborative approach results in more stable and natural speaker switching and contextually appropriate prosody in the generated dialogue.