Skip to content
FireRedTTS2

FireRedTTS2

A speech synthesis system that generates multi-speaker conversations with voice cloning and multilingual support.

Features

Open SourceTTS

System Requirements

Minimum 16GB RAM. This app starts slowly, an SSD is recommended. 30GB+ storage recommended.
macOS 15+: M-series chips required.
Windows 10/11: CPU supported, NVIDIA GPU recommended.
Note: For NVIDIA GPUs, install a newer driver.

Introduction

FireRedTTS-2 is an advanced speech synthesis system developed by the audio technology team of Xiaohongshu. It is primarily designed for podcast generation and interactive chatbot scenarios.

For general users or beginners, it easily enables the following functions:

  • AI Podcast Production: Generate audio dialogues of up to 4 speakers lasting 3 minutes from text alone, with the potential for longer dialogues and more speakers.
  • Voice Cloning: Utilize its "zero-shot cloning" capability to mimic a voice using only a short audio sample, even without recording equipment.
  • Multilingual Support: It handles not only Chinese but also speech synthesis in multiple languages including English, Japanese, Korean, French, German, and Russian.
  • Real-time Speech Interaction: Enables ultra-low latency voice responses in chat applications, with the first speech segment latency as low as 140ms, making conversations fluid and virtually wait-free.

What Are Its Technical Features?

FireRedTTS-2 introduces two main technical innovations that set it apart from traditional TTS systems:

  1. Efficient Speech Tokenizer: This component acts like a "speech compressor," converting continuous speech signals into a sequence of efficient discrete symbols (tokens) at a low frame rate of 12.5Hz. This reduces the computational load required per second of audio, thereby accelerating generation speed and enabling the system to handle longer conversations.

  2. Text-Speech Interleaved Sequence & Dual-Transformer Architecture: This is the core innovation. Instead of processing text and speech separately, the system interleaves the text and its corresponding speech tokens from different speakers in chronological order to form a single, cohesive sequence. A large 1.5B-parameter Transformer model then handles modeling the broad context and rhythm of the conversation, while a smaller 0.2B-parameter Transformer focuses on refining acoustic details. This collaborative approach results in more stable and natural speaker switching and contextually appropriate prosody in the generated dialogue.