Skip to content
IndexTTS 2

IndexTTS 2

An upgraded TTS system featuring multilingual support, real-time style switching and efficient inference

Features

Open SourceTTS

System Requirements

Minimum 16GB RAM. 24GB+ storage recommended.
macOS 15+: M-series chips required.
Windows 10/11: CPU supported, but NVIDIA GPU with 8GB+ VRAM recommended.
Note: For NVIDIA GPUs, install a newer driver.

Introduction

IndexTTS2 is an industrial-level controllable and efficient zero-shot Text-to-Speech (TTS) system developed by the Bilibili. It can quickly convert text into natural and fluent speech, supporting both Chinese and English scenarios. It is suitable for ordinary users' daily experience, developers' secondary development, and other needs.

1. Upgraded Core Features

Functions and Features

  1. Pronunciation Correction: In Chinese scenarios, it supports correcting the pronunciation of Chinese characters through pinyin, and can quickly fix mispronounced characters. For example, it can accurately pronounce polyphonic or rare characters.
  2. Pause Control: It can control pauses at any position through punctuation marks, making the speech more natural and fluent, and conforming to people's daily speaking habits.
  3. Zero-Shot Voice Cloning: It can synthesize voices with consistent styles based only on speaker reference features without target voice samples, achieving zero-shot voice cloning. This means that it can replicate a specific voice style with only a small amount of information.
  4. High-Quality Voice Output: It integrates a Conformer conditioning encoder and a BigVGAN2 - based speech code decoder, which improves training stability, voice timbre similarity, and sound quality. The MOS score of the synthesized voice reaches 4.01, with high speech naturalness and human voice restoration.

Technical Advantages

  1. Hybrid Modeling Method: It adopts a character-pinyin hybrid modeling method, which is specifically optimized for Chinese scenarios. It can effectively handle issues such as polyphonic characters and rare characters, and also enhances the model's control over sentence rhythm and intonation.
  2. Rich Data: It is trained with tens of thousands of hours of data, covering various types of speech data, ensuring the diversity and consistency of synthesized voices in terms of content and timbre.
  3. Leading Performance: In multiple tests, it outperforms mainstream TTS systems such as XTTS, CosyVoice2, Fish-Speech, and F5-TTS in indicators like Word Error Rate (WER) and Speaker Similarity (SS).

Key Features and Upgrades from Version 1.x

IndexTTS2 is a next-generation speech large model released by Bilibili, officially open-sourced on September 8, 2025. It boasts the following features and numerous upgrades compared to Version 1.x:

  • Precise Duration Control: IndexTTS2 achieves precise duration control in an autoregressive architecture for the first time, supporting two generation modes. One mode enables accurate duration control by explicitly specifying the number of tokens to generate, while the other allows free generation while preserving the prosodic features of the input prompt. In contrast, Version 1.x offers no duration control. This advantage makes IndexTTS2 particularly effective in scenarios requiring strict audio-visual synchronization (such as film and television dubbing), with an audio-visual synchronization error of less than 0.02%.

  • Decoupling of Timbre and Emotion: The model decouples emotional features from speaker timbre, allowing users to independently specify the source of timbre and the source of emotion. For example, users can retain the timbre from one audio clip and assign emotion using another audio clip with different emotions or a text description. Under zero-shot conditions, the model can accurately reproduce the target timbre while fully restoring the specified emotion. Version 1.x, however, lacked this capability, resulting in less flexibility in combining emotional expression and timbre.

  • Multiple Emotion Control Methods: IndexTTS2 introduces four new emotion control methods: using an emotional reference audio, controlling via an emotion vector, controlling via emotional descriptive text, and the default method (using the same reference audio as the timbre source). Users can choose different methods based on their needs to precisely regulate the emotional expression of the synthesized speech. In comparison, Version 1.x had relatively limited emotion control options.

  • Text-Driven Emotion Control: It incorporates a built-in T2E (Text-to-Emotion) module, fine-tuned based on the Qwen-3 model. This module converts natural language descriptions into emotion vectors, enabling users to drive the emotional expression of synthesized speech simply by inputting text descriptions (e.g., "questioning angrily"). This significantly lowers the barrier to use, whereas Version 1.x likely lacked such a convenient text-driven emotion control function.

  • Integration of GPT Latent Representations: IndexTTS2 integrates GPT latent representations and designs a three-stage training strategy. This enhances the stability and clarity of speech in high-emotion scenarios, addresses issues of insufficient data and overfitting, and makes the synthesized results more natural and fluent. Version 1.x, by contrast, might have had problems such as unclear articulation when expressing strong emotions.

  • Performance Improvements: Multi-dataset experiments show that IndexTTS2 outperforms current state-of-the-art zero-shot TTS models in terms of word error rate (WER), speaker similarity, and emotion fidelity. For instance, the word error rate of IndexTTS2 is 1.883%, compared to 1.921% for Version 1.x, representing a reduction of 0.038%.