Spark-TTS is jointly developed by multiple institutions such as the Hong Kong University of Science and Technology, Shanghai Jiao Tong University, Northwestern Polytechnical University, NetEase Fuxi AI Lab, and independent researchers, and is commercially applied by Mobvoi. It is an efficient text-to-speech tool based on a large language model (LLM). The following is a detailed introduction to its features and functions:
- Core Technological Innovations
- BiCodec Voice “Track Separation” Technology: This technology disassembles speech into semantic tokens and global tokens, similar to multi-track recording. Semantic tokens focus on "what is said", capturing the language content at an extremely low bit rate (50 tokens per second), which effectively ensures the semantic accuracy of the generated speech. Global tokens record "how it is said", encoding attributes such as the speaker's timbre and intonation through a fixed-length code, allowing the model to flexibly combine content and style like a "voice color palette". This enables fine-grained control over speech generation, achieving both high efficiency and precision.
- VoxBox Voice Dataset: The research team spent a great deal of time creating an open-source voice dataset, VoxBox, which contains 100,000 hours of speech data. It covers multilingual and multi-scenario speech and has detailed attribute annotations for gender, pitch, speaking rate, etc. The dataset undergoes strict data cleaning, and its quality is comparable to that of professional recordings. It provides a rich and high-quality sample library for model training, meeting diverse speech synthesis needs, ranging from a "gentle female voice" to a "passionate speech", and serves as a "golden training library" for speech synthesis.
- Excellent Voice Cloning Ability
- Zero-shot Voice Cloning: With just a 3-second reference audio, it can highly reproduce any human voice, with a similarity exceeding existing technologies. Even without specific training data of the target speaker, it can accurately imitate their voice, making it suitable for cross-language and code-switching scenarios and achieving seamless conversion between languages and voices. For example, it can imitate Jay Chou's voice to read both Chinese and English articles, and the effect is extremely realistic, greatly expanding the application scenarios.
- Fine-grained Voice Customization
- Coarse-grained Adjustment: Users can simply select the gender, 5 levels of pitch, and 5 levels of speaking rate with one click, quickly achieving a rough adjustment of the voice style to meet the basic needs of different scenarios. For example, it can quickly change a male voice to a female voice or select different speaking rates to adapt to different content rhythms.
- Fine-grained Adjustment: It allows for precise adjustment down to specific pitch values (such as A4 = 440Hz) and the number of syllables per second, enabling users to meticulously carve and polish the voice according to their own creativity and requirements, achieving a more personalized voice effect.
- Multilingual Support
- Smooth Chinese-English Switching: It supports Chinese and English and has cross-language synthesis capabilities, maintaining high naturalness and accuracy in multilingual scenarios. Users can input text in one language and generate voice output in another language, meeting the speech synthesis needs in a globalized context. It has broad application prospects in fields such as international customer service and cross-language education.
- Efficient and Simple Architecture
- Based on the Qwen2.5 Architecture: It is completely built on a large language model and does not require additional generation models such as flow-matching models. It directly reconstructs audio from the codes predicted by the LLM. This design simplifies the speech synthesis process, improves efficiency, and reduces complexity. With only 0.5B parameters and a training data volume of only 40% of that of similar models, it achieves better results and can operate efficiently in both research and production environments.
- Voice Cloning and Style Transfer
- Style Feature Extraction and Transfer: It can extract style features from a small number of speech samples and transfer them to the synthesized speech, realizing the replication and transfer of personalized voice styles. For example, users can transfer the style features of a specific speech sample to the voice of a virtual speaker, giving it a unique style, providing more creative possibilities for fields such as content creation and virtual character dubbing.
- Flexible Usage Modes
- Plugin-based Architecture: Some functions are provided in the form of plugins, allowing developers to load or remove relevant modules as needed, such as pre-processing tools, text regular expression processors, vocoders (such as HiFi-GAN, WaveGlow, etc.), etc., to meet personalized development needs. This makes it convenient for developers to carry out customized development according to specific project requirements.
- Command-line Tool: The command-line tool (CLI) is relatively intuitive. Without the need to write complex scripts, users can complete operations such as speech synthesis and batch processing through simple commands, improving the usage efficiency and facilitating the quick conversion of text to speech.
- Multi-platform Deployment Support: It supports multiple operating systems such as Windows, Linux, and macOS, and can be combined with containerization methods such as Docker and Kubernetes to adapt to more flexible production environment deployments, making it convenient to use in different devices and environments.
- GPU/CPU Adaptation: For different hardware environments, it has the ability to automatically detect and allocate resources. If a GPU is detected, it will give priority to using the GPU for accelerated rendering. In a CPU environment, it will automatically degrade to maintain relatively smooth synthesis efficiency, making full use of existing hardware resources and improving the speed and quality of speech synthesis.