VoxCPM is an open-source end-to-end text-to-speech (TTS) model family jointly developed by OpenBMB (ModelBest) and Tsinghua University HCSI Lab. Built on the MiniCPM large language model backbone, it uses a tokenizer-free diffusion autoregressive architecture to achieve highly natural, controllable speech synthesis and voice cloning.
Key Upgrades: VoxCPM (Original) vs VoxCPM2
- Greatly improved naturalness
More natural prosody, better long-text coherence, more human-like pauses and stress.
- Stronger voice cloning
Higher zero-shot similarity, more accurate reproduction of timbre, accent and emotional details.
- Faster generation & better real-time performance
Lower streaming latency, higher inference efficiency, smooth running on common GPUs.
- Enhanced audio quality & denoising
Lower noise, higher clarity, improved post-processing.
- Finer controllability
Richer adjustment of style, emotion and speaking speed for flexible voice design.
- Better multilingual & dialect performance
Higher accuracy and naturalness for Mandarin, English and Cantonese.
Core Features
- High-Naturalness Speech Synthesis
Generates smooth, natural, human-like speech from text, supporting long texts, dialogues, news, stories, etc., with prosody, pauses and intonation aligned with semantics.
- Zero-Shot Voice Cloning
Quickly clones timbre, speed and intonation from a short reference audio without heavy training, suitable for personalized voiceovers.
- Streaming Real-Time Synthesis
Supports progressive generation and playback with low latency, enabling real-time inference on consumer GPUs for voice assistants and interactive systems.
- Fine-Grained Voice Design & Control
Allows adjustment of speed, emotion intensity, style and denoising strength; supports text normalization (automatic reading of numbers, symbols, dates) for clearer and more controllable speech.
Supported Languages & Dialects
- Arabic, Burmese, Chinese, Danish, Dutch, English, Finnish, French, German, Greek, Hebrew, Hindi, Indonesian, Italian, Japanese, Khmer, Korean, Lao, Malay, Norwegian, Polish, Portuguese, Russian, Spanish, Swahili, Swedish, Tagalog, Thai, Turkish, Vietnamese
- Chinese Dialect: 四川话, 粤语, 吴语, 东北话, 河南话, 陕西话, 山东话, 天津话, 闽南话
Underlying Technology
- Based on the MiniCPM large model backbone;
- Uses a tokenizer-free end-to-end diffusion autoregressive architecture;
- Adopts semantic-acoustic decoupling, FSQ coding and streaming generation for high audio quality, speed and stability.
Application Scenarios
Audiobooks, short video dubbing, virtual human voice, intelligent customer service announcements, in-car voice assistants, educational reading, personalized voice customization, etc.