Skip to content
VoxCPM2

VoxCPM2

Supporting Mandarin, English and Cantonese, with natural speech synthesis and zero-shot voice cloning

Features

Open SourceTTS

System Requirements

Minimum 16GB RAM. 17GB+ storage recommended.
macOS 15+: M-series chips required.
Windows 10/11: CPU supported, but NVIDIA GPU with 8GB+ VRAM recommended.
Note: For NVIDIA GPUs, install a newer driver.

Introduction

VoxCPM is an open-source end-to-end text-to-speech (TTS) model family jointly developed by OpenBMB (ModelBest) and Tsinghua University HCSI Lab. Built on the MiniCPM large language model backbone, it uses a tokenizer-free diffusion autoregressive architecture to achieve highly natural, controllable speech synthesis and voice cloning.

Key Upgrades: VoxCPM (Original) vs VoxCPM2

  1. Greatly improved naturalness More natural prosody, better long-text coherence, more human-like pauses and stress.
  2. Stronger voice cloning Higher zero-shot similarity, more accurate reproduction of timbre, accent and emotional details.
  3. Faster generation & better real-time performance Lower streaming latency, higher inference efficiency, smooth running on common GPUs.
  4. Enhanced audio quality & denoising Lower noise, higher clarity, improved post-processing.
  5. Finer controllability Richer adjustment of style, emotion and speaking speed for flexible voice design.
  6. Better multilingual & dialect performance Higher accuracy and naturalness for Mandarin, English and Cantonese.

Core Features

  1. High-Naturalness Speech Synthesis Generates smooth, natural, human-like speech from text, supporting long texts, dialogues, news, stories, etc., with prosody, pauses and intonation aligned with semantics.
  2. Zero-Shot Voice Cloning Quickly clones timbre, speed and intonation from a short reference audio without heavy training, suitable for personalized voiceovers.
  3. Streaming Real-Time Synthesis Supports progressive generation and playback with low latency, enabling real-time inference on consumer GPUs for voice assistants and interactive systems.
  4. Fine-Grained Voice Design & Control Allows adjustment of speed, emotion intensity, style and denoising strength; supports text normalization (automatic reading of numbers, symbols, dates) for clearer and more controllable speech.

Supported Languages & Dialects

  • Arabic, Burmese, Chinese, Danish, Dutch, English, Finnish, French, German, Greek, Hebrew, Hindi, Indonesian, Italian, Japanese, Khmer, Korean, Lao, Malay, Norwegian, Polish, Portuguese, Russian, Spanish, Swahili, Swedish, Tagalog, Thai, Turkish, Vietnamese
  • Chinese Dialect: 四川话, 粤语, 吴语, 东北话, 河南话, 陕西话, 山东话, 天津话, 闽南话

Underlying Technology

  • Based on the MiniCPM large model backbone;
  • Uses a tokenizer-free end-to-end diffusion autoregressive architecture;
  • Adopts semantic-acoustic decoupling, FSQ coding and streaming generation for high audio quality, speed and stability.

Application Scenarios

Audiobooks, short video dubbing, virtual human voice, intelligent customer service announcements, in-car voice assistants, educational reading, personalized voice customization, etc.