VoxCPM2 One-click PC Deployment Tool | One Click to Run AI on Your Own Computer

Features

Open SourceTTS

System Requirements

Minimum 16GB RAM. 17GB+ storage recommended.
macOS 15+: M-series chips required.
Windows 10/11: CPU supported, but NVIDIA GPU with 8GB+ VRAM recommended.
Note: For NVIDIA GPUs, install a newer driver.

Introduction

VoxCPM is an open-source end-to-end text-to-speech (TTS) model family jointly developed by OpenBMB (ModelBest) and Tsinghua University HCSI Lab. Built on the MiniCPM large language model backbone, it uses a tokenizer-free diffusion autoregressive architecture to achieve highly natural, controllable speech synthesis and voice cloning.

Key Upgrades: VoxCPM (Original) vs VoxCPM2

Greatly improved naturalness More natural prosody, better long-text coherence, more human-like pauses and stress.
Stronger voice cloning Higher zero-shot similarity, more accurate reproduction of timbre, accent and emotional details.
Faster generation & better real-time performance Lower streaming latency, higher inference efficiency, smooth running on common GPUs.
Enhanced audio quality & denoising Lower noise, higher clarity, improved post-processing.
Finer controllability Richer adjustment of style, emotion and speaking speed for flexible voice design.
Better multilingual & dialect performance Higher accuracy and naturalness for Mandarin, English and Cantonese.

Core Features

High-Naturalness Speech Synthesis Generates smooth, natural, human-like speech from text, supporting long texts, dialogues, news, stories, etc., with prosody, pauses and intonation aligned with semantics.
Zero-Shot Voice Cloning Quickly clones timbre, speed and intonation from a short reference audio without heavy training, suitable for personalized voiceovers.
Streaming Real-Time Synthesis Supports progressive generation and playback with low latency, enabling real-time inference on consumer GPUs for voice assistants and interactive systems.
Fine-Grained Voice Design & Control Allows adjustment of speed, emotion intensity, style and denoising strength; supports text normalization (automatic reading of numbers, symbols, dates) for clearer and more controllable speech.

Supported Languages & Dialects

Arabic, Burmese, Chinese, Danish, Dutch, English, Finnish, French, German, Greek, Hebrew, Hindi, Indonesian, Italian, Japanese, Khmer, Korean, Lao, Malay, Norwegian, Polish, Portuguese, Russian, Spanish, Swahili, Swedish, Tagalog, Thai, Turkish, Vietnamese
Chinese Dialect: 四川话, 粤语, 吴语, 东北话, 河南话, 陕西话, 山东话, 天津话, 闽南话

Underlying Technology

Based on the MiniCPM large model backbone;
Uses a tokenizer-free end-to-end diffusion autoregressive architecture;
Adopts semantic-acoustic decoupling, FSQ coding and streaming generation for high audio quality, speed and stability.

Application Scenarios

Audiobooks, short video dubbing, virtual human voice, intelligent customer service announcements, in-car voice assistants, educational reading, personalized voice customization, etc.

GitHubhttps://github.com/OpenBMB/VoxCPM

LicenseApache-2.0