Skip to content
VoxCPM

VoxCPM

Clones voices from short audio and generates natural speech.

Features

Open SourceTTS

System Requirements

Minimum 8GB RAM. 10GB+ storage recommended.
macOS 15+: M-series chips required.
Windows 10/11: CPU supported, NVIDIA GPU recommended.
Note: For NVIDIA GPUs, install a newer driver.

Introduction

1. Project Overview: What is it?

VoxCPM is an advanced text-to-speech (TTS) model capable of converting text into highly natural, human-like speech. It was jointly developed by OpenBMB (a subsidiary of Facewall Intelligent) and the Human-Computer Speech Interaction Lab (THUHCSI) at Tsinghua University Shenzhen International Graduate School. It was open-sourced in September 2025. Despite having a relatively "small" architecture of only 0.5 billion parameters, it delivers powerful performance, making it part of the "Small But Mighty" model series.

2. Core Features & Capabilities: What can it do?

For beginners and non-technical users, the most impressive aspects of VoxCPM are its practical capabilities:

  • 🎙️ Highly Realistic Speech Synthesis: The speech generated by VoxCPM is exceptionally natural in terms of emotion, timbre, accent, pauses, and prosody, making it almost indistinguishable from real human recordings.
  • 👥 Zero-Shot Voice Cloning: This is the flagship feature. With just a short segment (a few seconds) of a reference audio clip, VoxCPM can accurately clone the speaker's voice timbre, accent, and even emotional tone. It can then use this voice to speak any new text without requiring additional training on the target voice.
  • 🌐 Multilingual & Special Content Support: The model is primarily trained on Chinese and English, supporting high-quality generation in both languages and cross-lingual voice cloning (e.g., using a Chinese voice to speak English). It can also handle complex text like mathematical formulas and symbols, and supports custom pronunciation correction.
  • ⚡ Efficient & Lightweight, Supports Real-Time Generation: Despite its powerful capabilities, VoxCPM is highly efficient. It can run smoothly on consumer-grade GPUs (like the NVIDIA RTX 4090) and supports streaming synthesis with very low latency, suitable for real-time interactive applications.

3. Technical Characteristics: Why is it so powerful?

The technical innovations behind VoxCPM are the foundation of its outstanding performance:

  • Development Team: A strong collaboration between Facewall Intelligent / OpenBMB (with deep expertise in efficient large models) and Tsinghua SIGS (with cutting-edge academic research capabilities).
  • Core Technology: Unlike traditional methods that convert speech into discrete tokens, VoxCPM employs an innovative end-to-end Diffusion Autoregressive architecture.
  • Technical Advantages:
    • Tokenizer-Free Design: It models speech directly in a continuous space, avoiding information loss caused by discretization, resulting in smoother and more natural sound.
    • Semantic-Acoustic Decoupling: Through hierarchical language modeling and Finite Scalar Quantization (FSQ) constraints, the model implicitly separates the semantic information of the text from the acoustic features of the sound. This allows for deeper text understanding and generates speech with greater expressiveness and stability.

4. Application Scenarios

VoxCPM can be widely applied in:

  • Intelligent Voice Assistants (providing more anthropomorphic interaction)
  • Audiobooks and Content Creation
  • Virtual Characters and Game Dubbing
  • Personalized Voice Cloning Services
  • Education (e.g., language learning, standard pronunciation demonstration)