Skip to content
Chatterbox TTS

Chatterbox TTS

Zero-shot cloning and emotional control across 23 languages

Features

Open SourceTTSVoice Conversion

System Requirements

Minimum 8GB RAM. 18GB+ storage recommended.
macOS 15+: Supports both Intel and M-series chips.
Windows 10/11: Intel/AMD GPUs supported, NVIDIA GPU recommended.
Note: For NVIDIA GPUs, install a newer driver.

Introduction

2026-01-29 Update Notes Added support for the Chatterbox Turbo 350M model, featuring even faster generation speeds.

Note: This application currently offers suboptimal support for Chinese, which may result in irregular speech rhythms or artifacts; however, it delivers high-quality and natural synthesis for English, German, and Spanish. Please evaluate your language requirements before proceeding with the installation.

ChatterBox, developed by Resemble AI, is a lightweight open-source Text-to-Speech (TTS) model designed to deliver high-fidelity, expressive, and multilingual voice synthesis with minimal hardware requirements.

🌟 Key Features

  • 23-Language Support: It natively supports 23 languages, including English, Chinese, French, German, and Spanish. Its powerful cross-lingual cloning allows you to use a Chinese reference clip to make a voice speak fluent German or English while retaining the original persona.
  • Zero-Shot Cloning: Clone any voice with just a 5-10 second sample. No additional training is required. In blind tests, over 63% of listeners preferred its output over other industry benchmarks.
  • Fine-Grained Emotion Control: Featuring a unique "exaggeration" parameter, users can modulate emotional intensity from calm narration to dramatic performances via simple numerical inputs.
  • Ultra-Lightweight: With only 3M parameters and a size under 50MB, it runs efficiently on edge devices like Raspberry Pi, synthesizing 1 minute of audio in under 0.8 seconds.

🔬 Technical Advantages

  • LLaMA 3 Foundation: Built on the LLaMA 3 architecture and pre-trained on 500,000+ hours of premium multilingual audio data.
  • Millisecond Latency: Optimized with streaming inference and KV caching, achieving sub-200ms latency—ideal for real-time AI agents and NPCs.
  • Neural Watermarking: Features the Perth neural watermark to ensure AI-generated content is traceable and used responsibly.