CosyVoice One-click PC Deployment Tool | One Click to Run AI on Your Own Computer

Features

Open SourceTTS

System Requirements

Minimum 16GB RAM. 33GB+ storage recommended.
macOS 15+: M-series chips required.
Windows 10/11: NVIDIA GPU with 4GB+ VRAM required.
Note: For NVIDIA GPUs, install a newer driver.

Introduction

Developed by the FunAudioLLM team (a technical team specializing in audio large language model research), CosyVoice is a next-generation Text-to-Speech (TTS) solution rooted in deep optimization of lightweight audio generation models and cross-scenario application exploration. The system balances "high-naturalness speech generation" with "low-resource deployment" through integrated streaming architecture, multi-dimensional emotional modeling, and device-level optimization, widely applied in smart hardware, content creation, and multilingual interaction scenarios.

Core Technical Architecture & Features:

Lightweight Audio LLM Core
- Utilizes mixed-precision quantization (e.g., INT8/INT4) and model distillation, compressing parameters to 1/5 of traditional TTS models while maintaining >98% sound quality via dynamic sparsification algorithms.
- Innovates the "Progressive Feature Fusion Network," combining auto-regressive (AR) and non-auto-regressive (Non-AR) architectures to ensure speech coherence while reducing inference latency to <50ms (3x improvement over traditional models).
Multi-Language & Multi-Modal Expression
- Supports speech generation for 80+ languages and dialects (covering major languages like Chinese, English, Japanese, Korean, Spanish, French, and variants like Wu Chinese, Cantonese), enabling rapid cold start for low-resource languages via cross-lingual acoustic model sharing.
- Embeds 12 basic emotion models (joy, sadness, anger, tenderness, etc.) and 3-layer intonation adjustment dimensions (speech rate, pitch, pause frequency), supporting dynamic style adjustment via text annotation or real-time parameter control.
Streaming Real-Time Generation Engine
- Adopts WebRTC-based streaming protocol for pipelined processing of "text chunking-speech generation-audio output," enabling zero-buffer experience for live dubbing, real-time translation, and other scenarios.
- Integrates dynamic speech rate control to adjust generation speed based on network/device load, ensuring smooth playback even over 4G networks.

Core Product Advantages:

Naturalness Breakthrough: Human-Level Speech Quality
Optimizes glottal wave modeling via Generative Adversarial Networks (GAN), achieving 92% spectral envelope matching with human speech. Prosody aligns with target language norms (e.g., Chinese tones, English liaison), with audiobook user surveys showing "difference from professional dubbers <5%".
Scenario-Oriented Solution Matrix
- Content Creation: Integrates APIs/plugins for major audio editors (Audition, Jianying), offering batch dubbing and character voice customization. Applied by a leading audiobook platform, boosting production efficiency by 300%.
- Intelligent Interaction: Maintains emotional consistency in multi-turn conversations. In customer service, adjusts response intonation via speech emotion recognition, reducing complaint rates by 25%.
- Education & Healthcare: Provides multilingual TTS+ASR closed-loop solutions, used in cross-border hospital guidance systems and minority language teaching apps, serving 100,000+ users.
Open-Source Ecosystem & Technical Support
Licensed under Apache 2.0, the project offers complete model training/inference toolchains (data preprocessing scripts, custom voice training guides). The community has contributed 30+ third-party plugins (Unity/Unreal integration, browser WebAssembly) and regularly releases pretrained model libraries (child voices, dialects, virtual idol voices).

Homepagehttps://funaudiollm.github.io/cosyvoice2/

GitHubhttps://github.com/FunAudioLLM/CosyVoice

LicenseApache-2.0