Skip to content
GLM-TTS

GLM-TTS

Zero-shot voice cloning, emotion expression and streaming inference

Features

Open SourceTTS

System Requirements

8GB RAM recommended. 20GB+ storage recommended.
macOS 15+: Supports both Intel and M-series chips.
Windows 10/11: Intel/AMD GPUs supported, NVIDIA GPU recommended.
Note: For NVIDIA GPUs, install a newer driver.

Introduction

I. Project Overview

GLM-TTS is a high-quality text-to-speech (TTS) synthesis system developed by the ZhipuAI team. Built on large language models (LLMs), it supports core features such as zero-shot voice cloning and streaming inference. With synthesis quality comparable to commercial systems, GLM-TTS flexibly meets voice generation needs across various scenarios.

II. Core Features

  1. Zero-shot Voice Cloning: Clone any speaker’s voice quickly with only 3-10 seconds of reference audio—no complex debugging required, enabling personalized voice generation.
  2. Emotion-Expressive Speech: Leveraging reinforcement learning, the generated speech carries natural emotions (e.g., joy, calmness), solving the problem of flat, unexpressive speech in traditional TTS systems.
  3. Streaming Inference: Supports real-time audio generation (processing and outputting simultaneously), ideal for interactive scenarios like smart assistants and online customer service—no need to wait for full text processing.
  4. Multi-language Support: Primarily optimized for Chinese, with full support for Chinese-English mixed text (e.g., accurately synthesizing phrases like "今天的 meeting 很顺利").
  5. Precise Pronunciation Control: Addresses ambiguous pronunciation of polyphonic characters (e.g., "行" can be pronounced xíng or háng) and rare words via "text + phoneme" hybrid input. Perfect for audiobooks, educational content, and other scenarios requiring accurate pronunciation.
  6. Flexible Usage Methods: Run via command line, scripts, or an interactive web interface—easy to use even for beginners.

III. Underlying Technology & Architecture

  1. Core Technology Stack:

    • Foundation: Llama-based LLM + Flow Matching model + Vocoder
    • Key Technologies: Multi-Reward Reinforcement Learning, GRPO (Group Relative Policy Optimization) algorithm, zero-shot voice feature extraction, phoneme-level modeling
    • Auxiliary Tools: HuggingFace/ModelScope model distribution, Gradio interactive interface, Whisper speech tokenization
  2. Two-Stage Synthesis Pipeline: Stage 1: The LLM converts input text into speech token sequences; Stage 2: The Flow model transforms these tokens into high-quality mel-spectrograms, which are finally converted into audio waveforms via a vocoder.

  3. Reinforcement Learning Optimization: Uses multi-dimensional reward functions (similarity, Character Error Rate (CER), emotion expression, etc.) to continuously optimize the model’s generation strategy, resulting in more natural and expressive speech.