Skip to content
MOSS-TTS-Nano

MOSS-TTS-Nano

An ultra-lightweight TTS tool, supporting multilingual speech synthesis and zero-shot voice cloning on ordinary CPU

Features

Open SourceTTS

System Requirements

16GB RAM recommended. 8GB+ storage recommended.
macOS 15+: Supports both Intel and M-series chips.
Windows 10/11: Intel/AMD GPUs supported, NVIDIA GPU recommended.
Note: For NVIDIA GPUs, install a newer driver.

Introduction

MOSS-TTS-Nano is a fully open-source lightweight text-to-speech (TTS) project jointly developed by the OpenMOSS open-source team and MOSI.AI. It focuses on low-hardware adaptation, offline local operation and cost-effective speech synthesis, serving as a universal and accessible speech generation tool for ordinary users, developers, individual enthusiasts and lightweight commercial scenarios.

Different from traditional speech synthesis solutions that require high-performance GPUs and rely on cloud APIs, this project is positioned for minimal deployment and offline availability. With an ultra-small model size of only 0.1 billion parameters, it does not require independent GPU acceleration. It can run stably and smoothly only with entry-level CPUs of ordinary home computers, thin-and-light laptops, low-configuration servers and edge embedded devices. It features low memory usage and low resource consumption. After simple download and local deployment, users can use it completely offline, effectively avoiding network latency, interface flow restrictions and private data upload risks.

In terms of core functions and product advantages, it integrates comprehensive practical features:

  1. Multilingual support, natively supports more than 20 mainstream languages worldwide including Chinese, English, Japanese, Korean, French and Spanish, with seamless multilingual switching;
  2. Zero-shot voice cloning, no complex training or fine-tuning required. Users only need to upload a short reference audio to quickly replicate the target timbre and customize exclusive dubbing voices;
  3. High-fidelity audio output, supports 48kHz high-definition sampling rate and stereo audio generation. The synthesized voice is natural and vivid with reasonable sentence segmentation and anthropic intonation, effectively reducing robotic electronic sound;
  4. Low-latency streaming synthesis, supports streaming inference with fast first-frame audio generation. Long texts can be automatically divided intelligently to realize continuous reading of massive content;
  5. Diversified & easy access methods, built-in local web visual Demo, command-line tools and Python development interfaces. It is friendly for non-technical users with visual operation and convenient for secondary development and integration by developers.

At the technical level, the project adopts a core architecture of large language model (LLM) + self-developed lightweight audio tokenizer MOSS-Audio-Tokenizer-Nano. It applies a pure autoregressive audio generation framework and abandons the complex combination of traditional acoustic models and vocoders. Relying on the strong semantic understanding capability of LLM and lightweight audio encoding technology, it greatly reduces hardware requirements and model volume while ensuring natural pronunciation and high sound quality, achieving a perfect balance of light weight, operational stability and generation performance.

It is suitable for a wide range of scenarios, including daily novel reading, short video dubbing for self-media, local intelligent voice assistant construction, voice function development for mini-programs and lightweight software, offline device voice broadcasting, creative voice production, multilingual listening and reading for education and other personal and lightweight development scenarios.