An ultra-lightweight TTS tool, supporting multilingual speech synthesis and zero-shot voice cloning on ordinary CPU
16GB RAM recommended. 8GB+ storage recommended.
macOS 15+: Supports both Intel and M-series chips.
Windows 10/11: Intel/AMD GPUs supported, NVIDIA GPU recommended.
Note: For NVIDIA GPUs, install a newer driver.MOSS-TTS-Nano is a fully open-source lightweight text-to-speech (TTS) project jointly developed by the OpenMOSS open-source team and MOSI.AI. It focuses on low-hardware adaptation, offline local operation and cost-effective speech synthesis, serving as a universal and accessible speech generation tool for ordinary users, developers, individual enthusiasts and lightweight commercial scenarios.
Different from traditional speech synthesis solutions that require high-performance GPUs and rely on cloud APIs, this project is positioned for minimal deployment and offline availability. With an ultra-small model size of only 0.1 billion parameters, it does not require independent GPU acceleration. It can run stably and smoothly only with entry-level CPUs of ordinary home computers, thin-and-light laptops, low-configuration servers and edge embedded devices. It features low memory usage and low resource consumption. After simple download and local deployment, users can use it completely offline, effectively avoiding network latency, interface flow restrictions and private data upload risks.
In terms of core functions and product advantages, it integrates comprehensive practical features:
At the technical level, the project adopts a core architecture of large language model (LLM) + self-developed lightweight audio tokenizer MOSS-Audio-Tokenizer-Nano. It applies a pure autoregressive audio generation framework and abandons the complex combination of traditional acoustic models and vocoders. Relying on the strong semantic understanding capability of LLM and lightweight audio encoding technology, it greatly reduces hardware requirements and model volume while ensuring natural pronunciation and high sound quality, achieving a perfect balance of light weight, operational stability and generation performance.
It is suitable for a wide range of scenarios, including daily novel reading, short video dubbing for self-media, local intelligent voice assistant construction, voice function development for mini-programs and lightweight software, offline device voice broadcasting, creative voice production, multilingual listening and reading for education and other personal and lightweight development scenarios.