Skip to content
OmniVoice

OmniVoice

Supporting 600+ languages, voice design, voice cloning, natural speech, and ultra-fast inference

Features

Open SourceTTS

System Requirements

16GB RAM recommended. 15GB+ storage recommended.
macOS 15+: Supports both Intel and M-series chips.
Windows 10/11: Intel/AMD GPUs supported, NVIDIA GPU recommended.
Note: For NVIDIA GPUs, install a newer driver.

Introduction

OmniVoice is an open-source, massively multilingual zero-shot text-to-speech (TTS) system developed by the k2-fsa team. The core team consists of key original developers behind the well-known open-source speech project Kaldi, led by Dr. Daniel Povey, who currently serves as Chief Speech Scientist at Xiaomi. The project is strongly supported by Xiaomi AI Lab and represents an important open-source achievement in Xiaomi’s intelligent speech technology research.

OmniVoice is designed for a wide range of scenarios including global voice content generation, multilingual intelligent interaction, accessibility narration, video dubbing, virtual human voice synthesis, and dialect & low-resource language content creation. It addresses limitations of traditional TTS systems such as limited language support, poor performance on low-resource languages, unnatural timbre, and slow inference speed. Built on a novel architecture that combines diffusion models with language models, OmniVoice delivers highly natural speech synthesis while achieving extremely fast inference, with an RTF as low as 0.025—meaning it can generate audio about 40 times faster than real time, making it suitable for high-concurrency, low-latency industrial deployment.

One of its most remarkable advantages is its extensive language coverage, supporting more than 600 languages and dialects worldwide, ranging from major global languages to many low-resource and regional varieties, such as:

  • Chinese varieties: Mandarin, Cantonese, etc.
  • Major international languages: English, Spanish, French, German, Russian, Japanese, Korean, Arabic, Hindi, Portuguese, Italian, etc.
  • Low-resource and ethnic languages: Swahili, Hausa, Vietnamese, Thai, Indonesian, Urdu, and numerous indigenous languages from Africa, the Americas, and Oceania.

All languages are supported in a zero-shot manner—no additional fine-tuning is required for any language, and natural speech can be synthesized directly from text input.

In terms of advanced voice capabilities, OmniVoice provides a rich set of professional features:

  1. Voice Design Without any reference audio, users can control voice attributes through text instructions, including gender, age, timbre style, pitch, speaking rate, emotional tone, accent strength, dialect features, and even qualities like whispering, deepness, or brightness, enabling highly customizable voice generation.
  2. Few-shot Voice Cloning With only 3 to 10 seconds of a target speaker’s audio, the model can accurately clone their voice with high fidelity and stability, ideal for personalized broadcasting and virtual avatar dubbing.
  3. Fine-grained Pronunciation Control It supports phonetic, phonemic, and pinyin annotations to precisely handle polyphonic characters, rare words, and loanwords. It also supports paralinguistic cues such as laughter, sighs, and pauses to make synthesized speech more expressive and human-like.
  4. Efficient Inference and Easy Deployment The model supports FP16 / BF16 mixed-precision inference with moderate GPU memory usage. It provides a command-line interface, Python API, and a simple WebUI for easy integration into various applications.

In summary, OmniVoice is one of the most comprehensive open-source multilingual TTS systems available today, offering exceptional language coverage, high audio quality, fast inference, and rich functionality, making it valuable for both academic research and real-world industrial applications.