Supporting high-quality TTS and zero-shot voice cloning with extremely high timbre similarity
16GB RAM recommended. 40GB+ storage recommended.
macOS 15+: Supports both Intel and M-series chips.
Windows 10/11: Intel/AMD GPUs supported, NVIDIA GPU recommended.
Note: For NVIDIA GPUs, install a newer driver.Mac computers with Apple M-series chips (macOS) can run MLX-optimized AI models, leveraging unified memory and GPU/NPU acceleration for significantly faster content generation.
LongCat-AudioDiT is an open-source high-fidelity text-to-speech (TTS) project based on DiT architecture and diffusion models, developed by the Meituan LongCat Team. It focuses on generating natural, smooth, and highly realistic speech, as well as achieving high-quality zero-shot voice cloning. Unlike traditional speech generation models, it does not rely on intermediate acoustic features such as mel-spectrograms; instead, it generates speech directly in the waveform latent space, improving generation efficiency while significantly enhancing speech naturalness, clarity, and timbre fidelity.
High-Quality Text-to-Speech Supports both Chinese and English text input and directly generates high-sampling-rate, high-definition natural speech with smooth intonation, standard pronunciation, and minimal artificial noise, close to human-recorded quality.
Powerful Zero-Shot Voice Cloning (Core Highlight) With only a very short audio clip from a target speaker, without fine-tuning or large-scale training data, it can accurately reproduce the speaker’s timbre, intonation, vocal characteristics, and personal acoustic features. It achieves extremely high timbre similarity, with outstanding performance in timbre restoration and prosodic consistency. The generated speech is highly close to the original speaker’s voice and difficult to distinguish.
High Audio Quality & Stability Adopts an improved diffusion generation strategy to reduce distortion, noise, and discontinuities in speech. The output is stable, coherent, and rich in details, suitable for scenarios with strict sound quality requirements.
Batch Inference & Research Deployment Supports batch speech generation for model evaluation, dataset synthesis, and industrial deployment. The full code and inference pipeline are open-sourced, facilitating further research and development.
The model uses DiT (Diffusion Transformer) as its backbone network, combined with waveform latent space modeling. It innovatively applies Adaptive Projection Guidance (APG) to optimize the generation process, effectively solving the training-inference mismatch issue in traditional diffusion models and achieving state-of-the-art (SOTA) results on public speech benchmarks.
Short video dubbing, audiobook production, virtual VTuber voice generation, personalized voice for smart assistants, film and animation dubbing, speech data synthesis, voice interaction systems, AI voice cloning, and speech technology research.