Update (2026-03-21): SVC model support is now available. Please click "Update" to get the latest version to enable this feature.
SoulX-Singer is an open-source project for high-quality zero-shot Singing Voice Synthesis (SVS) designed for real application scenarios, developed by Soul AI Lab under Soul App. It solves the problems of traditional SVS such as the need for fine-tuning for specific singers, poor multilingual adaptation and low controllability, and can generate high-fidelity singing voices for unseen singer timbres. Ordinary users can realize diversified singing creation and editing through simple operations without professional audio production knowledge.
Core Functions
- Zero-shot Singing Generation: Generate high-fidelity singing voices for new singers without fine-tuning the model, breaking the timbre limit of traditional models;
- Precise Dual-mode Control: Support two driving methods of MIDI score and melody (F0 contour), which can precisely control the pitch, rhythm and singing expression of the voice, suitable for both creating songs from scratch and covering or style transferring existing melodies;
- Multilingual Synthesis: Perfectly support singing generation in Mandarin, English and Cantonese, and realize cross-lingual timbre cloning between different languages to retain the unique vocal timbre of singers;
- Singing Editing and Timbre Replication: Keep natural singing prosody when modifying lyrics, and replicate singer timbre across languages and singing styles to meet the needs of personalized singing modification;
- Convenient Operation Methods: Provide local WebUI interactive interface, Hugging Face online demo and MIDI editor, support local deployment and online operation, taking into account the usage needs of professional development and ordinary users.
Target Users
Music creators, content production practitioners, AI technology developers, university researchers, and ordinary music lovers with personalized singing creation and cover needs.
Application Scenarios
Virtual singer creation, UGC music creation, song cover and style adaptation, multilingual song production, audio content creation, and it is also suitable for academic research and technological development in the field of singing voice synthesis.
Underlying Technology and Training Foundation
- Core Technology: Adopt a generative modeling paradigm based on Flow Matching, model singing voice synthesis as an audio infilling task, and introduce a fine-grained note-level alignment mechanism to achieve precise matching of lyrics, MIDI notes and acoustic features, supporting independent control and editing of notes; it also draws on the technical achievements of excellent open-source projects such as F5-TTS and Amphion, and integrates mature audio processing technologies such as speech separation, dereverberation and fundamental frequency extraction.
- Training Data: Trained on the basis of more than 42,000 hours of high-quality aligned singing data, covering Mandarin, English and Cantonese, including a variety of timbres and singing styles, laying a foundation for the stability and generalization ability of zero-shot synthesis.
- Deployment Support: Developed based on Python 3.10, supporting Conda environment deployment, providing a complete pre-trained model and preprocessing process, and compatible with the Hugging Face ecosystem to support fast online experience.
Authorization and Usage Norms
It adopts the Apache 2.0 open-source license, and researchers and developers can use the code and model weights for free; it is only for academic research, education and legitimate personalized creation and other scenarios. It is forbidden to imitate others' voices and make false audio without authorization, and the developers shall not be liable for the abuse of the model.