Features
Open SourceFoley Sound
System Requirements
Minimum 16GB RAM.
Note: The model requires significant storage space - 80GB+ free disk space recommended.
Windows 10/11 64-bit: NVIDIA GPU with 8GB+ VRAM required.
Note: For NVIDIA GPUs, install a newer driver.
Introduction
ThinkSound is a powerful "Any2Audio" generation and editing tool. Simply put, it can help you generate or modify matching sounds from videos, text, or existing audio, and supports step-by-step optimization according to your needs.
For example, given a video of an owl perching, preparing to take off, and then flying away, it can automatically generate sounds of the owl chirping, wings flapping, and branches swaying. If you think it's not enough, you can click on the branches in the video to let it specifically optimize the sound of branches swaying, or use text like "add some other birds' chirps", and it will add new sounds while retaining the original owl's chirping.
**License**
Note: The code, models, and dataset are for research and educational purposes only. Commercial use is NOT permitted. For commercial licensing, please contact the authors.
1. Stable Audio Open VAE (by Stability AI): This project includes a fine-tuned VAE from Stable Audio Open, licensed under the Stability AI Community License. Commercial use and redistribution require prior permission from Stability AI.
2. All other code and models are released under the Apache License 2.0.
**Development Team**
The project is developed by the FunAudioLLM team, with core authors including researchers such as Huadai Liu and Jialei Wang.
**Underlying Technologies**
Implemented based on the PyTorch framework, its core technologies include:
- Chain-of-Thought (CoT) reasoning from Multimodal Large Language Models (MLLMs): Analyzing needs step by step like humans, breaking down audio generation processes;
- Flow matching: Ensuring the generated audio accurately matches inputs (e.g., video frames) in terms of timing and details;
- Integration of third-party components such as Stable Audio Open VAE and MM-DiT backbone to improve audio generation quality.
**Technical Features**
1. **"One-stop" generation**: A unified framework supports audio generation from multiple inputs (video, text, audio, etc.) without switching tools;
2. **Interactive editing**: Supports step-by-step optimization of specific sounds via clicking on objects in videos (located by Grounded-SAM-2) or text instructions;
3. **Lightweight and efficient**: The model has been lightweighted, significantly reducing memory and GPU usage, enabling smooth operation on ordinary devices;
4. **User-friendly**: Provides Windows batch scripts and PyPI dependencies for one-click environment setup without complex configuration.
**Advantages**
Compared with traditional tools, it can more accurately capture temporal information (e.g., the sequence of the owl's actions) and subtle details (e.g., the correlation between wing flapping and branch swaying). It also allows users to adjust audio step by step like "building blocks", balancing professionalism and ease of use, and achieves state-of-the-art performance in tasks such as video-to-audio (VT2A).