Creates high-quality realistic sound effects from text or video with only English prompts supported
16GB RAM recommended. 40GB+ storage recommended.
macOS 15+: M-series chips required.
Windows 10/11 64-bit: NVIDIA GPU with 8GB+ VRAM required.
Note: For NVIDIA GPUs, install a newer driver.Legal Restrictions & Permitted Scenarios: This project (Woosh) and its model weights are governed by the CC BY-NC-4.0 license. Commercial use for profit is strictly prohibited.
Woosh is an open‑source sound effects foundation model developed by Sony Research. It focuses on two core scenarios: text‑to‑audio (T2A) and video‑to‑audio (V2A). Designed for content creators, developers, and general users, it can quickly generate high‑quality, realistic environmental sounds, action sounds, and special effects.
Currently, this project only supports English prompts and does not accept Chinese input.
Built on Latent Diffusion Models (LDM), Flow Matching, and multimodal alignment, balancing quality and speed. Supports local inference, Gradio web demo, and API deployment.
Short‑video dubbing, game sound design, film post‑production, podcast creation, interactive media, and AI application development.