Skip to content
Woosh

Woosh

Creates high-quality realistic sound effects from text or video with only English prompts supported

Features

Open SourceFoley Sound

System Requirements

16GB RAM recommended. 40GB+ storage recommended.
macOS 15+: M-series chips required.
Windows 10/11 64-bit: NVIDIA GPU with 8GB+ VRAM required.
Note: For NVIDIA GPUs, install a newer driver.

Introduction

Legal Restrictions & Permitted Scenarios: This project (Woosh) and its model weights are governed by the CC BY-NC-4.0 license. Commercial use for profit is strictly prohibited.

  • Permitted Use: Limited to academic research, personal learning, technical testing, and non-profit public demonstrations.
  • Prohibited Use: Strictly prohibited for use in paid video production, commercial advertisements, integration into paid software/platforms, or any commercial activity that directly or indirectly generates financial gain. For commercial inquiries, please contact the rights holder, Sony Research.

Woosh is an open‑source sound effects foundation model developed by Sony Research. It focuses on two core scenarios: text‑to‑audio (T2A) and video‑to‑audio (V2A). Designed for content creators, developers, and general users, it can quickly generate high‑quality, realistic environmental sounds, action sounds, and special effects.

Core Functions

  1. Text-to-Audio (T2A) Input a text prompt (e.g., “rain,” “footsteps,” “explosion”) to generate matching realistic sound effects.
  2. Video-to-Audio (V2A) Upload a video, and the model automatically generates synchronized audio aligned with visuals; text prompts can refine style and details.
  3. Basic Components
  • Audio encoder/decoder (Woosh‑AE): High‑fidelity compression and reconstruction for stable generation.
  • Text‑audio alignment model (Woosh‑CLAP): Connects text descriptions to corresponding sounds accurately.

Important Reminder

Currently, this project only supports English prompts and does not accept Chinese input.

Underlying Technologies

Built on Latent Diffusion Models (LDM), Flow Matching, and multimodal alignment, balancing quality and speed. Supports local inference, Gradio web demo, and API deployment.

Typical Scenarios

Short‑video dubbing, game sound design, film post‑production, podcast creation, interactive media, and AI application development.

Key Features

  • Open‑source: Developed by Sony Research with public code and weights.
  • Dual capabilities: Supports both text and video inputs for sound effects.
  • Easy to use: Local scripts, web demo, and API for beginners and developers.
  • Professional quality: Studio‑grade output for content creation and industrial use.