Skip to content
SenseVoice

SenseVoice

Multilingual speech recognition, emotion & audio event detection—efficient and accurate

Features

Open SourceASR

System Requirements

8GB RAM recommended. 10GB+ storage recommended.
macOS 15+: Supports both Intel and M-series chips.
Windows 10/11: Intel/AMD GPUs supported, NVIDIA GPU recommended.
Note: For NVIDIA GPUs, install a newer driver.

Introduction

## I. Core Positioning SenseVoice is an all-in-one speech foundation model developed by the Alibaba team. Built on the FunASR speech recognition toolkit, it integrates four core capabilities: Automatic Speech Recognition (ASR), Spoken Language Identification (LID), Speech Emotion Recognition (SER), and Audio Event Detection (AED). It acts like an "all-round speech assistant" that can understand, identify, and analyze speech information in multiple languages. ## II. Core Features 1. **Multilingual Speech-to-Text**: Supports over 50 languages including Chinese (Cantonese), English, Japanese, and Korean. It can accurately transcribe daily conversations, meeting recordings, and foreign language content, with recognition performance exceeding the well-known Whisper model. 2. **Speech Emotion Recognition**: Automatically detects emotions in speech such as happiness, sadness, anger, and neutrality. It can capture subtle emotional changes even in movie lines and daily chats. 3. **Audio Event Detection**: Identifies various common sounds in audio, such as background music (BGM), applause, laughter, crying, coughing, and sneezing, making it easy to detect key audio events in the environment. 4. **Ultra-Fast Processing**: Takes only 70ms to process 10 seconds of audio—15 times faster than Whisper-Large. Even long audio files can yield results instantly. 5. **Easy to Use**: Supports all audio formats. It can be operated via a simple web interface, integrated into your own projects through code, and is compatible with multiple programming languages including Python, C++, and Java. ## III. Technical Highlights & Advantages - Solid Training Data: Trained on over 400,000 hours of speech data, ensuring high accuracy in multilingual recognition. - Efficient Architecture: The small model (SenseVoice-Small) adopts a non-autoregressive end-to-end framework. With a parameter scale (234M) similar to Whisper-Small, its inference speed is more than 5 times faster. - Fine-Tuning Support: Provides ready-to-use fine-tuning scripts. Users can optimize the model according to their business scenarios (e.g., industry-specific terminology, dialects) to meet special needs. - Flexible Deployment: Supports local operation, API calls, web interaction, and export in formats like ONNX and Libtorch. It is compatible with various devices such as GPUs, mobile phones, and development boards. ## IV. Application Scenarios Daily office work (meeting recording transcription), customer service quality inspection (identifying customer emotions), content moderation (detecting special events in audio), multilingual communication (real-time translation and transcription), and smart device interaction (recognizing environmental sounds and voice commands).