Skip to content
FunASR

FunASR

multilingual, real-time/offline recognition, easy to use and efficient

Features

Open SourceASR

System Requirements

16GB RAM recommended. 12GB+ storage recommended.
macOS 15+: Supports both Intel and M-series chips.
Windows 10/11: Intel/AMD GPUs supported, NVIDIA GPU recommended.
Note: For NVIDIA GPUs, install a newer driver.

Introduction

## I. Core Positioning FunASR is an open-source fundamental speech recognition toolkit developed by Alibaba Damo Academy. It focuses on building a bridge between academic research and industrial applications, enabling researchers and developers to conveniently conduct speech recognition-related research and production, and promote the development of the speech recognition ecosystem. ## II. Core Features - Comprehensive functions: One-stop covers speech-related needs, not only speech-to-text, but also speaker recognition, distinguishing voices of different speakers, detecting valid speech, automatic punctuation, speech emotion recognition, and even supporting keyword spotting, multilingual recognition and translation. - Easy to get started: Supports calling via simple command lines or a few lines of Python code, with detailed tutorials and examples, allowing beginners without professional backgrounds to quickly get started. - High-quality models: Built-in a large number of industrial-grade pre-trained models, trained on massive data, with high accuracy, fast operation, support for rapid deployment, no need to train from scratch. - Strong compatibility: Compatible with Windows, Mac (including M1/M2 chips), Linux and other systems, supports CPU and GPU acceleration, can process single audio files or real-time speech transcription, meeting the needs of offline, real-time and other multi-scene scenarios. ## III. Underlying Technology - Built on the PyTorch deep learning framework, it mainly adopts non-autoregressive end-to-end speech recognition technology (such as Paraformer model), combines mainstream neural network structures such as Transformer and RNN-T, and integrates supporting modules such as voice activity detection, punctuation restoration, and speaker recognition to form a complete speech processing chain. - Training Data Support: Some core models are trained on hundreds of thousands of hours of industrial speech data (such as the SenseVoice model trained on 300,000 hours of data), ensuring usability in practical scenarios. ## IV. Core Function List - Automatic Speech Recognition (ASR): Supports multilingual (Chinese, English, Japanese, Korean, etc.), real-time streaming recognition and offline file recognition with timestamp marking - Auxiliary Speech Processing: Voice Activity Detection (filtering valid speech), punctuation restoration, inverse text normalization (e.g., converting "123" to "one hundred and twenty-three") - Speaker-Related: Speaker verification, multi-speaker diarization (distinguishing who is speaking) - Featured Functions: Speech emotion recognition (angry, happy, neutral, sad, etc.), keyword spotting, multimodal audio-text interaction (Qwen-Audio model)