Skip to content
Qwen3-ASR

Qwen3-ASR

Multilingual support for 52 languages/dialects and exceptional robustness in song and contextual transcription

Features

Open SourceASR

System Requirements

16GB RAM recommended. 25GB+ storage recommended.
macOS 15+: Supports both Intel and M-series chips.
Windows 10/11: Intel/AMD GPUs supported, NVIDIA GPU recommended.
Note: For NVIDIA GPUs, install a newer driver.

Introduction

Qwen3-ASR is an open-source Automatic Speech Recognition (ASR) model series developed by the Alibaba Qwen Team. More than just a transcription tool, it serves as an "intelligent ear" integrated with the reasoning power of large language models.

User-Friendly Features:

  • Multilingual Expertise: It supports 52 languages and dialects in total, including 30 global languages and 22 Chinese dialects, along with various international English accents.
  • Beyond Speech: It excels at transcribing lyrics from songs (even with heavy background music), fast-paced rap, and complex multi-speaker conversations.
  • Contextual Intelligence: Users can provide "hotwords" or reference text to guide the model, ensuring specialized terms and names are transcribed with zero errors.
  • Ultra-Fast Performance: The 0.6B version is highly efficient, capable of processing over 5 hours of audio in just 10 seconds.

Detailed Language & Dialect Support:

  • 30 Global Languages: Including Mandarin, English (US/UK/Regional), Japanese, Korean, French, German, Italian, Spanish, Portuguese, Russian, Arabic, Hindi, Indonesian, Dutch, Thai, Turkish, Vietnamese, and more.
  • 22 Chinese Dialects: Covering Cantonese (Guangdong/HK), Wu, Minnan (Hokkien), Sichuanese, Dongbei, Tianjin, Hebei, Shandong, Shanxi, Shaanxi, Gansu, Ningxia, Henan, Hubei, Hunan, Jiangxi, Zhejiang, Anhui, Guizhou, Yunnan, Fujian dialects, etc.

Underlying Technology:

The project is built upon the Qwen3-Omni multimodal foundation model. Its architecture integrates an AuT (Audio-Understanding-Transformer) encoder with the Qwen3 Large Language Model (LLM). This hybrid approach combines acoustic precision with semantic reasoning, maintaining state-of-the-art (SOTA) accuracy in noisy and complex environments.