Automatic Speech Recognition (ASR)

Definition

Automatic speech recognition is a technology that converts audio speech into text, used for meeting transcription and subtitle generation. It is a key component for voice-based AI experiences.

Instantly transcribing a meeting recording. Automatically adding subtitles to a video. Sending a message just by talking to your phone -- all of these are powered by speech recognition technology. Automatic Speech Recognition (ASR) is a technology that automatically converts human speech into text data.

Whisper: A Leap in Accuracy

Speech recognition has a long history, but dramatic improvements in accuracy are a recent development. In particular, Whisper, released by OpenAI in 2022, is a general-purpose model trained on 680,000 hours of multilingual audio data, supporting speech recognition in many languages including Japanese. It is open-source and can be used for free in local environments, leading to widespread adoption from individual developers to enterprises.

The Speech-to-Text Conversion Process

ASR processing involves multiple steps. First, the audio signal is divided into short time frames, and acoustic features (such as mel spectrograms) are extracted from each frame. Then, a trained neural network converts these features into phonemes or tokens, ultimately outputting text. End-to-end models like Whisper handle this entire pipeline within a single model, making them simpler and more accurate than traditional multi-module approaches.

Meeting Transcription and Minutes Generation

The most common business application is meeting transcription. Major video conferencing tools like Zoom, Microsoft Teams, and Google Meet come with real-time transcription features. Furthermore, workflows that use LLMs to summarize transcribed text and automatically generate meeting minutes or key point lists are becoming widespread. "Reviewing a one-hour meeting in five minutes" has become a reality.

Subtitle Generation and Content Production

Adding subtitles to video content is another important use of ASR. YouTube's auto-caption feature is based on speech recognition and supports multilingual subtitle generation. In video production, ASR is also used for media management -- converting footage audio to searchable text -- and streamlining the creation of translated subtitles. From an accessibility standpoint, it is an essential technology for providing information access to people with hearing impairments.

Factors Affecting Accuracy

Speech recognition accuracy is influenced by several factors. Environments with heavy background noise, situations where multiple people speak simultaneously, and speech with strong accents or specialized terminology can all reduce accuracy. Effective countermeasures include applying noise-canceling preprocessing, using speaker diarization technology, and domain-specific additional training. In fields with heavy specialized terminology such as medicine and law, registering custom vocabulary and fine-tuning are particularly effective for improving accuracy.