FFmpeg, an essential open-source media tool, now includes a new af_whisper audio filter that enables automatic speech recognition (ASR) directly within the FFmpeg ecosystem. It uses the whisper.cpp library, which adds a powerful AI model to media processing workflows. This is a significant move for FFmpeg because it moves the software beyond traditional media processing into the world of AI.
The new filter’s options allow for flexible transcription, including choosing the AI model, specifying the language, and setting the output format such as text, SRT, or JSON. It can handle pre-recorded files and live audio streams and users can also use Voice Activation Detection (VAD) to improve transcription accuracy and efficiency.
The filter uses a queue technique which allows users to balance between transcription accuracy and processing speed. It also supports GPU acceleration, which can significantly speed up the transcription process. For users, this feature replaces the need for external, multi-step transcription processes, consolidating tasks into a single, efficient command line workflow.
The new filter is able to generate subtitle files, such as SRT files for videos and podcasts, it also enables live audio transcriptions for streaming or other real-time applications. The filter is able to give you output metadata that can be used for further automation within FFmpeg. The new feature simplifies the process for content creators, archivists, and developers and also saves significant amounts of time and effort for anyone who wants to transcribe audio content.
This integration sets a precedent for FFmpeg to add other AI and machine learning models in the future. It also solidifies FFmpeg’s position as an industry-standard media tool. While some people may be concerned with AI, it’s clear that it’s going to permeate most software going forward.