I recently watched a movie on Netflix with scenes in multiple languages: English, Korean, French, and Italian. During the foreign language scenes, there was no translation, just the name of the language spoken, like “[Korean]”. How disappointing…
In a nerdy fit of revenge I decided to fix this myself. So, I obtained an .mp4 video file of the movie and went to work. The tech I’m about to describe uses AI to listen to your movie’s audio, translate it from almost any language, and create subtitles. You could also use these tools for other tasks such as generating lyrics for music.
The tools involved are a combination of ffmpeg and mlx-whisper – a version of OpenAI’s Whisper model optimized to take advantage of Apple Silicon chips. The hour and a half movie I mentioned took less than 5 minutes to generate subtitles on my Apple M2 Max Macbook Pro with 32Gb of memory. I asked ChatGPT what makes mlx-whisper faster on Apple Silicon chips and this is what it said:
- Unified Memory Architecture (UMA): Apple Silicon chips use a unified memory architecture, allowing the CPU, GPU, and Neural Engine to access the same memory pool. This reduces latency and improves data transfer speeds, which are crucial for machine learning tasks.
- Metal Performance Shaders (MPS): MLX leverages Metal Performance Shaders, Apple’s framework optimized for machine learning on Apple Silicon. This ensures efficient execution of operations compared to traditional frameworks like PyTorch with MPS backend.
- Specialized Machine Learning Accelerators: The Apple Silicon chips include dedicated machine learning accelerators that provide hardware-level support for common algorithms and models, significantly boosting performance for tasks like transcription in Whisper.
- Batched Decoding: Implementations like Lightning Whisper MLX utilize features like batched decoding, which improves throughput by processing multiple inputs simultaneously.
- Neural Engine: The Neural Engine on Apple Silicon is designed for AI tasks, offering high-speed performance for operations related to speech recognition and transcription.
- Optimized Operations: MLX uses custom-implemented operations that are optimized for Apple Silicon, outperforming traditional CUDA and other GPU-based setups in some benchmarks.
What you’ll need
- A modern Mac using Apple Silicon
- The Terminal app
This is how you get ffmpeg and mlx-whisper on your Mac.
- Brew
- https://brew.sh/
- On the web page, you can copy the install command for your terminal
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
- Follow the resulting instructions displayed in the terminal to make brew into a command. These were mine, specific to my user name on the machine. Copy yours from the terminal
echo >> /Users/bubba/.bash_profile
echo 'eval "$(/opt/homebrew/bin/brew shellenv)"' >> /Users/bubba/.bash_profile
eval "$(/opt/homebrew/bin/brew shellenv)"
- Give it a quick test- type
brew
and hit return to see if works
- FFMpeg
brew install ffmpeg
- Give it a quick test by typing
ffmpeg
and hit return
- Python
- https://www.python.org/
- Download and install from the package installer
- Pip
- Download the script from https://bootstrap.pypa.io/get-pip.py into a folder you can run the terminal from. You can also right-click the link and save it
python3 get-pip.py
- Give it a quick test by typing
pip
and hit return
- MLX-Whisper
pip install mlx-whisper
- Give it a quick test by typing
mlx_whisper
and hit return
- LLM – a 3 Gigabyte Large Language Model
pip install huggingface_hub hf_transfer
export HF_HUB_ENABLE_HF_TRANSFER=1
huggingface-cli download --local-dir whisper-large-v3-mlx mlx-community/whisper-large-v3-mlx
- use a folder where the video will reside
Now that that ffmpeg and mlx_whisper are installed, along with the LLM, lets assume you have a video to subtitle, called input.mp4.
To create an external subtitle file in the .srt format:
mlx_whisper input.mp4 --task translate --model whisper-large-v3-mlx --output-format srt --verbose False --condition-on-previous-text False
You can open the .srt file with a text editor and take a look, as well as make manual edits if desired. Now, you can either overlay the subtitle into the video, or add it as a track, so you could turn it on/off when viewing the video.
To overlay the subtitle into the video:
ffmpeg -i input.mp4 -vf subtitles=input.srt -c:a copy output.mp4
To add the subtitle as an optional track instead:
ffmpeg -i input.mp4 -i input.srt -c copy -c:s mov_text output.mp4