I am completely amazed at how well voice cloning works locally on my machine.
I already had a local MLX Whisper model running, which I use to generate subtitles from video. I thought that was all I needed. Then I asked Codex to build me a voice cloning tool in Python, and it went much deeper than I expected.
The tool ended up using:
- Python
- FFmpeg and FFprobe
- MLX Audio text-to-speech
- MLX Whisper transcription
- Hugging Face model snapshots
A quick note: cloning someone’s voice can get you into trouble if they object or decide to sue. That said, Winston Churchill is no longer around to complain.
What the Tool Does
- Lets me choose a reference audio or video file
- Lets me choose a speech text file
- Detects and skips leading silence (FFmpeg)
- Extracts and cleans a reference voice sample (FFmpeg)
- Transcribes the cleaned reference using an MLX Whisper model (whisper-large-v3-mlx)
- Generates new speech from text using a local MLX Audio model (higgs-audio-v2-3B-mlx-q8)
The wild part is that all of this runs locally. No cloud service. No subscription API. Just my Mac, some open-source tools, and a Python script that suddenly feels a little too powerful.