Voice Cloning

I am completely amazed at how well voice cloning works locally on my machine.

I already had a local MLX Whisper model running, which I use to generate subtitles from video. I thought that was all I needed. Then I asked Codex to build me a voice cloning tool in Python, and it went much deeper than I expected.

The tool ended up using:

  • Python
  • FFmpeg and FFprobe
  • MLX Audio text-to-speech
  • MLX Whisper transcription
  • Hugging Face model snapshots

A quick note: cloning someone’s voice can get you into trouble if they object or decide to sue. That said, Winston Churchill is no longer around to complain.

What the Tool Does

  1. Lets me choose a reference audio or video file
  2. Lets me choose a speech text file
  3. Detects and skips leading silence (FFmpeg)
  4. Extracts and cleans a reference voice sample (FFmpeg)
  5. Transcribes the cleaned reference using an MLX Whisper model (whisper-large-v3-mlx)
  6. Generates new speech from text using a local MLX Audio model (higgs-audio-v2-3B-mlx-q8)

The wild part is that all of this runs locally. No cloud service. No subscription API. Just my Mac, some open-source tools, and a Python script that suddenly feels a little too powerful.

Leave a Reply

Your email address will not be published. Required fields are marked *