I'm working on the same project myself and was planning to write a blog post similar to the author's. However, I'll share some additional tips and tricks that really made a difference for me.
For preprocessing, I found it best to convert files to a 16kHz WAV format for optimal processing. I also add low-pass and high-pass filters to remove non-speech sounds. To avoid hallucinations, I run Silero VAD on the entire audio file to find timestamps where there's a speaker. A side note on this: Silero requires careful tuning to prevent audio segments from being chopped up and clipped. I also use a post-processing step to merge adjacent VAD chunks, which helps ensure cohesive Whisper recordings.
For the Whisper task, I run Whisper in small audio chunks that correspond to the VAD timestamps. Otherwise, it will hallucinate during silences and regurgitate the passed-in prompt. If you're on a Mac, use the whisper-mlx models from Hugging Face to speed up transcription. I ran a performance benchmark, and it made a 22x difference to use a model designed for the Apple Neural Engine.
For post-processing, I've found that running the generated SRT files through ChatGPT to identify and remove hallucination chunks has a better yield.
btw, if you want local dictation, speak and get a transcript, not transcribe files, I built a Python tool called hns [1]. It's open source, uses faster-whisper, and you can run it with `uvx hns` or just `hns` after `uv tool install hns`.
Nice job. I made a similar python script available as a Github gist [1] a while back that given an audio file does the following:
- Converts to 16kHz WAV
- Transcribes using native ggerganov whisper
- Calls out to a local LLM to clean the text
- Prints out the final cleaned up transcription
I found that accuracy/success increased significantly when I added the LLM post-processor even with modestly sized 12-14b models.
I've been using it with great success to convert very old dictated memos from over a decade ago despite a lot of background noise (wind, traffic, etc).
I was going to go the opposite way and suggest that if you want python audio transcription, you can skip ffmpeg and just use whisper directly. Using the whisper module directly gives you a variety of outputs, including text and srt.
Yep. Whisper is great. I use it on podcasts as part of removing ads. Last time I used one of the official versions it would only accept .wav files so I had to convert with ffmpeg first.
I forget all the details of my tweaks, but I remember that I had better throughput on my version.
I know the OP talked about wanting it local, but thomasmol/whisper-diarization on replicate is fast and cheap. Here's a hacked front end to parse teh JSON: https://github.com/Sanborn-Young/MP3_2transcript
I personally love senko since it can run in seconds, whereas py-annote took hours, but there is a 10% WER (word error rate) that is tough to get around.
I'm working on the same project myself and was planning to write a blog post similar to the author's. However, I'll share some additional tips and tricks that really made a difference for me.
For preprocessing, I found it best to convert files to a 16kHz WAV format for optimal processing. I also add low-pass and high-pass filters to remove non-speech sounds. To avoid hallucinations, I run Silero VAD on the entire audio file to find timestamps where there's a speaker. A side note on this: Silero requires careful tuning to prevent audio segments from being chopped up and clipped. I also use a post-processing step to merge adjacent VAD chunks, which helps ensure cohesive Whisper recordings.
For the Whisper task, I run Whisper in small audio chunks that correspond to the VAD timestamps. Otherwise, it will hallucinate during silences and regurgitate the passed-in prompt. If you're on a Mac, use the whisper-mlx models from Hugging Face to speed up transcription. I ran a performance benchmark, and it made a 22x difference to use a model designed for the Apple Neural Engine.
For post-processing, I've found that running the generated SRT files through ChatGPT to identify and remove hallucination chunks has a better yield.
I added EQ to a task after reading this and got much more accurate and consistent results using whisper, thanks for the obvious in retrospect tip.
btw, if you want local dictation, speak and get a transcript, not transcribe files, I built a Python tool called hns [1]. It's open source, uses faster-whisper, and you can run it with `uvx hns` or just `hns` after `uv tool install hns`.
[1]: https://github.com/primaprashant/hns
That's a neat lil python script, it deserves a github page :)
Nice job. I made a similar python script available as a Github gist [1] a while back that given an audio file does the following:
- Converts to 16kHz WAV
- Transcribes using native ggerganov whisper
- Calls out to a local LLM to clean the text
- Prints out the final cleaned up transcription
I found that accuracy/success increased significantly when I added the LLM post-processor even with modestly sized 12-14b models.
I've been using it with great success to convert very old dictated memos from over a decade ago despite a lot of background noise (wind, traffic, etc).
[1] https://gist.github.com/scpedicini/455409fe7656d3cca8959c123...
This tool requires ffmpeg, but don't forget that the latest version of ffmpeg has speech-to-text built in!
I'm sure there are use cases where using Whisper directly is better, but it's a great addition to an already versatile tool.
I was going to go the opposite way and suggest that if you want python audio transcription, you can skip ffmpeg and just use whisper directly. Using the whisper module directly gives you a variety of outputs, including text and srt.
Yep. Whisper is great. I use it on podcasts as part of removing ads. Last time I used one of the official versions it would only accept .wav files so I had to convert with ffmpeg first.
I always thought this was a great implementation if you have a Cuda layer: https://github.com/rgcodeai/Kit-Whisperx
I had an old Acer laptop hanging around, so I implemented this: https://github.com/Sanborn-Young/MP3ToTXT
I forget all the details of my tweaks, but I remember that I had better throughput on my version.
I know the OP talked about wanting it local, but thomasmol/whisper-diarization on replicate is fast and cheap. Here's a hacked front end to parse teh JSON: https://github.com/Sanborn-Young/MP3_2transcript
You should throw in some diarization, there's some pretty effective libraries that don't need pertraining on the voice separation in python.
I would suggest 2 speaker-diarization libraries:
- https://huggingface.co/pyannote/speaker-diarization-3.1 - https://github.com/narcotic-sh/senko
I personally love senko since it can run in seconds, whereas py-annote took hours, but there is a 10% WER (word error rate) that is tough to get around.
Nice suggestion, I'll look them up.
whisperx does this all quite well and can be run with `uvx whisperx`
https://github.com/m-bain/whisperX
What's the best solution right now for TTS that supports speaker diarisation?
AssemblyAI (YC S17) is currently the one that stands out in the WER and accuracy benchmarks (https://www.assemblyai.com/benchmarks). Though its models are accessed through a web API rather than locally hosted, and speaker diarization is enabled through a parameter in the API call (https://www.assemblyai.com/docs/speech-to-text/pre-recorded-...).
I like this version of Whisper which has diarization built in: https://github.com/Purfview/whisper-standalone-win
Fantastic project.
I have an old project that relies on AWS transcription and I'd love to migrate it to something local.
All I can say is you’re a legend. This is a great resource, thank you!