How to Convert Audio or Video to Text for Free with Buzz

Buzz AI Windows 10/11 tutorial showing audio to text conversion with icons.

Buzz is a free, open-source desktop app that transcribes audio and video files to text using OpenAI’s Whisper model — locally, on your own computer, with no subscription and no upload to a cloud service. Install it, drop in an MP3 / MP4 / WAV / M4A file, pick a Whisper model size, hit Run, and it produces a .txt, .srt, or .vtt transcript. Once the model is downloaded on first run, the whole pipeline works offline.

Applies to: Buzz (current version) on Windows 10 (22H2), Windows 11 (23H2, 24H2, 25H2), macOS, and Linux | Last updated: April 20, 2026

Convert Audio or Video to Text for Free with Buzz

Key Takeaways

  • Buzz is free and open source, wraps OpenAI Whisper, and runs entirely on your machine — no audio ever leaves your PC once the model is downloaded.
  • Whisper “medium” is the sweet spot for English on a typical CPU — roughly real-time speed with 95%+ word accuracy on clean recordings. “Large” is more accurate but needs a GPU or a lot of patience.
  • Exports include plain text (.txt), SRT subtitles (.srt), and WebVTT (.vtt) — SRT is what YouTube, Premiere Pro, and DaVinci Resolve expect.

Quick Steps

  1. Download Buzz from the GitHub Releases page.
  2. Install it (click More info → Run anyway if SmartScreen complains on Windows).
  3. Open Buzz, click +, select an audio or video file (change the file filter to All files if video doesn’t appear).
  4. Pick Whisper → medium as the model, leave task on Transcribe, and language on Detect Language.
  5. Click Run. First run downloads the model (~1.5 GB for medium). Subsequent runs are fully offline.
  6. Double-click the finished job to view the transcript; export as .txt, .srt, or .vtt.

Why Buzz Over Online Tools

Web transcription services — Otter, Descript, Rev — are genuinely good, but they each have the same tradeoff: you upload your audio to someone else’s servers, you’re capped by free-tier minutes, and the output quality is locked behind a subscription. Buzz cuts all three: your audio stays on your machine, there’s no time limit, and the quality is literally the same Whisper model those services use under the hood.

The tradeoff is speed. On a mid-range CPU, the medium Whisper model transcribes roughly in real time (one minute of audio takes one minute of processing). On a GPU, it’s 5–10× faster. Browser-based services run on their own GPU farms, so they feel faster for short clips — but once you factor in upload time on a slow connection, Buzz often beats them for anything longer than a couple of minutes.

Step 1: Install Buzz

Buzz GitHub releases page listing Windows, macOS, and Linux installers.
  1. Open the Buzz GitHub Releases page.
  2. Download the installer for your OS:
    • Windows: Buzz-x.x.x-windows.exe
    • macOS: Buzz-x.x.x-mac.dmg (universal — works on Intel and Apple Silicon)
    • Linux: Buzz-x.x.x-unix.tar.gz or install via pipx install buzz-captions
  3. Run the installer. On Windows, you’ll hit a SmartScreen warning because Buzz isn’t code-signed with an expensive certificate — click More info → Run anyway. The project is open source and the source code is on the same GitHub repository.
Windows SmartScreen warning with the More info link highlighted to reveal the Run anyway button.

Step 2: Transcribe a File

  1. Open Buzz. The main window lists past transcription jobs (empty on first launch).
  2. Click the + button in the top-left to start a new transcription.
  3. In the file picker, switch the filter from Audio files to All files if you want to transcribe a video (MP4, MOV, MKV — Buzz extracts audio internally).
  4. Select the file and click Open.
Buzz new transcription window showing model, task, and language selectors.

Choosing the Right Whisper Model

Buzz lets you pick from Whisper’s five model sizes. Each trades speed against accuracy:

  • Tiny (~75 MB) — Very fast, error-prone. Good for quick draft transcripts where you’ll edit heavily.
  • Base (~140 MB) — Fast, decent on clean audio. Noticeably worse on accents.
  • Small (~460 MB) — Balanced. Good default on a low-end laptop.
  • Medium (~1.5 GB) — Roughly real-time on CPU. The sweet spot for most people.
  • Large (~3 GB) — Best accuracy, especially for non-English and noisy audio. Slow on CPU — use it with a GPU or an overnight job.

Models are downloaded on first use and cached — you only pay the download cost once per model.

Transcribe vs. Translate

  • Transcribe — Output in the same language as the input. Use this for captions or notes.
  • Translate — Whisper outputs English regardless of the input language. One-way only — you can’t translate English → Spanish with this feature. For that you’d pipe the English transcript through a separate translator.

Leave Language on Detect Language unless Whisper is mis-detecting — it’s usually accurate. Manually set the language if you’re transcribing a clip with very short speech where detection is unreliable.

  1. Click Run. Buzz downloads the model if needed, then starts processing. Progress shows as a percentage on the job in the main window.

Step 3: Export the Transcript

When the job finishes, double-click it to view the full transcript in a separate window. From there, use the Export menu to save the output:

Buzz export dialog showing TXT, SRT, and VTT output format options.
  • Text (.txt) — Plain text, no timestamps. Good for blog drafts, meeting notes, or content you’ll paste into a document.
  • SubRip (.srt) — Subtitles with timestamps. This is the format YouTube, DaVinci Resolve, Premiere, and most other video tools import natively.
  • WebVTT (.vtt) — Web-standard subtitles. Use this if you’re publishing captions alongside HTML5 video on your own site.

For YouTube, upload the .srt through Studio → Subtitles. YouTube’s own auto-captions are noticeably worse than Whisper medium, so this often replaces them directly.

Real-Time Transcription

Buzz also supports live transcription from the microphone — useful for voice notes, meeting attendance, or live-captioning yourself while you record. Click Record on the main screen, select your input device, and Buzz streams transcription as you speak.

On older or low-spec CPUs, real-time transcription can lag behind the audio noticeably. If that’s a problem on your machine, the workaround is to record with Audacity or Windows’ built-in Voice Recorder, then transcribe the saved file with Buzz afterwards — you get the same accuracy with no real-time latency.

Buzz real-time recording window with microphone input selected and Whisper streaming transcription.

Accuracy Tips

  • Clean audio wins. Whisper is much more accurate on a good-quality recording than a noisy one. A quick pass through Audacity’s noise-reduction or a desktop app like Adobe Podcast’s free Enhance Speech can bump accuracy noticeably before you even hit transcribe.
  • Short segments trip up language detection. For clips under ten seconds, set the language manually instead of leaving it on Detect.
  • Brand names and technical terms need proofreading. Whisper has a strong bias toward common dictionary words — expect to fix product names, URLs, and jargon-heavy passages.
  • Model size matters more than transcription settings. If accuracy is poor, stepping up from medium to large gives a much bigger improvement than any tweaking within the same model.

Conclusion

Buzz is one of the cleaner wrappers around Whisper — no account signup, no subscription, and no upload of your audio to a third party. The medium model on a modern CPU will handle most podcast episodes, long lectures, or meeting recordings in roughly the same amount of time as the recording lasted, and the output matches or beats most paid services.

If you’re transcribing footage you’re about to edit, pairing Buzz with DaVinci Resolve or CapCut gives you a complete free workflow — record, transcribe, edit, and export with zero subscriptions.


Frequently Asked Questions

Is Buzz really completely free?

Yes. Buzz is open source under the MIT licence, downloadable from GitHub, with no paywall, no trial, and no feature gating. The Whisper models it uses are also free and released by OpenAI under the MIT licence.

Does Buzz actually work offline?

Yes, after the model is downloaded on first use. The model file is cached locally (usually in %LOCALAPPDATA% on Windows), and subsequent runs never reach out to the internet. You can physically disconnect and Buzz keeps working.

What languages does Buzz support?

Whisper — and therefore Buzz — supports over 90 languages for transcription, including Spanish, French, German, Arabic, Chinese, Hindi, Japanese, and many smaller languages. The translate mode outputs English regardless of the input language. Accuracy varies by language — English and major European languages are strongest, low-resource languages less so.

Can Buzz use my GPU?

Yes — Buzz can use an NVIDIA GPU with CUDA acceleration. In Preferences, pick the CUDA backend if you have a compatible card. On an RTX 3060 or better, this is 5–10× faster than CPU. Buzz also supports Apple Silicon GPU acceleration on Macs.

What export formats does Buzz support?

Plain text (.txt), SubRip subtitles (.srt), and WebVTT (.vtt). SRT is the format every video editor and YouTube accepts, so it’s the one to pick for captions. Plain text is best for notes, blog drafts, or content you’ll paste elsewhere.

Similar Posts