What an SRT File Is
SRT (SubRip Subtitle) is the plain-text subtitle format nearly everything accepts: video editors, media players, YouTube, and course platforms. Each entry pairs a start and end timestamp with a line of text, so the words appear on screen in sync with the audio.
Generating one by hand means typing the transcript and timing every cue. A speech model does both at once, which is why you can now generate subtitles in a single pass on your own computer.
Method 1: Spokenly File Transcription (Mac and Windows)
- 1Download Spokenly (free, no account) and open file transcription.
- 2Drop in the audio: podcast episode, lecture, recording, any common format (M4A, MP3, WAV and more).
- 3Pick a local Whisper model to keep the file on your machine (a cloud engine is available for tougher audio, but that path does upload).
- 4Export as SRT, or VTT for web players. The other formats are compared in the table below.
The FCPXML option matters if you cut video in Final Cut Pro, because captions land straight on the timeline. For DaVinci Resolve or Premiere, standard SRT imports directly. And since local models have no per-minute pricing, a three-hour recording is as free as a three-minute one.
Method 2: whisper.cpp on the Command Line
The open-source whisper.cpp also writes .srt files (pass its SRT output flag) and runs on Mac, Windows, and Linux. It fits when you want to script the work, like batch-generating subtitles across a podcast archive. It has no graphical interface and you build it from source; if you would rather skip the terminal, method 1 exports the same files. Our guide on running Whisper locally covers the setup.
About Online Audio-to-SRT Converters
Web converters work, but check the limits. They upload your audio to their servers, free tiers usually cap minutes, and full exports often need a signup or a paid plan. For interviews, client work, or anything under NDA, on-device transcription keeps the file off outside servers. If your audio is public anyway (a published podcast), the trade-off is smaller; it is mostly about cost per minute at volume.
SRT vs VTT vs Plain Text
| Format | Best for | Notes |
|---|---|---|
| SRT | Video editors, YouTube, players | Universal; timestamps + text, no styling |
| VTT (WebVTT) | HTML5 video, web platforms | Web-native; supports styling cues |
| TXT / Markdown | Notes, articles, show notes | No timing; just the transcript |
| FCPXML | Final Cut Pro | Captions placed on the timeline |
| JSON | Scripts and pipelines | Structured segments for processing |
FAQ
How do I convert audio to an SRT file for free?
Two free paths. Drop the audio into Spokenly's file transcription on Mac or Windows, pick a local Whisper model, and export as SRT; everything runs on your machine. Or use the open-source whisper.cpp from the command line, which writes .srt files directly. Neither uploads your audio or caps minutes.
Can I create an SRT file without uploading my audio anywhere?
Yes. Local transcription is the point of both methods on this page: Spokenly's local models and whisper.cpp both process the file on your own computer. Online subtitle generators, by contrast, require uploading the audio to their servers.
What is the difference between SRT and VTT?
Both are timed subtitle formats. SRT is the older, universally supported format that most video editors and players accept. VTT (WebVTT) is the web-native flavor used by HTML5 video players and platforms like YouTube, and it supports styling cues. Spokenly exports both, so pick whichever your platform asks for.
Does this work for podcast episodes and YouTube captions?
Yes. Transcribe the episode audio to SRT and upload the file as captions on YouTube, or feed it to your podcast host if it accepts subtitle files. For long episodes, local transcription has no per-minute fees, which is where online tools usually start charging.
How accurate are the subtitle timestamps?
Whisper-class models generate segment-level timing that is accurate enough for standard captioning. For frame-perfect work (broadcast subtitling), expect to nudge a few cues in your editor afterward; the transcription still saves the bulk of the typing and timing work.
Can I get a plain transcript instead of subtitles?
Yes. The same file transcription flow exports plain text (TXT) and Markdown when you do not need timing, and JSON when you want to process the result programmatically.
Ready to try Spokenly?
Free to use with local models. No account required.
Download for macOSRead next