How to Transcribe an Interview (Free, Speaker Labels)

Manual vs Automatic Transcription

Typing an interview by hand is slow: estimates commonly land at four or more hours of typing per hour of audio, and longer with crosstalk or a noisy cafe recording. Manual transcription does force you to absorb the material, which some qualitative methods treat as part of the analysis. A careful review pass recovers some of that immersion, though not all; weigh it against the hours if your method leans on transcription-as-familiarization.

The practical workflow is hybrid. A speech model produces the full draft with speaker labels in a fraction of the recording's length (how fast depends on your hardware and model choice), and you spend your time on the part that needs a human: correcting names and terms, judging unclear passages, and formatting to your style guide. Plan for roughly one audio-length pass of focused review.

Step by Step: Transcribe an Interview Free

1Get the recording as a file on your computer. Phone memos, dedicated recorders, and call recordings are all ordinary audio files (M4A, MP3, WAV).
2Open file transcription in Spokenly (free on Mac and Windows) and drop the file in.
3Pick a local Whisper model so the audio never leaves your machine, and enable speaker labels.
4Transcribe, then export: plain text or Markdown for analysis, SRT if you need timestamps tied to the audio.
5Do one review pass against the audio: fix names, technical terms, and any turns the speaker labels got wrong.

If the interview came from a video call, the recording is still just a file; the Zoom recording guide covers where those files live.

Verbatim vs Clean Verbatim

Decide the style before you start editing, because it changes what counts as an error:

Full verbatim

Every um, false start, laugh, and repetition is kept. Required when the way something is said matters: discourse analysis, some qualitative methods, legal contexts.

Clean (intelligent) verbatim

Filler words and false starts are dropped, grammar is lightly tidied, meaning is untouched. The default for journalism, content, and most research interviews.

Speech models naturally produce something close to clean verbatim. If your method demands full verbatim, transcribe first, then restore fillers during the review pass while listening; it is still faster than typing from zero.

Formatting for Research and Dissertations

+Label speakers consistently and anonymously where required: Interviewer / P1, or pseudonyms defined in one legend.
+Add timestamps in a consistent format, [hh:mm:ss] at each speaker turn is a common convention, so quotes can be traced back to the audio.
+Anonymize identifying details (names, employers, places) during the review pass, and keep the mapping in a separate secure file.
+Keep three artifacts: the original audio, the raw machine transcript, and the corrected final. That is your audit trail.
+If your method includes member checking, send participants their corrected transcript for confirmation before analysis, and note the step in your write-up.
+Follow your department's or publication's template for line spacing and quote citation; the transcript exports as plain text, so any template applies cleanly.

Confidential Interviews: Keep the Audio Local

Check your consent forms and ethics approvals first: they often restrict where recordings may be stored or processed, and an upload to a transcription website can violate that before you have typed a word. Local transcription removes the upload from the equation; with Spokenly's local models your computer runs the speech model, and Local Only Mode additionally blocks all network requests while you work. Local processing is one piece of compliance, not all of it: storage encryption, retention, and destruction of recordings are still governed by your protocol, so confirm the full lifecycle with your ethics board or editor.

Tool Options Compared

Option	Cost	Where audio goes	Fit
Spokenly file transcription	Free (local models)	Stays on your machine	Interviews, speaker labels, subtitle export
whisper.cpp (open source)	Free	Stays on your machine	Scripted batches, command-line users
Word transcribe (Microsoft 365)	Included in some plans	Microsoft cloud	Non-sensitive files within plan upload limits
Meeting services (Otter etc.)	Subscription	Their cloud	Auto-notes for live meetings; also accepts file uploads
Human transcription services	Per audio minute	Their staff and systems	Broadcast-grade full verbatim on deadline

The meeting-service category is compared in detail in Spokenly vs Otter.ai.

FAQ

How do I transcribe an interview for free?

Move the recording to your computer, open file transcription in Spokenly (free on Mac and Windows), and transcribe with a local Whisper model. Turn on speaker labels so each voice is tagged, then export as text and clean up names and formatting. The whole process runs on your machine with no upload and no per-minute fees.

How long does it take to transcribe an interview?

Typing it out by hand is commonly estimated at four or more hours per hour of audio, depending on typing speed and audio quality. Automatic transcription turns that into minutes of processing plus a review pass; for most interviews, budget roughly the length of the recording for a careful read-through and correction pass.

How do I transcribe an interview for a dissertation or qualitative research?

Use clean verbatim unless your method requires full verbatim, keep speaker labels consistent (Interviewer / P1), add timestamps at reasonable intervals or at each turn, and anonymize identifying details during the review pass. Keep the raw audio and the corrected transcript as separate files so your audit trail is intact.

Can transcription label who is speaking?

Yes. Speaker labels (diarization) tag each segment by voice, which breaks one block of text into an attributed back-and-forth. Spokenly supports speaker labels in file transcription; review the labels during your correction pass, since overlapping or very short turns can be mislabeled.

How do I transcribe an interview without uploading it anywhere?

Choose a tool that runs the speech model on your own computer. Spokenly's local Whisper and Parakeet models and the open-source whisper.cpp both process audio entirely on-device, which matters when consent forms or ethics approval restrict where recordings can go.

Can Microsoft Word transcribe an interview?

Word's online version includes a transcribe feature on some Microsoft 365 plans; it uploads the audio to Microsoft's cloud and has upload limits. It can work for non-sensitive material, but for confidential interviews or long recordings, local transcription avoids the upload and the caps.

Ready to try Spokenly?

Free to use with local models. No account required.

Download for macOS

For Mac & iPhone

Free local models

Works offline