Whisper, but easier: a single-binary subtitle pipeline

The advice "just use Whisper" is correct and useless. It is correct because Whisper is an excellent open-source ASR model. It is useless because nobody actually wants Whisper. They want subtitled video, in production, callable from their app or their agent, without having to think about Python virtualenvs or GPU drivers.

This is what we kept running into, so we built the pipeline that should have existed already. Here is what is in it and what disappears when you use it.

The pain of "just use Whisper"

The unhappy path of integrating Whisper looks like this:

Python deps. pip install openai-whisper pulls torch and a model blob. You now have a Python service in your stack whether you wanted one or not.
GPU setup. CPU inference works but is slow. To go fast you need CUDA, the right driver, the right torch wheel, and a base image that matches all three. This is fine until your base image updates.
Audio extraction. Whisper takes audio. Your users send video. ffmpeg is the answer, but you are now writing ffmpeg invocations and handling its many failure modes.
Output formatting. Whisper returns JSON with segments. Your users want SRT, or VTT, or burned-in subtitles. More glue code, more edge cases around line wrapping and CJK characters.
Burn-in. If you want subtitles in the video frame, you go back to ffmpeg with libass and an SRT file. ffmpeg's subtitle filter has its own surprises around fonts, scaling, and styling.
HTTP wrapper. Now wrap all of that in a service so the rest of your stack can call it. Add upload handling. Add a job queue because transcription is slow. Add cleanup because video files are big.

You can do all of this. We did all of this. Then we packaged it so the next person does not have to.

What we built on top

Three layers, all open source:

A compression stage. ffmpeg with sane defaults — re-encode to a reasonable bitrate, normalize audio, extract a 16 kHz mono WAV for Whisper. This stage is what makes the rest of the pipeline predictable.
A transcription stage. OpenAI Whisper large model, called either via the OpenAI API or a local install. Output is normalized into SRT and JSON regardless of which backend you use.
A burn-in stage. ffmpeg again, libass under the hood, sensible font fallbacks for CJK and RTL scripts. Output is an MP4 with subtitles rendered into frames so they survive every player and every social platform.

A single Go binary glues the three together. The HTTP API in front of it accepts an upload, returns a job ID, and serves the burned video when the job is done. The MCP server in front of that exposes the same operations as tools to an agent.

The API surface that emerges

When you simplify the inside, the outside gets smaller. Here is the entire HTTP surface:

# kick off a job
POST /api/jobs
{ "video_url": "...", "language": "en" }
=> { "id": "abc123" }

# poll
GET /api/jobs/abc123
=> { "id": "abc123", "status": "done", "download_url": "..." }

# download the burned video
GET /api/jobs/abc123/download

# or just the transcript
GET /api/jobs/abc123/transcript.srt

Four endpoints. No SDK required. The full reference is at /docs/api.

The MCP surface is even smaller because the model handles the polling loop itself:

start_upload — returns a presigned URL the agent uploads to with curl
get_video_status
get_transcript
get_download_url — returns a presigned URL the agent downloads from with curl

If you want subtitles on a video and you are talking to an LLM, you say "subtitle this" and four tool calls happen. That is the surface. Video bytes never enter the model's context — uploads and downloads happen server-to-server, with the agent only holding the URLs.

What goes away

When you use the pipeline instead of building it, here is what you stop maintaining:

A Python service with torch in it.
A CUDA base image and the version drift around it.
ffmpeg invocations for audio extraction, compression, and burn-in.
An SRT formatter that handles line breaks correctly across scripts.
Font-fallback logic for CJK and RTL.
A job queue and a status endpoint.
An upload handler that does not OOM on large files.
A cleanup job that prunes intermediate artifacts.

That is roughly two engineer-weeks of plumbing the first time, and an ongoing tax forever after, that you do not have to budget for.

When to use it vs raw Whisper

There is still a case for using Whisper directly. Take it if:

You only need transcripts, never video.
You are doing research-grade work where you need to swap models, decoder settings, or post-processing constantly.
You already operate a Python ML stack and adding one more model is free for you.

For everything else — apps, agents, internal tools, content workflows — wrapping Whisper to the level we have already wrapped it is wasted effort.

Try it

The hosted pipeline runs at brains.subtitlesking.com. The free tier handles videos up to 100 MB with 24 hour retention, no card required. To try it through the web, use /try.

If you want to run the same pipeline on your own box, the binary is at github.com/kirillzubovsky/subtitlesking-mcp and the install steps are on /self-host.

For the apples-to-apples comparison against rolling your own Whisper service, the deep version is at /vs/whisper-self-hosted.