blog

From script to shot list: where pipelines win over single-shot prompts

There is a tidy version of the text-to-video pitch where you paste a paragraph into a single model, wait three minutes and get a finished film back. Like most tidy versions, it works in the demo and breaks the moment your input is longer than a couple of sentences, or your output needs to be longer than ten seconds, or someone else needs to be able to edit it later.

direktor takes the other route. Six stages, six checkpoints, six files on disk. This post is about why — and what we learned that pushed us out of the single-shot camp.

What single-shot actually means

A single-shot text-to-video system takes text in and returns video out, with no exposed intermediates. There is, of course, a pipeline inside it: tokenisation, planning, frame generation, audio synthesis if any, encoding. You just cannot see it. You cannot pause it, edit it or resume it. If anything inside fails, you re-run from the top and pay for the lot again.

For ten-second clips this is fine. The expected output is small, the cost is bounded and the prompt is usually short enough that you can iterate by rewriting it. For anything longer the maths change quickly. A six-minute explainer is roughly thirty-six times the work of a ten-second clip — but the prompt is not thirty-six times as informative, and the failure modes compound rather than add.

Decomposition is the actual product

The thing direktor sells is not narration, or images, or composition. It is the seams between them.

Stage 1 produces podcast_script.txt. It is plain text. You can read it. You can rewrite it. You can delete the last paragraph and watch the rest of the pipeline shorten itself to match. The model that wrote it was given a system prompt asking for a single-person podcast script and no host annotations. If you do not like its prose style, you do not need to fork the project — you edit one file.

Stage 2 reads that file and produces audio.mp3. The text is chunked at ~150 characters and each chunk is sent to BARK via Replicate, then concatenated with FFmpeg. If three chunks out of forty fail, the run does not crash; the failures get logged and the rest of the audio gets assembled around the gaps. We chose chunking deliberately: at this size, a single failed chunk costs you a few seconds of audio, not a half-hour run.

Stage 3 transcribes the audio you just generated. This is, on its face, redundant — we just turned text into audio, why transcribe it back? Because BARK does not return per-word timestamps, and we want them. The transcript is what drives stages 4 and 5: we use the timestamped chunks to decide where the picture should change.

Stages 4 and 5 turn the transcript into image prompts and then into images. Stage 6 hands the audio file and the list of images, with their times, to FFmpeg. FFmpeg does the boring part: concat demuxer, video stream at 1920×1080, audio stream copied from the MP3, optional drawtext overlays for keywords.

Every one of these stages is a leak point in a single-shot system. They are direktor’s product.

Specific things the pipeline buys you

A few wins are not obvious until you start working in this shape.

Re-running stage N is cheap

In a single-shot system, regenerating “the bit at 2:14” means re-rendering the whole video. In direktor, you change one entry in image_prompts.json and re-run stage 5 with that index. The audio, transcript and other images stay put. You pay one FLUX call.

Intermediates are human-editable

A common complaint about generative pipelines is that they hide where the bad output came from. Was the script weak? Was the prompt weak? Was the image weak? In direktor each of those is a separate file with a separate stage. You can stare at image_prompts.json and decide that “vivid cinematic photograph of two engineers at a whiteboard” is doing too much heavy lifting, swap it for something more specific, and re-render exactly that shot.

Model swaps stay local

Every model id is an environment variable. BARK_MODEL, FLUX_MODEL, GPT4_MODEL, DISTIL_MODEL. We never wrote a “use FLUX” assumption into the code; we wrote a “call whatever model id is in the env var” assumption. When a better TTS lands on Replicate next month, the diff is one line in a .env file. No code change, no fork.

Failure stays local

The biggest practical advantage. When stage 5 dies because Replicate had a hiccup, stages 1 to 4 are still on disk. You resume from 5. You do not regenerate the audio. You do not re-transcribe. You do not re-prompt. This is the difference between a tool that costs a dollar to retry and a tool that costs ten.

Where pipelines hurt

We want to be honest about this. Pipelines also lose, in a couple of specific ways.

Latency. A pipeline is sequential by construction. Six stages each waiting on a network call to a generative model is not fast. A single-shot model can parallelise internally in ways direktor cannot. If your only success metric is wall-clock time to first preview, single-shot wins.

Surface area. Six stages means six failure modes, six sets of credentials, six things to keep up to date. The first install of direktor is genuinely harder than the first install of a SaaS that does the same thing. We accept the trade.

Coherence between shots. Each image prompt is generated from one transcript segment, in isolation. The model does not know that the shot before it had a red kitchen and a Persian cat. Continuity is the hardest thing to get right in a decomposed pipeline. We talk about this in Compositing AI-generated shots without losing continuity.

What “win” means here

We do not think pipelines win on every axis. They lose on latency. They lose on installer surface area. They are harder to demo because the demo is six commands instead of one.

What they win on is everything that happens after the demo. They win when you want to ship a fourteen-minute explainer next week. They win when the script changes and you do not want to re-render the visuals. They win when you want to teach the tool to a junior teammate by handing them a folder of intermediates instead of a prompt-engineering ritual. They win when a model gets deprecated and you need to swap it without rebuilding your output pipeline.

The single-shot approach treats text-to-video as a generation problem. The pipeline approach treats it as an editing problem with generation steps. We think the second framing maps more cleanly onto how people actually make video — and onto how the underlying model landscape will keep changing.

direktor is one expression of that bet. It is opinionated, six stages long, and entirely re-runnable from any stage. If you want a one-line text-to-video API, it is the wrong tool. If you want a pipeline you can read, edit and own, the file you want to look at first is pipeline.py. It is a hundred lines. Read it, then run it.


← All build notes