blog

Compositing AI-generated shots without losing continuity

The biggest visual problem in a decomposed text-to-video pipeline is not the quality of any individual image. Models like FLUX-schnell produce pleasant 1920×1080 frames at a reasonable speed. The problem is what happens when you stitch sixty of them together under a continuous narration: every frame was generated in isolation, by a model with no memory of the others, and the audience notices instantly.

This post is the messy, working version of what we know about that problem. There is no silver bullet. There is a stack of small tricks that compound.

What continuity actually means here

“Continuity” in a pipeline like direktor’s is not the film school definition. We are not worried about an actor’s coffee cup moving between cuts, because there are no actors and no coffee cups that persist. We are worried about three lower-order things.

Subject continuity. If the script mentions a chef twice, the chef in scene 4 and the chef in scene 17 should plausibly be the same person. They will not be — the model has no memory of the first generation — but they should at least be recognisably the same character type.

Visual register continuity. If scene 1 is a moody dim kitchen at dusk and scene 2 is a flat fluorescent office, the cut feels like an editing mistake, not a story beat. Register should drift slowly unless the script explicitly asks for a shift.

Composition continuity. If half the shots are tight close-ups and half are wide establishing shots, the rhythm feels broken. Composition should follow the script’s pace, not the model’s mood that day.

direktor handles none of these automatically. Below is what we do, what works, and what doesn’t.

The default failure

In a naive pipeline, each transcript segment goes to a single GPT call that produces a stand-alone image prompt. The prompt is something like “a vivid cinematic shot of a chef preparing dough at a wooden counter.” It is independent of the previous prompt. FLUX renders it independently of the previous render. You end up with sixty perfectly good images that look like they came from sixty different films.

The artefact is jarring. The cuts feel arbitrary. The narration sounds continuous but the visuals do not.

What helps, in order of effort

We will go from “easy and worth doing” to “harder and partial.”

1. Lock the visual register in the system prompt

The single biggest lift, and the cheapest. The system prompt for image-prompt generation should say things like “cinematic, warm tungsten lighting, shallow depth of field, 35mm film grain, muted palette.” Every scene then inherits that aesthetic regardless of subject matter. The chef and the office are both shot like they belong in the same film.

This costs nothing. It is fifteen extra words in the system prompt and it covers maybe 60% of the problem. We recommend doing it before anything else.

2. Include a one-line “style bible” with every prompt

Build a short string at the start of the run — “Visual style: ink, amber, cream. Documentary-toned. Warm key light from frame-left. Low-saturation backgrounds.” — and concatenate it onto every prompt before it goes to FLUX. The image model now sees the style requirement on every call, not just implicitly through the GPT-side system prompt.

This is mechanical and adds a few tokens per call. It does not solve subject continuity but it gives visual register a much stronger anchor.

3. Aggregate the transcript before prompting

Stable Diffusion prompts work better when they describe a single visual idea. Whisper-style transcripts arrive as small chunks — five or ten seconds each. If you prompt one image per chunk, you end up with a cut every ten seconds, which is too fast for most narration, and the prompts get fragmented because there is not enough text in each chunk to anchor a strong visual.

direktor aggregates chunks until each segment is approximately 30 seconds long, then prompts once per segment. The cuts are slower, the prompts are richer, and the model has more room to produce something coherent. The aggregation is in transcript.py:aggregate_chunks and it is one of the more underrated pieces of the pipeline.

If two prompts are related — same character, same location — passing the same FLUX seed makes the outputs noticeably more similar. We do not currently do this in the default pipeline because we have not yet built the “are these prompts related?” classifier. It is on the list.

The crude version is to use the same seed for the whole run. This makes everything look a little more like itself, at the cost of variety. The slightly better version is to bucket prompts by topic and assign one seed per bucket.

5. Let the script imply the cuts

The naive approach is “one image per N seconds.” The slightly better approach is “one image per transcript chunk.” The right approach is “one image per beat in the script.” direktor does the second and is moving toward the third. The work involves having the script-generation step emit explicit beat markers that the image-prompt step can read.

This is the highest-leverage continuity move we have not yet finished. When a beat ends, you cut. When a beat is still going, you do not. The result is a video where the visuals breathe with the script instead of flipping past it.

6. Compose, do not just concatenate

FFmpeg’s concat demuxer will glue images together at hard cuts. This is fine for documentary-style content, where hard cuts read as editorial. It is not fine if you want the visual to feel produced. The next layer up is a slow crossfade between adjacent images, optionally with a Ken Burns push or pull.

direktor’s default is hard cuts because they are honest about what the pipeline produced. Crossfades and Ken Burns moves are easy to add at the FFmpeg layer, but they also smooth over real discontinuities and can make the output feel like it is trying to hide something. We leave them off by default and let users opt in.

7. Accept that the model will not remember

Every single one of the above techniques is about working around the fact that FLUX-schnell does not have memory between calls. A model with persistent character embeddings — one where you could attach an identity token to the chef and have it survive across thirty calls — would make most of these tricks unnecessary. That is not what we have today.

When such a model exists and ships with a sensible API, the direktor pipeline absorbs it as a one-line config change. Until then, we treat continuity as a compositing problem, not a generation problem.

What does not help

A few things look tempting but turn out to fight the medium.

Asking GPT to “make the prompt consistent with the previous one” produces worse prompts, not better ones. The model spends its capacity restating the previous prompt instead of describing the current scene. We tried it. The prompts became muddled and the images got worse.

Generating one giant image and crop-zooming for different “shots” is clever but produces a wall of identical-feeling visuals. The audience reads it as static.

Overlaying a watermark or logo “to unify the look” reads exactly the way it sounds — like a sticker on the bottom-right of every frame. Skip it.

The honest summary

Continuity in a pipeline like direktor is not free. The default output, with no work, looks like sixty unrelated images stitched to narration. The first three techniques above — register lock in the system prompt, a per-call style bible, and proper transcript aggregation — fix most of what an attentive viewer notices. The later ones are higher-effort, higher-payoff, and tracked in the repo as work in progress.

We think the right way to ship a pipeline like this is to give people honest defaults, expose the seams, and let them harden the continuity layer to whatever level their content actually needs. A research summary that nobody will watch in full does not need the same continuity discipline as a marketing piece that is going to play on a landing page. The pipeline should not pretend otherwise.


← All build notes