blog

Why text-to-video is mostly text-to-storyboard right now

The phrase “text to video” implies one thing and delivers another. What you actually get from almost every tool on the market today — including direktor — is something closer to “text to storyboard, with synthesised narration on top.” Once you accept that framing, the field gets a lot easier to navigate and a lot easier to build for.

This post is a slightly grumpy taxonomy.

The pitch versus the artefact

The pitch is: type a paragraph, get a film. The implicit artefact is a continuous moving image, with motion vectors, parallax, lighting changes and characters that hold continuity from shot to shot.

The actual artefact for most tools today is one of three things.

  1. Stills under audio. Generated images cut to a narration track at segment boundaries. The Ken Burns-style pan, if any, is added by the compositing layer (FFmpeg, in our case), not by the image model.
  2. Short motion clips. A handful of seconds of model-generated video, with the model picking the camera move and the action inside the clip. Hard to chain into anything longer than ten or fifteen seconds without breaks in continuity.
  3. Animated stills. A still that the model has been asked to animate, usually by hallucinating a small amount of camera or subject motion. Coherent for a few seconds, then it drifts.

direktor lives squarely in category one. We generate still images with FLUX-schnell at 1920×1080 and cut between them at transcript-derived boundaries. We are honest about it: the “video” output is a slideshow with timing, narration and overlays. It looks like video because the audio is continuous and the cuts are paced to the speech. The visual content underneath is a sequence of stills.

Why this is fine

Two reasons.

First, the medium most people want this for is already storyboarded thinking. Explainer videos, documentary-style segments, podcast clips with visuals — these are formats where the visual is illustrative rather than narrative-driving. Nobody watches an explainer because the camera move was great. They watch it because the script was clear. A pipeline that produces strong narration and adequate stills will outperform a pipeline that produces beautiful motion but a weaker script.

Second, storyboards compose into video at zero marginal cost. FFmpeg does that part for free. The expensive part is getting the right image at the right time. That is a discrete problem, not a continuous one. Treating it as discrete buys you tractable evaluation: “is this the right image for this 30-second chunk?” is a question with a yes-or-no answer. “Is this the right ten-second motion clip for this segment?” is not.

Why the framing matters

Calling it “text to storyboard” changes what you build, not just what you say.

It changes the model selection. If you are generating stills, you can use the strongest available image model — and the image-generation field is years ahead of the video-generation field in quality, controllability and cost. We use FLUX-schnell because it is fast and cheap; the same pipeline would happily call SDXL or anything else with a Replicate adapter.

It changes the editorial loop. If your visual outputs are stills, your “edit” is reorganising prompts in a JSON file. That is grep-able, diff-able, fork-able. The minute the visual outputs are short video clips, your edit becomes a timeline operation and you have rebuilt Premiere.

It changes the user-facing pitch. We tell people direktor produces a podcast-style video — narration with synchronised stills — because that is what it is. We do not tell people they are going to get something that competes with a Sora clip, because they will not.

What is genuinely new

Some of the pipeline pieces are new, even if “video” is the wrong word for the output.

The TTS layer is new. BARK and its successors produce narration that is close enough to a human voice that a passive listener does not notice within the first thirty seconds. Five years ago this required a voice actor.

The transcription layer is new in a different sense. We re-transcribe the audio we just generated, because Whisper-class models give us per-word timestamps that BARK does not. This is a quietly useful trick: synthesise audio, transcribe it to recover structure, use the structure to drive everything downstream. We get away with it because Whisper is essentially free at this scale.

The image-prompt layer is new in the most boring way possible. A GPT call per ~30-second segment, asking for a stable-diffusion-style prompt that captures the most striking visual element of the text. The interesting part is not the prompt — it is the segmentation. We aggregate the Whisper chunks until each one is ~30 seconds long, and treat that as a shot boundary. Shot boundaries derived from speech timing turn out to be a much better default than shot boundaries derived from sentence structure.

What we are not going to pretend

direktor does not produce camera motion. The stills do not move. There are no pans, no zooms, no parallax. If you watch the output and feel like the visuals are static, that is correct. They are.

direktor does not handle character continuity. Each FLUX call is independent. If you ask for a scene with a chef in a yellow apron and the next scene also features the same chef, you will probably get a different chef. Continuity is an open problem in our pipeline and in every pipeline we have seen.

direktor does not lipsync. There are no faces synced to speech. If you want a talking head, you need a different tool stacked on top.

The honest version of the pitch

“direktor turns text into a podcast-style video — narrated audio, AI-generated stills, keyword overlays, FFmpeg composition.” Every word in that sentence is load-bearing. We do not say “AI video” because it implies motion we do not produce. We do not say “text to video” without qualification because the qualification is the whole point.

We think more tools in this space should say it this way. Not because honesty wins on its own — it does not, particularly in marketing — but because honest naming makes for better tools. If you sell stills under narration as “AI video,” you eventually have to ship motion to justify the noun. If you sell stills under narration as stills under narration, you can spend your engineering effort on the things that actually matter for that format: better script generation, better narration quality, better segmentation, better prompt selection.

Text-to-storyboard is the field today. It will quietly become text-to-video as the underlying models get better, and the pipelines that wrote down their stages will absorb those improvements without needing rewrites. That is the bet underneath direktor.

A small practical note for builders

If you are building in this space, write down what your pipeline actually produces before you write down what you are going to call it on the landing page. The two should match. We have seen too many demos where the marketing copy describes one artefact (“cinematic AI video”) and the model output is a different artefact (“a slideshow with narration”). The gap erodes user trust on the first run.

There is nothing wrong with shipping a slideshow with narration. It is a useful, tractable artefact that scales to long-form content in a way actual motion video does not yet do, at any price. The only mistake is selling it as something it is not. The corollary, for a user evaluating tools: ask the demo to render a five-minute output, not a five-second one, and watch what the visuals do in the middle.


← All build notes