about

A tool, not a magic show.

direktor is a small, opinionated Python library. It takes a text file and produces a 1920×1080 MP4 with synthesised narration, AI-generated stills and optional keyword overlays. That is the entire premise.

Why we built it

Most "text to video" tools either (a) hand the whole job to a single video model and pray, or (b) hide the pipeline behind a polished UI that you cannot edit. Neither works well when you are making a 6-minute explainer from a research note. You want to see the script before it gets narrated. You want to rewrite an image prompt that came out weird. You want to swap the TTS voice without losing the rest of the run.

direktor exposes the pipeline as six discrete stages, each with a file on disk. Crash on stage 5 and you have not lost the script, the audio or the transcript. Decide the FLUX prompts are too generic and you can hand-edit image_prompts.json before re-running stage 5.

Who it is for

Honest scope

direktor is not Sora. It is not Veo. It does not generate motion video. Stages 5 and 6 produce still images from FLUX-schnell and stitch them under the audio track with FFmpeg. The cuts happen at ~30-second segment boundaries derived from the transcript.

If you need camera moves, lipsync or per-frame video, direktor is the wrong tool. If you want a controllable narration-plus-illustration pipeline you can ship from a CLI, you are in the right place.

What is built today

What is on the table

The architecture treats every model choice as an environment variable. Swapping BARK for a different TTS, FLUX for a different image model, or GPT-4 for a local LLM is a config change, not a code change. Future work tracked in the repo includes pluggable narrators, multi-voice scripts and tighter alignment between transcript segments and image cuts.

Maker notes

direktor lives at github.com/Skelf-Research/direktor, MIT licensed. Documentation: docs.skelfresearch.com/direktor. Issues, ideas and pull requests welcome — especially around alternative model adapters and better continuity between generated shots.