Skip to content

Expert overview

This manual assumes a reader who has made short-form video or worked with a video pipeline, has handled ffmpeg or yt-dlp in a terminal, and has worked in Claude Code or Codex before. You don’t need Codex image generation or Whisper internals memorized. How the two tracks fork and where the gates sit are unpacked in the sections that follow.

What this section covers is not a command reference but “why it’s shaped this way.” Pass through the reasoning once, then move to architecture, and tracks and verification will read as something more. scene-studio has the most external-tool dependencies of any base, so the prerequisites are heavier than others, which is worth flagging up front.

Hand short-form video to AI and one cut comes out fine. The problem is what comes after. When AI spits out a finished product in one shot, the cut lengths are uneven, the hook is weak, the remix ignores copyright labels, and image generation is slow enough to take an hour per video. scene-studio is a starting point with a gate or a policy planted at each of those spots.

You run two tracks on one channel. Only the entry points fork; they reuse shared assets (caption-styling, publish-meta, pipeline-performance, global-virality, approval-gate).

  • Generative (scene-flow-orchestrator) — from a free-text requirement to planning, scenes, images (Codex), and video (ffmpeg). Four gates (A plan, B-1 scenes, B-2 images, C video) and publish.
  • Remix (remix-flow-orchestrator) — from a source video URL to collection (yt-dlp), analysis (Whisper and hook), and auto-editing. Three gates (A analysis, B editing) and publish.

The trigger forks by input. A URL attachment goes to remix; “original” or “character” goes to generative. If it’s ambiguous (just “make a short-form video”), it doesn’t guess; it stops and asks. The detail of the two tracks is in tracks.

It isn’t a flow where AI makes a finished product in one shot. Each phase stops at an approval gate, and a human approves, edits, or rolls back. Because outputs are kept as files in _workspace/ rather than chat, you recover the exact spot from each *.meta.json’s status even after a session break.

The spot that reduces human waiting without breaking the gate philosophy is fast-preview mode. In preview mode, the planning-stage gates (A, B-1, B-2) auto-pass and the pipeline runs through to a video draft in one go, and then a human reviews the actual video — not the intermediate JSON — once at Gate C. The gate outputs remain in _workspace/ and can be rolled back to any stage, and the no-auto-publish rule stays.

The remix track has license-policy L2 planted at every stage. License-label verification, Fair Use of 30% or less, at least three transformation duties, automatic source attribution, automatic copyright-claim blocking. A source with no license label is refused. The only bypass is the spot where the user explicitly accepts responsibility (license_responsibility.json). The generative track doesn’t take a source, so it has no such gate.

The generative track doesn’t write free scenarios. It picks one of five validated structures (problem_twist_solution, listicle, story_arc, curiosity_gap, transformation) and fills the slots. Free scenarios are forbidden because they drift toward AI-smell. The scene type (hook, beat, cta) becomes a downstream branching key. Remix is different. The source timeline is the answer, and it doesn’t write a new scenario; it discovers a hook in the source and reworks it. The 5 structures don’t apply to remix.

pipeline-performance enforces seven policies. Preview/final two-mode, parallel Codex, hash cache, ffmpeg single-pass, per-step timings, a 4–6 cut budget, and fast image backends for preview. The target is 20 minutes per episode, 12 minutes from the second episode of a series. Codex CLI calls are gathered in one place, image-director (S7). Calling from several agents breaks the style and leaks cost.

publish-copywriter makes only the publish.md metadata. The platform upload is done by the user (S8). Auto-publishing is absolutely forbidden on either track.

The automation entry point points at one place

Section titled “The automation entry point points at one place”

.claude/agents/ and .claude/skills/ are the Claude Code storage location, and AGENTS.md is the Codex entry point. Both read AI_AUTOMATION.md. Track routing, the scene-type spec, license-policy levels, and the security baseline S1–S8 live in that one file, so the phase and output contracts are the same in whichever runtime you work.

This base honestly has unfinished parts. The manual and landing are public, but the template itself is a commercial license model, and the payment flow isn’t wired up yet. COMMERCIAL-LICENSE.md is still a draft. Right after a fork, _workspace/ is just a skeleton, so what this manual teaches is the procedure by which that empty skeleton fills in through the two-track phase flow.

OS-by-OS install branches, general usage of ffmpeg, Whisper, and Codex, platform upload screens, copyright law in general — none of that is here. If you need it, see the beginner manual.

The next section is architecture. It covers how the 8 agents mesh across the two-track phases and how several videos are isolated by folder, with two diagrams.