The DSL Approach to Video Scripting: Why We Built It

Every video tool eventually shows you a timeline. Clips on tracks, handles you drag to trim, keyframes for audio fades. It looks intuitive until you are three hours in and trying to make 15 videos that differ only in the narration text.

That is the problem we were actually trying to solve when we built InkSlop: not one video, but many videos, quickly, with consistency. The answer was a domain-specific language rather than a UI.

What the DSL Looks Like

The InkSlop DSL is a tag-based markup language for describing video structure. A basic script looks like this:

[tts voice="af_heart"]
Did you know the mantis shrimp can punch with the force of a bullet?
[/tts]

[music file="ambient/deep-focus" volume="0.3"]
[video file="stock/ocean-floor" duration="6s"]

Each block declares what should happen: a line of speech rendered with a specific voice, a music bed at a given volume, a video clip that runs for a fixed duration. The renderer assembles them in order, handles the audio sync, and produces the final file.

Tags can be self-closing for single operations or block-level when they wrap content:

[pause duration="1.5s"/]

[style font-size="32" color="#ffffff"]
Text that appears as a subtitle overlay
[/style]

The Full Tag Set

The DSL covers every production element you would normally handle in a timeline editor:

[tts] for text-to-speech segments (voice, speed, language)
[video] for background video clips from the library or user uploads
[music] for audio beds with volume and fade control
[image] for static image overlays
[ai-image] to generate an image from a text prompt inline
[ai-video] for AI-generated video clips (Veo)
[sound] for one-shot sound effects
[pause] for explicit silence
[zoom], [pan], [tilt], [shake] for camera motion effects
[style] for subtitle and text overlay styling
[social-post] for Reddit-style formatted posts

Motion effects accept timing and easing parameters. Time anchors let you reference another element's start or end point: [zoom start="clip1:start" end="clip1:end+2s"] ties a zoom to when a specific clip begins.

Why Plain Text Beats a Timeline for Repetitive Work

A timeline is great for one video you are crafting by hand. It is painful for anything repeatable.

Plain text scales in ways a GUI cannot. You can template it. You can generate it with code. You can diff two versions. You can store it in git. You can write a script that produces 30 variations with different opening lines and submit all of them to the renderer in a batch.

The DSL was designed from the start to be machine-writable, not just human-writable.

LLMs Write This Format Well

The deeper reason to build a structured language is that large language models are genuinely good at it.

When you describe a video in a free-form chat, the model produces prose. When you give it a grammar to follow, it produces structured output you can actually render. InkSlop's AI generation feature prompts the model with the DSL spec and a description of what you want, and the model writes the script directly. You can tweak it, regenerate a section, or publish as-is.

This also means you can pipe topics into the AI outside InkSlop, get structured DSL back, and submit it to the API for rendering. The format was designed to survive that round-trip cleanly.

How the Parser Works

The grammar is defined in Lark, a Python parsing library. The parser produces an AST (abstract syntax tree) of typed node objects: TTSTag, VideoTag, MusicTag, ZoomTag, and so on. Each node carries its parameters as typed fields rather than raw strings, which means rendering code deals with validated, structured data rather than parsing strings a second time.

Comments are supported with # syntax, so you can annotate a script:

# Intro segment
[tts voice="am_michael"]
Three things nobody tells you about starting a business.
[/tts]

# Cut to stock footage while the list plays
[video file="stock/office-timelapse" duration="8s"]

The transformer walks the AST and generates a timeline with calculated segment durations, audio offsets, and effect parameters. The Celery worker then renders that timeline with FFmpeg.

What This Enables in Practice

A few things become straightforward once you have a text-based scripting layer:

Batch production. Write 10 scripts in a text editor or with an LLM, submit them all, come back when they are rendered. No clicking through a UI ten times.

Version control. Scripts are files. You can track them in git, review diffs, and roll back a bad change without losing the original.

Programmatic variation. Any templating language can produce DSL output. If you want 30 variations of a video with different opening statistics, a small script can generate all 30 DSL files without touching the editor.

Reproducibility. A DSL file is a complete description of a video. Two years from now you can render it again and get the same output, assuming the referenced assets still exist.

The DSL documentation covers the full syntax reference if you want to dig into the specifics. And if you would rather let the AI write the script while you describe what you want in plain English, the AI generator handles that too.