AI CreationTutorialBehind the Scenes

Making a Historical Short Film with AI: The Complete Huo Qubing Production Log

Ima Claw Team

2026-03-10

2,000 years ago, a 19-year-old general named Huo Qubing led 50,000 cavalry north across the Gobi Desert and performed the famous ritual at Wolf Juxu Mountain — the furthest any Han dynasty army had ever reached.

Today, we recreated that story with AI — 12 shots, 2 minutes 24 seconds, entirely AI-generated from script to final cut.

The Final Result

Watch the finished film first:

A complete vertical short film featuring:

12 AI-generated video clips (Kling O1 model)
AI voiceover narration (deep male voice, documentary style)
AI-composed original score (war drums + traditional Chinese instruments)

Designed for platforms like Douyin (TikTok China), Xiaohongshu, and Instagram Reels.

Step 1: Script Design

Good short films start with good scripts. We didn't jump straight into generation — we spent time crafting a 12-shot narrative structure first.

Three-Act Structure

Act	Shots	Narrative Arc
Act I: The March	A–D	Imperial command → Departure → Army rides north
Act II: The Battle	E–H	Archery → Close combat → Pursuit → Looking back
Act III: The Legend	I–L	Mountain summit ritual → Army cheers → Lone rider at sunset → Epilogue

The 12-Shot Breakdown

Shot	Scene	Description	Mood
A	Imperial Court	Emperor grants command, young general accepts	Solemn
B	Court Close-up	Huo Qubing salutes, eyes fixed northward	Determined
C	City Gate	Mounting horse outside Chang'an, cavalry in formation	Bold
D	Aerial Desert	50,000 cavalry charging across the Gobi	Vast
E	Mountain Battle	Drawing bow, arrow flies like thunder	Tense
F	Close Combat	Blood-stained armor, a smile on his face	Fierce
G	Sunset Pursuit	Solo rider chasing at sunset, red sky	Intense
H	Looking Back	Reining horse, surveying 2,000 li of conquest	Reflective
I	Mountain Summit	Sword raised to sky, army kneeling below	Climax
J	Army Cheers	Ten thousand voices shake the steppe	Triumphant
K	Lone Rider Sunset	Silhouette against the setting sun	Melancholic
L	Epilogue	Young general looks up, freeze frame	Bittersweet

Key Lesson: Emotional arc matters more than visual spectacle. Each shot serves the overall rhythm — from solemn to vast to tense to climactic to melancholic.

Step 2: Character Design

Character consistency is critical for historical films. We established Huo Qubing's look first, then maintained it across all 12 shots.

Character Specs

Reference: Zhang Ruoyun's facial features (clean-cut, heroic)
Age: 19 years old
Costume: Black-gold Han dynasty battle armor, red cape
Style: Photorealistic cinematic, not animation

We used Gemini 3 Pro (Nano Banana Pro) to generate character design sheets, then used these as reference for all subsequent keyframes.

Step 3: Keyframe Generation

For each of the 12 shots, we first generated a static keyframe image. Only after the composition was approved did we convert to video.

Prompt Example

For the climactic summit scene (Shot I):

Low-angle shot of a young Chinese general (19 years old, resembling Zhang Ruoyun) 
standing triumphant on the summit of Wolf Juxu Mountain, drawing his sword pointed 
skyward. Black-gold Han dynasty battle armor with flowing red cape. 
Thousands of soldiers kneel in formation on the mountainside. 
Golden hour sunlight, epic cinematic composition.

Negative: No god rays, no supernatural glow, no light beams, no lens flare.

Key Lessons:

Negative prompts are essential — Shot I initially had unnatural "ghost rays" that were only fixed by explicitly excluding them
Lock aspect ratio early — All keyframes were generated in 9:16 vertical format from the start
Character description consistency — Every prompt included identical character descriptors

Step 4: Image-to-Video

After keyframe approval, each static image was converted to a 5-second video clip.

Model Selection

Model	Credits/clip	Quality	Use Case
Kling O1	48	⭐⭐⭐⭐⭐	Final version (quality first)
Wan 2.6	40	⭐⭐⭐⭐	Initial test round

We tested with Wan 2.6 first, then switched everything to Kling O1 for the final version. The facial detail and motion naturalness were noticeably better.

Cost Breakdown

Item	Quantity	Unit Cost	Total
Kling O1 clips (final)	12	48 credits	576
Wan 2.6 clips (test)	12	40 credits	480
Remakes/fixes	~8	48 credits	384
Video subtotal			~1,440 credits

Step 5: AI Voiceover

Narration is the soul of a historical short film. We generated 12 segments of narration using AI TTS.

Technical Setup

Model: Gemini TTS (gemini-2.5-flash-preview-tts), Orus voice
Style: Slow, deliberate, documentary gravitas
Output: Raw PCM (s16le, 24kHz, mono) → converted to MP3 via ffmpeg
Total duration: 144 seconds across 12 segments

Pitfall: We first tried seed-tts-1.1 for voice cloning, but its ref_audio_url parameter is silently ignored. Gemini TTS worked perfectly as the alternative.

Step 6: AI Music Score

Original score composed using DouBao BGM (ByteDance music generation).

Prompt Design

Ancient Chinese war epic soundtrack. 
NO orchestra, NO strings, NO violin. 
Use only: massive war drums (taiko), bronze bells, erhu, 
guzheng, dizi flute, powerful male choir. 
Fierce, aggressive, triumphant. 150 seconds.

We went through 3 iterations:

v1: Orchestral — too Western, didn't match the period
v2: Mixed — better but still had strings
v3: Pure traditional instruments + war drums — perfect ✅

Cost: 30 credits × 3 iterations = 90 credits

Step 7: Final Assembly

The Slow-Motion Solution

Each AI video clip is only 5 seconds, but the corresponding narration is 10–14 seconds. Our solution:

Slow each clip to match its narration duration using ffmpeg's setpts filter. For a historical film, this actually enhanced the epic quality — slow motion adds gravitas.

Assembly Pipeline

# 1. Slow each clip to match narration
ffmpeg -i clip-A.mp4 -vf "setpts=2.43*PTS,scale=1080:1920" -an seg-A.mp4

# 2. Concatenate all 12 segments
ffmpeg -f concat -i concat-list.txt -c copy video.mp4

# 3. Mix narration + BGM (BGM at 20% volume)  
ffmpeg -i narration.mp3 -i bgm.mp3 \
  -filter_complex "[1:a]volume=0.2[bgm];[0:a][bgm]amix=inputs=2" mixed.mp3

# 4. Combine video + audio
ffmpeg -i video.mp4 -i mixed.mp3 -c:v copy -c:a aac -shortest final.mp4

Total Cost

Generation Cost (IMA Credits)

Item	Tool	Credits	~USD
Keyframe images	Nano Banana Pro	~200	$2
Video generation	Kling O1 + Wan 2.6	~1,440	$14
Voiceover	Gemini TTS	Free	$0
Music score	DouBao BGM × 3 versions	90	$1
Generation subtotal		~1,730	~$17

AI Conversation Cost (LLM Tokens)

Item	Description	Est. Cost
Script writing & iteration	12-shot breakdown, narrative arc, two revision rounds	$2-3
Prompt engineering	Generation prompts for 12 shots, character descriptions	$1-2
Feedback & adjustments	Multi-round approvals, visual fixes, music style iteration	$2-4
Conversation subtotal		~$5-9

Grand Total

Category	Cost
Generation (images + video + music)	~$17
AI conversation (script + prompts + feedback)	~$7
Total	Under $25

Traditional production of a comparable historical short — actors, costumes, locations, crew — would cost tens of thousands of dollars minimum. One person + AI, under $25, half a day.

Lessons Learned

What Worked

Script first — The 12-shot narrative arc was locked before any generation began
Keyframes before video — Approving static images first saved hundreds of credits in video generation
Model tiering — Test with cheaper models, finalize with premium
Negative prompts — Telling AI what NOT to do is as important as telling it what to do
Slow motion = epic — Slowing 5s clips to 12s actually enhanced the historical gravitas

What We'd Do Differently

Character consistency — Still the hardest problem; faces varied slightly across shots
Model selection earlier — We should have committed to Kling O1 from the start instead of testing with Wan 2.6 first
Music iteration — Should have specified "no Western instruments" from the first prompt

Tools Used

Purpose	Tool	Via Ima Claw
Keyframe generation	Gemini 3 Pro Image	✅
Video generation	Kling O1 / Wan 2.6	✅
Voiceover	Gemini TTS	✅
Music score	DouBao BGM	✅
Video editing	ffmpeg	✅ (CLI)

All tools accessible through Ima Claw — no separate accounts or API keys needed.

A coffee break to write the script. Half a day to generate the footage. One ffmpeg command to assemble the final cut. A 2,000-year-old legend, brought back to life by AI. 🐎

💬 Join Our Community

Connect with developers, get updates and technical support

Join Discord