| name | clone-ad |
| description | Clone an existing video ad for a different product or offer. Analyzes the source video's style, pacing, camera work, dialogue, and tone, then adapts and generates a new Seedance 2.0 video customized for the user's product. End-to-end workflow: input video → analysis → adapted prompt → generation → delivery. Use when someone says "clone this ad", "make this ad but for my product", "recreate this video for my brand", or provides a video ad and a product image asking for a similar video.
|
Clone ad — Seedance 2.0
Clone an existing video ad for a different product or offer. The agent analyzes the
source video frame-by-frame, transcribes dialogue, extracts the visual style and
beat structure, then generates a new Seedance 2.0 video adapted for the user's product.
How this differs from analyze-video:
- analyze-video → output is a reusable markdown template saved to
prompt-library/
- clone-ad → output is a generated Seedance 2.0 video delivered to the user
Prerequisites
Before starting, verify:
which ffmpeg || echo "MISSING — run: brew install ffmpeg"
python3 -c "import whisper; print('whisper OK')" 2>/dev/null || echo "MISSING — run: pip3 install openai-whisper"
Both extract-frames.sh and whisper depend on ffmpeg. If missing, install via brew install ffmpeg before proceeding.
Workflow
Step 0: Gather inputs
Collect from the user:
| Input | Required | Notes |
|---|
| Source video | yes | The video ad to clone. File path to .mp4, .mov, .webm |
| Product image | recommended | Reference photo of the user's product. Becomes referenceImages / @(img1) in the prompt. Without this, Seedance invents its own product design. |
| Product/offer description | if no image | Text description of the product, its features, target audience, and key selling points. Used to rewrite dialogue and product references. |
| Brand voice | optional | Check MASTER_CONTEXT.md for brand blocks. If empty, ask the user for tone/audience preferences. |
If the user only provides a video and says "clone this for my product," ask them for
at least a product image or a text description before proceeding.
Step 1: Extract frames and audio
Reuse the analyze-video extraction script — do NOT duplicate it.
bash "skills/arcads-external-api/prompting/analyze-video/scripts/extract-frames.sh" \
"<source_video_path>" "/tmp/clone-ad-analysis" <num_frames>
Frame count by duration:
| Source duration | Frames |
|---|
| Under 10s | 8 |
| 10–20s | 12 |
| 20–30s | 16 |
| Over 30s | 20 |
Outputs:
frame_001.jpg through frame_NNN.jpg
audio.wav (16 kHz mono, whisper-ready)
metadata.txt (duration, resolution, fps, frame count)
Read metadata.txt to get the source video duration — you'll need it for step 6.
Step 2: Transcribe audio
Use whisper to get the exact dialogue. This is critical — the dialogue pattern is what
gets adapted for the user's product.
import whisper
model = whisper.load_model("base")
result = model.transcribe("/tmp/clone-ad-analysis/audio.wav")
Record:
- Full transcript text
- Per-segment timestamps and text (
result["segments"])
- Total word count
- Language detected
If the video is silent (no speech detected), note that and skip the dialogue
adaptation in step 7. The clone will be a visual-style clone only.
Step 3: Compressed analysis
Read ALL extracted frames visually. For each frame, note:
Structure and pacing:
- How many distinct beats/shots are there?
- What's the narrative arc? (hook → demo → verdict? reveal → detail → CTA?)
- How long does each beat last? (map to segment timestamps)
Camera and framing:
- POV style: selfie/handheld, tripod, propped phone, over-the-shoulder?
- Framing per beat: wide, medium, close-up, macro?
- Camera movement: static, pan, dolly, handheld shake?
- Signature framing moves (e.g., "leans into camera," "tilts product toward lens")
Edit style:
- Transition type: jump cuts, dissolves, match cuts?
- Visual rhythm: fast cuts vs held shots?
- Any recurring motif (e.g., "every other beat is an extreme close-up")?
Dialogue and script structure:
- Hook format: question, statement, exclamation, reaction?
- Speech pattern: casual/formal, filler words, trailing thoughts, mid-sentence cuts?
- How many spoken lines? How many silent beats?
- CTA style: direct ("link in bio"), soft ("you need to try this"), none?
Tone and energy:
- Emotion words that describe the speaker/mood
- Energy arc: starts calm → builds excitement? Flat? Burst then settle?
- Speaker's relationship to viewer: friend, expert, skeptic, fan?
Lighting and technical quality:
- Light source: natural/artificial, direction, quality
- Camera quality: phone/DSLR/cinema, intentional flaws?
- Audio quality: phone mic, studio, car, outdoor?
Product references:
- How is the product physically shown? (held up, worn, applied, on a surface)
- What specific claims or features are called out?
- Brand mentions, labels visible, text overlays?
What makes this ad distinctive (2–3 defining traits):
- The unique combination of elements that makes this ad recognizable
- These are the traits that MUST transfer to the clone
Store this analysis internally — it does NOT get saved as a template file.
Step 4: Present analysis summary
Show the user a structured breakdown before proceeding:
📋 Source video analysis
Duration: Xs | Beats: N | Dialogue: Y words | Style: [style name]
Beat map:
[00:00–00:03] HOOK — close-up, excited expression, "opening line"
[00:03–00:07] SHOW — tilts product to camera, "feature call-out"
[00:07–00:10] DEMO — (silent) applies/uses product, close-up on texture
[00:10–00:15] VERDICT — back to camera, "closing line + CTA"
Defining traits:
1. [trait 1]
2. [trait 2]
3. [trait 3]
What transfers to your product:
✅ Beat structure, pacing, camera angles, edit style, tone, energy
✅ Dialogue pattern (adapted for your product)
✅ Lighting and technical quality cues
What gets swapped:
🔄 Product references → your product
🔄 Specific claims → your product's features
🔄 Brand mentions → your brand (if provided)
Proceed with adaptation? (yes / adjust)
Wait for user confirmation before continuing.
Step 5: Decide generation mode
Walk through this decision tree:
┌─ Source video ≤ 15s?
│ YES → Single-clip generation
│ NO → Multi-clip split at natural beat boundaries
│ Each clip ≤ 15s (Seedance max)
│ Identify best split points from beat map
│ Use the CHAINED MULTI-CLIP PIPELINE below
│
├─ User provided a product IMAGE?
│ YES → Image-to-video mode (referenceImages with @(img1) in prompt)
│ For multi-clip: use i2v for clip 1 ONLY, then chain v2v for clips 2+
│ NO → Text-only mode (describe product in prompt text only)
│ OR v2v if:
│ - Source video has NO human faces
│ - AND user wants to preserve exact visual style
│ - (v2v with faces → content checker rejection + billed)
│
├─ Source video has person SPEAKING?
│ YES → audioEnabled: true (confirm with user)
│ Dialogue confirmation gate REQUIRED (step 7)
│ NO → audioEnabled: false (or ask user preference)
│ Skip dialogue gate
│
└─ User wants voice clone from source audio?
YES → Upload source audio as referenceAudios
(check audio+image regression: run sanity probe first)
NO → Seedance generates its own voice from text
Chained multi-clip pipeline (confirmed 2026-04-10)
When the source ad is longer than 15s, use this hybrid i2v→v2v chaining pattern for visual continuity:
Clip 1: i2v mode
- referenceImages: [product image] ← establishes brand fidelity
- audioEnabled: true (if speech)
- Generate → poll → download output
Clip 2: v2v mode
- referenceVideos: [clip 1 output] ← inherits hands, surface, lighting, product
- NO referenceImages (mutually exclusive)
- audioEnabled: true (if speech)
- Upload clip 1 output via fresh presigned URL
- Generate → poll → download output
Clip 3: v2v mode
- referenceVideos: [clip 2 output] ← chain from MOST RECENT clip, not clip 1
- Upload clip 2 output via fresh presigned URL
- Generate → poll → download output
...continue for clips 4+
Critical rules for chaining:
- Always chain from the most recent clip — do not reuse earlier uploads. Presigned URLs expire and stale uploads may fail silently.
- Upload each clip output fresh via
POST /v1/file-upload/get-presigned-url immediately before using it as a reference. Do not reuse filePath values from previous uploads.
- Wait for each clip to reach
generated status before uploading it as a reference for the next clip. Do not fire clips in parallel — they must be sequential.
- Clip 1 uses i2v for brand fidelity (product image as reference). All subsequent clips use v2v (previous clip as reference) for visual continuity.
- After all clips are generated, stitch with ffmpeg:
ffmpeg -f concat -safe 0 -i list.txt -c copy output.mp4 (use absolute paths in the list file).
Why chaining works: Seedance v2v inherits the visual style, hands, surface, lighting, and product appearance from the reference video. By chaining clip N → clip N+1, each subsequent clip maintains continuity with the one before it. The first clip's i2v reference image establishes the product identity; v2v propagates it through the series.
Cost note: Clip 1 costs the i2v rate (~0.06/sec, 0.9 cr at 15s). Clips 2+ cost the v2v rate (~0.1/sec, 1.5 cr at 15s). A 3-clip series costs ~0.9 + 1.5 + 1.5 = ~3.9 credits total.
Important constraints to check:
referenceImages and referenceVideos are mutually exclusive — pick one per call
- v2v with human-containing reference videos → content checker rejection (credits burned)
audioEnabled: true + referenceImages may 500 (intermittent server regression) — sanity probe first
referenceVideos count > 1 fails — only 1 ref video works
- If using v2v: only use product-only/abstract/hands-only videos (no faces)
- Hands-only clips (no face visible) pass the v2v content checker — confirmed 2026-04-10
Tell the user which mode you're using and why.
Step 6: Adapt for user's product
This is the creative core. Using the analysis from step 3:
Dialogue adaptation (if source has speech):
- Keep the same conversational pattern: if the source uses a question hook, use a question hook. If it uses filler words ("like," "okay so"), keep filler words.
- Keep the same number of spoken lines and same silent beat placement
- Keep the same energy arc (excited → calm, or flat, or building)
- Replace product-specific references with the user's product name, features, and claims
- Match the word count of each line closely (±3 words per beat) to preserve pacing
- Read the adapted dialogue out loud at natural pace — it must fit the target duration
Visual adaptation:
- Keep the analyzed camera work, framing per beat, and edit style
- Replace the product description with the user's product (physical appearance, colors, materials, label details)
- Keep the setting, lighting, and atmosphere
- Keep the person description (or adapt if user specifies a different persona)
- Keep the technical flaw cues (phone quality, mic type, lighting imperfections)
Prompt composition:
- Read seedance-2.md for platform rules before composing
- Read the closest matching style template (e.g., seedance-2-ugc.md for UGC-style sources) for structural guidance
- Follow the Subject + Action + Camera + Style + Constraints order
- Stay within 100–260 words (Seedance sweet spot)
- Include
@(img1) token if user provided a product image
- Add consistency anchors: "The product from @(img1) must remain visually unchanged in every shot"
- Add pacing cues in the tone direction paragraph
- Use timestamps
[00:00], [00:04], etc. for multi-beat sequences
- No forbidden words: cinematic, professional, stunning, 8k, studio, perfect
Duration selection:
- If source ≤ 15s: match source duration (or round to nearest second in 4–15 range)
- If source > 15s: split into clips, each ≤ 15s
- Use dialogue word count to validate (see main SKILL.md duration table: ~2.5 words/sec)
Step 7: Dialogue confirmation gate
MANDATORY for any clone with spoken dialogue. Follow the exact format from the
main SKILL.md:
📝 Dialogue script (please confirm before I generate)
1. [HOOK] "adapted line matching original pattern"
2. [SHOW] "adapted feature call-out for user's product"
3. [DEMO] (silent beat — physical demonstration, no dialogue)
4. [VERDICT] "adapted closing line / CTA"
Total spoken words: ~N | Target duration: Xs | Fits at natural pace: ✅/❌
Approve this dialogue? (yes / edit / rewrite)
Rules:
- This gate is separate from the credit cost confirmation — both must be satisfied
- Never assume approval from earlier confirmations (tone, analysis, credit cost)
- If user says "edit" or proposes changes, revise and re-present until approved
- Skip ONLY if the source video is entirely silent (no speech detected in step 2)
Step 8: Audio decision
Ask the user:
- Enable audio output? (
audioEnabled: true / false)
- Default to
true if source video has speech
- Default to
false if source video is silent
- Supply reference audio for voice cloning?
- Offer to extract the source video's audio and use it as
referenceAudios
- Or user can provide their own voice clip
- Upload via presigned URL if provided
- Sanity probe (if using
audioEnabled: true + referenceImages):
- This combo has a known regression that returns 500
- Before the full call, fire a minimal test to check if the regression is still active
- If still broken, offer fallbacks:
- Drop audio (
audioEnabled: false)
- Drop reference image (text-only, lose brand fidelity)
- Use v2v workaround (generate silent i2v first, then v2v with audio on top)
Step 9: Credit cost estimation
Follow the main SKILL.md's mandatory estimation flow:
- Check
logs/arcads-api.jsonl for matching model + similar config
- Fall back to
MASTER_CONTEXT.md rate table
- For multi-clip: show per-clip and total
- Present with source citation and estimate-only disclosure:
Estimated credit cost:
Seedance 2.0 (15s i2v) × 1 clip × 1 variation = ~0.9 credits
(from logs/arcads-api.jsonl 2026-04-09)
─────────────────────────────────────
Estimated total: ~0.9 credits
⚠️ Estimate only — confirm exact cost in the Arcads platform.
Proceed? (yes/no)
Do NOT generate until the user confirms.
Step 10: Session setup and upload
Follow the main SKILL.md session folder checklist:
GET /v1/products → resolve productId (default from MASTER_CONTEXT.md)
- Check for existing "Arcads API - YYYY-MM-DD" folder → create if missing
- Create project inside the folder → store
projectId
- Upload references via
POST /v1/file-upload/get-presigned-url:
- Product image →
fileType: "image/jpeg" → auto-upscale if longest side < 1024px
- Source video (if v2v mode) →
fileType: "video/mp4"
- Reference audio (if voice clone) →
fileType: "audio/mpeg" or appropriate type
- Store all
filePath values for the generation payload
Step 11: Generate
-
Compose the CreateVideoDto JSON:
model: "seedance-2.0"
productId, projectId
prompt (from step 6)
aspectRatio: match source video (9:16 or 16:9)
duration: from step 6
resolution: 720p (default)
audioEnabled: from step 8
referenceImages: product image filePath (if i2v mode)
referenceVideos: source video filePath (if v2v mode, product-only, no faces)
referenceAudios: voice clip filePath (if voice clone)
-
Ask generation count (how many variations? default 1)
-
Single-clip: Fire N parallel POST /v2/videos/generate calls.
Multi-clip (chained): Fire clips sequentially per the chaining pipeline in step 5.
Each clip depends on the previous clip's output — do not fire in parallel.
-
Log immediately to logs/arcads-api.jsonl:
{
"timestamp": "...",
"endpoint": "POST /v2/videos/generate",
"model": "seedance-2.0",
"assetId": "...",
"request": { "duration": ..., "resolution": ..., ... },
"response": { "status": "pending", "creditsCharged": ... },
"session": { "folderName": "Arcads API - YYYY-MM-DD", "notes": "clone-ad: ..." }
}
-
Poll GET /v1/assets/{id} until generated or failed
- Single-clip: poll all variation IDs concurrently
- Multi-clip: poll each clip individually, wait for
generated before proceeding to the next
- Update log entry with final status,
creditsCharged, generationTimeSec, URLs
-
For multi-clip: upload each completed clip as a fresh presigned URL reference for the next clip (see chaining pipeline in step 5)
Step 12: Present results
- Assign all assets to session project via
POST /v1/assets/add-to-project
- Save all videos to
outputs/clone-ad-tests/ (or a descriptive subfolder)
- Open the output folder on the user's machine so they can immediately review:
open "outputs/clone-ad-tests/"
- Present watch/download URLs
- For multiple variations: numbered list for comparison
- For multi-clip:
- Show credit summary (total
creditsCharged across all clips/variations)
Seedance 2.0 constraints (quick reference)
Check reference.md for full details. These are the ones most
likely to bite during clone-ad:
| Constraint | Impact |
|---|
referenceVideos + referenceImages mutually exclusive | Cannot combine in same request (500) |
| v2v with human faces in reference video | Content checker rejects, credits still charged |
audioEnabled: true + referenceImages regression | May return 500 — sanity probe first |
referenceVideos count > 1 fails | Only 1 reference video accepted despite docs saying 3 |
| Content check bills before checking | Credits charged at create time, not refunded on rejection |
endFrame non-functional on Seedance 2.0 | Do not use |
| Prompt length | 100–260 words (Seedance sweet spot) |
| Duration | 4–15 seconds (continuous integer) |
| Aspect ratio | 9:16 or 16:9 only (no 1:1) |
| Forbidden words | cinematic, professional, stunning, 8k, studio, perfect |
Error recovery
| Error | Recovery |
|---|
| Content checker rejects prompt | Do NOT retry same payload. Remove potentially flagged language. Tighten motion descriptions. Check for forbidden words. |
500 on audioEnabled: true + referenceImages | Audio+image regression is active. Fallback options: (a) drop audio, (b) drop image and go text-only, (c) v2v workaround: generate silent i2v first, then run v2v with audio on top using the i2v output as reference video |
| v2v face rejection | Source video has humans — switch to i2v mode with user's product image |
| Prompt too long (> 260 words) | Trim: cut filler from tone direction, compress setting details, shorten consistency anchors. Prioritize beat structure and dialogue. |
| Source video > 15s | Split into clips at natural beat boundaries. Generate each separately. Offer to stitch. |
Generation fails (status: failed) | Check data.error.message. If content-related, rewrite prompt. If server error, wait and retry once. |
Related files