Run any Skill in Manus with one click

$pwd:

workflow-ai-generation

Name: Workflow Ai Generation
Author: damionrashford

// Generate media from scratch with 2026 open-source AI — TTS voiceover (Kokoro / OpenVoice / Piper), image gen (FLUX-schnell / Kolors / Sana / ComfyUI), video gen (LTX-Video / CogVideoX / Mochi / Wan), music (Riffusion / YuE), lipsync talking heads (LivePortrait / LatentSync), OCR (PaddleOCR / Tesseract 5 / TrOCR), zero-shot tagging (CLIP / SigLIP / BLIP-2 / LLaVA). Strict commercial-safe license filter. Use when the user says "generate a video", "TTS voiceover", "AI explainer video", "clone my voice", "generate music", "AI image", "digital human", or anything about from-scratch AI media.

Run Skill in Manus

$ git log --oneline --stat

stars:7

forks:2

updated:April 18, 2026 at 05:55

SKILL.md

readonly

name	workflow-ai-generation
description	Generate media from scratch with 2026 open-source AI — TTS voiceover (Kokoro / OpenVoice / Piper), image gen (FLUX-schnell / Kolors / Sana / ComfyUI), video gen (LTX-Video / CogVideoX / Mochi / Wan), music (Riffusion / YuE), lipsync talking heads (LivePortrait / LatentSync), OCR (PaddleOCR / Tesseract 5 / TrOCR), zero-shot tagging (CLIP / SigLIP / BLIP-2 / LLaVA). Strict commercial-safe license filter. Use when the user says "generate a video", "TTS voiceover", "AI explainer video", "clone my voice", "generate music", "AI image", "digital human", or anything about from-scratch AI media.
argument-hint	["prompt"]

Workflow — AI Generation

What: Synthesize new media with open-source, commercial-safe AI models. Strict license filter: Apache-2 / MIT / BSD / GPL. NC / research-only models are always-dropped.

Skills used

media-tts-ai, media-whisper, media-sd, media-svd, media-musicgen, media-lipsync, media-depth, media-ocr-ai, media-tag, media-demucs, media-denoise-ai.

Tool matrix

TTS (`media-tts-ai`)

Model	License	Best for
Kokoro	Apache-2	general default
OpenVoice	MIT	voice cloning
CosyVoice	Apache-2	Chinese
Chatterbox	MIT	expressive
Piper	MIT	embedded / offline
StyleTTS2	MIT	single-voice quality
Parler	Apache-2	style prompting
Bark	MIT	creative / SFX
Orpheus	Apache-2	modern expressive

DROPPED: XTTS-v2 (CPML NC), F5-TTS (research).

Image (`media-sd`)

Model	License	Best for
FLUX-schnell	Apache-2	default (4-step distilled)
Kolors	Apache-2	bilingual EN/ZH
Sana	Apache-2	fast 4K
ComfyUI	GPL-3	node-graph workflows

DROPPED: FLUX-dev (NC), SDXL / SD3 base (restrictive).

Video (`media-svd`)

Model	License	Best for
LTX-Video	Apache-2	fastest
CogVideoX	Apache-2	high quality (slower)
Mochi	Apache-2	cinematic
Wan	Apache-2	versatile

DROPPED: Stable Video Diffusion (NC research).

Music / SFX (`media-musicgen`)

Model	License	Best for
Riffusion	Apache-2	spectrogram-diffusion
YuE	Apache-2	long-form structured

DROPPED: Meta MusicGen (CC-BY-NC).

Lipsync (`media-lipsync`)

Model	License
LivePortrait	MIT
LatentSync	Apache-2

DROPPED: Wav2Lip (research), SadTalker (NC).

OCR (`media-ocr-ai`)

Model	License	Best for
PaddleOCR	Apache-2	Latin + Chinese default
Tesseract 5	Apache-2	fastest classical
TrOCR	MIT	handwriting
EasyOCR	Apache-2	multilingual

DROPPED: Surya (commercial restriction).

Tagging / captioning (`media-tag`)

CLIP (MIT), SigLIP (Apache-2), BLIP-2 (BSD), LLaVA (Apache-2 but needs Llama-2 backbone — check license).

Stem separation (`media-demucs`)

htdemucs 4-stem, htdemucs_6s 6-stem (adds guitar + piano).

Speech-to-text (`media-whisper`)

whisper.cpp (MIT fastest CPU), faster-whisper (MIT CUDA-accelerated).

Example composite workflows

Explainer video from script

script → TTS voiceover (media-tts-ai Kokoro) → slide images (media-sd FLUX-schnell) → B-roll clips (media-svd LTX-Video) → host animation (media-lipsync LivePortrait) → background music (media-musicgen Riffusion) → assemble (ffmpeg-cut-concat) → mix (ffmpeg-audio-filter + sidechain ducking) → loudness normalize (media-ffmpeg-normalize) → auto-burn subtitles (media-whisper + ffmpeg-subtitles) → transcode H.264 → YouTube upload (media-cloud-upload).

Multilingual voice clone

30 s clean reference → DeepFilterNet denoise → OpenVoice clone to EN/ES/FR/JP → per-track denoise → mux with localized subs.

Book cover + trailer

FLUX-schnell cover variations → LLaVA auto-describe for alt-text → Mochi cinematic trailer → StyleTTS2 narrator → YuE cinematic score → assemble.

Automated podcast chapter thumbnails

scenedetect chapter boundaries → extract frame per chapter → BLIP-2 caption → Kolors generate thumbnail from caption → embed chapter metadata.

Digital human with cloned voice

OpenVoice clone → LivePortrait drives portrait → FLUX-schnell branded background → chromakey + RVM matte composite → Riffusion audio bed → transcode + upload.

Gotchas

License discipline. Always check the skill's references/LICENSES.md BEFORE adopting a new model. HuggingFace license field is NOT authoritative. Pin model weights to specific commit hashes.
GPL-3 in ComfyUI / RVM — requires source distribution if modified/redistributed. Use dynamically / in pipeline, not bundled into proprietary product.
All Layer 9 skills need GPU for reasonable throughput (10–50× slower on CPU).
TTS sample rates vary: Kokoro 24 kHz, Piper 22.05 kHz, StyleTTS2 24 kHz, OpenVoice 24 kHz. Resample ALL to 48 kHz before mixing.
Voice cloning needs CLEAN reference. Noisy sample → noisy clone. DeepFilterNet the reference first.
FLUX-schnell is 4-step distilled. Do NOT push --steps above 4–8 — wastes compute, no quality gain.
ComfyUI workflow JSONs pin specific node versions. Use ComfyUI Manager to install matching nodes.
LTX-Video prompt adherence is phrasing-sensitive. "cat running" ≠ "running cat".
CogVideoX-5b needs ~20 GB VRAM. 2b variant runs on 8 GB at lower quality.
Riffusion produces 5.11-second clips natively. Chain with crossfade for longer, or use YuE for structured long-form.
LivePortrait expects clean frontal portrait. Angled faces, glasses, occluded mouths degrade output.
LivePortrait outputs 512×512 by default. Upscale with media-upscale for larger.
Whisper large-v3 is 3 GB. Test with base.en (140 MB) first — quality gap to medium (1.5 GB) is small for clean audio.
Whisper hallucinates on silence. Trim leading/trailing with silenceremove.
Whisper word-level timestamps require --word_timestamps True (faster-whisper) or --max-len 1 --split-on-word (whisper.cpp).
Whisper --language auto is fragile on accented speech. Specify language explicitly.
Diarization is external to Whisper. Use pyannote.audio (MIT) or simple-diarizer.
OCR selection: PaddleOCR for Latin+Chinese, TrOCR for handwriting, Tesseract for speed on clean docs.
CLIP / SigLIP / BLIP-2 need fixed-resolution inputs (224/336/384/448). The tagctl.py script resizes internally.
LLaVA needs an LLM backbone (Vicuna / Llama-2 7B+). Llama-2 community license has revenue caps.

workflow-ai-enhancement — for enhancing EXISTING footage (not generating new).
workflow-podcast-pipeline — for podcast-specific AI workflows.
workflow-analysis-quality — VMAF + QC on AI-generated output.

related-skills.json

same repository

ffmpeg-cut-concat.md

from "damionrashford/media-os"

Trim, cut, split, segment, and concatenate media with ffmpeg (stream copy when possible, re-encode across cut boundaries). Use when the user asks to trim a video, cut a clip, extract a segment by timestamps, remove a section, split into parts, join/merge videos, concatenate files, or build a segmented HLS-style playlist.

2026-05-177

obs-config.md

from "damionrashford/media-os"

Install and configure OBS Studio programmatically: install via brew cask / winget / Flatpak / apt, author profiles (basic.ini, streamEncoder.json, recordEncoder.json), author scene collections (scenes JSON), manage global.ini, set defaults for encoder / output / audio / hotkeys, cross-platform config paths. Use when the user asks to install OBS, set up an OBS profile, create a scene collection from code, configure OBS defaults without the GUI, edit basic.ini, manage multiple OBS profiles, or script a fresh OBS install with known-good settings.

2026-05-177

workflow-analysis-quality.md

from "damionrashford/media-os"

Deep media inspection + automated QC — ffprobe stream details, MediaInfo diagnostics, VMAF/PSNR/SSIM quality metrics, PySceneDetect scene cuts, crop/silence/black-frame/interlacing detection, ffplay scope debugging, NAL/SEI bitstream forensics, metadata audits, loudness compliance against Spotify/Apple/ATSC/EBU specs, and automated CI QC gates. Use when the user says "QC this file", "run VMAF", "compare encoders", "detect scene cuts", "check loudness compliance", "validate delivery spec", "automated QC pipeline", or any deep inspection / quality gating.

2026-05-177

media-pipeline-router.md

from "damionrashford/media-os"

Routes media production requests to the right Media OS specialist subagent. ALWAYS use this skill when the user expresses ANY media production intent — "go live", "start streaming", "OBS broadcast", "wire up live rig", "NDI to stream", "PTZ camera setup", "DeckLink capture stream", "HLS deliver", "DASH package", "encode for streaming", "CDN upload", "make a HLS manifest", "multi-bitrate ladder", "CMAF package", "LL-HLS", "low-latency stream", "Widevine package", "DRM package", "package for streaming", "broadcast deliver", "MXF master", "IMF package", "ProRes master", "DPP deliver", "AS-11 deliver", "Netflix deliver", "broadcast spec deliver", "deliver for air", "create IMF", "Premiere to Resolve", "round-trip", "OTIO export", "editorial conform", "XML round-trip", "FCPXML", "EDL export", "AAF export", "convert timeline", "Avid to Premiere", "Resolve to Premiere", "upscale this", "interpolate frames", "denoise AI", "remove background", "rotoscope", "matte", "depth estimate", "AI upscale", "RIFE", "Real-ESRGAN"

2026-05-177

workflow-acquisition-archive.md

from "damionrashford/media-os"

Ingest from every source — web (yt-dlp), screen / webcam / mic capture, SDI (DeckLink), DSLR tether (gphoto2), NDI network sources, RTSP / IP cameras via MediaMTX, PTZ — then verify integrity (SHA-256 + full-decode pass), preserve metadata (EXIF / XMP / IPTC), normalize to MKV or FFV1 / J2K / ProRes archival containers, and push to cold-storage cloud (Glacier / B2 / Archive.org). Use when the user says "download YouTube playlist", "capture SDI for 24 hours", "archive IP cameras", "preserve VHS rips", "tether DSLR timelapse", "cold-storage upload", or any ingest-to-archive workflow.

2026-04-187

workflow-ai-enhancement.md

from "damionrashford/media-os"

Restore, upscale, and enhance existing footage using 2026 open-source AI models — Real-ESRGAN/SwinIR/HAT super-resolution, RIFE/FILM interpolation, DeepFilterNet/RNNoise audio denoise, rembg/BiRefNet/RVM matting, Depth-Anything v2 depth — with strict OSI-open commercial-safe license filter. Use when the user says "upscale old footage", "remaster", "enhance quality", "30 to 60fps", "AI denoise", "restore VHS", "remove background from video", or anything about AI-driven footage restoration.

2026-04-187

package.json

"author": "damionrashford"

"repository": "damionrashford/media-os"

View GitHub Repository View Creator Repositories

$ install --global

$ download --local

Run Skill in Manus

$ useful --forSOC

Software DevelopersComputer and Mathematical Occupations15-1252L4

name	workflow-ai-generation
description	Generate media from scratch with 2026 open-source AI — TTS voiceover (Kokoro / OpenVoice / Piper), image gen (FLUX-schnell / Kolors / Sana / ComfyUI), video gen (LTX-Video / CogVideoX / Mochi / Wan), music (Riffusion / YuE), lipsync talking heads (LivePortrait / LatentSync), OCR (PaddleOCR / Tesseract 5 / TrOCR), zero-shot tagging (CLIP / SigLIP / BLIP-2 / LLaVA). Strict commercial-safe license filter. Use when the user says "generate a video", "TTS voiceover", "AI explainer video", "clone my voice", "generate music", "AI image", "digital human", or anything about from-scratch AI media.
argument-hint	["prompt"]

Workflow — AI Generation

What: Synthesize new media with open-source, commercial-safe AI models. Strict license filter: Apache-2 / MIT / BSD / GPL. NC / research-only models are always-dropped.

Skills used

media-tts-ai, media-whisper, media-sd, media-svd, media-musicgen, media-lipsync, media-depth, media-ocr-ai, media-tag, media-demucs, media-denoise-ai.

Tool matrix

TTS (`media-tts-ai`)

Model	License	Best for
Kokoro	Apache-2	general default
OpenVoice	MIT	voice cloning
CosyVoice	Apache-2	Chinese
Chatterbox	MIT	expressive
Piper	MIT	embedded / offline
StyleTTS2	MIT	single-voice quality
Parler	Apache-2	style prompting
Bark	MIT	creative / SFX
Orpheus	Apache-2	modern expressive

DROPPED: XTTS-v2 (CPML NC), F5-TTS (research).

Image (`media-sd`)

Model	License	Best for
FLUX-schnell	Apache-2	default (4-step distilled)
Kolors	Apache-2	bilingual EN/ZH
Sana	Apache-2	fast 4K
ComfyUI	GPL-3	node-graph workflows

DROPPED: FLUX-dev (NC), SDXL / SD3 base (restrictive).

Video (`media-svd`)

Model	License	Best for
LTX-Video	Apache-2	fastest
CogVideoX	Apache-2	high quality (slower)
Mochi	Apache-2	cinematic
Wan	Apache-2	versatile

DROPPED: Stable Video Diffusion (NC research).

Music / SFX (`media-musicgen`)

Model	License	Best for
Riffusion	Apache-2	spectrogram-diffusion
YuE	Apache-2	long-form structured

DROPPED: Meta MusicGen (CC-BY-NC).

Lipsync (`media-lipsync`)

Model	License
LivePortrait	MIT
LatentSync	Apache-2

DROPPED: Wav2Lip (research), SadTalker (NC).

OCR (`media-ocr-ai`)

Model	License	Best for
PaddleOCR	Apache-2	Latin + Chinese default
Tesseract 5	Apache-2	fastest classical
TrOCR	MIT	handwriting
EasyOCR	Apache-2	multilingual

DROPPED: Surya (commercial restriction).

Tagging / captioning (`media-tag`)

CLIP (MIT), SigLIP (Apache-2), BLIP-2 (BSD), LLaVA (Apache-2 but needs Llama-2 backbone — check license).

Stem separation (`media-demucs`)

htdemucs 4-stem, htdemucs_6s 6-stem (adds guitar + piano).

Speech-to-text (`media-whisper`)

whisper.cpp (MIT fastest CPU), faster-whisper (MIT CUDA-accelerated).

Example composite workflows

Explainer video from script

Multilingual voice clone

30 s clean reference → DeepFilterNet denoise → OpenVoice clone to EN/ES/FR/JP → per-track denoise → mux with localized subs.

Book cover + trailer

FLUX-schnell cover variations → LLaVA auto-describe for alt-text → Mochi cinematic trailer → StyleTTS2 narrator → YuE cinematic score → assemble.

Automated podcast chapter thumbnails

scenedetect chapter boundaries → extract frame per chapter → BLIP-2 caption → Kolors generate thumbnail from caption → embed chapter metadata.

Digital human with cloned voice

OpenVoice clone → LivePortrait drives portrait → FLUX-schnell branded background → chromakey + RVM matte composite → Riffusion audio bed → transcode + upload.

Gotchas

License discipline. Always check the skill's references/LICENSES.md BEFORE adopting a new model. HuggingFace license field is NOT authoritative. Pin model weights to specific commit hashes.
GPL-3 in ComfyUI / RVM — requires source distribution if modified/redistributed. Use dynamically / in pipeline, not bundled into proprietary product.
All Layer 9 skills need GPU for reasonable throughput (10–50× slower on CPU).
TTS sample rates vary: Kokoro 24 kHz, Piper 22.05 kHz, StyleTTS2 24 kHz, OpenVoice 24 kHz. Resample ALL to 48 kHz before mixing.
Voice cloning needs CLEAN reference. Noisy sample → noisy clone. DeepFilterNet the reference first.
FLUX-schnell is 4-step distilled. Do NOT push --steps above 4–8 — wastes compute, no quality gain.
ComfyUI workflow JSONs pin specific node versions. Use ComfyUI Manager to install matching nodes.
LTX-Video prompt adherence is phrasing-sensitive. "cat running" ≠ "running cat".
CogVideoX-5b needs ~20 GB VRAM. 2b variant runs on 8 GB at lower quality.
Riffusion produces 5.11-second clips natively. Chain with crossfade for longer, or use YuE for structured long-form.
LivePortrait expects clean frontal portrait. Angled faces, glasses, occluded mouths degrade output.
LivePortrait outputs 512×512 by default. Upscale with media-upscale for larger.
Whisper large-v3 is 3 GB. Test with base.en (140 MB) first — quality gap to medium (1.5 GB) is small for clean audio.
Whisper hallucinates on silence. Trim leading/trailing with silenceremove.
Whisper word-level timestamps require --word_timestamps True (faster-whisper) or --max-len 1 --split-on-word (whisper.cpp).
Whisper --language auto is fragile on accented speech. Specify language explicitly.
Diarization is external to Whisper. Use pyannote.audio (MIT) or simple-diarizer.
OCR selection: PaddleOCR for Latin+Chinese, TrOCR for handwriting, Tesseract for speed on clean docs.
CLIP / SigLIP / BLIP-2 need fixed-resolution inputs (224/336/384/448). The tagctl.py script resizes internally.
LLaVA needs an LLM backbone (Vicuna / Llama-2 7B+). Llama-2 community license has revenue caps.

workflow-ai-enhancement — for enhancing EXISTING footage (not generating new).
workflow-podcast-pipeline — for podcast-specific AI workflows.
workflow-analysis-quality — VMAF + QC on AI-generated output.

workflow-ai-generation

Workflow — AI Generation

Skills used

Tool matrix

TTS (media-tts-ai)

Image (media-sd)

Video (media-svd)

Music / SFX (media-musicgen)

Lipsync (media-lipsync)

OCR (media-ocr-ai)

Tagging / captioning (media-tag)

Stem separation (media-demucs)

Speech-to-text (media-whisper)

Example composite workflows

Explainer video from script

Multilingual voice clone

Book cover + trailer

Automated podcast chapter thumbnails

Digital human with cloned voice

Gotchas

Related

More from this repository

More from this repository

Workflow — AI Generation

Skills used

Tool matrix

TTS (media-tts-ai)

Image (media-sd)

Video (media-svd)

Music / SFX (media-musicgen)

Lipsync (media-lipsync)

OCR (media-ocr-ai)

Tagging / captioning (media-tag)

Stem separation (media-demucs)

Speech-to-text (media-whisper)

Example composite workflows

Explainer video from script

Multilingual voice clone

Book cover + trailer

Automated podcast chapter thumbnails

Digital human with cloned voice

Gotchas

Related

TTS (`media-tts-ai`)

Image (`media-sd`)

Video (`media-svd`)

Music / SFX (`media-musicgen`)

Lipsync (`media-lipsync`)

OCR (`media-ocr-ai`)

Tagging / captioning (`media-tag`)

Stem separation (`media-demucs`)

Speech-to-text (`media-whisper`)

TTS (`media-tts-ai`)

Image (`media-sd`)

Video (`media-svd`)

Music / SFX (`media-musicgen`)

Lipsync (`media-lipsync`)

OCR (`media-ocr-ai`)

Tagging / captioning (`media-tag`)

Stem separation (`media-demucs`)

Speech-to-text (`media-whisper`)