원클릭으로 Manus에서 모든 스킬 실행

talking-head-production

스타559

포크88

업데이트2026년 5월 16일 12:29

Talking head video production with AI avatars, lipsync, and voiceover. Recommended: P-Video-Avatar (fastest, cheapest, built-in TTS). Also covers OmniHuman, PixVerse, Fabric. Portrait requirements, audio quality, production workflows. Use for: spokesperson videos, course content, social media, presentations, demos. Triggers: talking head, avatar video, lipsync, lip sync, ai spokesperson, virtual presenter, ai presenter, omnihuman, talking avatar, video presenter, ai talking head, presenter video, ai face video, p-video-avatar

설치

Codex 또는 Claude로 설치 이 Prompt를 복사해 Codex, Claude 또는 다른 어시스턴트에 붙여 넣으면 Skill 페이지를 검토하고 설치를 진행할 수 있습니다.

Manus에서 실행

출처

inference-sh

inference-sh/skills

GitHub 저장소 열기 Creator 저장소 보기

다운로드

Manus에서 실행

관련 직업SOC

SOC 직업 분류 기준

소프트웨어 개발자컴퓨터 및 수학직·SOC 15-1252

SKILL.md

readonly

이 저장소의 다른 Skills

같은 저장소

building-inferencesh-apps

inference-sh/skills

Build and deploy applications on inference.sh. Use when getting started, understanding the platform, creating apps, configuring resources, or needing an overview of inference.sh app development. Supports both Python and Node.js. Triggers: inference.sh app, belt app, inf.yml, inference.py, inference.js, deploy app, app development, build app, create app, GPU app, VRAM, app resources, app secrets, app integrations, multi-function app

2026-05-18559

ai-automation-workflows

inference-sh/skills

Build automated AI workflows combining multiple models and services. Patterns: batch processing, scheduled tasks, event-driven pipelines, agent loops. Tools: inference.sh CLI, bash scripting, Python SDK, webhook integration. Use for: content automation, data processing, monitoring, scheduled generation. Triggers: ai automation, workflow automation, batch processing, ai pipeline, automated content, scheduled ai, ai cron, ai batch job, automated generation, ai workflow, content at scale, automation script, ai orchestration

2026-05-16559

ai-content-pipeline

inference-sh/skills

Build multi-step AI content creation pipelines combining image, video, audio, and text. Workflow examples: generate image -> animate -> add voiceover -> merge with music. Tools: FLUX, Veo, Kokoro TTS, OmniHuman, media merger, upscaling. Use for: YouTube videos, social media content, marketing materials, automated content. Triggers: content pipeline, ai workflow, content creation, multi-step ai, content automation, ai video workflow, generate and edit, ai content factory, automated content creation, ai production pipeline, media pipeline, content at scale

2026-05-16559

ai-podcast-creation

inference-sh/skills

Create AI-powered podcasts with text-to-speech, music, and audio editing. Tools: Kokoro TTS, DIA TTS, Chatterbox, AI music generation, media merger. Capabilities: multi-voice conversations, background music, intro/outro, full episodes. Use for: podcast production, audiobooks, voice content, audio newsletters. Triggers: podcast, ai podcast, text to speech podcast, audio content, voice over, ai audiobook, multi voice, conversation ai, notebooklm alternative, audio generation, podcast automation, ai narrator, voice content, audio newsletter, podcast maker

2026-05-16559

content-repurposing

inference-sh/skills

Content atomization — turn one piece of content into many formats. Covers blog-to-thread, blog-to-carousel, podcast-to-blog, video-to-quotes, and more. Use for: content marketing, social media, multi-platform distribution, content strategy. Triggers: content repurposing, repurpose content, content atomization, content recycling, one to many content, multi platform content, cross post, adapt content, reformat content, blog to thread, blog to video, podcast to blog, content multiplication

2026-05-16559

app-store-screenshots

inference-sh/skills

App Store and Google Play screenshot creation with exact platform specs. Covers iOS/Android dimensions, gallery ordering, device mockups, and preview videos. Use for: app store optimization, ASO, app screenshots, app preview, play store listing. Triggers: app store screenshots, aso, app store optimization, play store screenshots, app preview, app listing, ios screenshots, android screenshots, app store images, app mockup, device mockup, app gallery, store listing

2026-05-16559

name	talking-head-production
description	Talking head video production with AI avatars, lipsync, and voiceover. Recommended: P-Video-Avatar (fastest, cheapest, built-in TTS). Also covers OmniHuman, PixVerse, Fabric. Portrait requirements, audio quality, production workflows. Use for: spokesperson videos, course content, social media, presentations, demos. Triggers: talking head, avatar video, lipsync, lip sync, ai spokesperson, virtual presenter, ai presenter, omnihuman, talking avatar, video presenter, ai talking head, presenter video, ai face video, p-video-avatar
allowed-tools	Bash(belt *)

Install the belt CLI skill: npx skills add belt-sh/cli

Talking Head Production

Create talking head videos with AI avatars and lipsync via inference.sh CLI.

Quick Start

Requires inference.sh CLI (belt). Install instructions

belt login

# Recommended: P-Video-Avatar (built-in TTS, fastest, cheapest)
belt app run pruna/p-video-avatar --input '{
  "image": "https://portrait.jpg",
  "voice_script": "Welcome to our product tour. Today I will show you three features that will save you hours every week.",
  "voice": "Zephyr (Female)"
}'

Portrait Requirements

The source portrait image is critical. Poor portraits = poor video output.

Must Have

Requirement	Why	Spec
Center-framed	Avatar needs face in predictable position	Face centered in frame
Head and shoulders	Body visible for natural gestures	Crop below chest
Eyes to camera	Creates connection with viewer	Direct frontal gaze
Neutral expression	Starting point for animation	Slight smile OK, not laughing/frowning
Clear face	Model needs to detect features	No sunglasses, heavy shadows, or obstructions
High resolution	Detail preservation	Min 512x512 face region, ideally 1024x1024+

Generate a Portrait

# Generate a professional portrait with P-Image
belt app run pruna/p-image --input '{
  "prompt": "professional headshot portrait of a friendly business person, soft studio lighting, clean grey background, head and shoulders, direct eye contact, neutral pleasant expression, photorealistic",
  "aspect_ratio": "9:16"
}'

Background Options

Type	When to Use
Solid color	Professional, clean, easy to composite
Soft bokeh	Natural, lifestyle feel
Office/studio	Business context
Dynamic (P-Video-Avatar)	Use `video_prompt` to set background

Model Selection

Start with P-Video-Avatar — it's 18x faster and 6x cheaper than alternatives, with built-in TTS.

Model	App ID	Built-in TTS	Best For
P-Video-Avatar	`pruna/p-video-avatar`	Yes (30 voices, 10 langs)	Best overall: speed, cost, quality
OmniHuman 1.5	`bytedance/omnihuman-1-5`	No	Multi-character, gestures
OmniHuman 1.0	`bytedance/omnihuman-1-0`	No	Single character
Fabric 1.0	`falai/fabric-1-0`	Yes	Image talks with lipsync
PixVerse Lipsync	`falai/pixverse-lipsync`	No	Realistic lipsync

Cost & Speed Comparison

Model	Speed (per sec of video)	Cost per second
P-Video-Avatar	~1.83s/s	$0.025
OmniHuman 1.5	~28s/s (15x slower)	$0.16 (6.4x more)
Fabric 1.0	~34s/s (18x slower)	$0.14 (5.6x more)

Production Workflows

Simple: Text Script -> Video (P-Video-Avatar)

No separate TTS step needed — P-Video-Avatar has built-in voices:

belt app run pruna/p-video-avatar --input '{
  "image": "https://portrait.jpg",
  "voice_script": "Hi there! I am excited to share something with you today.",
  "voice": "Puck (Male)",
  "voice_language": "English (US)",
  "resolution": "720p"
}'

With Style Control

belt app run pruna/p-video-avatar --input '{
  "image": "https://portrait.jpg",
  "voice_script": "This is exciting news for our community!",
  "voice": "Aoede (Female)",
  "voice_prompt": "Enthusiastic and energetic tone, slightly faster pace",
  "video_prompt": "The person is presenting on stage with dramatic lighting",
  "resolution": "1080p"
}'

Audio-Driven (Any Model)

Provide your own audio file:

# P-Video-Avatar with custom audio
belt app run pruna/p-video-avatar --input '{
  "image": "https://portrait.jpg",
  "audio": "https://speech.mp3"
}'

# OmniHuman with custom audio
belt app run bytedance/omnihuman-1-5 --input '{
  "image_url": "https://portrait.jpg",
  "audio_url": "https://speech.mp3"
}'

Full Workflow: Generate Portrait + Avatar

# 1. Generate a portrait image
belt app run pruna/p-image --input '{
  "prompt": "professional headshot portrait of a young woman, neutral background, looking at camera, studio lighting, photorealistic",
  "aspect_ratio": "9:16"
}'

# 2. Create avatar video with built-in TTS
belt app run pruna/p-video-avatar --input '{
  "image": "<image-url-from-step-1>",
  "voice_script": "Hi there! Let me walk you through our latest features.",
  "voice": "Zephyr (Female)"
}'

With Separate TTS (for non-TTS models)

# 1. Generate speech
belt app run falai/dia-tts --input '{
  "prompt": "[S1] Your narration script here."
}'

# 2. Create talking head
belt app run bytedance/omnihuman-1-5 --input '{
  "image_url": "https://portrait.jpg",
  "audio_url": "<audio-url-from-step-1>"
}'

Multi-Character Conversation

OmniHuman 1.5 supports up to 2 characters:

# 1. Generate dialogue with two speakers
belt app run falai/dia-tts --input '{
  "prompt": "[S1] So tell me about the new feature. [S2] Sure! We built a dashboard that shows real-time analytics. [S1] That sounds great. How long did it take? [S2] About two weeks from concept to launch."
}'

# 2. Create video with two characters
belt app run bytedance/omnihuman-1-5 --input '{
  "image_url": "https://two-person-portrait.png",
  "audio_url": "<audio-url>"
}'

Long-Form (Stitched Clips)

For content longer than ~60 seconds, split into segments:

# Generate clips with same portrait for consistency
belt app run pruna/p-video-avatar --input '{"image": "https://portrait.jpg", "voice_script": "Segment one..."}' --no-wait
belt app run pruna/p-video-avatar --input '{"image": "https://portrait.jpg", "voice_script": "Segment two..."}' --no-wait
belt app run pruna/p-video-avatar --input '{"image": "https://portrait.jpg", "voice_script": "Segment three..."}' --no-wait

# Merge all segments
belt app run infsh/media-merger --input '{
  "media": ["segment1.mp4", "segment2.mp4", "segment3.mp4"]
}'

Multilingual Content

P-Video-Avatar supports 10 languages with built-in TTS:

# Spanish
belt app run pruna/p-video-avatar --input '{
  "image": "https://portrait.jpg",
  "voice_script": "Bienvenidos a nuestra demostración de producto.",
  "voice": "Kore (Female)",
  "voice_language": "Spanish"
}'

# Japanese
belt app run pruna/p-video-avatar --input '{
  "image": "https://portrait.jpg",
  "voice_script": "こんにちは、製品デモへようこそ。",
  "voice": "Leda (Female)",
  "voice_language": "Japanese"
}'

Dub Existing Video

# 1. Transcribe original video
belt app run infsh/fast-whisper-large-v3 --input '{"audio_url": "https://video.mp4"}'

# 2. Translate text (manually or with LLM)

# 3. Generate speech in new language
belt app run infsh/kokoro-tts --input '{"text": "<translated-text>"}'

# 4. Lipsync original video with new audio
belt app run infsh/latentsync-1-6 --input '{
  "video_url": "https://original-video.mp4",
  "audio_url": "<new-audio-url>"
}'

Audio Quality (for non-TTS workflows)

When providing your own audio, quality directly impacts lipsync accuracy.

Parameter	Target	Why
Background noise	None/minimal	Noise confuses lipsync timing
Volume	Consistent throughout	Prevents sync drift
Sample rate	44.1kHz or 48kHz	Standard quality
Format	MP3 128kbps+ or WAV	Compatible with all tools

Available Voices (P-Video-Avatar)

Female: Zephyr, Kore, Leda, Aoede, Callirrhoe, Autonoe, Despina, Erinome, Laomedeia, Achernar, Gacrux, Pulcherrima, Vindemiatrix, Sulafat

Male: Puck, Charon, Fenrir, Orus, Enceladus, Iapetus, Umbriel, Algenib, Algieba, Schedar, Achird, Zubenelgenubi, Sadachbia, Sadaltager, Alnilam, Rasalgethi

Languages: English (US), English (UK), Spanish, French, German, Italian, Portuguese (Brazil), Japanese, Korean, Hindi

Framing Guidelines

┌─────────────────────────────────┐
│          Headroom (minimal)     │
│  ┌───────────────────────────┐  │
│  │                           │  │
│  │     ● ─ ─ Eyes at 1/3 ─ ─│─ │ ← Eyes at top 1/3 line
│  │    /|\                    │  │
│  │     |   Head & shoulders  │  │
│  │    / \  visible           │  │
│  │                           │  │
│  └───────────────────────────┘  │
│       Crop below chest          │
└─────────────────────────────────┘

Common Mistakes

Mistake	Problem	Fix
Low-res portrait	Blurry face, poor lipsync	Use 1024x1024+ face region
Profile/side angle	Lipsync can't track mouth well	Use frontal or near-frontal
Noisy audio	Lipsync drifts, looks unnatural	Use built-in TTS or record clean
Too-long clips	Quality degrades	Split into segments, stitch
Sunglasses/obstruction	Face features hidden	Clear face required
Inconsistent lighting	Uncanny when animated	Even, soft lighting

Related Skills

# Dedicated P-Video-Avatar skill
npx skills add inference-sh/skills@p-video-avatar

# All avatar models
npx skills add inference-sh/skills@ai-avatar-video

# All video generation models
npx skills add inference-sh/skills@ai-video-generation

# Text-to-speech
npx skills add inference-sh/skills@text-to-speech

# Image generation (for portraits)
npx skills add inference-sh/skills@ai-image-generation

Browse all apps: belt app store