تشغيل أي مهارة في Manus بنقرة واحدة

gemini-multimodal

This skill should be used when the user asks to analyze a video, process images, transcribe audio, read or summarize a PDF, extract text from a screenshot, convert a diagram to code, or perform any visual analysis. Relevant when the user says "transcribe this audio file," "what's in this video," or "turn this diagram into code."

تشغيل في Manus

النجوم٣

التفرعات٠

آخر تحديث٢ أبريل ٢٠٢٦ في ١٧:٠١

المصدر

NaluForge

NaluForge/geminicli-cc-plugin

فتح مستودع GitHub عرض مستودعات المنشئ

أمر التثبيت

تنزيل

تشغيل في Manus

مفيد لـSOC

مطوّرو البرمجياتمهن الحاسوب والرياضيات15-1252L4

SKILL.md

readonly

name	gemini-multimodal
description	This skill should be used when the user asks to analyze a video, process images, transcribe audio, read or summarize a PDF, extract text from a screenshot, convert a diagram to code, or perform any visual analysis. Relevant when the user says "transcribe this audio file," "what's in this video," or "turn this diagram into code."

Multimodal Processing with Gemini

Invoke using /gemini-media or mcp__gemini__gemini_execute with file paths in the prompt.

Supported Media Types

Gemini can directly process:

Video: MP4, WebM, MOV — analysis, summarization, scene detection
Audio: MP3, WAV, FLAC — transcription, speaker detection, analysis
Images: PNG, JPG, WebP, GIF — OCR, analysis, diagram interpretation
PDFs: Multi-page document analysis, extraction, summarization

Pre-Flight Validation

Before sending files to Gemini:

Verify files exist: Use Glob or Read to confirm all paths are valid
Check file sizes: Very large files (>1GB video) may need segmenting
Confirm file type: Verify the extension matches expected content

Parameter Selection by Media Type

Media	Model	Timeout	Notes
Video (long)	pro	2400000	Complex temporal analysis
Video (short)	flash	300000	Quick extraction
Audio (long)	pro	2400000	Full transcription
Audio (short)	flash	300000	Quick transcription
Images	flash	300000	Most image tasks are fast
Complex diagrams	pro	300000	Architecture, flowcharts
PDFs (long)	pro	2400000	Multi-page analysis
PDFs (short)	flash	300000	Quick extraction

Output Structure by Media Type

Video

Include timestamps: "At 2:34, the speaker discusses..."
Reference visual elements: "The diagram shown at 5:12 illustrates..."
For long videos, provide a timeline summary first, then details

Audio

Include timestamps for key moments
Attribute speakers when possible: "Speaker A (likely the interviewer)..."
Note audio quality issues that may affect accuracy

Images

Use spatial references: "In the top-right corner...", "The second row..."
For diagrams, describe the structure before details
For screenshots, identify UI elements and their state

PDFs

Reference page numbers: "On page 3, section 2.1..."
For tables, describe structure and key data points
For forms, list fields and their values

Combining with Code Context

Multimodal analysis often feeds into code work:

Screenshot → identify UI components → generate code
Architecture diagram → map to file structure → verify alignment
Error screenshot → identify the error → find relevant code
PDF spec → extract requirements → plan implementation

المزيد من هذا المستودع

نفس المستودع

gemini-code-review

NaluForge/geminicli-cc-plugin

This skill should be used when the user asks to review code changes, review a PR, get a second opinion on a diff, cross-check modifications, or perform a security review using Gemini. Relevant when the user says "review this PR with Gemini," "check my diff for bugs," or "get a second opinion on these changes."

2026-04-023

gemini-large-analysis

NaluForge/geminicli-cc-plugin

This skill should be used when the user asks to analyze a large codebase, audit a full repository, review architecture across many files, map dependencies, or plan a migration involving more code than fits in Claude's context. Relevant when the user says "analyze this entire repo," "audit the architecture," or "this codebase is too big to review."

2026-04-023

gemini-research

NaluForge/geminicli-cc-plugin

This skill should be used when the user asks to search for current information, find the latest version of a package, check for security advisories, or look up recent documentation. Relevant when the user says "what's the latest version of X," "are there any CVEs for Y," or "search for the current docs on Z."

2026-04-023

name	gemini-multimodal
description	This skill should be used when the user asks to analyze a video, process images, transcribe audio, read or summarize a PDF, extract text from a screenshot, convert a diagram to code, or perform any visual analysis. Relevant when the user says "transcribe this audio file," "what's in this video," or "turn this diagram into code."

Multimodal Processing with Gemini

Invoke using /gemini-media or mcp__gemini__gemini_execute with file paths in the prompt.

Supported Media Types

Gemini can directly process:

Video: MP4, WebM, MOV — analysis, summarization, scene detection
Audio: MP3, WAV, FLAC — transcription, speaker detection, analysis
Images: PNG, JPG, WebP, GIF — OCR, analysis, diagram interpretation
PDFs: Multi-page document analysis, extraction, summarization

Pre-Flight Validation

Before sending files to Gemini:

Verify files exist: Use Glob or Read to confirm all paths are valid
Check file sizes: Very large files (>1GB video) may need segmenting
Confirm file type: Verify the extension matches expected content

Parameter Selection by Media Type

Media	Model	Timeout	Notes
Video (long)	pro	2400000	Complex temporal analysis
Video (short)	flash	300000	Quick extraction
Audio (long)	pro	2400000	Full transcription
Audio (short)	flash	300000	Quick transcription
Images	flash	300000	Most image tasks are fast
Complex diagrams	pro	300000	Architecture, flowcharts
PDFs (long)	pro	2400000	Multi-page analysis
PDFs (short)	flash	300000	Quick extraction

Output Structure by Media Type

Video

Include timestamps: "At 2:34, the speaker discusses..."
Reference visual elements: "The diagram shown at 5:12 illustrates..."
For long videos, provide a timeline summary first, then details

Audio

Include timestamps for key moments
Attribute speakers when possible: "Speaker A (likely the interviewer)..."
Note audio quality issues that may affect accuracy

Images

Use spatial references: "In the top-right corner...", "The second row..."
For diagrams, describe the structure before details
For screenshots, identify UI elements and their state

PDFs

Reference page numbers: "On page 3, section 2.1..."
For tables, describe structure and key data points
For forms, list fields and their values

Combining with Code Context

Multimodal analysis often feeds into code work:

Screenshot → identify UI components → generate code
Architecture diagram → map to file structure → verify alignment
Error screenshot → identify the error → find relevant code
PDF spec → extract requirements → plan implementation