一键在 Manus 中运行任何 Skill

kimicode-vision-bridge

星标20

分支6

更新时间2026年5月11日 07:06

Use when the current Agent LLM cannot process images directly and visual analysis is needed — bridges images through KimiCode CLI print mode to a multimodal Kimi model for text description

安装

用 Codex 或 Claude 帮你安装复制这段 Prompt，粘贴到 Codex、Claude 或其他助手里，让它检查 Skill 页面并帮你完成安装。

在 Manus 中运行

来源

Dqz00116

Dqz00116/skill-lib

打开 GitHub 仓库查看创作者相关仓库

下载

在 Manus 中运行

KimiCode Vision Bridge

Overview

Routes images through KimiCode CLI's print mode (kimi --print) to give a non-vision Agent LLM visual understanding. Kimi is a multimodal model; print mode provides a non-interactive, programmatic pipe — the Agent submits an image + question via CLI, Kimi returns a text description on stdout.

Agent (no vision)                KimiCode CLI                  Kimi multimodal model
     │                                │                              │
     │  image + question              │                              │
     ├───────────────────────────────→│                              │
     │  kimi --print -p "..."         │                              │
     │                                ├─────────────────────────────→│
     │                                │     API call with image      │
     │                                │←─────────────────────────────┤
     │                                │     text response            │
     │←───────────────────────────────┤                              │
     │  text on stdout                │                              │
     ▼                                ▼                              ▼

Print mode docs: https://www.kimi-cli.com/en/customization/print-mode.html

When to Use

Use this skill when:

The current Agent LLM cannot process images directly (no vision capability)
An image needs to be read or analyzed — screenshots, photos, diagrams, charts, document scans, UI mockups
KimiCode CLI is installed and configured with a vision-capable model on the system

Do NOT use when:

The Agent already has native image input — this skill adds an unnecessary hop
The task is pure text analysis with no visual component
KimiCode is not installed or lacks a vision model

Prerequisites (MUST verify before proceeding)

Both checks must pass. If either fails, stop and report the failure.

Check 1: KimiCode CLI (`kimi`) is installed

Windows (PowerShell):

Get-Command kimi -ErrorAction SilentlyContinue | Select-Object -ExpandProperty Source

macOS / Linux:

which kimi || command -v kimi

Result	Action
Found	Pass
Not found	Fail. "KimiCode CLI not found. Install from https://www.kimi-cli.com"

Check 2: A multimodal (vision-capable) model is configured

Locate the config file and verify an API key and a vision model are set.

Config locations (try in order):

Windows	macOS	Linux
`%APPDATA%\kimi\config.json`	`~/Library/Application Support/kimi/config.json`	`~/.config/kimi/config.json`
`%USERPROFILE%\.kimi\config.json`	`~/.kimi/config.json`	`~/.kimi/config.json`

Verify:

A non-empty API key (fields: apiKey, token, api_key, or under providers)
A model name (fields: model, defaultModel) — must be vision-capable
If config uses env vars (e.g., $KIMI_API_KEY), verify those are set

Result	Action
API key + vision model set	Pass
API key missing	Fail. "No API key configured. Run `kimi config set` or edit the config file."
Model missing or text-only	Fail. "No vision model configured. Set a multimodal model via `kimi config set model <name>` (e.g., moonshot-v1-vision, gpt-4o, claude-3.5-sonnet)."

The Pipe: Core Workflow

Once prerequisites pass, the bridge has two steps.

Step A: Submit image + question to Kimi via print mode

The Agent already has an image (from the user, from a file, from a prior tool call). Combine it with a clear instruction and pipe it through kimi --print.

Approach 1 — File reference (recommended)

Point Kimi at the image file on disk. KimiCode's file-read tool loads and analyzes it.

kimi --quiet -p "Describe every visible element in this image in detail: /path/to/image.png"

Approach 2 — Inline base64 via JSONL stdin (fully programmatic)

Pipe the image as base64 directly — no temp file needed.

BASE64=$(base64 -w 0 /path/to/image.png)
echo "{\"role\":\"user\",\"content\":[{\"type\":\"image_url\",\"image_url\":{\"url\":\"data:image/png;base64,$BASE64\"}},{\"type\":\"text\",\"text\":\"Describe every visible element in this image in detail.\"}]}" \
  | kimi --print --input-format=stream-json --output-format=stream-json

Windows PowerShell equivalent:

$base64 = [Convert]::ToBase64String([IO.File]::ReadAllBytes("C:\path\to\image.png"))
$msg = '{"role":"user","content":[{"type":"image_url","image_url":{"url":"data:image/png;base64,' + $base64 + '"}},{"type":"text","text":"Describe every visible element in this image in detail."}]}'
$msg | kimi --print --input-format=stream-json --output-format=stream-json

Effective prompts for the image question:

Scenario	Prompt
Generic	"Describe every visible element — text, layout, colors, positions, any errors or warnings."
UI/dialog	"Read every label, button text, input field value, and error message in this screenshot."
Code	"Read all visible code including line numbers, syntax highlighting, and any squiggly/error indicators."
Chart/graph	"Describe the chart type, axes labels, data ranges, trend direction, and any annotations."
Document	"Transcribe the visible text exactly. Note any formatting (bold, headers, tables)."

Step B: Read the text result and feed into Agent context

Flags	Output format	How to parse
`--quiet`	Plain text, final message only	Read directly
`--output-format=stream-json`	JSONL, one JSON per line	`grep '"role":"assistant"' \| tail -1 \| jq -r '.content'`

Inject into Agent context:

[Vision bridge: KimiCode print mode]
<text from stdout>
[End vision bridge]

Complete pipeline example

IMAGE="/path/to/image.png"
RESULT=$(kimi --quiet -p "Describe every visible element in $IMAGE in detail" 2>/dev/null)
echo "[Vision bridge: KimiCode print mode]"
echo "$RESULT"
echo "[End vision bridge]"

Image Sources

The bridge works with any image the Agent can reference. Common sources:

User-provided file path — kimi --quiet -p "Describe ~/Downloads/screenshot.png"
Screenshot captured on-the-fly — capture with OS tool first, then feed the saved file (macOS: screencapture, Windows: PowerShell GDI+, Linux: ImageMagick/gnome-screenshot)
Image from a prior tool call — pass the path from a previous download/generation step
Clipboard image — save to temp file first, then bridge

Screenshot capture is not part of the bridge skill itself — use platform-native tools.

Iterative Refinement

Narrow the question: "Focus only on reading every text string in the dialog box."
Compare states: Submit two images and ask "Compare these two screenshots and describe what changed."
Crop and retry: Crop to a sub-region with OS tools and re-submit

Common Mistakes

Mistake	Fix
Using a text-only model	Run `kimi config set model <vision-model>` to switch
Image path contains spaces or special chars	Wrap path in quotes or use a temp file with a simple name
Expecting KimiCode to take screenshots	Use OS tools; KimiCode is the analysis pipe, not the capture tool
Forgetting to verify prerequisites first	Always run Check 1 and Check 2 before attempting the bridge
Not checking exit codes	0=success, 1=permanent error (auth/config), 75=transient (rate limit, retry)

Error Recovery

Symptom	Likely cause	Action
`kimi: command not found`	CLI not on PATH	Install from https://www.kimi-cli.com
Exit code 75	Rate limit / transient	Wait 10s, retry
Exit code 1	Auth or config error	Run `kimi config` to check API key and model
Empty or nonsensical output	Model may not be vision-capable	Verify the model supports multimodal input
"Image input not supported" in output	Model is text-only	Switch to a vision model in KimiCode config
Base64 too large for stdin	Image too big	Use file-reference approach (Approach 1)
Output appears truncated	Long response	Use `--output-format=stream-json` for reliable capture

name	kimicode-vision-bridge
description	Use when the current Agent LLM cannot process images directly and visual analysis is needed — bridges images through KimiCode CLI print mode to a multimodal Kimi model for text description
version	1

kimicode-vision-bridge

同仓库更多 Skills

KimiCode Vision Bridge

Overview

When to Use

Prerequisites (MUST verify before proceeding)

Check 1: KimiCode CLI (kimi) is installed

Check 2: A multimodal (vision-capable) model is configured

The Pipe: Core Workflow

Step A: Submit image + question to Kimi via print mode

Step B: Read the text result and feed into Agent context

Complete pipeline example

Image Sources

Iterative Refinement

Common Mistakes

Error Recovery

KimiCode Vision Bridge

Overview

When to Use

Prerequisites (MUST verify before proceeding)

Check 1: KimiCode CLI (kimi) is installed

Check 2: A multimodal (vision-capable) model is configured

The Pipe: Core Workflow

Step A: Submit image + question to Kimi via print mode

Step B: Read the text result and feed into Agent context

Complete pipeline example

Image Sources

Iterative Refinement

Common Mistakes

Error Recovery

同仓库更多 Skills

Check 1: KimiCode CLI (`kimi`) is installed

Check 1: KimiCode CLI (`kimi`) is installed