Run any Skill in Manus with one click

transcribe-audio

Transcribe a local audio (or video) file to text with timestamps using Whisper. Defaults to local whisper.cpp (free, offline, word/segment-level timestamps) and automatically falls back to the OpenAI Whisper API (whisper-1) when the local engine is not installed. Any agent can call it for meetings, podcasts, voice notes, interviews, or video audio tracks.

Run Skill in Manus

Stars77

Forks21

UpdatedJune 7, 2026 at 00:28

Source

stevehuang0115

stevehuang0115/crewly

View GitHub Repository View Creator Repositories

Install command

Download

Run Skill in Manus

File Explorer

4 files

SKILL.md

readonly

name	transcribe-audio
description	Transcribe a local audio (or video) file to text with timestamps using Whisper. Defaults to local whisper.cpp (free, offline, word/segment-level timestamps) and automatically falls back to the OpenAI Whisper API (whisper-1) when the local engine is not installed. Any agent can call it for meetings, podcasts, voice notes, interviews, or video audio tracks.
category	content
assignableRoles	["*"]
version	1.0.0
tags	["audio","transcribe","transcription","whisper","whisper.cpp","openai","speech-to-text","timestamps"]

Transcribe Audio (Whisper)

Transcribe a local audio or video file into text with timestamps. This is the general-purpose Whisper-based speech-to-text skill — distinct from xiaoyuzhoufm-transcript (which is the Gemini-based podcast/video research skill; this skill does not replace it).

Engines

Engine	When used	Cost	Network	Timestamps
whisper.cpp (local, `large-v3-turbo`)	Default — used whenever the binary + model are present	Free	Offline	Segment + word-level
OpenAI Whisper API (`whisper-1`)	Fallback — used when local is unavailable, or forced via `engine:"openai"`	~$0.006/min	Required	Segment + word-level

Default = auto: prefer local whisper.cpp (free, private, offline, no per-use cost — ideal for unattended agents), and transparently fall back to the OpenAI API when the local engine is not installed. This gives zero-cost transcription where the local model exists and zero-install reliability everywhere else. Force a specific engine with engine:"local" or engine:"openai".

Usage

# Auto engine (local if available, else OpenAI). Prints JSON to stdout.
bash execute.sh '{"audioFile":"/path/to/recording.m4a"}'

# Save as Markdown (timestamped transcript)
bash execute.sh '{"audioFile":"/path/to/recording.mp3","outputFile":"./transcript.md"}'

# Save as JSON (full structured segments)
bash execute.sh '{"audioFile":"/path/to/recording.wav","outputFile":"./transcript.json"}'

# Language hint (ISO-639-1, e.g. en / zh / ja). Default: auto-detect
bash execute.sh '{"audioFile":"/path/to/recording.m4a","language":"zh"}'

# Force a specific engine
bash execute.sh '{"audioFile":"/path/to/x.wav","engine":"local"}'
bash execute.sh '{"audioFile":"/path/to/x.wav","engine":"openai"}'

# A video file works too — the audio track is extracted automatically
bash execute.sh '{"audioFile":"/path/to/clip.mp4"}'

Parameters

Parameter	Required	Description
`audioFile`	Yes	Path to a local audio or video file (m4a, mp3, wav, aac, ogg, flac, mp4, mov…)
`outputFile`	No	Save the transcript to this path. `.json` → structured JSON; anything else → Markdown
`language`	No	Language hint (ISO-639-1). Default: auto-detect
`engine`	No	`auto` (default), `local`, or `openai`

Output (stdout JSON)

{
  "success": true,
  "engine": "whisper.cpp",
  "language": "en",
  "durationSec": 11.0,
  "text": "Full transcript as one block of text...",
  "segments": [
    { "start": 0.0, "end": 2.5, "text": "Hello everyone", "speaker": "Unknown" }
  ],
  "segmentCount": 1,
  "outputFile": "/abs/path/transcript.md"
}

speaker is "Unknown" unless the chosen engine returns diarization (neither default engine performs speaker separation today; the field is reserved so downstream consumers have a stable shape).

Dependencies & Setup

ffmpeg — required for both engines (audio is normalized to 16 kHz mono WAV). Install: brew install ffmpeg.
Local engine (whisper.cpp) — needs the whisper-cli binary and a model file:
- Binary: brew install whisper-cpp (provides whisper-cli). Override with FLOPOST_WHISPER_BIN.
- Model: ggml-large-v3-turbo-q5_0.bin in ~/.flopost/whisper/ or ~/.cache/whisper-models/. Override with FLOPOST_WHISPER_MODEL.
- If the binary or model is missing, the skill falls back to OpenAI (or reports a clear hint when engine:"local" is forced).
OpenAI engine — needs an OpenAI API key. Resolution order:
1. OPENAI_API_KEY environment variable (injected by Crewly secrets).
2. Crewly Settings → API Keys (GET http://localhost:8787/api/settings → data.apiKeys.global.openai). No key is ever hard-coded.

Notes

Engine detection and key resolution lift the proven logic from Flopost (desktop/server/whisperModule.ts and service/lib/video/transcriptionService.ts).
Long files: whisper.cpp handles arbitrary length locally. The OpenAI API enforces a 25 MB upload limit; for larger files prefer the local engine.

More from this repository

same repository

accept-task

stevehuang0115/crewly

Accept and take the next available task from the task queue. Use when an agent is idle and ready to pick up the highest-priority unassigned task. For assigning tasks to specific agents, use assign-task instead.

2026-06-0677

stevehuang0115/crewly

Register the agent as active with the Crewly backend on startup. Use when an agent first launches and needs to join the team and become visible to the orchestrator. For confirming responsiveness after registration, use heartbeat instead.

2026-06-0677

code-review

stevehuang0115/crewly

Analyze git changes and produce a structured code review with automated checks for missing tests, debug statements, potential secrets, large changes, and dependency modifications. Use when reviewing staged changes, unstaged diffs, recent commits, or branch comparisons before submitting a pull request. For task-level quality gates, use check-quality-gates instead.

2026-06-0677

assign-task

stevehuang0115/crewly

Assign a task to a specific agent via the task management system. Use when the orchestrator needs to target a particular agent with a known task ID. For agents self-assigning the next available task, use accept-task instead.

2026-06-0677

broadcast

stevehuang0115/crewly

Send a message to all active agent sessions, excluding the orchestrator. Use when the orchestrator needs to announce information or coordinate the entire team at once. For messaging a single agent, use send-message instead.

2026-06-0677

record-learning

stevehuang0115/crewly

Record a learning or insight gained during task execution for team knowledge sharing.

2026-06-0677

name	transcribe-audio
description	Transcribe a local audio (or video) file to text with timestamps using Whisper. Defaults to local whisper.cpp (free, offline, word/segment-level timestamps) and automatically falls back to the OpenAI Whisper API (whisper-1) when the local engine is not installed. Any agent can call it for meetings, podcasts, voice notes, interviews, or video audio tracks.
category	content
assignableRoles	["*"]
version	1.0.0
tags	["audio","transcribe","transcription","whisper","whisper.cpp","openai","speech-to-text","timestamps"]

Transcribe Audio (Whisper)

Engines

Engine	When used	Cost	Network	Timestamps
whisper.cpp (local, `large-v3-turbo`)	Default — used whenever the binary + model are present	Free	Offline	Segment + word-level
OpenAI Whisper API (`whisper-1`)	Fallback — used when local is unavailable, or forced via `engine:"openai"`	~$0.006/min	Required	Segment + word-level

Usage

# Auto engine (local if available, else OpenAI). Prints JSON to stdout.
bash execute.sh '{"audioFile":"/path/to/recording.m4a"}'

# Save as Markdown (timestamped transcript)
bash execute.sh '{"audioFile":"/path/to/recording.mp3","outputFile":"./transcript.md"}'

# Save as JSON (full structured segments)
bash execute.sh '{"audioFile":"/path/to/recording.wav","outputFile":"./transcript.json"}'

# Language hint (ISO-639-1, e.g. en / zh / ja). Default: auto-detect
bash execute.sh '{"audioFile":"/path/to/recording.m4a","language":"zh"}'

# Force a specific engine
bash execute.sh '{"audioFile":"/path/to/x.wav","engine":"local"}'
bash execute.sh '{"audioFile":"/path/to/x.wav","engine":"openai"}'

# A video file works too — the audio track is extracted automatically
bash execute.sh '{"audioFile":"/path/to/clip.mp4"}'

Parameters

Parameter	Required	Description
`audioFile`	Yes	Path to a local audio or video file (m4a, mp3, wav, aac, ogg, flac, mp4, mov…)
`outputFile`	No	Save the transcript to this path. `.json` → structured JSON; anything else → Markdown
`language`	No	Language hint (ISO-639-1). Default: auto-detect
`engine`	No	`auto` (default), `local`, or `openai`

Output (stdout JSON)

{
  "success": true,
  "engine": "whisper.cpp",
  "language": "en",
  "durationSec": 11.0,
  "text": "Full transcript as one block of text...",
  "segments": [
    { "start": 0.0, "end": 2.5, "text": "Hello everyone", "speaker": "Unknown" }
  ],
  "segmentCount": 1,
  "outputFile": "/abs/path/transcript.md"
}

speaker is "Unknown" unless the chosen engine returns diarization (neither default engine performs speaker separation today; the field is reserved so downstream consumers have a stable shape).

Dependencies & Setup

ffmpeg — required for both engines (audio is normalized to 16 kHz mono WAV). Install: brew install ffmpeg.
Local engine (whisper.cpp) — needs the whisper-cli binary and a model file:
- Binary: brew install whisper-cpp (provides whisper-cli). Override with FLOPOST_WHISPER_BIN.
- Model: ggml-large-v3-turbo-q5_0.bin in ~/.flopost/whisper/ or ~/.cache/whisper-models/. Override with FLOPOST_WHISPER_MODEL.
- If the binary or model is missing, the skill falls back to OpenAI (or reports a clear hint when engine:"local" is forced).
OpenAI engine — needs an OpenAI API key. Resolution order:
1. OPENAI_API_KEY environment variable (injected by Crewly secrets).
2. Crewly Settings → API Keys (GET http://localhost:8787/api/settings → data.apiKeys.global.openai). No key is ever hard-coded.

Notes

Engine detection and key resolution lift the proven logic from Flopost (desktop/server/whisperModule.ts and service/lib/video/transcriptionService.ts).
Long files: whisper.cpp handles arbitrary length locally. The OpenAI API enforces a 25 MB upload limit; for larger files prefer the local engine.