Run any Skill in Manus with one click

Get Started

doubao-asr

Transcribe audio files to text using Volcengine Doubao (豆包) Big-Model ASR 2.0 with word-level timestamps

Run Skill in Manus

Overview

Transcribe audio files to text using Volcengine Doubao (豆包) Big-Model ASR 2.0 with word-level timestamps

Install command

npx skills add https://github.com/agentrix-ai/skills --skill doubao-asr

Copy and paste this command into Claude Code to install the skill

Source

agentrix-ai/skills

Stars1

Forks0

UpdatedMarch 17, 2026 at 16:43

File Explorer

2 files

SKILL.md

readonly

name	doubao-asr
description	Transcribe audio files to text using Volcengine Doubao (豆包) Big-Model ASR 2.0 with word-level timestamps
metadata	{"openclaw":{"emoji":"👂","requires":{"bins":["python3"],"env":["VOLCENGINE_TTS_APPID","VOLCENGINE_TTS_TOKEN"]},"primaryEnv":"VOLCENGINE_TTS_TOKEN"}}

Doubao ASR (豆包录音文件识别 2.0)

Transcribe audio files to text using Volcengine Doubao Big-Model ASR. Supports word-level timestamps, punctuation, ITN (inverse text normalization), speaker diarization, and channel splitting.

Workflow

Submit — Upload audio (file or URL) to start transcription
Query — Poll for results (small files return instantly; large files are async)
Get Results — Full text + word-level timestamps + optional speaker info

Quick Start

# Transcribe a local audio file
python3 scripts/asr_transcribe.py --audio recording.mp3

# Transcribe from URL
python3 scripts/asr_transcribe.py --url "https://example.com/audio.wav"

# Save transcription to file
python3 scripts/asr_transcribe.py --audio meeting.mp3 --output transcript.txt

# With speaker diarization
python3 scripts/asr_transcribe.py --audio meeting.mp3 --speakers

API

Transcribe

Submit: POST https://openspeech.bytedance.com/api/v3/auc/bigmodel/submit Query: POST https://openspeech.bytedance.com/api/v3/auc/bigmodel/query

Script: scripts/asr_transcribe.py

Parameters:

Param	Flag	Required	Default	Description
audio	`--audio`	Yes*	—	Path to local audio file
url	`--url`	Yes*	—	URL to audio file
output	`--output`	No	—	Save transcript to text file
format	`--format`	No	auto-detect	Audio format: `mp3`, `wav`, `ogg`, `m4a`, `flac`, `aac`, `amr`
speakers	`--speakers`	No	false	Enable speaker diarization
channels	`--channels`	No	false	Enable channel splitting
words	`--words`	No	false	Show word-level timestamps
max-wait	`--max-wait`	No	`120`	Max seconds to wait for async tasks

*One of --audio or --url is required.

Examples:

# Basic transcription
python3 scripts/asr_transcribe.py --audio podcast.mp3

# Show word-level timestamps
python3 scripts/asr_transcribe.py --audio speech.wav --words

# Multi-speaker meeting with output file
python3 scripts/asr_transcribe.py --audio meeting.mp3 --speakers --output meeting.txt

# From URL with channel splitting
python3 scripts/asr_transcribe.py --url "https://example.com/call.wav" --channels

# Custom audio format
python3 scripts/asr_transcribe.py --audio recording.raw --format pcm

Success output:

音频时长: 5496ms
识别结果: 你好，我是豆包语音合成，今天天气真不错，一起出去走走吧。

With --words:

音频时长: 5496ms
识别结果: 你好，我是豆包语音合成，今天天气真不错，一起出去走走吧。

逐字时间戳:
  [0.19s-0.35s] 你
  [0.35s-0.71s] 好
  [0.71s-0.99s] 我
  ...

Request Parameters (Advanced)

These can be set in the request object when calling the API directly:

Parameter	Type	Default	Description
`model_name`	string	`bigmodel`	Model name
`model_version`	string	`400`	Model version
`enable_itn`	bool	`true`	Inverse text normalization (numbers → digits)
`enable_punc`	bool	`true`	Automatic punctuation
`enable_ddc`	bool	`true`	Spoken language normalization
`show_utterances`	bool	`true`	Return sentence-level segments
`enable_channel_split`	bool	`false`	Split audio channels (for stereo)
`enable_speaker_info`	bool	`false`	Speaker diarization

Auth Headers

The ASR API uses header-based auth (different from TTS Bearer token):

Header	Value
`X-Api-App-Key`	AppID
`X-Api-Access-Key`	Access Token
`X-Api-Resource-Id`	`volc.bigasr.auc`
`X-Api-Request-Id`	Unique task UUID
`X-Api-Sequence`	`-1` (single request)

Response Status Codes

Code	Status	Action
`20000000`	Completed	Results available
`20000001`	Queued	Wait and re-query
`20000002`	Processing	Wait and re-query
Other	Failed	Check error message

Supported Audio Formats

mp3, wav, ogg, m4a, flac, aac, amr, pcm (16kHz 16bit mono)

Environment Variables

Variable	Required	Description
`VOLCENGINE_TTS_APPID`	Yes	Application ID
`VOLCENGINE_TTS_TOKEN`	Yes	Access Token

References

More from this repository

same repository

uno

agentrix-ai/skills

通过 curl 调用 2000+ tools，零安装。支持 tool 级别语义搜索，一步拿到完整 inputSchema 直接调用。覆盖：搜索、开发、文档、金融、地图、出行、AI媒体、社交、办公、企业等领域。

2026-03-291

minimax-music

agentrix-ai/skills

使用 MiniMax 音乐生成 API（music-2.5 / music-2.5+）创作歌曲、纯音乐和自动歌词作品。用户提到“生成音乐/写歌/BGM/纯音乐/哼唱/歌词自动生成/MiniMax 音乐”时都应使用本 skill，即使用户只说“做一首歌”也应触发。

2026-03-191

minimax-tts

agentrix-ai/skills

使用 MiniMax 语音合成 API（TTS）将文本转语音，支持多音色、语速/音量/音高、情绪、输出格式控制。用户提到“语音合成/TTS/配音/朗读/旁白/角色音色/MiniMax 声音”时必须触发本 skill。

2026-03-191

doubao-music

agentrix-ai/skills

AI music generation using Volcengine Doubao (豆包) Music API — generate vocal songs, instrumental BGM, and lyrics. Use when users want to create music, generate songs, compose BGM/background music, write lyrics, or anything related to AI music creation with Doubao/豆包/火山引擎.

2026-03-171

doubao-voice-clone

agentrix-ai/skills

Clone voices using Volcengine Doubao (豆包) Voice Cloning API — upload audio, train, check status, then synthesize with cloned voice

2026-03-171

doubao-tts

agentrix-ai/skills

Text-to-Speech synthesis using Volcengine Doubao (豆包) Speech API with 2.0 voice instruction support

2026-03-171

Source

agentrix-ai

agentrix-ai/skills

View GitHub Repository View Creator Repositories

Install command

Download

Run Skill in Manus

Useful forSOC

Software DevelopersComputer and Mathematical Occupations15-1252L4

name	doubao-asr
description	Transcribe audio files to text using Volcengine Doubao (豆包) Big-Model ASR 2.0 with word-level timestamps
metadata	{"openclaw":{"emoji":"👂","requires":{"bins":["python3"],"env":["VOLCENGINE_TTS_APPID","VOLCENGINE_TTS_TOKEN"]},"primaryEnv":"VOLCENGINE_TTS_TOKEN"}}

Doubao ASR (豆包录音文件识别 2.0)

Transcribe audio files to text using Volcengine Doubao Big-Model ASR. Supports word-level timestamps, punctuation, ITN (inverse text normalization), speaker diarization, and channel splitting.

Workflow

Submit — Upload audio (file or URL) to start transcription
Query — Poll for results (small files return instantly; large files are async)
Get Results — Full text + word-level timestamps + optional speaker info

Quick Start

# Transcribe a local audio file
python3 scripts/asr_transcribe.py --audio recording.mp3

# Transcribe from URL
python3 scripts/asr_transcribe.py --url "https://example.com/audio.wav"

# Save transcription to file
python3 scripts/asr_transcribe.py --audio meeting.mp3 --output transcript.txt

# With speaker diarization
python3 scripts/asr_transcribe.py --audio meeting.mp3 --speakers

API

Transcribe

Submit: POST https://openspeech.bytedance.com/api/v3/auc/bigmodel/submit Query: POST https://openspeech.bytedance.com/api/v3/auc/bigmodel/query

Script: scripts/asr_transcribe.py

Parameters:

Param	Flag	Required	Default	Description
audio	`--audio`	Yes*	—	Path to local audio file
url	`--url`	Yes*	—	URL to audio file
output	`--output`	No	—	Save transcript to text file
format	`--format`	No	auto-detect	Audio format: `mp3`, `wav`, `ogg`, `m4a`, `flac`, `aac`, `amr`
speakers	`--speakers`	No	false	Enable speaker diarization
channels	`--channels`	No	false	Enable channel splitting
words	`--words`	No	false	Show word-level timestamps
max-wait	`--max-wait`	No	`120`	Max seconds to wait for async tasks

*One of --audio or --url is required.

Examples:

# Basic transcription
python3 scripts/asr_transcribe.py --audio podcast.mp3

# Show word-level timestamps
python3 scripts/asr_transcribe.py --audio speech.wav --words

# Multi-speaker meeting with output file
python3 scripts/asr_transcribe.py --audio meeting.mp3 --speakers --output meeting.txt

# From URL with channel splitting
python3 scripts/asr_transcribe.py --url "https://example.com/call.wav" --channels

# Custom audio format
python3 scripts/asr_transcribe.py --audio recording.raw --format pcm

Success output:

音频时长: 5496ms
识别结果: 你好，我是豆包语音合成，今天天气真不错，一起出去走走吧。

With --words:

音频时长: 5496ms
识别结果: 你好，我是豆包语音合成，今天天气真不错，一起出去走走吧。

逐字时间戳:
  [0.19s-0.35s] 你
  [0.35s-0.71s] 好
  [0.71s-0.99s] 我
  ...

Request Parameters (Advanced)

These can be set in the request object when calling the API directly:

Parameter	Type	Default	Description
`model_name`	string	`bigmodel`	Model name
`model_version`	string	`400`	Model version
`enable_itn`	bool	`true`	Inverse text normalization (numbers → digits)
`enable_punc`	bool	`true`	Automatic punctuation
`enable_ddc`	bool	`true`	Spoken language normalization
`show_utterances`	bool	`true`	Return sentence-level segments
`enable_channel_split`	bool	`false`	Split audio channels (for stereo)
`enable_speaker_info`	bool	`false`	Speaker diarization

Auth Headers

The ASR API uses header-based auth (different from TTS Bearer token):

Header	Value
`X-Api-App-Key`	AppID
`X-Api-Access-Key`	Access Token
`X-Api-Resource-Id`	`volc.bigasr.auc`
`X-Api-Request-Id`	Unique task UUID
`X-Api-Sequence`	`-1` (single request)

Response Status Codes

Code	Status	Action
`20000000`	Completed	Results available
`20000001`	Queued	Wait and re-query
`20000002`	Processing	Wait and re-query
Other	Failed	Check error message

Supported Audio Formats

mp3, wav, ogg, m4a, flac, aac, amr, pcm (16kHz 16bit mono)

Environment Variables

Variable	Required	Description
`VOLCENGINE_TTS_APPID`	Yes	Application ID
`VOLCENGINE_TTS_TOKEN`	Yes	Access Token