Run any Skill in Manus with one click

Get Started

audio-memory

Stars5

Forks3

UpdatedApril 15, 2026 at 17:25

Store and manage voice samples for TTS cloning — portable, version-controlled audio references

Installation

Install with Codex or Claude Copy this prompt, paste it into Codex, Claude, or another assistant, and let it review the skill page and install it for you.

Run Skill in Manus

Source

fabioc-aloha

fabioc-aloha/Alex_Plug_In

View GitHub Repository View Creator Repositories

Download

Run Skill in Manus

Related occupationsSOC

Based on SOC occupation classification

Software DevelopersComputer and Mathematical Occupations·SOC 15-1252

SKILL.md

readonly

Audio Memory Skill

Domain: AI Audio / Voice Cloning
Version: 1.0.0
Last Updated: 2026-04-15
Author: Alex (Master Alex)
Related: text-to-speech (generation), visual-memory (photos)

Overview

Store voice samples for TTS voice cloning in a portable, version-controlled format. Unlike visual memory (base64 inline), audio files are stored as files with JSON metadata — audio is too large to embed sensibly.

Voice Sample Specifications

Spec	Value
Duration	5-15 seconds of clear speech
Format	WAV (preferred) or MP3
Sample rate	16kHz+ (22kHz+ recommended)
Content	Natural speech, varied intonation
Background	No music, no background noise
File size	~100KB-500KB per sample

Storage Structure

.github/skills/<skill-name>/audio-memory/
├── index.json              # Metadata registry
├── voices/
│   ├── alex-sample.wav     # Voice sample files
│   ├── narrator-sample.wav
│   └── ...
└── README.md               # Usage notes (optional)

index.json Schema

{
  "version": "1.0",
  "updated": "2026-04-15",
  "voices": {
    "alex": {
      "description": "Natural conversational voice, warm and friendly",
      "audioFile": "voices/alex-sample.wav",
      "duration": "10s",
      "sampleRate": "22050",
      "language": "en-US",
      "preferredModel": "chatterbox-turbo",
      "notes": "Best for narration and documentation reads"
    },
    "narrator": {
      "description": "Professional narration voice",
      "audioFile": "voices/narrator-sample.wav",
      "duration": "12s",
      "sampleRate": "44100",
      "language": "en-US",
      "preferredModel": "qwen/qwen3-tts"
    }
  }
}

Compatible TTS Models

Model	Replicate ID	Voice Cloning	Cost
Chatterbox Turbo	`resemble-ai/chatterbox-turbo`	✅ 5s sample	$0.025/1k chars
Qwen TTS	`qwen/qwen3-tts`	✅ 3 modes	$0.02/1k chars
MiniMax Speech	`minimax/speech-2.8-turbo`	❌ Presets	$0.06/1k tokens

Note: MiniMax doesn't support cloning but has 40+ voice presets.

Recording Voice Samples

Requirements

Duration: 5-15 seconds (longer = better quality cloning)
Content: Natural speech with varied intonation (not monotone reading)
Quality: Clear audio, no background noise, no music
Format: WAV 16kHz+ or MP3

Recording Tips

Use a quiet room with minimal echo
Speak naturally — include some pauses, varied pitch
Avoid reading monotonously — conversational tone works best
Keep microphone at consistent distance (~6-12 inches)
Include a variety of sounds (different vowels, consonants)

Example Recording Script

"Hello, I'm [Name]. Today I want to share some thoughts about technology and how it shapes our daily lives. The key is finding balance — embracing innovation while staying grounded in what matters most."

Adding a Voice Sample

Step 1: Record the Sample

# Recommended: Use Audacity, Voice Memos (macOS), or Windows Voice Recorder
# Export as WAV, 22kHz or 44.1kHz, mono

Step 2: Place in Audio Memory

# Create directory structure
New-Item -ItemType Directory -Path ".github/skills/<skill>/audio-memory/voices" -Force

# Copy voice sample
Copy-Item "my-recording.wav" ".github/skills/<skill>/audio-memory/voices/<name>-sample.wav"

Step 3: Update index.json

{
  "voices": {
    "<name>": {
      "description": "Brief description of the voice character",
      "audioFile": "voices/<name>-sample.wav",
      "duration": "10s",
      "sampleRate": "22050",
      "language": "en-US",
      "preferredModel": "chatterbox-turbo"
    }
  }
}

Step 4: Test the Clone

import Replicate from "replicate";

const replicate = new Replicate();

const output = await replicate.run("resemble-ai/chatterbox-turbo", {
  input: {
    text: "Testing the voice clone. This should sound like the reference sample.",
    audio_prompt: fs.readFileSync("voices/<name>-sample.wav"),
  },
});

console.log("Generated audio:", output);

Using Audio Memory in Generation

With Chatterbox Turbo

import { readFileSync } from "fs";
import Replicate from "replicate";

// Load audio memory
const audioMemory = JSON.parse(
  readFileSync(".github/skills/<skill>/audio-memory/index.json", "utf8")
);
const voice = audioMemory.voices["alex"];

// Generate speech with cloned voice
const replicate = new Replicate();
const output = await replicate.run("resemble-ai/chatterbox-turbo", {
  input: {
    text: "Content to speak in the cloned voice",
    audio_prompt: readFileSync(
      `.github/skills/<skill>/audio-memory/${voice.audioFile}`
    ),
  },
});

With Qwen TTS (Clone Mode)

const output = await replicate.run("qwen/qwen3-tts", {
  input: {
    text: "Content to speak",
    tts_mode: "voice_clone",
    audio_input: readFileSync(
      `.github/skills/<skill>/audio-memory/${voice.audioFile}`
    ),
  },
});

Quality Guidelines

Element	Recommendation
Sample duration	10s optimal (5s minimum, 15s maximum)
Varied speech	Include questions, statements, exclamations
Distinct voice	Clear enunciation, consistent microphone setup
File format	WAV preferred (lossless), MP3 acceptable
Sample rate	22kHz+ (44.1kHz for premium)

Benefits vs External Storage

Without Audio Memory	With Audio Memory
External folder required	Version-controlled with code
Breaks on different machines	Works anywhere
Manual path management	Structured JSON metadata
No documentation	Self-describing with index.json
Ad-hoc organization	Consistent skill-scoped storage

Integration with text-to-speech Skill

This skill stores voice samples. Use the text-to-speech skill for:

Generating speech from text
Model selection (MiniMax, Chatterbox, Qwen)
Emotion control
Voice design from descriptions (no sample needed)

Workflow:

audio-memory: Store and manage voice samples
text-to-speech: Generate speech using those samples

name	audio-memory
description	Store and manage voice samples for TTS cloning — portable, version-controlled audio references
tier	standard
applyTo	*/voice,/audiomemory,*/clonevoice
$schema	../SKILL-SCHEMA.json

name	audio-memory
description	Store and manage voice samples for TTS cloning — portable, version-controlled audio references
tier	standard
applyTo	*/voice,/audiomemory,*/clonevoice
$schema	../SKILL-SCHEMA.json

audio-memory

More from this repository

Audio Memory Skill

Overview

Voice Sample Specifications

Storage Structure

index.json Schema

Compatible TTS Models

Recording Voice Samples

Requirements

Recording Tips

Example Recording Script

Adding a Voice Sample

Step 1: Record the Sample

Step 2: Place in Audio Memory

Step 3: Update index.json

Step 4: Test the Clone

Using Audio Memory in Generation

With Chatterbox Turbo

With Qwen TTS (Clone Mode)

Quality Guidelines

Benefits vs External Storage

Integration with text-to-speech Skill

Audio Memory Skill

Overview

Voice Sample Specifications

Storage Structure

index.json Schema

Compatible TTS Models

Recording Voice Samples

Requirements

Recording Tips

Example Recording Script

Adding a Voice Sample

Step 1: Record the Sample

Step 2: Place in Audio Memory

Step 3: Update index.json

Step 4: Test the Clone

Using Audio Memory in Generation

With Chatterbox Turbo

With Qwen TTS (Clone Mode)

Quality Guidelines

Benefits vs External Storage

Integration with text-to-speech Skill

More from this repository