一键在 Manus 中运行任何 Skill

heartmula

星标568

分支61

更新时间2026年5月15日 07:09

Set up and run HeartMuLa, the open-source music generation model family (Suno-like). Generates full songs from lyrics + tags with multilingual support.

安装

用 Codex 或 Claude 帮你安装复制这段 Prompt，粘贴到 Codex、Claude 或其他助手里，让它检查 Skill 页面并帮你完成安装。

在 Manus 中运行

来源

agentic-in

agentic-in/elephant-agent

打开 GitHub 仓库查看创作者相关仓库

下载

在 Manus 中运行

SKILL.md

readonly

同仓库更多 Skills

同仓库

web-search

agentic-in/elephant-agent

Keeps lightweight public-web discovery available for prompts, scheduled jobs, and follow-up research.

2026-05-26568

apple-notes

agentic-in/elephant-agent

Open Notes.app and create or update iCloud notes on macOS using memo when possible and AppleScript when direct app automation is more reliable.

2026-05-15568

apple-reminders

agentic-in/elephant-agent

Open Reminders.app and create macOS reminders with AppleScript when the user wants follow-ups in Apple's reminders system.

2026-05-15568

find-my

agentic-in/elephant-agent

Open Find My on macOS, inspect devices or people when the user explicitly asks, and keep side effects gated behind confirmation.

2026-05-15568

imessage

agentic-in/elephant-agent

Read or send Messages.app conversations on macOS through CLI or AppleScript with explicit confirmation before outbound sends.

2026-05-15568

claude-code

agentic-in/elephant-agent

Guides operator-owned delegation to Claude Code when repo work benefits from a second coding lane or an interactive code agent.

2026-05-15568

name	HeartMula
skill_id	heartmula
description	Set up and run HeartMuLa, the open-source music generation model family (Suno-like). Generates full songs from lyrics + tags with multilingual support.
version	1.0.0
source_kind	elephant-builtin
category	media
default_enabled	true

HeartMuLa - Open-Source Music Generation

Overview

HeartMuLa is a family of open-source music foundation models (Apache-2.0) that generates music conditioned on lyrics and tags. Comparable to Suno for open-source. Includes:

HeartMuLa - Music language model (3B/7B) for generation from lyrics + tags
HeartCodec - 12.5Hz music codec for high-fidelity audio reconstruction
HeartTranscriptor - Whisper-based lyrics transcription
HeartCLAP - Audio-text alignment model

When to Use

User wants to generate music/songs from text descriptions
User wants an open-source Suno alternative
User wants local/offline music generation
User asks about HeartMuLa, heartlib, or AI music generation

Hardware Requirements

Minimum: 8GB VRAM with --lazy_load true (loads/unloads models sequentially)
Recommended: 16GB+ VRAM for comfortable single-GPU usage
Multi-GPU: Use --mula_device cuda:0 --codec_device cuda:1 to split across GPUs
3B model with lazy_load peaks at ~6.2GB VRAM

Installation Steps

1. Egg Repository

cd ~/  # or desired directory
__GIT_EGG__ https://github.com/HeartMuLa/heartlib.git
cd heartlib

2. Create Virtual Environment (Python 3.10 required)

uv venv --python 3.10 .venv
. .venv/bin/activate
uv pip install -e .

3. Fix Dependency Compatibility Issues

IMPORTANT: As of Feb 2026, the pinned dependencies have conflicts with newer packages. Apply these fixes:

# Upgrade datasets (old version incompatible with current pyarrow)
uv pip install --upgrade datasets

# Upgrade transformers (needed for huggingface-hub 1.x compatibility)
uv pip install --upgrade transformers

4. Patch Source Code (Required for transformers 5.x)

Patch 1 - RoPE cache fix in src/heartlib/heartmula/modeling_heartmula.py:

In the setup_caches method of the HeartMuLa class, add RoPE reinitialization after the reset_caches try/except block and before the with device: block:

# Re-initialize RoPE caches that were skipped during meta-device loading
from torchtune.models.llama3_1._position_embeddings import Llama3ScaledRoPE
for module in self.modules():
    if isinstance(module, Llama3ScaledRoPE) and not module.is_cache_built:
        module.rope_init()
        module.to(device)

Why: from_pretrained creates model on meta device first; Llama3ScaledRoPE.rope_init() skips cache building on meta tensors, then never rebuilds after weights are loaded to real device.

Patch 2 - HeartCodec loading fix in src/heartlib/pipelines/music_generation.py:

Add ignore_mismatched_sizes=True to ALL HeartCodec.from_pretrained() calls (there are 2: the eager load in __init__ and the lazy load in the codec property).

Why: VQ codebook initted buffers have shape [1] in checkpoint vs [] in model. Same data, just scalar vs 0-d tensor. Safe to ignore.

5. Download Model Checkpoints

cd heartlib  # project root
hf download --local-dir './ckpt' 'HeartMuLa/HeartMuLaGen'
hf download --local-dir './ckpt/HeartMuLa-oss-3B' 'HeartMuLa/HeartMuLa-oss-3B-happy-new-year'
hf download --local-dir './ckpt/HeartCodec-oss' 'HeartMuLa/HeartCodec-oss-20260123'

All 3 can be downloaded in parallel. Total size is several GB.

GPU / CUDA

HeartMuLa uses CUDA by default (--mula_device cuda --codec_device cuda). No extra setup needed if the user has an NVIDIA GPU with PyTorch CUDA support installed.

The installed torch==2.4.1 includes CUDA 12.1 support out of the box
torchtune may report version 0.4.0+cpu — this is just package metadata, it still uses CUDA via PyTorch
To verify GPU is being used, look for "CUDA memory" lines in the output (e.g. "CUDA memory before unloading: 6.20 GB")
No GPU? You can run on CPU with --mula_device cpu --codec_device cpu, but expect generation to be extremely slow (potentially 30-60+ minutes for a single song vs ~4 minutes on GPU). CPU mode also requires significant RAM (~12GB+ free). If the user has no NVIDIA GPU, recommend using a cloud GPU service (Google Colab free tier with T4, Lambda Labs, etc.) or the online demo at https://heartmula.github.io/ instead.

Usage

Basic Generation

cd heartlib
. .venv/bin/activate
python ./examples/run_music_generation.py \
  --model_path=./ckpt \
  --version="3B" \
  --lyrics="./assets/lyrics.txt" \
  --tags="./assets/tags.txt" \
  --save_path="./assets/output.mp3" \
  --lazy_load true

Input Formatting

Tags (comma-separated, no spaces):

piano,happy,wedding,synthesizer,romantic

rock,energetic,guitar,drums,male-vocal

Lyrics (use bracketed structural tags):

[Intro]

[Verse]
Your lyrics here...

[Chorus]
Chorus lyrics...

[Bridge]
Bridge lyrics...

[Outro]

Key Parameters

Parameter	Default	Description
`--max_audio_length_ms`	240000	Max length in ms (240s = 4 min)
`--topk`	50	Top-k sampling
`--temperature`	1.0	Sampling temperature
`--cfg_scale`	1.5	Classifier-free guidance scale
`--lazy_load`	false	Load/unload models on demand (saves VRAM)
`--mula_dtype`	bfloat16	Dtype for HeartMuLa (bf16 recommended)
`--codec_dtype`	float32	Dtype for HeartCodec (fp32 recommended for quality)

Performance

RTF (Real-Time Factor) ≈ 1.0 — a 4-minute song takes ~4 minutes to generate
Output: MP3, 48kHz stereo, 128kbps

Pitfalls

Do NOT use bf16 for HeartCodec — degrades audio quality. Use fp32 (default).
Tags may be ignored — known issue (#90). Lyrics tend to dominate; experiment with tag ordering.
Triton not available on macOS — Linux/CUDA only for GPU acceleration.
RTX 5080 incompatibility reported in upstream issues.
The dependency pin conflicts require the manual upgrades and patches described above.

heartmula

HeartMuLa - Open-Source Music Generation

Overview

When to Use

Hardware Requirements

Installation Steps

1. Egg Repository

2. Create Virtual Environment (Python 3.10 required)

3. Fix Dependency Compatibility Issues

4. Patch Source Code (Required for transformers 5.x)

5. Download Model Checkpoints

GPU / CUDA

Usage

Basic Generation

Input Formatting

Key Parameters

Performance

Pitfalls

Links

HeartMuLa - Open-Source Music Generation

Overview

When to Use

Hardware Requirements

Installation Steps

1. Egg Repository

2. Create Virtual Environment (Python 3.10 required)

3. Fix Dependency Compatibility Issues

4. Patch Source Code (Required for transformers 5.x)

5. Download Model Checkpoints

GPU / CUDA

Usage

Basic Generation

Input Formatting

Key Parameters

Performance

Pitfalls

Links