Ejecuta cualquier Skill en Manus
con un clic

Ejecuta cualquier Skill en Manus con un clic

$pwd:

optimizing-performance

Name: Optimizing Performance
Author: inference-sh

// Optimize inference.sh app performance. Use when handling memory, devices, model loading, mixed precision, or flash attention.

Ejecutar en Manus

$ git log --oneline --stat

stars:26

forks:2

updated:21 de mayo de 2026, 08:16

SKILL.md

readonly

name	optimizing-performance
description	Optimize inference.sh app performance. Use when handling memory, devices, model loading, mixed precision, or flash attention.

Optimizing Performance

Best practices for inference.sh apps.

Device Detection

Never hardcode "cuda":

from accelerate import Accelerator
self.device = Accelerator().device

Model Loading

import os
from huggingface_hub import snapshot_download

os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
model_path = snapshot_download(repo_id="org/model", resume_download=True)

Memory Cleanup

import torch, gc

def cleanup():
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
    gc.collect()

Mixed Precision

model = model.to(dtype=torch.bfloat16)

# Or with autocast
from torch.amp import autocast
with autocast('cuda'):
    output = model(input)

Flash Attention

model = AutoModel.from_pretrained(
    "model-name",
    attn_implementation="flash_attention_2",
    torch_dtype=torch.bfloat16
)

Pre-deploy Checklist

📖 Full docs: inference.sh/docs/extend/best-practices

related-skills.json

mismo repositorio

configuring-resources.md

from "inference-sh/grid"

Configure inf.yml for inference.sh apps. Use when setting GPU, VRAM, RAM, categories, environment variables, packages.txt, or resource requirements.

2026-05-2126

debugging-issues.md

from "inference-sh/grid"

Debug and troubleshoot inference.sh apps. Use when facing import errors, CUDA issues, memory problems, or deployment failures.

2026-05-2126

handling-cancellation.md

from "inference-sh/grid"

Handle graceful cancellation in inference.sh apps. Use when implementing long-running tasks that users might cancel.

2026-05-2126

managing-secrets.md

from "inference-sh/grid"

Handle API keys and sensitive values in inference.sh apps. Use when adding secrets, accessing environment variables, or securing credentials.

2026-05-2126

building-inferencesh-apps.md

from "inference-sh/grid"

Build and deploy applications on inference.sh. Use when getting started, understanding the platform, or needing an overview of inference.sh development.

2026-05-2126

tracking-usage.md

from "inference-sh/grid"

Track usage with output metadata in inference.sh apps. Use when implementing billing, counting tokens, or reporting image/video/audio generation metrics.

2026-05-2126

package.json

"author": "inference-sh"

"repository": "inference-sh/grid"

Abrir repositorio de GitHub Ver repositorios del creador

$ install --global

$ download --local

Ejecutar en Manus

$ useful --forSOC

Desarrolladores de softwareOcupaciones informáticas y matemáticas15-1252L4

name	optimizing-performance
description	Optimize inference.sh app performance. Use when handling memory, devices, model loading, mixed precision, or flash attention.

Optimizing Performance

Best practices for inference.sh apps.

Device Detection

Never hardcode "cuda":

from accelerate import Accelerator
self.device = Accelerator().device

Model Loading

import os
from huggingface_hub import snapshot_download

os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
model_path = snapshot_download(repo_id="org/model", resume_download=True)

Memory Cleanup

import torch, gc

def cleanup():
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
    gc.collect()

Mixed Precision

model = model.to(dtype=torch.bfloat16)

# Or with autocast
from torch.amp import autocast
with autocast('cuda'):
    output = model(input)

Flash Attention

model = AutoModel.from_pretrained(
    "model-name",
    attn_implementation="flash_attention_2",
    torch_dtype=torch.bfloat16
)

Pre-deploy Checklist

📖 Full docs: inference.sh/docs/extend/best-practices

optimizing-performance

Optimizing Performance

Device Detection

Model Loading

Memory Cleanup

Mixed Precision

Flash Attention

Pre-deploy Checklist

Más de este repositorio

Más de este repositorio

Optimizing Performance

Device Detection

Model Loading

Memory Cleanup

Mixed Precision

Flash Attention

Pre-deploy Checklist