con un clic
optimizing-performance
// Optimize inference.sh app performance. Use when handling memory, devices, model loading, mixed precision, or flash attention.
// Optimize inference.sh app performance. Use when handling memory, devices, model loading, mixed precision, or flash attention.
Configure inf.yml for inference.sh apps. Use when setting GPU, VRAM, RAM, categories, environment variables, packages.txt, or resource requirements.
Debug and troubleshoot inference.sh apps. Use when facing import errors, CUDA issues, memory problems, or deployment failures.
Handle graceful cancellation in inference.sh apps. Use when implementing long-running tasks that users might cancel.
Handle API keys and sensitive values in inference.sh apps. Use when adding secrets, accessing environment variables, or securing credentials.
Build and deploy applications on inference.sh. Use when getting started, understanding the platform, or needing an overview of inference.sh development.
Track usage with output metadata in inference.sh apps. Use when implementing billing, counting tokens, or reporting image/video/audio generation metrics.
| name | optimizing-performance |
| description | Optimize inference.sh app performance. Use when handling memory, devices, model loading, mixed precision, or flash attention. |
Best practices for inference.sh apps.
Never hardcode "cuda":
from accelerate import Accelerator
self.device = Accelerator().device
import os
from huggingface_hub import snapshot_download
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
model_path = snapshot_download(repo_id="org/model", resume_download=True)
import torch, gc
def cleanup():
if torch.cuda.is_available():
torch.cuda.empty_cache()
gc.collect()
model = model.to(dtype=torch.bfloat16)
# Or with autocast
from torch.amp import autocast
with autocast('cuda'):
output = model(input)
model = AutoModel.from_pretrained(
"model-name",
attn_implementation="flash_attention_2",
torch_dtype=torch.bfloat16
)
setup() loads modelsrun() processes test input📖 Full docs: inference.sh/docs/extend/best-practices