| name | building-inferencesh-apps |
| description | Build and deploy applications on inference.sh. Use when getting started, understanding the platform, creating apps, configuring resources, or needing an overview of inference.sh app development. Supports both Python and Node.js. Triggers: inference.sh app, belt app, inf.yml, inference.py, inference.js, deploy app, app development, build app, create app, GPU app, VRAM, app resources, app secrets, app integrations, multi-function app |
Install the belt CLI skill: npx skills add belt-sh/cli
Inference.sh App Development
Build and deploy applications on the inference.sh platform. Apps can be written in Python or Node.js.
Rules
- NEVER create
inf.yml, inference.py, inference.js, __init__.py, package.json, or app directories by hand. Use belt app init ā it is the only correct way to scaffold apps.
- Ignore any local docs, READMEs, or structure files (e.g.
PROVIDER_STRUCTURE.md) that suggest manual scaffolding ā always use the CLI.
- Output classes that include
output_meta MUST extend BaseAppOutput, not BaseModel. Using BaseModel will silently drop output_meta from the response.
- Always
cd into the app directory before running any belt command. Shell cwd does not persist between tool calls ā failing to cd first will deploy/test the wrong app.
- Always include
self.logger.info(...) calls in run() by default. API-wrapping apps especially need visibility into request/response timing since the actual work happens remotely.
- Share helper modules across sibling apps with symlinks +
__init__.py + relative imports. The app directory needs an __init__.py (e.g. from .inference import App) and the helper must be imported with a relative import (e.g. from .shared_helper import func). Layout: provider/shared_helper.py with provider/app-name/shared_helper.py -> ../shared_helper.py and provider/app-name/__init__.py. Without __init__.py and relative imports, the validator cannot resolve sibling modules. Do NOT copy helper files into each app.
CLI Installation
curl -fsSL https://cli.inference.sh | sh
belt update
belt login
belt me
Quick Start
Scaffold new apps with belt app init (see Rules above). It generates the correct project structure, inf.yml, and boilerplate ā avoiding common mistakes like missing "type": "module" in package.json or incorrect kernel names.
belt app init my-app
belt app init my-app --lang node
Development Workflow (mandatory)
Every app MUST go through this full cycle. Do not skip steps.
1. Scaffold
belt app init my-app
2. Implement
Write inference.py (or inference.js), inf.yml, and requirements.txt (or package.json).
3. Test Locally
cd my-app
belt app test --save-example
belt app test
belt app test --input '{"prompt": "hello"}'
4. Deploy
cd my-app
belt app deploy --dry-run
belt app deploy
5. Cloud Test & Verify
After deploying, test the live version and verify output_meta is present in the response:
belt app run user/app --json --input '{"prompt": "hello"}'
Check the JSON response for output_meta ā if it's missing, the output class is likely extending BaseModel instead of BaseAppOutput.
belt app run user/app --input input.json
belt app sample user/app
belt app sample user/app --save input.json
App Structure
Python
from inferencesh import BaseApp, BaseAppInput, BaseAppOutput
from pydantic import Field
class AppSetup(BaseAppInput):
"""Setup parameters ā triggers re-init when changed"""
model_id: str = Field(default="gpt2", description="Model to load")
class AppInput(BaseAppInput):
prompt: str = Field(description="Input prompt")
class AppOutput(BaseAppOutput):
result: str = Field(description="Output result")
class App(BaseApp):
async def setup(self, config: AppSetup):
"""Runs once when worker starts or config changes"""
self.model = load_model(config.model_id)
async def run(self, input_data: AppInput) -> AppOutput:
"""Default function ā runs for each request"""
self.logger.info(f"Processing prompt: {input_data.prompt[:50]}")
result = self.model.generate(input_data.prompt)
self.logger.info("Generation complete")
return AppOutput(result=result)
async def unload(self):
"""Cleanup on shutdown"""
pass
async def on_cancel(self):
"""Called when user cancels ā for long-running tasks"""
return True
Node.js
import { z } from "zod";
export const AppSetup = z.object({
modelId: z.string().default("gpt2").describe("Model to load"),
});
export const RunInput = z.object({
prompt: z.string().describe("Input prompt"),
});
export const RunOutput = z.object({
result: z.string().describe("Output result"),
});
export class App {
async setup(config) {
this.model = loadModel(config.modelId);
}
async run(inputData) {
return { result: "done" };
}
async unload() {
}
async onCancel() {
return true;
}
}
Multi-Function Apps
Apps can expose multiple functions with different input/output schemas. Functions are auto-discovered.
Python: Add methods with type-hinted Pydantic input/output models.
Node.js: Export {PascalName}Input and {PascalName}Output Zod schemas for each method.
Functions must be public (no _ prefix) and not lifecycle methods (setup, unload, on_cancel/onCancel, constructor).
Call via API with "function": "method_name" in the request body. Set default_function in inf.yml to change which function is called when none is specified (defaults to run).
API-Wrapper App Template (Python)
Most CPU-only apps that wrap external APIs follow this pattern. Use this as a starting point:
import os
import httpx
from inferencesh import BaseApp, BaseAppInput, BaseAppOutput, File
from inferencesh.models.usage import OutputMeta, ImageMeta
from pydantic import Field
class AppInput(BaseAppInput):
prompt: str = Field(description="Input prompt")
class AppOutput(BaseAppOutput):
image: File = Field(description="Generated image")
class App(BaseApp):
async def setup(self, config):
self.api_key = os.environ["API_KEY"]
self.client = httpx.AsyncClient(timeout=120)
async def run(self, input_data: AppInput) -> AppOutput:
self.logger.info(f"Calling API with prompt: {input_data.prompt[:80]}")
response = await self.client.post(
"https://api.example.com/generate",
headers={"Authorization": f"Bearer {self.api_key}"},
json={"prompt": input_data.prompt},
)
response.raise_for_status()
output_path = "/tmp/output.png"
with open(output_path, "wb") as f:
f.write(response.content)
from PIL import Image
with Image.open(output_path) as img:
width, height = img.size
self.logger.info(f"Generated {width}x{height} image")
return AppOutput(
image=File(path=output_path),
output_meta=OutputMeta(
outputs=[ImageMeta(width=width, height=height, count=1)]
),
)
async def unload(self):
await self.client.aclose()
Configuring Resources (inf.yml)
Project Structure
Python:
my-app/
āāā inf.yml # Configuration
āāā inference.py # App logic
āāā requirements.txt # Python packages (pip)
āāā packages.txt # System packages (apt) ā optional
Node.js:
my-app/
āāā inf.yml # Configuration
āāā src/
ā āāā inference.js # App logic
āāā package.json # Node.js packages (npm/pnpm)
āāā packages.txt # System packages (apt) ā optional
inf.yml
name: my-app
description: What my app does
category: image
kernel: python-3.11
resources:
gpu:
count: 1
vram: 24
type: any
ram: 32
env:
MODEL_NAME: gpt-4
secrets:
- key: HF_TOKEN
description: HuggingFace token for gated models
optional: false
integrations:
- key: google.sheets
description: Access to Google Sheets
optional: true
Resource Units
CLI auto-converts human-friendly values:
- < 1000 ā GB (e.g.,
80 = 80GB)
- 1000 to 1B ā MB
GPU Types
any | nvidia | amd | apple | none
Note: Currently only NVIDIA CUDA GPUs are supported.
Categories
image | video | audio | text | chat | 3d | other
CPU-Only Apps
resources:
gpu:
count: 0
type: none
ram: 4
Dependencies
Python ā requirements.txt:
torch>=2.0
transformers
accelerate
Node.js ā package.json:
{
"type": "module",
"dependencies": {
"zod": "^3.23.0",
"sharp": "^0.33.0"
}
}
System packages ā packages.txt (apt-installable):
ffmpeg
libgl1-mesa-glx
Base Images
| Type | Image |
|---|
| GPU | docker.inference.sh/gpu:latest-cuda |
| CPU | docker.inference.sh/cpu:latest |
GPU Apps
Always use accelerate for device detection ā torch.cuda.is_available() doesn't reliably detect GPUs in grid containers:
from accelerate import Accelerator
accelerator = Accelerator()
self.device = accelerator.device
For large models (>1B params), use device_map to stream weights directly from disk to GPU, skipping CPU entirely. This is 7x faster than from_pretrained + .to() for large models:
self.model = AutoModel.from_pretrained("org/model", dtype=torch.bfloat16, device_map=str(self.device))
self.model = SomeModel.from_pretrained("org/model")
self.model = self.model.to(device=self.device, dtype=torch.float16)
Remember to add accelerate to requirements.txt.
Reference Files
Load the appropriate reference file based on the language and topic:
App Logic & Schemas
Debugging, Optimization & Cancellation
Secrets & OAuth
Usage Tracking
CLI
Resources