Run any Skill in Manus with one click

$pwd:

fal-serverless-guide

Name: Fal Serverless Guide
Author: JosiahSiegel

// Complete fal.ai serverless deployment system. PROACTIVELY activate for: (1) Creating fal.App class, (2) GPU machine selection (T4/A10G/A100/H100), (3) setup() for model loading, (4) @fal.endpoint decorators, (5) Persistent volumes for weights, (6) Secrets management, (7) Scaling configuration (min/max concurrency), (8) Multi-GPU deployment, (9) fal deploy commands, (10) Local development with fal run. Provides: App structure, Dockerfile patterns, deployment commands, scaling config. Ensures production-ready serverless ML deployment.

Run Skill in Manus

$ git log --oneline --stat

stars:39

forks:7

updated:May 20, 2026 at 18:07

SKILL.md

readonly

related-skills.json

same repository

ml-azureml-adf-automation.md

from "JosiahSiegel/claude-plugin-marketplace"

This skill should be used when the user asks to automate Azure ML and Azure Data Factory production workflows. PROACTIVELY activate for: (1) Azure ML code asset registration, azure-ai-ml SDK, AML code versions, `result.version`, requested-vs-actual versions, (2) ADF to Azure ML orchestration, ADF WebActivity, managed identity blob reads, `connectVia`, managed VNet IR, (3) code asset version pointer blobs, latest.json contracts, training/scoring code version propagation, (4) private storage firewalls, Microsoft-hosted CI agents, temporary network rules, storage data-plane smoke tests, (5) marshmallow<4 pinning, AML SDK import failures, runtime validation for Azure ML infrastructure. Provides: operationally safe Azure ML + ADF automation patterns.

2026-05-2839

ml-cloud-deployment.md

from "JosiahSiegel/claude-plugin-marketplace"

This skill should be used when the user asks to deploy, scale, or cost-optimize ML workloads on cloud platforms. PROACTIVELY activate for: (1) AWS SageMaker Studio, Training, Processing, Pipelines, Endpoints, Model Monitor, Feature Store, Clarify, Ground Truth, EC2 GPUs, EKS, Lambda, Inferentia, Trainium, (2) GCP Vertex AI Training, Pipelines, Endpoints, Feature Store, Model Monitoring, AutoML, Matching Engine, TPU, GKE, Cloud Run, (3) Azure ML workspaces, pipelines, managed endpoints, AutoML, Responsible ML, AKS/ACI, (4) Databricks, Modal, Replicate, RunPod, Lambda Labs, Anyscale. Provides: cloud ML architecture, autoscaling, hardware, security, and cost guidance.

2026-05-2839

ml-data-pipeline.md

from "JosiahSiegel/claude-plugin-marketplace"

This skill should be used when the user asks to ingest, clean, validate, transform, version, monitor, or serve ML data and features. PROACTIVELY activate for: (1) data ingestion, preprocessing, feature engineering, leakage prevention, train/serving skew, (2) Spark, Dask, Polars, pandas, Ray Data, streaming pipelines, (3) Great Expectations, TFDV, Deequ, data quality and validation, (4) DVC, lakehouse tables, dataset versioning, lineage, reproducibility, (5) Feast, Tecton, Hopsworks feature stores, point-in-time joins, online/offline features. Provides: scalable, reproducible, leakage-safe ML data pipeline design.

2026-05-2839

ml-mlops.md

from "JosiahSiegel/claude-plugin-marketplace"

This skill should be used when the user asks to productionize, track, version, govern, monitor, or automate ML systems. PROACTIVELY activate for: (1) MLflow, Weights & Biases, Neptune, Comet, ClearML experiment tracking, (2) model registry, model versioning, artifact lineage, reproducibility, (3) Kubeflow, SageMaker Pipelines, Vertex AI Pipelines, Azure ML pipelines, Databricks workflows, (4) CI/CD, continuous training/evaluation, A/B tests, canary/shadow deployments, (5) drift detection, model monitoring, data validation, responsible AI governance. Provides: end-to-end MLOps architecture and operational safeguards.

2026-05-2839

competition-workflows.md

from "JosiahSiegel/claude-plugin-marketplace"

Kaggle competition notebook workflows and submissions. PROACTIVELY activate for: (1) submitting notebook outputs to competitions, (2) `kaggle competitions submit -k`, (3) downloading competition data, (4) validating submission.csv format, (5) leakage review, (6) cross-validation split design, (7) public leaderboard overfitting concerns, (8) competition rule compliance, (9) reproducible top-to-bottom notebook execution, (10) fold-aware preprocessing for ML pipelines. Provides: submission commands, validation checklist, leakage controls, and competition-ready notebook guidance.

2026-05-2739

datasets-models-sources.md

from "JosiahSiegel/claude-plugin-marketplace"

Kaggle datasets, models, sources, and kagglehub workflows. PROACTIVELY activate for: (1) downloading datasets with kagglehub, (2) uploading datasets, (3) downloading or uploading Kaggle models, (4) competition_download, (5) notebook_output_download, (6) choosing Kaggle CLI vs kagglehub, (7) attaching dataset_sources, competition_sources, kernel_sources, or model_sources, (8) model artifact transfer, (9) source dependency cleanup, (10) kagglehub limitations for notebooks. Provides: dataset/model transfer patterns, source attachment guidance, tool selection, and limitation checks.

2026-05-2739

package.json

"author": "JosiahSiegel"

"repository": "JosiahSiegel/claude-plugin-marketplace"

View GitHub Repository View Creator Repositories

$ install --global

$ download --local

Run Skill in Manus

$ useful --forSOC

Software DevelopersComputer and Mathematical Occupations15-1252L4

name

fal-serverless-guide

description

Complete fal.ai serverless deployment system. PROACTIVELY activate for: (1) Creating fal.App class, (2) GPU machine selection (T4/A10G/A100/H100), (3) setup() for model loading, (4) @fal.endpoint decorators, (5) Persistent volumes for weights, (6) Secrets management, (7) Scaling configuration (min/max concurrency), (8) Multi-GPU deployment, (9) fal deploy commands, (10) Local development with fal run. Provides: App structure, Dockerfile patterns, deployment commands, scaling config. Ensures production-ready serverless ML deployment.

Quick Reference

Machine Type	GPU	VRAM	Use Case
`GPU-T4`	T4	16GB	Dev, small models
`GPU-A10G`	A10G	24GB	7B-13B models
`GPU-A100`	A100	40/80GB	13B-70B models
`GPU-H100`	H100	80GB	Cutting-edge

App Attribute	Purpose	Example
`machine_type`	GPU selection	`"GPU-A100"`
`requirements`	Dependencies	`["torch", "transformers"]`
`keep_alive`	Warm duration	`300` (5 min)
`min_concurrency`	Min instances	`0` (scale to zero)
`max_concurrency`	Max parallel	`4`

Command	Purpose
`fal deploy app.py::MyApp`	Deploy to fal
`fal run app.py::MyApp`	Run locally
`fal logs <app-id>`	View logs
`fal secrets set KEY=value`	Set secrets

When to Use This Skill

Use for custom model deployment:

Deploying custom ML models on fal infrastructure
Configuring GPU instances and scaling
Setting up persistent storage for model weights
Creating multi-endpoint apps
Managing secrets and environment variables

Related skills:

For API integration: see fal-api-reference
For optimization: see fal-optimization
For using hosted models: see fal-model-guide

fal.ai Serverless Deployment Guide

Complete guide to deploying custom ML models on fal.ai's serverless infrastructure.

Overview

fal serverless provides:

Automatic scaling from zero to thousands of instances
GPU support (T4, A10G, A100, H100, H200, B200)
Persistent storage for model weights
Secrets management
Real-time logs and monitoring
Pay-per-use pricing

Installation

pip install fal

Authentication

# Login to fal
fal auth login

# Or set API key
export FAL_KEY="your-api-key"

Basic App Structure

import fal
from pydantic import BaseModel

class RequestModel(BaseModel):
    """Input schema for your endpoint"""
    prompt: str
    max_tokens: int = 100

class ResponseModel(BaseModel):
    """Output schema for your endpoint"""
    text: str
    tokens: int

class MyApp(fal.App):
    # Machine configuration
    machine_type = "GPU-A100"
    num_gpus = 1

    # Dependencies
    requirements = [
        "torch>=2.0.0",
        "transformers>=4.35.0",
        "accelerate"
    ]

    # Scaling configuration
    keep_alive = 300        # Keep instance warm (seconds)
    min_concurrency = 0     # Scale to zero when idle
    max_concurrency = 4     # Max concurrent requests

    def setup(self):
        """
        Called once when container starts.
        Load models and heavy resources here.
        """
        import torch
        from transformers import AutoModelForCausalLM, AutoTokenizer

        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        self.tokenizer = AutoTokenizer.from_pretrained("model-name")
        self.model = AutoModelForCausalLM.from_pretrained(
            "model-name",
            torch_dtype=torch.float16
        ).to(self.device)

    @fal.endpoint("/predict")
    def predict(self, request: RequestModel) -> ResponseModel:
        """
        Main inference endpoint.
        Called for each request.
        """
        inputs = self.tokenizer(request.prompt, return_tensors="pt")
        inputs = inputs.to(self.device)

        outputs = self.model.generate(
            **inputs,
            max_new_tokens=request.max_tokens
        )

        text = self.tokenizer.decode(outputs[0], skip_special_tokens=True)

        return ResponseModel(text=text, tokens=len(outputs[0]))

    @fal.endpoint("/health")
    def health(self):
        """Health check endpoint"""
        return {"status": "healthy", "device": self.device}

    def teardown(self):
        """Called when container shuts down (optional)"""
        if hasattr(self, 'model'):
            del self.model
        import torch
        torch.cuda.empty_cache()

Machine Types

Type	GPU	VRAM	Use Case
`CPU`	None	-	Preprocessing, lightweight
`GPU-T4`	NVIDIA T4	16GB	Development, small models
`GPU-A10G`	NVIDIA A10G	24GB	Medium models (7B-13B)
`GPU-A100`	NVIDIA A100	40/80GB	Large models (13B-70B)
`GPU-H100`	NVIDIA H100	80GB	Cutting-edge performance
`GPU-H200`	NVIDIA H200	141GB	Very large models
`GPU-B200`	NVIDIA B200	192GB	Frontier models (100B+)

Multi-GPU Configuration

class MultiGPUApp(fal.App):
    machine_type = "GPU-H100"
    num_gpus = 4  # Use 4 H100s

    def setup(self):
        import torch
        from transformers import AutoModelForCausalLM

        self.model = AutoModelForCausalLM.from_pretrained(
            "meta-llama/Llama-2-70b-hf",
            torch_dtype=torch.float16,
            device_map="auto"  # Distribute across GPUs
        )

Persistent Storage

Use volumes to persist data across restarts:

class AppWithStorage(fal.App):
    machine_type = "GPU-A100"
    requirements = ["torch", "transformers"]

    # Define persistent volumes
    volumes = {
        "/data": fal.Volume("model-cache"),
        "/outputs": fal.Volume("generated-outputs")
    }

    def setup(self):
        import os
        from transformers import AutoModel

        cache_dir = "/data/models"
        os.makedirs(cache_dir, exist_ok=True)

        # Model weights persist across cold starts
        self.model = AutoModel.from_pretrained(
            "large-model",
            cache_dir=cache_dir
        )

    @fal.endpoint("/generate")
    def generate(self, request):
        output_path = "/outputs/result.png"
        # Save to persistent storage
        return {"path": output_path}

Secrets Management

# Set secrets via CLI
fal secrets set HF_TOKEN=hf_xxx API_KEY=sk_xxx

# List secrets
fal secrets list

# Delete secret
fal secrets delete HF_TOKEN

import os

class SecureApp(fal.App):
    def setup(self):
        # Access secrets as environment variables
        hf_token = os.environ["HF_TOKEN"]

        from huggingface_hub import login
        login(token=hf_token)

        # Now can access gated models
        self.model = load_gated_model()

Deployment Commands

# Deploy application
fal deploy app.py::MyApp

# Deploy with options
fal deploy app.py::MyApp \
  --machine-type GPU-A100 \
  --num-gpus 2 \
  --min-concurrency 1 \
  --max-concurrency 8

# View deployments
fal list

# View logs
fal logs <app-id>

# View real-time logs
fal logs <app-id> --follow

# Delete deployment
fal delete <app-id>

# Run locally for testing
fal run app.py::MyApp

Advanced Patterns

Image Generation App

import fal
from pydantic import BaseModel
from typing import Optional
import io

class ImageRequest(BaseModel):
    prompt: str
    negative_prompt: Optional[str] = None
    width: int = 1024
    height: int = 1024
    steps: int = 28
    seed: Optional[int] = None

class ImageResponse(BaseModel):
    image_url: str
    seed: int

class ImageGenerator(fal.App):
    machine_type = "GPU-A100"
    requirements = [
        "torch",
        "diffusers",
        "transformers",
        "accelerate",
        "safetensors"
    ]
    keep_alive = 600
    max_concurrency = 2

    volumes = {
        "/data": fal.Volume("diffusion-models")
    }

    def setup(self):
        import torch
        from diffusers import StableDiffusionXLPipeline

        self.pipe = StableDiffusionXLPipeline.from_pretrained(
            "stabilityai/stable-diffusion-xl-base-1.0",
            torch_dtype=torch.float16,
            cache_dir="/data/models"
        ).to("cuda")

        # Optimize
        self.pipe.enable_model_cpu_offload()

    @fal.endpoint("/generate")
    def generate(self, request: ImageRequest) -> ImageResponse:
        import torch
        import random

        seed = request.seed or random.randint(0, 2**32 - 1)
        generator = torch.Generator("cuda").manual_seed(seed)

        image = self.pipe(
            prompt=request.prompt,
            negative_prompt=request.negative_prompt,
            width=request.width,
            height=request.height,
            num_inference_steps=request.steps,
            generator=generator
        ).images[0]

        # Save and upload to CDN
        path = f"/tmp/output_{seed}.png"
        image.save(path)
        url = fal.upload_file(path)

        return ImageResponse(image_url=url, seed=seed)

Streaming Response

import fal
from typing import Generator

class StreamingApp(fal.App):
    machine_type = "GPU-A100"
    requirements = ["torch", "transformers"]

    def setup(self):
        from transformers import AutoModelForCausalLM, AutoTokenizer
        self.tokenizer = AutoTokenizer.from_pretrained("model")
        self.model = AutoModelForCausalLM.from_pretrained("model")

    @fal.endpoint("/stream")
    def stream(self, prompt: str) -> Generator[str, None, None]:
        from transformers import TextIteratorStreamer
        from threading import Thread

        inputs = self.tokenizer(prompt, return_tensors="pt").to("cuda")
        streamer = TextIteratorStreamer(self.tokenizer, skip_prompt=True)

        thread = Thread(
            target=self.model.generate,
            kwargs={**inputs, "streamer": streamer, "max_new_tokens": 256}
        )
        thread.start()

        for text in streamer:
            yield text

Background Tasks

import fal
from typing import Optional
import asyncio

class BackgroundApp(fal.App):
    machine_type = "GPU-A100"

    @fal.endpoint("/process")
    async def process(self, data: str) -> dict:
        # Submit background work
        task_id = await self.start_background_task(data)
        return {"task_id": task_id, "status": "processing"}

    async def start_background_task(self, data: str) -> str:
        # Implement your background logic
        import uuid
        task_id = str(uuid.uuid4())
        # Save task to queue/database
        return task_id

Multiple Endpoints

import fal
from pydantic import BaseModel

class TextRequest(BaseModel):
    text: str

class ImageRequest(BaseModel):
    image_url: str

class MultiModalApp(fal.App):
    machine_type = "GPU-A100"
    requirements = ["torch", "transformers", "Pillow"]

    def setup(self):
        self.text_model = self.load_text_model()
        self.vision_model = self.load_vision_model()

    @fal.endpoint("/analyze-text")
    def analyze_text(self, request: TextRequest) -> dict:
        result = self.text_model(request.text)
        return {"analysis": result}

    @fal.endpoint("/analyze-image")
    def analyze_image(self, request: ImageRequest) -> dict:
        result = self.vision_model(request.image_url)
        return {"analysis": result}

    @fal.endpoint("/")
    def info(self) -> dict:
        return {
            "name": "MultiModal Analyzer",
            "endpoints": ["/analyze-text", "/analyze-image"]
        }

    @fal.endpoint("/health")
    def health(self) -> dict:
        return {"status": "healthy"}

Scaling Configuration

class ScaledApp(fal.App):
    machine_type = "GPU-A100"

    # Scaling options
    min_concurrency = 0     # Scale to zero (cost savings)
    max_concurrency = 10    # Max parallel requests
    keep_alive = 300        # Keep warm for 5 minutes

    # For always-on endpoints
    # min_concurrency = 1   # Always have one instance ready

Concurrency Guidelines

GPU Memory per Request	Suggested max_concurrency
< 4GB	8-10
4-8GB	4-6
8-16GB	2-4
16-40GB	1-2
> 40GB	1

Error Handling

import fal
from pydantic import BaseModel

class MyApp(fal.App):
    @fal.endpoint("/predict")
    def predict(self, request: dict):
        try:
            result = self.process(request)
            return {"result": result}
        except ValueError as e:
            # Client error
            raise fal.HTTPException(400, f"Invalid input: {e}")
        except RuntimeError as e:
            # Server error
            raise fal.HTTPException(500, f"Processing failed: {e}")
        except Exception as e:
            # Unexpected error
            raise fal.HTTPException(500, "Internal error")

Local Development

# Run locally
fal run app.py::MyApp

# Test endpoint
curl -X POST http://localhost:8000/predict \
  -H "Content-Type: application/json" \
  -d '{"prompt": "test"}'

# Run with environment variables
FAL_KEY=xxx HF_TOKEN=yyy fal run app.py::MyApp

Calling Deployed Apps

JavaScript/TypeScript

import { fal } from "@fal-ai/client";

fal.config({ credentials: process.env.FAL_KEY });

const result = await fal.subscribe("your-username/your-app/predict", {
  input: {
    prompt: "Hello world",
    max_tokens: 100
  }
});

Python

import fal_client

result = fal_client.subscribe(
    "your-username/your-app/predict",
    arguments={
        "prompt": "Hello world",
        "max_tokens": 100
    }
)

REST API

curl -X POST "https://queue.fal.run/your-username/your-app/predict" \
  -H "Authorization: Key $FAL_KEY" \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Hello world", "max_tokens": 100}'

Best Practices

Load models in setup()
- Heavy initialization once, not per request
- Use persistent volumes for large weights
Use appropriate machine type
- Match GPU memory to model size
- Consider cost vs performance trade-offs
Handle cold starts
- Use keep_alive for frequently accessed endpoints
- Use min_concurrency=1 for latency-critical apps
Optimize memory
- Use fp16/bf16 where possible
- Enable memory-efficient attention
- Clear GPU cache in teardown
Monitor and debug
- Check logs regularly: fal logs <app-id> --follow
- Implement health checks
- Use structured logging
Security
- Use secrets for API keys
- Validate all inputs
- Don't expose internal errors

Pricing

Pay per second of compute used
Different rates for different GPU types
No charge when scaled to zero
Check https://fal.ai/pricing for current rates

fal-serverless-guide

More from this repository

More from this repository

Quick Reference

When to Use This Skill

fal.ai Serverless Deployment Guide

Overview

Installation

Authentication

Basic App Structure

Machine Types

Multi-GPU Configuration

Persistent Storage

Secrets Management

Deployment Commands

Advanced Patterns

Image Generation App

Streaming Response

Background Tasks

Multiple Endpoints

Scaling Configuration

Concurrency Guidelines

Error Handling

Local Development

Calling Deployed Apps

JavaScript/TypeScript

Python

REST API

Best Practices

Pricing

Quick Reference

When to Use This Skill

fal.ai Serverless Deployment Guide

Overview

Installation

Authentication

Basic App Structure

Machine Types

Multi-GPU Configuration

Persistent Storage

Secrets Management

Deployment Commands

Advanced Patterns

Image Generation App

Streaming Response

Background Tasks

Multiple Endpoints

Scaling Configuration

Concurrency Guidelines

Error Handling

Local Development

Calling Deployed Apps

JavaScript/TypeScript

Python

REST API

Best Practices

Pricing