Run any Skill in Manus with one click

$pwd:

add-inference-backend

Name: Add Inference Backend
Author: intel

// Add a new hardware inference backend to AutoRound for deploying quantized models (e.g., CUDA/Marlin, Triton, CPU, HPU, ARK). Use when implementing QuantLinear kernels, registering backend capabilities, or enabling quantized model inference on a new hardware platform.

Run Skill in Manus

$ git log --oneline --stat

stars:1,425

forks:134

updated:May 11, 2026 at 02:30

SKILL.md

readonly

related-skills.json

same repository

adapt-new-diffusion-model.md

from "intel/auto-round"

Adapt AutoRound to support a new diffusion model architecture (DiT, UNet, hybrid AR+DiT). Use when a new diffusion model fails quantization, needs custom output configs, requires a custom pipeline function, or is a hybrid architecture with both autoregressive and diffusion components.

2026-05-141.4k

adapt-new-llm.md

from "intel/auto-round"

Adapt AutoRound to support a new LLM architecture that doesn't work out-of-the-box. Use when quantization fails for a new model type, block detection doesn't find layers, MoE models need unfusing, custom forward passes are needed, or non-standard linear layer types need handling.

2026-05-141.4k

add-vlm-model.md

from "intel/auto-round"

Add support for a new Vision-Language Model (VLM) to AutoRound, including multimodal block handler, calibration dataset template, and special model handling. Use when integrating a new VLM like LLaVA, Qwen2-VL, GLM-Image, Phi-Vision, or similar multi-modal models for quantization.

2026-05-141.4k

add-export-format.md

from "intel/auto-round"

Add a new model export format to AutoRound (e.g., auto_round, auto_gptq, auto_awq, gguf, llm_compressor). Use when implementing a new quantized model serialization format, adding a new packing method, or extending export compatibility for deployment frameworks like vLLM, SGLang, or llama.cpp.

2026-04-171.4k

add-quantization-datatype.md

from "intel/auto-round"

Add a new quantization data type to AutoRound (e.g., INT, FP8, MXFP, NVFP, GGUF variants). Use when implementing a new weight/activation quantization scheme, registering a new quant function, or extending the data_type registry.

2026-04-171.4k

review-pr.md

from "intel/auto-round"

Review a pull request for the AutoRound repository with a structured checklist covering code quality, test coverage, documentation, Chinese translations, and quantization-specific concerns. Use when reviewing or preparing to submit a PR.

2026-04-171.4k

package.json

"author": "intel"

"repository": "intel/auto-round"

View GitHub Repository View Creator Repositories

$ install --global

$ download --local

Run Skill in Manus

$ useful --forSOC

Software DevelopersComputer and Mathematical Occupations15-1252L4

name	add-inference-backend
description	Add a new hardware inference backend to AutoRound for deploying quantized models (e.g., CUDA/Marlin, Triton, CPU, HPU, ARK). Use when implementing QuantLinear kernels, registering backend capabilities, or enabling quantized model inference on a new hardware platform.

Adding a New Inference Backend to AutoRound

Overview

This skill guides you through adding a new inference backend for running quantized models on a specific hardware platform. A backend defines how quantized weights are unpacked and computed at inference time. AutoRound automatically selects the best available backend based on hardware, quantization config, and priority.

Prerequisites

Before starting, determine:

Target hardware: CPU (Intel/AMD), CUDA GPU, Intel XPU, Habana HPU, etc.
Supported quantization configs: Which bit-widths, group sizes, and data types your backend handles
Kernel implementation: Triton, CUDA C++, PyTorch native, or external library (e.g., GPTQModel Marlin)
Packing format: How quantized weights are stored in memory

Step 1: Register Backend Info

Edit auto_round/inference/backend.py to register your backend's capabilities:

BackendInfos["auto_round:your_backend"] = BackendInfo(
    device=["cuda"],  # Supported devices
    sym=[True, False],  # Symmetric and/or asymmetric
    packing_format=["auto_round"],  # Compatible packing formats
    bits=[2, 4, 8],  # Supported bit-widths
    group_size=[32, 64, 128, -1],  # Supported group sizes (-1 = per-channel)
    compute_dtype=["float16", "bfloat16"],  # Compute precision
    data_type=["int"],  # Quantization data types
    act_bits=[16, 32],  # Activation bit-widths (16 = WxA16)
    priority=2,  # Higher = preferred (0-5 typical range)
    checkers=[your_feature_checker],  # Validation functions (optional)
    alias=["your_backend_short"],  # Alternative names (optional)
    requirements=["some_package>=1.0"],  # Required packages (optional)
    systems=["linux"],  # OS restriction (optional)
)

BackendInfo Fields Reference

Field	Type	Description
`device`	`list[str]`	Hardware targets: `"cpu"`, `"cuda"`, `"xpu"`, `"hpu"`
`sym`	`list[bool]`	`True` for symmetric, `False` for asymmetric
`packing_format`	`list[str]`	How weights are packed: `"auto_round"`, `"auto_gptq"`, etc.
`bits`	`list[int]`	Supported weight bit-widths
`group_size`	`list[int]`	Group sizes; `-1` means per-channel
`compute_dtype`	`list[str]`	Compute precision during inference
`data_type`	`list[str]`	Quantization data types: `"int"`, `"nv_fp"`, `"mx_fp"`
`act_bits`	`list[int]`	Activation bits: `[16, 32]` for weight-only, `[8]` for W8A8
`priority`	`int`	Selection priority (higher wins when multiple backends match)
`checkers`	`list[Callable]`	Functions to validate layer compatibility
`alias`	`list[str]`	Alternative names for CLI/API usage
`requirements`	`list[str]`	pip-installable dependency specifications
`systems`	`list[str]`	OS names: `"linux"`, `"windows"`, `"darwin"`

Checker Functions

Use these pre-built checkers or create your own:

# Require in_features and out_features divisible by 32
from auto_round.inference.backend import feature_multiply_checker_32

# Require in_features divisible by group_size
from auto_round.inference.backend import in_feature_checker_group_size


# Custom checker
def your_feature_checker(in_feature, out_feature, config):
    """Check if layer dimensions are compatible with your backend."""
    return in_feature % 64 == 0 and out_feature % 64 == 0 and config["group_size"] in [64, 128]

Step 2: Implement QuantLinear Module

Create auto_round_extension/your_device/qlinear_your_backend.py:

import torch
import torch.nn as nn

QUANT_TYPE = "your_backend"


class QuantLinear(nn.Module):
    """Quantized linear layer for your backend.

    Stores packed quantized weights and performs dequantize-then-matmul
    (or fused quantized matmul) at inference time.
    """

    QUANT_TYPE = QUANT_TYPE

    def __init__(self, bits, group_size, in_features, out_features, bias=True, sym=True, **kwargs):
        super().__init__()
        self.bits = bits
        self.group_size = group_size
        self.in_features = in_features
        self.out_features = out_features
        self.sym = sym

        # Register packed weight buffers
        # Example: INT4 packed into INT32
        pack_factor = 32 // bits
        self.register_buffer(
            "qweight",
            torch.zeros(in_features // pack_factor, out_features, dtype=torch.int32),
        )
        self.register_buffer(
            "scales",
            torch.zeros(
                (in_features // group_size, out_features),
                dtype=torch.float16,
            ),
        )
        if not sym:
            self.register_buffer(
                "qzeros",
                torch.zeros(
                    (in_features // group_size, out_features // pack_factor),
                    dtype=torch.int32,
                ),
            )
        if bias:
            self.register_buffer("bias", torch.zeros(out_features, dtype=torch.float16))
        else:
            self.bias = None

    def forward(self, x):
        """Dequantize weights and compute linear transformation."""
        weight = self._dequantize()
        out = torch.matmul(x, weight.T)
        if self.bias is not None:
            out += self.bias
        return out

    def _dequantize(self):
        """Unpack and dequantize weights."""
        # Implement your dequantization kernel here
        # Can use Triton, CUDA, or PyTorch operations
        ...

    @classmethod
    def pack(cls, linear, scales, zeros, bias=None):
        """Pack a standard nn.Linear into this quantized format.

        Called during export to convert calibrated weights into packed format.
        """
        ...

Step 3: Wire Up QuantLinear Import Logic

Register your backend in the explicit import logic in auto_round/inference/backend.py. In this repository, backend loading is not a generic directory scan; dynamic_import_inference_linear() maps backend keys to specific QuantLinear implementations.

Add a new backend key in BackendInfos[...] if needed, and make sure dynamic_import_inference_linear() returns your QuantLinear class for that backend:

if backend == "auto_round:your_backend":
    from auto_round_extension.your_device.qlinear_your_backend import QuantLinear

    return QuantLinear

If your backend fits an existing branch pattern, you can also reuse that logic, but contributors should update the explicit import mapping rather than rely on implicit auto-discovery.

Step 4: Add Extension `init.py`

Create auto_round_extension/your_device/__init__.py if the directory is new:

# Auto-Round extension for YourDevice backend

Step 5: Test

Unit test for the QuantLinear

def test_your_backend_qlinear():
    from auto_round_extension.your_device.qlinear_your_backend import QuantLinear

    ql = QuantLinear(bits=4, group_size=128, in_features=256, out_features=512)
    x = torch.randn(1, 256, dtype=torch.float16, device="cuda")
    out = ql(x)
    assert out.shape == (1, 512)

End-to-end test

def test_your_backend_e2e(tiny_opt_model_path, dataloader):
    ar = AutoRound(
        tiny_opt_model_path,
        bits=4,
        group_size=128,
        dataset=dataloader,
        iters=2,
        nsamples=2,
    )
    compressed_model, _ = ar.quantize()
    ar.save_quantized(output_dir="./tmp_backend_test", format="auto_round")

    # Load and verify inference with your backend
    from transformers import AutoModelForCausalLM, AutoTokenizer

    model = AutoModelForCausalLM.from_pretrained("./tmp_backend_test")
    tokenizer = AutoTokenizer.from_pretrained("./tmp_backend_test")
    inputs = tokenizer("Hello", return_tensors="pt").to("cuda")
    outputs = model.generate(**inputs, max_new_tokens=10)
    assert outputs.shape[1] > inputs["input_ids"].shape[1]

Reference: Existing Backend Implementations

Backend Key	Device	Extension Dir	Key Patterns
`auto_gptq:exllamav2`	CUDA	`cuda/`	Marlin kernels via GPTQModel, priority=3
`auto_round:triton_*`	CUDA	`triton/`	Triton JIT-compiled kernels
`auto_round:torch_*`	CPU/CUDA	`torch/`	Pure PyTorch fallback
`auto_round:ark`	ARK	`ark/`	ARK accelerator kernels
HPU backends	HPU	`hpu/`	Habana Gaudi optimized

Key Registration Points

What	Where	Mechanism
Backend capabilities	`auto_round/inference/backend.py`	`BackendInfos["name"]` dict
QuantLinear module	`auto_round_extension/<device>/qlinear_*.py`	`QUANT_TYPE` class attr
QuantLinear import wiring	`auto_round/inference/backend.py`	`dynamic_import_inference_linear()`
Feature checkers	`auto_round/inference/backend.py`	`functools.partial` wrappers

add-inference-backend

More from this repository

More from this repository

Adding a New Inference Backend to AutoRound

Overview

Prerequisites

Step 1: Register Backend Info

BackendInfo Fields Reference

Checker Functions

Step 2: Implement QuantLinear Module

Step 3: Wire Up QuantLinear Import Logic

Step 4: Add Extension __init__.py

Step 5: Test

Unit test for the QuantLinear

End-to-end test

Reference: Existing Backend Implementations

Key Registration Points

Adding a New Inference Backend to AutoRound

Overview

Prerequisites

Step 1: Register Backend Info

BackendInfo Fields Reference

Checker Functions

Step 2: Implement QuantLinear Module

Step 3: Wire Up QuantLinear Import Logic

Step 4: Add Extension __init__.py

Step 5: Test

Unit test for the QuantLinear

End-to-end test

Reference: Existing Backend Implementations

Key Registration Points

Step 4: Add Extension `init.py`

Step 4: Add Extension `init.py`