تشغيل أي مهارة في Manus بنقرة واحدة

$pwd:

adapt-new-llm

Name: Adapt New Llm
Author: intel

// Adapt AutoRound to support a new LLM architecture that doesn't work out-of-the-box. Use when quantization fails for a new model type, block detection doesn't find layers, MoE models need unfusing, custom forward passes are needed, or non-standard linear layer types need handling.

تشغيل في Manus

$ git log --oneline --stat

stars:١٬٤٢٥

forks:١٣٤

updated:١٤ مايو ٢٠٢٦ في ٠١:٤٠

SKILL.md

readonly

related-skills.json

نفس المستودع

adapt-new-diffusion-model.md

from "intel/auto-round"

Adapt AutoRound to support a new diffusion model architecture (DiT, UNet, hybrid AR+DiT). Use when a new diffusion model fails quantization, needs custom output configs, requires a custom pipeline function, or is a hybrid architecture with both autoregressive and diffusion components.

2026-05-141.4k

add-vlm-model.md

from "intel/auto-round"

Add support for a new Vision-Language Model (VLM) to AutoRound, including multimodal block handler, calibration dataset template, and special model handling. Use when integrating a new VLM like LLaVA, Qwen2-VL, GLM-Image, Phi-Vision, or similar multi-modal models for quantization.

2026-05-141.4k

add-inference-backend.md

from "intel/auto-round"

Add a new hardware inference backend to AutoRound for deploying quantized models (e.g., CUDA/Marlin, Triton, CPU, HPU, ARK). Use when implementing QuantLinear kernels, registering backend capabilities, or enabling quantized model inference on a new hardware platform.

2026-05-111.4k

add-export-format.md

from "intel/auto-round"

Add a new model export format to AutoRound (e.g., auto_round, auto_gptq, auto_awq, gguf, llm_compressor). Use when implementing a new quantized model serialization format, adding a new packing method, or extending export compatibility for deployment frameworks like vLLM, SGLang, or llama.cpp.

2026-04-171.4k

add-quantization-datatype.md

from "intel/auto-round"

Add a new quantization data type to AutoRound (e.g., INT, FP8, MXFP, NVFP, GGUF variants). Use when implementing a new weight/activation quantization scheme, registering a new quant function, or extending the data_type registry.

2026-04-171.4k

review-pr.md

from "intel/auto-round"

Review a pull request for the AutoRound repository with a structured checklist covering code quality, test coverage, documentation, Chinese translations, and quantization-specific concerns. Use when reviewing or preparing to submit a PR.

2026-04-171.4k

package.json

"author": "intel"

"repository": "intel/auto-round"

فتح مستودع GitHub عرض مستودعات المنشئ

$ install --global

$ download --local

تشغيل في Manus

$ useful --forSOC

مطوّرو البرمجياتمهن الحاسوب والرياضيات15-1252L4

name	adapt-new-llm
description	Adapt AutoRound to support a new LLM architecture that doesn't work out-of-the-box. Use when quantization fails for a new model type, block detection doesn't find layers, MoE models need unfusing, custom forward passes are needed, or non-standard linear layer types need handling.

Adapting AutoRound for a New LLM Architecture

Overview

Most standard Transformers-based LLMs work with AutoRound out-of-the-box. This skill covers what to do when a new model architecture requires code changes. The need for adaptation typically arises from:

Non-standard layer hierarchy (block detection fails)
Fused Mixture-of-Experts (MoE) weights
Non-standard linear layer types (not nn.Linear or Conv1D)
Complex multi-component architectures (multimodal routing)
Shared cache keys or position embeddings

Step 0: Diagnose the Problem

Try quantizing the model first:

from auto_round import AutoRound

ar = AutoRound("your-org/your-model", scheme="W4A16", iters=2, nsamples=2)
ar.quantize_and_save(output_dir="./test_output", format="auto_round")

Common failure modes and their fixes:

Error / Symptom	Root Cause	Fix Section
"No quantizable layers found"	Block detection failed	Step 1
"Quantized 0/N layers"	Layers not `nn.Linear`/`Conv1D`	Step 4
Shape mismatch in MoE layers	Fused expert weights	Step 2
Wrong outputs / calibration diverges	Forward pass not exercised correctly	Step 3
Cache key errors (Gemma3-style)	Shared position embeddings	Step 5

Step 1: Fix Block Detection

AutoRound discovers quantizable blocks via get_block_names() which searches recursively for nn.ModuleList instances. If your model has a non-standard layer hierarchy, block detection may fail.

Check current detection

from auto_round.utils import get_block_names

model = ...  # loaded model
print(get_block_names(model))

Option A: Use `to_quant_block_names` parameter

For simple cases, override block names without code changes:

ar = AutoRound(
    model,
    to_quant_block_names="model.decoder.layers",  # explicit path
)

Option B: Register in `SPECIAL_MULTIMODAL_BLOCK`

For multimodal or multi-component models, add a custom block handler in auto_round/special_model_handler.py:

def _get_your_model_multimodal_block(model, quant_vision=False):
    """Get block names for YourModel.

    YourModel structure:
    - encoder.layers: encoder blocks
    - decoder.layers: decoder blocks
    """
    block_names = []

    if quant_vision and hasattr(model, "encoder"):
        block_names.append([f"encoder.layers.{i}" for i in range(len(model.encoder.layers))])

    block_names.append([f"decoder.layers.{i}" for i in range(len(model.decoder.layers))])

    return block_names


# Register: key must match model.config.model_type
SPECIAL_MULTIMODAL_BLOCK["your_model_type"] = _get_your_model_multimodal_block

Also add to support lists if applicable:

# If text-only calibration works for this multimodal model:
SUPPORT_ONLY_TEXT_MODELS.append("your_model_type")

# If batch_size must be limited:
mllms_with_limited_bs = (..., "your_model_type")

Step 2: Handle MoE (Mixture-of-Experts) Models

MoE models often have fused 3D expert weights (shape [num_experts, hidden, intermediate]) that must be "unfused" into per-expert nn.Linear layers for quantization.

Check if auto-handled

Transformers >= 5.0 has a linear_loop experts interface that auto-unfuses most MoE models. Test first — it may just work.

Register custom unfusing

If auto-unfusing fails, create a custom module in auto_round/modeling/fused_moe/:

1. Create auto_round/modeling/fused_moe/your_moe.py:

"""Unfuse fused MoE weights for YourModel."""

import torch
import torch.nn as nn
from auto_round.modeling.fused_moe.replace_modules import register_replacement


@register_replacement("YourMoELayer")
def replace_your_moe_layer(module, name, model):
    """Replace FusedMoE with per-expert nn.Linear layers."""
    experts = nn.ModuleList()
    for i in range(module.num_experts):
        linear = nn.Linear(module.hidden_size, module.intermediate_size, bias=False)
        linear.weight.data = module.weight[i].clone()
        experts.append(linear)
    return experts

2. Register in BUILTIN_MODULES:

Edit auto_round/modeling/fused_moe/replace_modules.py:

BUILTIN_MODULES["your_model_type"] = LazyImport("auto_round.modeling.fused_moe.your_moe")

Existing MoE implementations

Model Type	File	Pattern
`llama4`	`fused_moe/llama4.py`	Custom replacement for no `use_experts_implementation`
`deepseek_v2`	`fused_moe/deepseek_v2.py`	q_scale calibration for Gaudi
`qwen3_5_moe`	`fused_moe/qwen3_5_moe.py`	Transformers >= 5.0 support
`step3p5`	`fused_moe/step3_5_moe.py`	Splits fused MoELinear
`qwen3_omni_moe`	`fused_moe/qwen3_omni.py`	Thinker + talker MoE

Step 3: Add Custom Forward Pass

Some models have non-standard forward passes that don't get calibrated correctly with the default model.forward(). This is common for multi-component architectures.

Edit _handle_special_model() in auto_round/special_model_handler.py:

def _your_model_forward(model, **kwargs):
    """Custom forward that routes through all quantizable components."""
    # Example: route through both encoder and decoder
    encoder_output = model.encoder(**kwargs)
    decoder_output = model.decoder(encoder_output, **kwargs)
    return decoder_output


def _handle_special_model(model):
    ...
    if hasattr(model, "config") and model.config.model_type == "your_model_type":
        from functools import partial

        model.forward = partial(_your_model_forward, model)
    return model

When is this needed?

Model has multiple sub-models (thinker/talker, encoder/decoder)
Default forward doesn't exercise all quantizable layers
Model needs special input preprocessing during calibration

Existing examples

Model	Custom Forward	Purpose
`deepseek_vl_v2`	`_deepseek_vl2_forward`	Route through language component
`qwen2_5_omni`	`_qwen2_5_omni_forward`	Route through thinker → talker
`qwen3_omni_moe`	`_qwen3_omni_moe_forward`	Handle MoE routing in omni model

Step 4: Handle Non-Standard Linear Layers

AutoRound quantizes these layer types by default:

# auto_round/utils/common.py
SUPPORTED_LAYER_TYPES = (torch.nn.Linear, transformers.pytorch_utils.Conv1D)
INNER_SUPPORTED_LAYER_TYPES = ("FP8Linear",)  # matched by class name string

If your model uses a custom linear type (e.g., QuantizedLinear, FP8Linear), it won't be quantized unless registered.

Option A: String-based matching

INNER_SUPPORTED_LAYER_TYPES matches by class name string — useful for external classes that can't be imported directly:

INNER_SUPPORTED_LAYER_TYPES = ("FP8Linear", "YourCustomLinear")

Option B: Type-based registration

If you can import the class:

from your_library import YourLinear

SUPPORTED_LAYER_TYPES = SUPPORTED_LAYER_TYPES + (YourLinear,)

Step 5: Handle Shared Cache Keys

Some models share tensors across blocks during inference (e.g., Gemma3's rotary position embeddings). These must be declared so the calibration cache doesn't duplicate or corrupt them.

Edit SPECIAL_SHARED_CACHE_KEYS in auto_round/special_model_handler.py:

SPECIAL_SHARED_CACHE_KEYS["YourModelForCausalLM"] = ("shared_position_embeddings", "shared_rope")

The key is the class name of the model (not model_type).

Step 6: Test

def test_your_model_quantization():
    ar = AutoRound(
        "your-org/your-model",
        scheme="W4A16",
        iters=2,
        nsamples=2,
        batch_size=2,
    )
    compressed_model, layer_config = ar.quantize()
    # Verify layers were quantized
    assert len(layer_config) > 0, "No layers were quantized"

    ar.save_quantized(output_dir="./tmp_your_model", format="auto_round")

    # Verify inference works
    from auto_round.utils import model_infer

    output = model_infer(compressed_model, tokenizer, "Hello world")
    assert output is not None

Step 7: Update Documentation

Add model to supported list in README.md
Update README_CN.md with equivalent Chinese content
Add example script if the model has notable differences

Checklist

get_block_names() finds all quantizable blocks
MoE layers (if any) are unfused correctly
calib() runs without shape errors
All target layers are quantized (check "Quantized X/Y layers" log)
Forward pass exercises all quantizable components
Quantized model produces valid outputs
Export to target format works
README.md + README_CN.md updated

Key Files

File	Purpose
`auto_round/special_model_handler.py`	Block handlers, custom forwards, shared cache keys
`auto_round/modeling/fused_moe/replace_modules.py`	MoE unfusing registry (`BUILTIN_MODULES`)
`auto_round/utils/common.py`	`SUPPORTED_LAYER_TYPES`, `INNER_SUPPORTED_LAYER_TYPES`
`auto_round/utils/model.py`	`get_block_names()`, `is_mllm_model()`, model loading
`auto_round/compressors/data_driven.py`	New-architecture quantization loop and block scheduling
`auto_round/algorithms/quantization/base.py`	Quantizer block execution, sampling, and diffusion output configs
`auto_round/calibration/llm.py`	LLM calibration data collection and `calib()` flow
`auto_round/autoround.py`	`AutoRound` factory — model type routing logic

adapt-new-llm

المزيد من هذا المستودع

المزيد من هذا المستودع

Adapting AutoRound for a New LLM Architecture

Overview

Step 0: Diagnose the Problem

Step 1: Fix Block Detection

Check current detection

Option A: Use to_quant_block_names parameter

Option B: Register in SPECIAL_MULTIMODAL_BLOCK

Step 2: Handle MoE (Mixture-of-Experts) Models

Check if auto-handled

Register custom unfusing

Existing MoE implementations

Step 3: Add Custom Forward Pass

When is this needed?

Existing examples

Step 4: Handle Non-Standard Linear Layers

Option A: String-based matching

Option B: Type-based registration

Step 5: Handle Shared Cache Keys

Step 6: Test

Step 7: Update Documentation

Checklist

Key Files

Adapting AutoRound for a New LLM Architecture

Overview

Step 0: Diagnose the Problem

Step 1: Fix Block Detection

Check current detection

Option A: Use to_quant_block_names parameter

Option B: Register in SPECIAL_MULTIMODAL_BLOCK

Step 2: Handle MoE (Mixture-of-Experts) Models

Check if auto-handled

Register custom unfusing

Existing MoE implementations

Step 3: Add Custom Forward Pass

When is this needed?

Existing examples

Step 4: Handle Non-Standard Linear Layers

Option A: String-based matching

Option B: Type-based registration

Step 5: Handle Shared Cache Keys

Step 6: Test

Step 7: Update Documentation

Checklist

Key Files

Option A: Use `to_quant_block_names` parameter

Option B: Register in `SPECIAL_MULTIMODAL_BLOCK`

Option A: Use `to_quant_block_names` parameter

Option B: Register in `SPECIAL_MULTIMODAL_BLOCK`