with one click
litert-cli
// LiteRT CLI tool to download, convert, quantize, run, benchmark, and visualize LiteRT models.
// LiteRT CLI tool to download, convert, quantize, run, benchmark, and visualize LiteRT models.
| name | litert-cli |
| description | LiteRT CLI tool to download, convert, quantize, run, benchmark, and visualize LiteRT models. |
This skill allows the agent to download, convert, quantize, run, benchmark, and
visualize LiteRT models using the litert command on desktop, device, or Google
Cloud.
Before running any litert commands, you must ensure a Python virtual environment is active and the litert-cli package is correctly installed.
Use this method if you are developing inside the cloned repository clone:
uv (Recommended - Super Fast)# Create a venv with seed packages (critical for dynamic deps.py auto-installers)
uv venv --clear --python=3.13 --seed
source .venv/bin/activate
# Install local clone
uv pip install -e .
python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip setuptools wheel
pip install -e .
Use this method if you are installing the published package:
uvuv venv --clear --python=3.13 --seed
source .venv/bin/activate
uv pip install litert-cli-nightly
python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip setuptools wheel
pip install litert-cli-nightly
[!TIP] If you encounter package resolution or network errors with
uv, set the standard PyPI index URL first:export UV_INDEX_URL=https://pypi.org/simple
model-ref) SystemTo avoid handling complex and fragile absolute filesystem paths, the LiteRT CLI uses a centralized Model Reference (model-ref) catalog.
When you download or import a model to the centralized cache, you can assign it a reference alias (and optional sub-references):
<alias_name> or <alias_name>:<sub_reference> (e.g., mobilenet, resnet18:gpu, efficientnet:int8).--model-ref is omitted, the CLI automatically assigns a flattened repository ID (e.g., litert-community__MobileNet-v3-large) as the default alias.Once a model is registered, all CLI commands (including run, benchmark,
compile, delete, list) accept this <model_ref> directly instead of a
file path! The CLI will automatically resolve it to the correct absolute cache
file path on the fly.
Here are some examples of using model references in action:
# Run inference using the central alias directly
litert run mobilenet --android --cpu
# Benchmark using a specific sub-reference GPU file
litert benchmark resnet18:gpu --android --gpu
# Compile for NPU directly using the reference alias
litert compile efficientnet --target sm8750
# Delete from the central cache
litert delete mobilenet
Download public LiteRT models from HuggingFace Hub or a direct URL.
# Download public models using repo path
litert download repo_id_or_url --output ./models
# Download only .tflite files from HuggingFace
litert download litert-community/MobileNet-v3-large --file "*.tflite" --output ./models
# Download models with a custom model-ref alias
litert download litert-community/MobileNet-v3-large --model-ref my_model_ref
[!NOTE] If
--outputis omitted during HuggingFace downloads, the model is downloaded to~/.cache/litert-cli/models/and cataloged automatically viametadata.json(associating it with the repo ID as themodel-ref). If--outputis provided, it is treated as a standalone folder and is not cataloged.
Convert a PyTorch or HuggingFace model into a TFLite model.
# Automated HF model conversion
litert convert Qwen/Qwen1.5-0.5B-Chat --output /tmp/qwen
# Conversion with INT8 Weight-Only Quantization
litert convert Qwen/Qwen1.5-0.5B-Chat --quantize-recipe weight_only_wi8_afp32 --output /tmp/qwen_w8
# From a custom local PyTorch script
litert convert my_model.py --quantize-recipe dynamic_wi8_afp32 --output /tmp/mymodel
[!NOTE] Custom Python Script Interface (
my_model.py): To convert from a custom Python script, the file must expose functions to return the instantiated PyTorch model and generate sample inputs for tracer graph execution: *--model-func: Function name returning the model (torch.nn.Module). Default:get_model. *--input-func: Function name returning sample trace inputs (tuple/dict). Default:get_args.
Minimal Script Example: ```python import torch
def get_model() -> torch.nn.Module: return MyPyTorchModel()
def get_args() -> tuple: return (torch.randn(1, 3, 224, 224),) ```
Quantize a TFLite model using optimized recipes.
# Dynamic Range Quantization (dynamic_wi8_afp32) (Default)
litert quantize model.tflite --recipe dynamic_wi8_afp32 --output dynamic.tflite
# Weight-Only Quantization
litert quantize model.tflite --recipe weight_only_wi8_afp32 --output weight_only.tflite
# Static Range Quantization (requires calibration data)
litert quantize model.tflite --recipe static_wi8_ai8 --calibration-data calib_data.py --output static.tflite
# Custom JSON Recipe
litert quantize model.tflite --custom-recipe recipe.json --output recipe.tflite
Apply Ahead-of-Time (AOT) offline compilation to a standard TFLite model for edge SoC target NPUs.
[!NOTE] Currently only supported on Linux hosts for Qualcomm targets. Other targets are coming soon!
# Basic offline compile for Qualcomm sm8750 NPU
litert compile model.tflite --target sm8750
# Compile for multiple NPUs and export as Android AI Pack
litert compile model.tflite --target sm8750 --target mt6989 --export-aipack my_npu_models
# Download the latest SoC metadata target maps from GitHub
litert compile --update-targets main
Run a TFLite model locally on desktop or on an adb-connected Android device.
Desktop Execution: ```bash
litert run model.tflite --desktop --cpu
litert run my_model_ref --desktop --cpu
litert run model.tflite --desktop --accelerator gpu,cpu ```* Output logs are
clean by default. To enable C++ verbose debug logs, set the environment
variable:export LITERT_VERBOSE=1.
Android Execution (CPU, GPU, or NPU): ```bash
litert run model.tflite --android --cpu
litert run model.tflite --android --accelerator gpu,cpu
litert run standard_model.tflite --android --accelerator npu,cpu * **NPU Ahead-Of-Time (AOT) execution mode**: Pass an already NPU-compiled TFLite model (compiled offline via `litert compile`). The on-device runtime loads the compiled binary block directly, avoiding graph-compilation warmup overhead:bash litert run resnet18_compiled_sm8750.tflite --android --npu ```
Custom Inputs and Formats: ```bash
litert run model.tflite --iterations 5 --print-tensors
litert run model.tflite --desktop --input inputs="[0.5, 0.5, 0.5]"
litert run model.tflite --desktop --input "image.png" --print-tensors ```
Benchmark LiteRT models on different platforms (Android, Google Cloud, or Desktop).
# Benchmark on connected Android device via CPU
litert benchmark model.tflite --android --cpu
# Benchmark on Android GPU using OpenCL/OpenGL delegates
litert benchmark model.tflite --android --gpu
# Benchmark on Android NPU (JIT compilation mode)
litert benchmark model.tflite --android --npu
# Benchmark compiled AOT model on Android NPU
litert benchmark model_compiled.tflite --android --npu
# Benchmark on macOS CPU
litert benchmark my_model_ref --desktop --cpu
# Benchmark on Google AI Edge Portal (Google Cloud)
# Note: You must join EAP program first, authenticate using 'gcloud auth login', and configure
# --gcp-project and --gcp-bucket.
litert benchmark model.tflite --gcp --device "pixel 7" --gcp-project "your-gcp-project-id" --gcp-bucket "your-gcp-bucket"
Interact with LLM generative models (like Qwen 1.5 or Gemma 4) using native litert-lm utilities.
[!TIP] Non-interactive / Background Execution: When running LLM inferences in scripts or background tasks, the process will block waiting for chat prompts on
stdin. To prevent hanging, always redirect stdin from/dev/null(i.e. append< /dev/nullto the end of command).
# Run a generative model loading from Hugging Face
litert lm run --from-huggingface-repo=litert-community/gemma-4-E2B-it-litert-lm --prompt="What is the capital of France?" < /dev/null
# Run using a local compiled LLM (.litertlm) file
litert lm run gemma-4-E2B-it.litertlm --prompt "Hello, how are you?" < /dev/null
# Benchmark a generative LLM model
litert lm benchmark gemma-4-E2B-it.litertlm
Launch Model Explorer to visualize the model structure.
# Open model structure graph in web browser
litert visualize model.tflite
# Clean up and stop all background visualization servers
litert visualize --stop-all
Import a local model file or folder into the centralized cache.
# Import a local file into the cache
litert import my_model.tflite --model-ref my_model
# Import a directory and associate with a Hugging Face repository ID
litert import ./my_model_dir --model-ref my_model --hf-id my_org/my_model
List managed model references or view details of a specific model.
# List all models in the catalog
litert list
# Show detailed configurations of a specific model
litert list my_model
Delete a managed model reference from the centralized catalog.
# Delete a model reference from the cache
litert delete my_model
Clean up local caches, downloads, and temporary directories.
# Clean up local caches, downloaded files, and temporary on-device directories
litert clean
Agents should run tests after modifying code to ensure no regressions.
To run unit tests locally:
python litert_cli/litert_test.py
python litert_cli/litert_help_test.py
To run comprehensive end-to-end regression tests: bash ./examples/run_smoke_tests.sh ./examples/run_commands.sh ./examples/run_models.sh
litert lm run command in a script or in the background, always append < /dev/null to redirect standard input. Otherwise, the process will block indefinitely waiting on stdin.examples/ directory to explore comprehensive per-command demos (under examples/commands/) and model-specific demos (under examples/models/) for complete automation patterns.README.md file's Troubleshooting & Tips section for platform-specific environmental setup guides, adb port recoveries, and NPU offline compiler clang version requirements.These prompts demonstrate how developers can leverage this skill. You can copy and use them directly in your agent queries:
litert-community/efficientnet_b1 and run it on CPU"litert-community/efficientnet_b1 on my Android GPU"litert-community/efficientnet_b1 for NPU target
sm8750"litert-community/efficientnet_b1"litert-community/efficientnet_b1
from HuggingFace. Quantize it to INT8 dynamic range (--recipe dynamic_wi8_afp32), then benchmark both the FP32 and INT8 models on my
Android GPU, comparing the throughput speedup."Qwen/Qwen1.5-0.5B-Chat from HuggingFace Hub to LiteRT
format, and run it locally with the prompt 'Explain edge machine learning in
one sentence'."litert-community/efficientnet_b1, offline compile (AOT) it for
the sm8750 target NPU into ./models/compiled, then run on-device
inference and benchmark on the NPU, confirming zero JIT warmup latency."