Ejecuta cualquier Skill en Manus
con un clic

Ejecuta cualquier Skill en Manus con un clic

tensorrt-llm

Optimizes LLM inference with NVIDIA TensorRT for maximum throughput and lowest latency. Use for production deployment on NVIDIA GPUs (A100/H100), when you need 10-100x faster inference than PyTorch, or for serving models with quantization (FP8/INT4), in-flight batching, and multi-GPU scaling.

Ejecutar en Manus

Estrellas189.253

Forks32.680

Actualizado8 de mayo de 2026, 21:27

Fuente

NousResearch

NousResearch/hermes-agent

Abrir repositorio de GitHub Ver repositorios del creador

Comando de instalación

Descarga

Ejecutar en Manus

Útil paraSOC

Científicos de datosOcupaciones informáticas y matemáticas15-2051L4

Explorador de archivos

4 archivos

SKILL.md

readonly

name	tensorrt-llm
description	Optimizes LLM inference with NVIDIA TensorRT for maximum throughput and lowest latency. Use for production deployment on NVIDIA GPUs (A100/H100), when you need 10-100x faster inference than PyTorch, or for serving models with quantization (FP8/INT4), in-flight batching, and multi-GPU scaling.
version	1.0.0
author	Orchestra Research
license	MIT
dependencies	["tensorrt-llm","torch"]
platforms	["linux","macos"]
metadata	{"hermes":{"tags":["Inference Serving","TensorRT-LLM","NVIDIA","Inference Optimization","High Throughput","Low Latency","Production","FP8","INT4","In-Flight Batching","Multi-GPU"]}}

TensorRT-LLM

NVIDIA's open-source library for optimizing LLM inference with state-of-the-art performance on NVIDIA GPUs.

When to use TensorRT-LLM

Use TensorRT-LLM when:

Deploying on NVIDIA GPUs (A100, H100, GB200)
Need maximum throughput (24,000+ tokens/sec on Llama 3)
Require low latency for real-time applications
Working with quantized models (FP8, INT4, FP4)
Scaling across multiple GPUs or nodes

Use vLLM instead when:

Need simpler setup and Python-first API
Want PagedAttention without TensorRT compilation
Working with AMD GPUs or non-NVIDIA hardware

Use llama.cpp instead when:

Deploying on CPU or Apple Silicon
Need edge deployment without NVIDIA GPUs
Want simpler GGUF quantization format

Quick start

Installation

# Docker (recommended)
docker pull nvidia/tensorrt_llm:latest

# pip install
pip install tensorrt_llm==1.2.0rc3

# Requires CUDA 13.0.0, TensorRT 10.13.2, Python 3.10-3.12

Basic inference

from tensorrt_llm import LLM, SamplingParams

# Initialize model
llm = LLM(model="meta-llama/Meta-Llama-3-8B")

# Configure sampling
sampling_params = SamplingParams(
    max_tokens=100,
    temperature=0.7,
    top_p=0.9
)

# Generate
prompts = ["Explain quantum computing"]
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(output.text)

Serving with trtllm-serve

# Start server (automatic model download and compilation)
trtllm-serve meta-llama/Meta-Llama-3-8B \
    --tp_size 4 \              # Tensor parallelism (4 GPUs)
    --max_batch_size 256 \
    --max_num_tokens 4096

# Client request
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Meta-Llama-3-8B",
    "messages": [{"role": "user", "content": "Hello!"}],
    "temperature": 0.7,
    "max_tokens": 100
  }'

Key features

Performance optimizations

In-flight batching: Dynamic batching during generation
Paged KV cache: Efficient memory management
Flash Attention: Optimized attention kernels
Quantization: FP8, INT4, FP4 for 2-4× faster inference
CUDA graphs: Reduced kernel launch overhead

Parallelism

Tensor parallelism (TP): Split model across GPUs
Pipeline parallelism (PP): Layer-wise distribution
Expert parallelism: For Mixture-of-Experts models
Multi-node: Scale beyond single machine

Advanced features

Speculative decoding: Faster generation with draft models
LoRA serving: Efficient multi-adapter deployment
Disaggregated serving: Separate prefill and generation

Common patterns

Quantized model (FP8)

from tensorrt_llm import LLM

# Load FP8 quantized model (2× faster, 50% memory)
llm = LLM(
    model="meta-llama/Meta-Llama-3-70B",
    dtype="fp8",
    max_num_tokens=8192
)

# Inference same as before
outputs = llm.generate(["Summarize this article..."])

Multi-GPU deployment

# Tensor parallelism across 8 GPUs
llm = LLM(
    model="meta-llama/Meta-Llama-3-405B",
    tensor_parallel_size=8,
    dtype="fp8"
)

Batch inference

# Process 100 prompts efficiently
prompts = [f"Question {i}: ..." for i in range(100)]

outputs = llm.generate(
    prompts,
    sampling_params=SamplingParams(max_tokens=200)
)

# Automatic in-flight batching for maximum throughput

Performance benchmarks

Meta Llama 3-8B (H100 GPU):

Throughput: 24,000 tokens/sec
Latency: ~10ms per token
vs PyTorch: 100× faster

Llama 3-70B (8× A100 80GB):

FP8 quantization: 2× faster than FP16
Memory: 50% reduction with FP8

Supported models

LLaMA family: Llama 2, Llama 3, CodeLlama
GPT family: GPT-2, GPT-J, GPT-NeoX
Qwen: Qwen, Qwen2, QwQ
DeepSeek: DeepSeek-V2, DeepSeek-V3
Mixtral: Mixtral-8x7B, Mixtral-8x22B
Vision: LLaVA, Phi-3-vision
100+ models on HuggingFace

References

Optimization Guide - Quantization, batching, KV cache tuning
Multi-GPU Setup - Tensor/pipeline parallelism, multi-node
Serving Guide - Production deployment, monitoring, autoscaling

Resources

Docs: https://nvidia.github.io/TensorRT-LLM/
GitHub: https://github.com/NVIDIA/TensorRT-LLM
Models: https://huggingface.co/models?library=tensorrt_llm

Más de este repositorio

mismo repositorio

google-meet

NousResearch/hermes-agent

Join a Google Meet call, transcribe live captions, optionally speak in realtime, and do the followup work afterwards. Use when the user asks the agent to sit in on a meeting, take notes, summarize, respond in-call, or action items from it.

2026-06-09189.3k

simplify-code

NousResearch/hermes-agent

Parallel 3-agent cleanup of recent code changes.

2026-06-08189.3k

codex

NousResearch/hermes-agent

Delegate coding to OpenAI Codex CLI (features, PRs).

2026-06-08189.3k

google-workspace

NousResearch/hermes-agent

Gmail, Calendar, Drive, Docs, Sheets via gws CLI or Python.

2026-06-05189.3k

hermes-agent

NousResearch/hermes-agent

Configure, extend, or contribute to Hermes Agent.

2026-06-05189.3k

hermes-s6-container-supervision

NousResearch/hermes-agent

Modify, debug, or extend the s6-overlay supervision tree inside the Hermes Agent Docker image — adding new services, debugging profile gateways, understanding the Architecture B main-program pattern.

2026-06-04189.3k

name	tensorrt-llm
description	Optimizes LLM inference with NVIDIA TensorRT for maximum throughput and lowest latency. Use for production deployment on NVIDIA GPUs (A100/H100), when you need 10-100x faster inference than PyTorch, or for serving models with quantization (FP8/INT4), in-flight batching, and multi-GPU scaling.
version	1.0.0
author	Orchestra Research
license	MIT
dependencies	["tensorrt-llm","torch"]
platforms	["linux","macos"]
metadata	{"hermes":{"tags":["Inference Serving","TensorRT-LLM","NVIDIA","Inference Optimization","High Throughput","Low Latency","Production","FP8","INT4","In-Flight Batching","Multi-GPU"]}}

TensorRT-LLM

NVIDIA's open-source library for optimizing LLM inference with state-of-the-art performance on NVIDIA GPUs.

When to use TensorRT-LLM

Use TensorRT-LLM when:

Deploying on NVIDIA GPUs (A100, H100, GB200)
Need maximum throughput (24,000+ tokens/sec on Llama 3)
Require low latency for real-time applications
Working with quantized models (FP8, INT4, FP4)
Scaling across multiple GPUs or nodes

Use vLLM instead when:

Need simpler setup and Python-first API
Want PagedAttention without TensorRT compilation
Working with AMD GPUs or non-NVIDIA hardware

Use llama.cpp instead when:

Deploying on CPU or Apple Silicon
Need edge deployment without NVIDIA GPUs
Want simpler GGUF quantization format

Quick start

Installation

# Docker (recommended)
docker pull nvidia/tensorrt_llm:latest

# pip install
pip install tensorrt_llm==1.2.0rc3

# Requires CUDA 13.0.0, TensorRT 10.13.2, Python 3.10-3.12

Basic inference

from tensorrt_llm import LLM, SamplingParams

# Initialize model
llm = LLM(model="meta-llama/Meta-Llama-3-8B")

# Configure sampling
sampling_params = SamplingParams(
    max_tokens=100,
    temperature=0.7,
    top_p=0.9
)

# Generate
prompts = ["Explain quantum computing"]
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(output.text)

Serving with trtllm-serve

# Start server (automatic model download and compilation)
trtllm-serve meta-llama/Meta-Llama-3-8B \
    --tp_size 4 \              # Tensor parallelism (4 GPUs)
    --max_batch_size 256 \
    --max_num_tokens 4096

# Client request
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Meta-Llama-3-8B",
    "messages": [{"role": "user", "content": "Hello!"}],
    "temperature": 0.7,
    "max_tokens": 100
  }'

Key features

Performance optimizations

In-flight batching: Dynamic batching during generation
Paged KV cache: Efficient memory management
Flash Attention: Optimized attention kernels
Quantization: FP8, INT4, FP4 for 2-4× faster inference
CUDA graphs: Reduced kernel launch overhead

Parallelism

Tensor parallelism (TP): Split model across GPUs
Pipeline parallelism (PP): Layer-wise distribution
Expert parallelism: For Mixture-of-Experts models
Multi-node: Scale beyond single machine

Advanced features

Speculative decoding: Faster generation with draft models
LoRA serving: Efficient multi-adapter deployment
Disaggregated serving: Separate prefill and generation

Common patterns

Quantized model (FP8)

from tensorrt_llm import LLM

# Load FP8 quantized model (2× faster, 50% memory)
llm = LLM(
    model="meta-llama/Meta-Llama-3-70B",
    dtype="fp8",
    max_num_tokens=8192
)

# Inference same as before
outputs = llm.generate(["Summarize this article..."])

Multi-GPU deployment

# Tensor parallelism across 8 GPUs
llm = LLM(
    model="meta-llama/Meta-Llama-3-405B",
    tensor_parallel_size=8,
    dtype="fp8"
)

Batch inference

# Process 100 prompts efficiently
prompts = [f"Question {i}: ..." for i in range(100)]

outputs = llm.generate(
    prompts,
    sampling_params=SamplingParams(max_tokens=200)
)

# Automatic in-flight batching for maximum throughput

Performance benchmarks

Meta Llama 3-8B (H100 GPU):

Throughput: 24,000 tokens/sec
Latency: ~10ms per token
vs PyTorch: 100× faster

Llama 3-70B (8× A100 80GB):

FP8 quantization: 2× faster than FP16
Memory: 50% reduction with FP8

Supported models

LLaMA family: Llama 2, Llama 3, CodeLlama
GPT family: GPT-2, GPT-J, GPT-NeoX
Qwen: Qwen, Qwen2, QwQ
DeepSeek: DeepSeek-V2, DeepSeek-V3
Mixtral: Mixtral-8x7B, Mixtral-8x22B
Vision: LLaVA, Phi-3-vision
100+ models on HuggingFace

References

Optimization Guide - Quantization, batching, KV cache tuning
Multi-GPU Setup - Tensor/pipeline parallelism, multi-node
Serving Guide - Production deployment, monitoring, autoscaling

Resources

Docs: https://nvidia.github.io/TensorRT-LLM/
GitHub: https://github.com/NVIDIA/TensorRT-LLM
Models: https://huggingface.co/models?library=tensorrt_llm