一键导入
nvidia-nim
NVIDIA NIM (NVIDIA Inference Microservices) for deploying and managing AI models. Use for NIM microservices, model inference, API integration, and building AI applications with NVIDIA's inference infrastructure.
菜单
NVIDIA NIM (NVIDIA Inference Microservices) for deploying and managing AI models. Use for NIM microservices, model inference, API integration, and building AI applications with NVIDIA's inference infrastructure.
Amazon Elastic Kubernetes Service (EKS) for running Kubernetes on AWS. Use for container orchestration, deploying applications, managing clusters, and Kubernetes workloads on AWS.
Amazon SageMaker for building, training, and deploying machine learning models. Use for SageMaker AI endpoints, model training, inference, MLOps, and AWS machine learning services.
NVIDIA NeMo framework for building and training conversational AI models. Use for NeMo Retriever models, RAG (Retrieval-Augmented Generation), embedding models, enterprise search, and multilingual retrieval systems.
AWS Prescriptive Guidance for best practices and architectural patterns. Use for AWS architecture recommendations, SageMaker AI endpoints guidance, deployment patterns, and AWS solution architectures.
NVIDIA API documentation for integrating NVIDIA services. Use for NVIDIA NIM (NVIDIA Inference Microservices), LLM APIs, visual models, multimodal APIs, retrieval APIs, healthcare APIs, and CUDA-X microservices integration.
| name | nvidia-nim |
| description | NVIDIA NIM (NVIDIA Inference Microservices) for deploying and managing AI models. Use for NIM microservices, model inference, API integration, and building AI applications with NVIDIA's inference infrastructure. |
Comprehensive guide for deploying GPU-accelerated AI inference microservices with NVIDIA NIM™. NIM provides containers to self-host pretrained and customized AI models across clouds, data centers, and RTX™ AI PCs with industry-standard APIs.
This skill should be triggered when you:
Deployment & Infrastructure:
Model Integration:
Performance Optimization:
Specific Use Cases:
NIM (NVIDIA Inference Microservices) are containerized microservices that provide:
# Deploy a Llama 3 inference microservice
docker run --gpus all \
-e NGC_API_KEY=$NGC_API_KEY \
-p 8000:8000 \
nvcr.io/nvidia/nim/llama-3-8b-instruct:latest
Launches a GPU-accelerated Llama 3 inference service on port 8000
# Use NIM API like OpenAI
import openai
client = openai.OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-used" # NIM handles auth differently
)
response = client.chat.completions.create(
model="llama-3-8b-instruct",
messages=[
{"role": "user", "content": "Explain quantum computing"}
],
temperature=0.7,
max_tokens=512
)
print(response.choices[0].message.content)
Standard OpenAI SDK works seamlessly with NIM endpoints
# Add NVIDIA Helm repository
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
# Deploy NIM microservice
helm install my-nim nvidia/nim \
--set image.repository=nvcr.io/nvidia/nim/llama-3-8b-instruct \
--set image.tag=latest \
--set replicaCount=3 \
--set resources.limits.nvidia.com/gpu=1
Scale NIM inference across Kubernetes cluster with GPU allocation
# Direct REST API call to NIM
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3-8b-instruct",
"messages": [{"role": "user", "content": "Hello, world!"}],
"temperature": 0.5,
"max_tokens": 100
}'
REST endpoint for language-agnostic integration
# Deploy your custom fine-tuned model with NIM
docker run --gpus all \
-e NGC_API_KEY=$NGC_API_KEY \
-e MODEL_PATH=/models/my-custom-model \
-v /path/to/models:/models \
-p 8000:8000 \
nvcr.io/nvidia/nim/base-llm:latest
NIM supports custom models fine-tuned on your data
# Building retrieval-augmented generation with NIM
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA
from langchain.vectorstores import FAISS
# Point LangChain to NIM endpoint
llm = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-used",
model="llama-3-8b-instruct"
)
# Create RAG chain with NIM-powered LLM
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
retriever=vectorstore.as_retriever(),
chain_type="stuff"
)
answer = qa_chain.run("What are the key features of NIM?")
Integrate NIM into RAG workflows with frameworks like LangChain
# docker-compose.yml for multi-GPU NIM deployment
version: '3'
services:
nim-service:
image: nvcr.io/nvidia/nim/llama-3-70b-instruct:latest
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 4 # Use 4 GPUs
capabilities: [gpu]
environment:
- NGC_API_KEY=${NGC_API_KEY}
- TENSOR_PARALLEL_SIZE=4
ports:
- "8000:8000"
Leverage multiple GPUs for large model inference
# Access built-in metrics from NIM
import requests
metrics = requests.get("http://localhost:8000/metrics")
print(metrics.text)
# Prometheus-compatible metrics include:
# - nim_inference_requests_total
# - nim_inference_duration_seconds
# - nim_gpu_utilization_percent
# - nim_throughput_tokens_per_second
Built-in Prometheus metrics for dashboarding and monitoring
# Use NVIDIA AI Blueprints with NIM
from nvidia_blueprints import AgentBlueprint
# Initialize blueprint with NIM endpoint
agent = AgentBlueprint(
nim_endpoint="http://localhost:8000/v1",
model="llama-3-8b-instruct",
tools=["web_search", "calculator", "code_executor"]
)
# Execute agentic workflow
result = agent.run(
task="Research and summarize recent AI developments"
)
Predefined AI workflows using NIM as inference backend
# Alternative: Use Hugging Face dedicated endpoints
from huggingface_hub import InferenceClient
client = InferenceClient(
model="nvidia/llama-3-8b-nim",
token=hf_token
)
response = client.text_generation(
"Explain NVIDIA NIM",
max_new_tokens=200
)
Managed NIM deployment via Hugging Face cloud infrastructure
This skill includes comprehensive documentation in references/:
Primary documentation for NVIDIA NIM architecture and capabilities:
Best for:
Additional resources and references:
Best for:
Start here:
references/microservices.md "How It Works" sectionFirst Steps:
Focus on:
Key Skills:
Advanced topics:
Production Considerations:
Custom Models:
Integration Path:
1. Choose model from NVIDIA catalog → references/microservices.md
2. Deploy with Kubernetes Helm chart → Quick Reference #3
3. Configure multi-GPU if needed → Quick Reference #7
4. Set up monitoring → Quick Reference #8
5. Test with OpenAI-compatible client → Quick Reference #2
6. Scale based on metrics
1. Deploy NIM inference endpoint → Quick Reference #1
2. Integrate with vector database (Milvus, Pinecone)
3. Connect via LangChain → Quick Reference #6
4. Implement retrieval pipeline
5. Add observability and error handling
1. Use NVIDIA AI Blueprint → Quick Reference #9
2. Deploy NIM for reasoning engine
3. Configure tool integrations (search, APIs)
4. Implement agentic workflow
5. Monitor agent performance
max_batch_size for your workloadContainer won't start:
nvidia-smiOut of memory errors:
Slow inference:
API compatibility issues:
/v1/chat/completions)