| name | aqua-deployment |
| description | Deploy LLM models on OCI using AI Quick Actions (AQUA) - single model, multi-model, stacked (LoRA), with GPU shape selection, vLLM configuration, streaming, and tool calling. Triggered when user wants to deploy, update, or manage model deployments. |
| user-invocable | true |
| disable-model-invocation | false |
AQUA Model Deployment
Use this skill when the user wants to deploy, manage, or configure LLM model deployments on OCI Data Science using AI Quick Actions.
Deployment Types
| Type | Description |
|---|
| Single Model | One model per deployment (most common) |
| Multi-Model | Multiple LLMs on one instance via LiteLLM routing |
| Stacked | Base model + multiple LoRA fine-tuned weights sharing inference |
Python SDK Usage
Import
from ads.aqua.modeldeployment import AquaDeploymentApp
deployment_app = AquaDeploymentApp()
Create Single Model Deployment
from ads.aqua.modeldeployment.entities import CreateModelDeploymentDetails
details = CreateModelDeploymentDetails(
model_id="ocid1.datasciencemodel.oc1.iad.xxx",
instance_shape="VM.GPU.A10.2",
display_name="llama-3.1-8b-deployment",
compartment_id="ocid1.compartment.oc1..xxx",
project_id="ocid1.datascienceproject.oc1.iad.xxx",
log_group_id="ocid1.loggroup.oc1.iad.xxx",
log_id="ocid1.log.oc1.iad.xxx",
env_var={
"MODEL_DEPLOY_PREDICT_ENDPOINT": "/v1/completions",
"PARAMS": "--max-model-len 4096",
},
)
deployment = deployment_app.create(create_deployment_details=details)
print(f"Deployment: {deployment.id} | State: {deployment.state}")
Create with Chat Completions Endpoint
details = CreateModelDeploymentDetails(
model_id="ocid1.datasciencemodel.oc1.iad.xxx",
instance_shape="VM.GPU.A10.2",
display_name="llama-3.1-8b-chat",
env_var={
"MODEL_DEPLOY_PREDICT_ENDPOINT": "/v1/chat/completions",
"PARAMS": "--max-model-len 4096",
},
)
Create Multi-Model Deployment
from ads.aqua.common.entities import AquaMultiModelRef
details = CreateModelDeploymentDetails(
models=[
AquaMultiModelRef(
model_id="ocid1.datasciencemodel.oc1.iad.model1",
model_name="llama-3.1-8b",
gpu_count=1,
),
AquaMultiModelRef(
model_id="ocid1.datasciencemodel.oc1.iad.model2",
model_name="mistral-7b",
gpu_count=1,
),
],
instance_shape="VM.GPU.A10.2",
display_name="multi-model-deployment",
compartment_id="ocid1.compartment.oc1..xxx",
project_id="ocid1.datascienceproject.oc1.iad.xxx",
)
deployment = deployment_app.create(create_deployment_details=details)
Create Stacked Deployment (Base + LoRA Fine-Tunes)
from ads.aqua.common.entities import AquaMultiModelRef, LoraModuleSpec
details = CreateModelDeploymentDetails(
models=[
AquaMultiModelRef(
model_id="ocid1.datasciencemodel.oc1.iad.base_model",
model_name="llama-3.1-8b",
fine_tune_weights=[
LoraModuleSpec(
model_id="ocid1.datasciencemodel.oc1.iad.ft1",
model_name="llama-3.1-8b-customer-support",
),
LoraModuleSpec(
model_id="ocid1.datasciencemodel.oc1.iad.ft2",
model_name="llama-3.1-8b-summarization",
),
],
),
],
instance_shape="VM.GPU.A10.2",
display_name="stacked-llama-deployment",
deployment_type="STACKED",
)
deployment = deployment_app.create(create_deployment_details=details)
List Deployments
deployments = deployment_app.list(compartment_id="ocid1.compartment.oc1..xxx")
for d in deployments:
print(f"{d.display_name} | {d.state} | {d.endpoint}")
Get Deployment Details
deployment = deployment_app.get(model_deployment_id="ocid1.datasciencemodeldeployment.oc1.iad.xxx")
Get Deployment Config (Recommended Shapes)
config = deployment_app.get_deployment_config(model_id="ocid1.datasciencemodel.oc1.iad.xxx")
List Available Shapes
shapes = deployment_app.list_shapes(compartment_id="ocid1.compartment.oc1..xxx")
Shape Recommendation
recommendation = deployment_app.recommend_shape(model_id="ocid1.datasciencemodel.oc1.iad.xxx")
CLI Usage
Create Deployment
ads aqua deployment create \
--model_id "ocid1.datasciencemodel.oc1.iad.xxx" \
--instance_shape "VM.GPU.A10.2" \
--display_name "llama-3.1-8b-deployment" \
--compartment_id "ocid1.compartment.oc1..xxx" \
--project_id "ocid1.datascienceproject.oc1.iad.xxx" \
--log_group_id "ocid1.loggroup.oc1.iad.xxx" \
--log_id "ocid1.log.oc1.iad.xxx"
Create Multi-Model Deployment
ads aqua deployment create \
--models '[{"model_id":"ocid1...model1","model_name":"llama-8b","gpu_count":1},{"model_id":"ocid1...model2","model_name":"mistral-7b","gpu_count":1}]' \
--instance_shape "VM.GPU.A10.2" \
--display_name "multi-model"
Create Stacked Deployment
ads aqua deployment create \
--models '[{"model_id":"ocid1...base","model_name":"llama-8b","fine_tune_weights":[{"model_id":"ocid1...ft1","model_name":"ft-support"}]}]' \
--instance_shape "VM.GPU.A10.2" \
--display_name "stacked-deployment" \
--deployment_type "STACKED"
List / Get
ads aqua deployment list --compartment_id "ocid1.compartment.oc1..xxx"
ads aqua deployment get --model_deployment_id "ocid1.datasciencemodeldeployment.oc1.iad.xxx"
Invoking a Deployed Model
Python SDK (Streaming)
import ads
import oci
import requests
ads.set_auth("resource_principal")
endpoint = "https://modeldeployment.us-ashburn-1.oci.customer-oci.com/ocid1.datasciencemodeldeployment.oc1.iad.xxx"
response = requests.post(
f"{endpoint}/predict",
json={
"model": "odsc-llm",
"prompt": "Write a haiku about clouds",
"max_tokens": 256,
"temperature": 0.7,
},
auth=oci.auth.signers.get_resource_principals_signer(),
)
print(response.json())
OpenAI-Compatible Client (ADS)
from ads.aqua.client.openai_client import OpenAI
client = OpenAI(
model_deployment_url="https://modeldeployment.us-ashburn-1.oci.customer-oci.com/ocid1.datasciencemodeldeployment.oc1.iad.xxx",
auth={"signer": oci.auth.signers.get_resource_principals_signer()},
)
response = client.chat.completions.create(
model="odsc-llm",
messages=[{"role": "user", "content": "Hello!"}],
max_tokens=500,
)
print(response.choices[0].message.content)
GPU Shape Reference
Quick sizing rule: GPU_memory_GB = num_params_billions × 2 for FP16/BF16, plus ~20% for KV cache.
| Shape | GPUs | GPU Memory | Fits (FP16) |
|---|
| VM.GPU.A10.1 | 1 | 24 GB | ≤ 7B |
| VM.GPU.A10.2 | 2 | 48 GB | ≤ 13B |
| BM.GPU.A10.4 | 4 | 96 GB | ≤ 34B, or 70B quantized |
| BM.GPU.A100-v2.8 | 8 | 640 GB | ≤ 70B |
| BM.GPU.H100.8 | 8 | 640 GB | ≤ 70B (faster) |
| BM.GPU.H200.8 | 8 | 1128 GB | 405B+ |
For the full shape table, per-model recommendations, multi-model GPU count constraints, and quantization options, see references/shapes.md.
vLLM Configuration Parameters
Set via PARAMS environment variable or --params CLI flag:
| Parameter | Description | Example |
|---|
--max-model-len | Maximum context length | 4096, 8192, 32768 |
--gpu-memory-utilization | Fraction of GPU memory for model | 0.9 (default), 0.95 |
--max-num-seqs | Max concurrent sequences | 256 |
--quantization | Quantization method | fp8, bitsandbytes |
--tensor-parallel-size | Number of GPUs for tensor parallelism | 2, 4, 8 |
--trust-remote-code | Allow custom model code from HF | (no value needed) |
--enable-auto-tool-choice | Enable function/tool calling | (no value needed) |
--tool-call-parser | Parser for tool calls | llama3_json, granite, hermes |
--limit-mm-per-prompt | Limit multimodal inputs | '{"image": 1}' |
--task | Model task override | embedding, transcribe |
--enforce-eager | Disable CUDA graphs | (no value needed) |
Tool Calling / Function Calling
Enable during deployment:
env_var={
"MODEL_DEPLOY_PREDICT_ENDPOINT": "/v1/chat/completions",
"PARAMS": "--enable-auto-tool-choice --tool-call-parser llama3_json --max-model-len 4096",
}
Supported parsers: llama3_json, llama4_json, granite, hermes, mistral, jamba, pythonic, internlm.
Advanced Topics
| Topic | Reference |
|---|
| Shape recommender CLI + JSON output | references/shapes.md → Shape Recommendation Tool section |
| LMCache (KV cache persistence for multi-turn) | references/lmcache.md |
| Private endpoints (no public internet) | references/private-endpoints.md |
| Batch inferencing (offline Job-based) | references/batch-inferencing.md |
Key Source Files
ads/aqua/modeldeployment/deployment.py — AquaDeploymentApp (create, list, get, update)
ads/aqua/modeldeployment/entities.py — CreateModelDeploymentDetails, AquaDeployment
ads/aqua/common/entities.py — AquaMultiModelRef, LoraModuleSpec
ads/aqua/client/openai_client.py — OpenAI-compatible client
ads/aqua/shaperecommend/recommend.py — GPU shape recommendation engine