| name | aqua-troubleshooting |
| description | Diagnose and fix OCI AI Quick Actions (AQUA) issues including deployment failures, OOM errors, authorization problems, capacity issues, container errors, and policy misconfigurations. Triggered when user encounters errors or needs help debugging AQUA workflows. |
| user-invocable | true |
| disable-model-invocation | false |
AQUA Troubleshooting Guide
Use this skill when the user encounters errors or needs help diagnosing issues with OCI AI Quick Actions deployments, fine-tuning, evaluation, or model registration.
Step 1: Check Logs
Always check logs first. Logging must be enabled during deployment creation.
ads opctl watch <model_deployment_ocid> --auth resource_principal
ads opctl watch <job_run_ocid> --auth resource_principal
To get the OCID: AQUA > Model Deployments tab > click deployment > copy OCID from details.
Common Deployment Errors
1. Service Timeout Error
Symptom: Model deployment fails during startup - couldn't load the model in time.
Diagnosis: Check logs via ads opctl watch.
Solutions:
- The model may be too large for the selected shape
- Try a larger GPU shape
- Reduce
--max-model-len to decrease memory requirements
2. Out of Memory (OOM) Error
Case A: Model Too Large for GPU
Symptom: CUDA OOM error during model loading.
Solutions (try in order):
- Use a bigger shape (more GPU memory)
- Try FP8 quantization: Add
--quantization fp8 to PARAMS
- Try 4-bit quantization: Add
--quantization bitsandbytes --load-format bitsandbytes to PARAMS
- Reduce context length: Add
--max-model-len <smaller_value> to PARAMS
env_var={
"PARAMS": "--quantization fp8 --max-model-len 4096",
}
Case B: KV Cache Too Small
Symptom: Error says "max seq len is larger than maximum tokens in KV cache".
Solution: The error log contains a hint for the max supported --max-model-len. Set it to that value:
env_var={
"PARAMS": "--max-model-len <value_from_log>",
}
3. Trust Remote Code Error
Symptom: Error mentions trust_remote_code=True is required.
Solution: Add --trust-remote-code to PARAMS (leave value blank):
env_var={
"PARAMS": "--trust-remote-code --max-model-len 4096",
}
4. Architecture Not Supported
Symptom: ValueError: Model architectures ['<NAME>'] are not supported for the current vLLM instance.
Solutions:
- Check vLLM supported models
- If not supported by vLLM, use the BYOC (Bring Your Own Container) approach
- For some models, add
--trust-remote-code
5. Capacity Issues
Symptom: "No capacity for the specified shape" or "Out of host capacity".
Solutions:
- Try a different availability domain
- Try a different GPU shape
- Use capacity reservations
- Wait and retry (capacity is dynamic)
Authorization Errors
Root Causes
Authorization errors arise from:
- Missing OCI IAM policies
- Object Storage bucket without versioning enabled
- Notebook session not in the same compartment as the dynamic group
Required Policies
Set up policies via Oracle Resource Manager (ORM) - recommended:
# Go to: AQUA > Policies > Setup via ORM
Or verify with the AQUA Policy Verification tool:
from ads.aqua.verify_policies import AquaVerifyPoliciesApp
verify_app = AquaVerifyPoliciesApp()
result = verify_app.verify()
Policy-to-Operation Mapping
| Operation | Required Policy |
|---|
| Create/List Models | manage data-science-models in compartment |
| Create/List Deployments | manage data-science-model-deployments in compartment |
| Create/List Model Version Sets | manage data-science-modelversionsets in compartment |
| Create/List Jobs (FT/Eval) | manage data-science-job-runs in compartment |
| Read Object Storage | read buckets + read objectstorage-namespaces in compartment |
| Write Object Storage | manage object-family in compartment |
| List Log Groups | use logging-family in compartment |
| Use Private Endpoints | use virtual-network-family in compartment |
| Tag Resources | use tag-namespaces in tenancy |
| Evaluation/Fine-Tuning | manage data-science-models + read resource-availability + use virtual-network-family |
Bucket Versioning
Object Storage bucket must have versioning enabled:
oci os bucket get -bn <bucket-name> --auth resource_principal | jq ".data.versioning"
Environment Setup Issues
Authentication
import ads
ads.set_auth("resource_principal")
ads.set_auth("api_key")
ads.set_auth("security_token")
Required Environment Variables (for local/internal development)
export OCI_IAM_TYPE="security_token"
export OCI_CONFIG_PROFILE=<your-profile>
export OCI_ODSC_SERVICE_ENDPOINT="https://datascience.us-ashburn-1.oci.oraclecloud.com"
HuggingFace Gated Models
export HF_TOKEN=<your_hf_read_token>
huggingface-cli login
Fine-Tuning Specific Issues
Dataset Format Errors
- Ensure JSONL format (one valid JSON per line)
- All rows must have same schema
- For instruction format:
prompt and completion keys required
- For conversational format:
messages key with role/content objects
- Verify no trailing commas or invalid JSON
Distributed Training Failures
- VCN + Subnet required for
replica > 1
- Logging required for distributed training
- Multi-node overhead is significant; single replica with multi-GPU shape is preferred
- Check that all nodes can communicate (security lists / NSGs allow traffic)
Evaluation Specific Issues
Evaluation Job Fails
- Ensure deployment is in
ACTIVE state before running evaluation
- Dataset must be JSONL with
prompt and completion keys
- Report path must be writable Object Storage location
- Block storage size must be sufficient (default: 50 GB)
BERTScore Issues
- BERTScore is not suitable for evaluating code generation tasks
- Consider ROUGE for summarization-focused evaluations
- The evaluation model endpoint must be reachable from the evaluation job
Diagnostic Commands
ads aqua deployment get --model_deployment_id <ocid>
ads aqua deployment list --compartment_id <compartment_ocid>
ads aqua model get --model_id <model_ocid>
ads aqua verify_policies
Key Source Files
ads/aqua/verify_policies/ — Policy verification app
ads/aqua/common/errors.py — Error hierarchy (AquaValueError, AquaRuntimeError, etc.)
ads/aqua/training/exceptions.py — Training job exit code mappings
ads/aqua/extension/errors.py — HTTP error message templates