| name | agent-platform-inference |
| description | Connects to and performs inference with Google Cloud Agent Platform GenAI models, including First-Party Gemini models and Third-Party OpenMaaS models (Llama, DeepSeek, Qwen, etc.). Use when you need to generate code for calling Gemini or OpenMaaS models, authenticate with GenAI SDK, OpenAI SDK, or legacy Agent Platform SDK, configure base URLs and global/regional endpoints, or troubleshoot 429 Resource Exhausted (DSQ), 400 User Validation, or 404 Not Found errors. Don't use for deploying models to endpoints or for running model evaluations. |
Agent Platform GenAI Inference Skill
This skill provides instructions for authenticating and connecting to Google
Cloud Agent Platform to use Generative AI models. It covers both First-Party
(Gemini) and Third-Party (OpenMaaS) models.
Phase 0: Environment Setup
CRITICAL: Before running any of the Python sample scripts in the scripts/
directory (e.g., scripts/openmaas_openai_sdk.py), you MUST ensure the
environment is correctly initialized by following these steps:
- Google Cloud Authentication: Authenticate with your Google Cloud
credentials and configure active Application Default Credentials (ADC) for
Agent Platform access:
gcloud auth login
gcloud auth application-default login
- Enable API (if not already enabled):
gcloud services enable aiplatform.googleapis.com
- Virtual Environment: Create and activate a dedicated local virtual
environment:
python3 -m venv .venv
source .venv/bin/activate
- Install Dependencies: Install the required SDKs:
pip install -r scripts/requirements.txt
- Verify Setup (Optional): Run all sample scripts at once to verify the
environment is working end-to-end:
./scripts/verify_all.sh
- Execution: Advise the user that every time they execute a Python
snippet from this skill, they must ensure this virtual environment is
activated first.
[!IMPORTANT]
CRITICAL: Model IDs & Availability
Workflow Decision Tree
-
Model Family Identification: Has the user specified whether they want
to call a Gemini (First-Party) model or an OpenMaaS (Third-Party,
e.g. Llama, DeepSeek, Qwen) model?
- No -> Ask the user which model family they want to use. If they
provide a specific model name, infer the family from the name.
- Yes -> Proceed to Step 2.
-
SDK Choice: Which SDK does the user want to use?
- Gemini + GenAI SDK (preferred for Gemini) -> Proceed to
[1. Gemini Models].
- Gemini + legacy Vertex AI SDK -> Proceed to [1. Gemini Models].
- OpenMaaS + OpenAI SDK (preferred for OpenMaaS) -> Proceed to
[2. OpenMaaS Models].
- OpenMaaS + GenAI SDK -> Proceed to [2. OpenMaaS Models].
- Unsure -> Default to the preferred SDK for the chosen family.
-
Troubleshooting: Is the user reporting an error (429 Resource
Exhausted, 400 User Validation, 404 Not Found, etc.)?
- Yes -> Proceed to [3. Troubleshooting & Common Error Codes].
- No -> Proceed with the SDK choice from Step 2.
1. Gemini Models
For Gemini models (e.g., gemini-2.5-pro, gemini-3-flash-preview), the
GenAI SDK (google-genai) is the PREFERRED method. The legacy
vertexai SDK is still supported but GenAI SDK is recommended for new projects.
[!IMPORTANT]
Preview Models (including Gemini 3.1) are often ONLY available in the
global region. Stable models are available in us-central1 and other
regions.
Choosing the Right SDK
- Gemini Models: GenAI SDK (
google-genai) is PREFERRED. Use OpenAI SDK for compatibility, or Legacy SDK (vertexai) if needed.
- OpenMaaS Models: OpenAI SDK is HIGHLY RECOMMENDED. Use GenAI SDK or Legacy SDK if you have specific infrastructure requirements.
Installation
pip install google-genai
Python Example (GenAI SDK - Preferred)
See scripts/gemini_genai_sdk.py for the
complete code.
Alternative: OpenAI SDK (Chat Completions)
Use the standard OpenAI SDK with the Agent Platform endpoint. This is great for
cross-compatibility.
See scripts/gemini_openai_sdk.py for the
complete code.
Legacy: Agent Platform SDK
The legacy vertexai SDK is still widely used but google-genai is preferred
for new Gemini projects.
See scripts/gemini_vertexai_sdk.py for the
complete code.
Documentation: Google GenAI SDK
Documentation: Agent Platform Gemini Models
2. OpenMaaS Models (Llama, DeepSeek, Qwen, etc.)
For OpenMaaS (Model-as-a-Service) models, the HIGHLY RECOMMENDED approach is
to use the standard OpenAI SDK with a specific Vertex AI endpoint.
[!WARNING]
While GenerativeModel can support some OpenMaaS models, it is
discouraged. Use the OpenAI SDK for best compatibility (especially for Chat
Completions).
Installation
pip install openai google-auth
Authentication for OpenAI SDK
You MUST use a Google Cloud OAuth access token as the API key for the OpenAI
SDK.
import google.auth
from google.auth.transport.requests import Request
def get_gcp_access_token():
creds, _ = google.auth.default()
creds.refresh(Request())
return creds.token
[!NOTE]
Google Cloud access tokens typically expire after 1 hour. The
get_gcp_access_token() function above retrieves a fresh token at the time
it is called.
For long-running applications, you implement a refresh mechanism. See Refresh the access token for details.
Configuration (Base URL)
- Global Endpoint (Recommended for most models requiring global
availability):
https://aiplatform.googleapis.com/v1/projects/{PROJECT_ID}/locations/global/endpoints/openapi
- Regional Endpoint:
https://{REGION}-aiplatform.googleapis.com/v1/projects/{PROJECT_ID}/locations/{REGION}/endpoints/openapi
Python Example (OpenMaaS - Chat Completions)
See scripts/openmaas_openai_sdk.py for the
complete code.
[!TIP]
Alternative: Environment Variables
You can set environment variables in your shell instead of updating the code.
export OPENAI_BASE_URL="https://aiplatform.googleapis.com/v1/projects/YOUR_PROJECT_ID/locations/global/endpoints/openapi"
export OPENAI_API_KEY="$(gcloud auth print-access-token)"
Then initialize the client without arguments: client = OpenAI()
Python Example (OpenMaaS - Completions API)
The following models support the legacy Completions API: zai-org/glm-5-maas,
moonshotai/kimi-k2-thinking-maas, minimaxai/minimax-m2-maas,
deepseek-ai/deepseek-v3.1-maas, and deepseek-ai/deepseek-v3.2-maas.
response = client.completions.create(
model="deepseek-ai/deepseek-v3.2-maas",
prompt="Once upon a time",
max_tokens=100
)
print(response.choices[0].text)
Python Example (OpenMaaS - Embeddings)
response = client.embeddings.create(
model="intfloat/multilingual-e5-large-maas",
input="The quick brown fox jumps over the lazy dog",
)
print(response.data[0].embedding)
Alternative: GenAI SDK
The google-genai SDK can also access OpenMaaS models via the vertexai
backend.
See scripts/openmaas_genai_sdk.py for the
complete code.
[!IMPORTANT]
Model ID Format: For GenAI SDK with OpenMaaS, you MUST use the full
path: publishers/PUBLISHER/models/MODEL (e.g.,
publishers/zai-org/models/glm-5-maas).
Legacy: Agent Platform SDK (OpenMaaS)
For OpenMaaS, you can also use GenerativeModel (if supported).
See scripts/openmaas_vertexai_sdk.py for
the complete code.
[!IMPORTANT]
Model ID Format: For Agent Platform SDK with OpenMaaS, you MUST use the
full path: publishers/PUBLISHER/models/MODEL.
Model Reference & Availability
Documentation: Use Open Models on Agent Platform
[!TIP]
Self-Deployment for Control: If you need dedicated hardware
(GPUs/TPUs), guaranteed capacity, or specific regional placement not
offered by MaaS, you can Self-Deploy these models to Agent Platform
Endpoints. Search for the model in Model Garden and click "Deploy" to select
your machine type.
[!IMPORTANT]
Finding Inference Examples: The list above is a starting point. For the
definitive inference snippets (especially for Chat Completions payload
structure):
- Consult the Use Open Models on Agent Platform
list.
- Click the link for your specific model (e.g., "DeepSeek-V3") to visit its
Model Garden page.
- Look for the "Sample Code" or "Use this model" button on the Model
Garden page to get the exact
curl or Python code for that specific model
version.
[!NOTE]
This list is INCOMPLETE. See [Use Open Models on Agent Platform]
(https://docs.cloud.google.com/gemini-enterprise-agent-platform/models/maas/use-open-models)
for the full list of supported models.
| Model Family | Model ID Examples | Location | Notes |
|---|
| Llama 4 | meta/llama-4-maverick-17b-128e-instruct-maas | us-east5 | |
| Llama 4 | meta/llama-4-scout-17b-16e-instruct-maas | us-east5 | |
| Llama 3.3 | meta/llama-3.3-70b-instruct-maas | us-central1 | |
| DeepSeek | deepseek-ai/deepseek-v3.2-maas | global | Global ONLY |
| DeepSeek | deepseek-ai/deepseek-v3.1-maas | us-west2 | US-West2 ONLY |
| DeepSeek | deepseek-ai/deepseek-r1-0528-maas | us-central1 | |
| Qwen 3 | qwen/qwen3-coder-480b-a35b-instruct-maas | global | |
| Qwen 3 | qwen/qwen3-next-80b-a3b-instruct-maas | global | |
| Kimi | moonshotai/kimi-k2-thinking-maas | global | |
| MiniMax | minimaxai/minimax-m2-maas | global | |
| GLM | zai-org/glm-4.7-maas, zai-org/glm-5-maas | global | |
3. Troubleshooting & Common Error Codes
429: Resource Exhausted
- Cause: OpenMaaS and Gemini models use Dynamic Shared Quota (DSQ).
Resources are pooled and allocated dynamically based on availability. A 429
error indicates the shared pool is temporarily exhausted, not necessarily
that your specific project quota is hit (though it can be).
- Solution: Implement strict exponential backoff and retry strategies.
- High Throughput: For production workloads requiring high throughput or guaranteed capacity, consider Provisioned Throughput (PT).
- Important: Quota increases through normal cloud processes (Cloud Console) are NOT applicable for DSQ constraints.
- Documentation: Quotas and limits (DSQ)
400: User Validation Error
- Cause: Invalid request format, unsupported parameter, or incorrect Model ID.
- Action: Double-check your request payload and parameters. Verify the Model ID and Region are correct.
404: Not Found / Model Not Available
- Cause: The model is not enabled, or not available in the specified project or region.
- Action:
- Check Location Availability:
- OpenMaaS: Verify the model is available in your region. See Model Availability by Location.
- Gemini:
- Preview Models: All Preview models (e.g., Gemini 3.1, experimental versions) are often ONLY available in the
us-central1 or global regions.
- Stable Models: (e.g., Gemini 2.5 Pro) Available in
us-central1, europe-west4, and many other regions.
- Important: If you get a 404/400 error, try switching your client location to
us-central1 or global.
- Enable Llama Models: For Llama 3.3 and Llama 4, you MUST
enable the model in Model Garden before use. Go to the [Model Garden]
(https://console.cloud.google.com/agent-platform/model-garden), search
for the model card (e.g., "Llama 3.3 API Service"), and click
Enable. Only then can you make inference requests.