一键导入
azure-openai-patterns
Azure OpenAI API patterns for rate limiting, function calling, error handling, and token optimization
用 Codex 或 Claude 帮你安装 复制这段 Prompt,粘贴到 Codex、Claude 或其他助手里,让它检查 Skill 页面并帮你完成安装。
菜单
Azure OpenAI API patterns for rate limiting, function calling, error handling, and token optimization
用 Codex 或 Claude 帮你安装 复制这段 Prompt,粘贴到 Codex、Claude 或其他助手里,让它检查 Skill 页面并帮你完成安装。
基于 SOC 职业分类
Create and maintain ASCII visual dashboards for project tracking with parallel lane progress bars
Store and manage voice samples for TTS cloning — portable, version-controlled audio references
Clear documentation through visual excellence
AI music generation via Replicate — 5 models for background tracks, lyrics, and sound design
Practitioner methodology for longitudinal case study research, evidence-based documentation, and publication-ready academic writing in AI-assisted development.
First impressions matter. Set projects up for success.
| name | azure-openai-patterns |
| description | Azure OpenAI API patterns for rate limiting, function calling, error handling, and token optimization |
| tier | standard |
| applyTo | **/*openai*,**/*chat*,**/*llm*,**/*gpt* |
Rate limiting, function calling, error handling, and token optimization for Azure OpenAI API.
Version: 1.0.0
Azure OpenAI uses dual rate limits: Tokens Per Minute (TPM) and Requests Per Minute (RPM). The ratio is typically 6 RPM per 1000 TPM.
| Model | Tier | TPM | RPM | Ratio |
|---|---|---|---|---|
| gpt-4o | Default | 450K | 2.7K | 6 RPM/1K TPM |
| gpt-4o-mini | Default | 2M | 12K | 6 RPM/1K TPM |
| gpt-4o | Enterprise | 30M | 180K | 6 RPM/1K TPM |
TPM is estimated before processing based on:
max_tokens parameter settingbest_of parameter setting (if used)The rate limit estimate is NOT the same as actual token consumption for billing.
RPM is enforced over small time windows (1-10 seconds):
600 RPM deployment = max 10 requests per second
If you send 15 requests in 1 second → 429 error
Even though 15/min < 600/min
async function chatWithTools(
messages: ChatCompletionMessage[],
tools: Tool[],
onStatusUpdate?: (status: string) => void
): Promise<ChatCompletionResponse> {
const maxRetries = 5;
for (let attempt = 1; attempt <= maxRetries; attempt++) {
const response = await fetch(apiUrl, {
method: 'POST',
headers: {
'Content-Type': 'application/json',
'Authorization': `Bearer ${token}`,
},
body: JSON.stringify({ messages, tools, tool_choice: 'auto' }),
});
if (response.ok) {
return response.json();
}
if (response.status === 429 && attempt < maxRetries) {
const waitTime = Math.pow(2, attempt);
onStatusUpdate?.(`Rate limited. Waiting ${waitTime}s...`);
await new Promise(resolve => setTimeout(resolve, waitTime * 1000));
continue;
}
throw new Error(`API error: ${response.status}`);
}
}
// Bad: Set unnecessarily high max_tokens — uses full quota even for short responses
const badRequest = { messages: [...], max_tokens: 4000 };
// Good: Set appropriate max_tokens for expected response length
const goodRequest = { messages: [...], max_tokens: 500 };
// Bad: Send one request per tool result (consumes RPM quota)
for (const toolCall of toolCalls) {
const result = await executeFunction(toolCall);
await sendToolResult(result);
}
// Good: Collect all results and send once
const results = await Promise.all(
toolCalls.map(tc => executeFunction(tc))
);
await sendToolResults(results); // Single request
const headers = response.headers;
const remainingRequests = headers.get('x-ratelimit-remaining-requests');
const remainingTokens = headers.get('x-ratelimit-remaining-tokens');
const resetRequests = headers.get('x-ratelimit-reset-requests');
const resetTokens = headers.get('x-ratelimit-reset-tokens');
const retryAfter = headers.get('Retry-After'); // Only on 429
// Bad: Return entire resource objects
{ name: 'get_resources', description: 'Get all Azure resources' }
// Returns: huge JSON with all properties
// Good: Return only necessary fields
{ name: 'get_resources', description: 'Get Azure resource summary' }
// Returns: { name, type, status } only
const request = {
messages,
tools,
parallel_tool_calls: true, // Default: true in newer models
};
// Model may call multiple tools in one response, reducing round trips
class RequestQueue {
private queue: Array<() => Promise<void>> = [];
private processing = false;
private minDelayMs = 100; // 10 req/sec max
async enqueue<T>(request: () => Promise<T>): Promise<T> {
return new Promise((resolve, reject) => {
this.queue.push(async () => {
try { resolve(await request()); }
catch (e) { reject(e); }
await this.delay(this.minDelayMs);
});
this.process();
});
}
private async process() {
if (this.processing) return;
this.processing = true;
while (this.queue.length > 0) {
const next = this.queue.shift();
await next?.();
}
this.processing = false;
}
private delay(ms: number) {
return new Promise(r => setTimeout(r, ms));
}
}
| Code | Meaning | Action |
|---|---|---|
| 429 | Rate limited | Exponential backoff, check Retry-After |
| 400 | Invalid request | Check request format, content filter |
| 401 | Authentication error | Refresh token |
| 403 | Quota exceeded | Wait or upgrade tier |
| 500 | Server error | Retry with backoff |
| 503 | Service unavailable | Retry with longer backoff |
if (response.status === 400) {
const error = await response.json();
if (error.error?.code === 'content_filter') {
return { message: 'Content was filtered by safety policy.', filtered: true };
}
}
| Setting | Value | Rationale |
|---|---|---|
| max_tokens | 500-2000 | Sized for expected response |
| temperature | 0.3-0.7 | Lower for tool calling, higher for creative |
| retry attempts | 5 | Handles transient rate limits |
| base delay | 2000ms | Start at 2s for backoff |
| max delay | 60000ms | Cap at 1 minute |
| Trigger | Response |
|---|---|
| "azure openai", "rate limit", "429" | Full skill activation |
| "function calling", "tool calling" | Function Calling Patterns section |
| "token optimization", "max_tokens" | Pattern 2 + Recommended Settings |
| "retry", "backoff" | Pattern 1 + Error Codes |
| "request queue", "high volume" | Pattern 3 |