mit einem Klick
ml-cv-specialist
// Deep expertise in ML/CV model selection, training pipelines, and inference architecture. Use when designing machine learning systems, computer vision pipelines, or AI-powered features.
// Deep expertise in ML/CV model selection, training pipelines, and inference architecture. Use when designing machine learning systems, computer vision pipelines, or AI-powered features.
Detect common technical and organizational anti-patterns in proposals, architectures, and plans. Use when strategic-cto-mentor needs to identify red flags before they become problems.
Recommend architecture patterns (monolith, microservices, serverless, modular monolith) based on scale, team size, and constraints. Use when cto-architect needs to select the right architectural approach for a new system or migration.
Identify and challenge implicit assumptions in plans, proposals, and technical decisions. Use when strategic-cto-mentor needs to surface hidden assumptions and wishful thinking before they become costly mistakes.
Generate targeted clarifying questions (2-3 max) that challenge vague requirements and extract missing context. Use after request-analyzer identifies clarification needs, before routing to specialist agents. Helps cto-orchestrator avoid delegating unclear requirements.
Infrastructure and development cost estimation for technical projects. Use when planning budgets, evaluating build vs buy decisions, or projecting TCO for architecture choices.
Transform clarified user requests into structured delegation prompts optimized for specialist agents (cto-architect, strategic-cto-mentor, cv-ml-architect). Use after clarification is complete, before routing to specialist agents. Ensures agents receive complete context for effective work.
| name | ml-cv-specialist |
| description | Deep expertise in ML/CV model selection, training pipelines, and inference architecture. Use when designing machine learning systems, computer vision pipelines, or AI-powered features. |
Provides specialized guidance for machine learning and computer vision system design, model selection, and production deployment.
Use Case Identified
│
├─► Text/Language Tasks
│ ├─► Classification → BERT, DistilBERT, or API (OpenAI, Claude)
│ ├─► Generation → GPT-4, Claude, Llama (self-hosted)
│ ├─► Embeddings → OpenAI Ada, sentence-transformers
│ └─► Search/RAG → Vector DB + Embeddings + LLM
│
├─► Computer Vision Tasks
│ ├─► Classification → ResNet, EfficientNet, ViT
│ ├─► Object Detection → YOLOv8, DETR, Faster R-CNN
│ ├─► Segmentation → SAM, Mask R-CNN, U-Net
│ ├─► OCR → Tesseract, PaddleOCR, Cloud Vision API
│ └─► Face Recognition → InsightFace, DeepFace
│
├─► Audio Tasks
│ ├─► Speech-to-Text → Whisper, DeepSpeech, Cloud APIs
│ ├─► Text-to-Speech → ElevenLabs, Coqui TTS
│ └─► Audio Classification → PANNs, AudioSet models
│
└─► Structured Data
├─► Tabular → XGBoost, LightGBM, CatBoost
├─► Time Series → Prophet, ARIMA, Transformer-based
└─► Recommendations → Two-tower, matrix factorization
| Factor | API Preferred | Self-Hosted Preferred |
|---|---|---|
| Volume | < 10K requests/month | > 100K requests/month |
| Latency | > 500ms acceptable | < 100ms required |
| Customization | General use case | Domain-specific fine-tuning |
| Data Privacy | Non-sensitive data | PII, HIPAA, financial |
| Team Expertise | No ML engineers | ML team available |
| Budget | Predictable per-call costs | High volume justifies infra |
## API Costs (Example: OpenAI GPT-4)
- Input: $0.03/1K tokens
- Output: $0.06/1K tokens
- Average request: 500 input + 200 output tokens
- Cost per request: $0.027
- 100K requests/month: $2,700
## Self-Hosted Costs (Example: Llama 70B)
- GPU instance: $3/hour (A100 40GB)
- Throughput: ~50 requests/minute = 3K/hour
- Cost per request: $0.001
- 100K requests/month: $100 + $500 engineering time
## Break-even Analysis
- < 50K requests: API likely cheaper
- > 50K requests: Self-hosted may be cheaper
- Factor in: engineering time, ops burden, model quality
┌─────────────────────────────────────────────────────────────┐
│ DATA LAYER │
├─────────────────────────────────────────────────────────────┤
│ Data Sources → ETL → Feature Store → Training Data │
│ (S3, DBs) (Airflow) (Feast) (Versioned) │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ TRAINING LAYER │
├─────────────────────────────────────────────────────────────┤
│ Experiment Tracking → Training Jobs → Model Registry │
│ (MLflow, W&B) (SageMaker) (MLflow, S3) │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ SERVING LAYER │
├─────────────────────────────────────────────────────────────┤
│ Model Server → Load Balancer → Monitoring │
│ (TorchServe) (K8s/ELB) (Prometheus) │
└─────────────────────────────────────────────────────────────┘
| Component | Options | Recommendation |
|---|---|---|
| Feature Store | Feast, Tecton, SageMaker | Feast (open source), Tecton (enterprise) |
| Experiment Tracking | MLflow, Weights & Biases, Neptune | MLflow (free), W&B (best UX) |
| Training Orchestration | Kubeflow, SageMaker, Vertex AI | SageMaker (AWS), Vertex (GCP) |
| Model Registry | MLflow, SageMaker, custom S3 | MLflow (standard) |
| Model Serving | TorchServe, TFServing, Triton | Triton (multi-framework) |
Best for: Low-latency requirements, simple integration
Client → API Gateway → Model Server → Response
│
Load Balancer
│
┌──────┴──────┐
│ │
Model Pod Model Pod
Latency targets:
Best for: Long-running inference, batch processing
Client → API → Queue (SQS) → Worker → Result Store → Webhook/Poll
│
S3/Redis
Use when:
Best for: Privacy, offline capability, ultra-low latency
┌─────────────────────────────────────────┐
│ EDGE DEVICE │
│ ┌─────────┐ ┌─────────────────────┐ │
│ │ Camera │───▶│ Optimized Model │ │
│ └─────────┘ │ (ONNX, TFLite) │ │
│ └─────────────────────┘ │
│ │ │
│ Local Result │
└─────────────────────────────────────────┘
│
Sync to Cloud
(non-blocking)
Model optimization for edge:
Camera Stream → Frame Extraction → Preprocessing → Model → Postprocessing → Output
│ │ │ │ │
RTSP/ 1-30 FPS Resize, Batch or NMS, tracking,
WebRTC normalize single annotation
Performance optimization:
## Pipeline Components
1. **Input Processing**
- Video decode: FFmpeg, OpenCV
- Frame buffer: Ring buffer for temporal context
- Preprocessing: NVIDIA DALI (GPU), OpenCV (CPU)
2. **Detection**
- Model: YOLOv8 (speed), DETR (accuracy)
- Batch size: 1-8 depending on latency requirements
- Confidence threshold: 0.5-0.7 typical
3. **Post-processing**
- NMS (Non-Maximum Suppression)
- Tracking: SORT, DeepSORT, ByteTrack
- Smoothing: Kalman filter for stable boxes
4. **Output**
- Annotations: Bounding boxes, labels, confidence
- Events: Trigger on detection (webhook, queue)
- Storage: Frame + metadata to S3/DB
User Query → Embedding → Vector Search → Context Retrieval → LLM → Response
│
Vector DB
(Pinecone, Weaviate,
Chroma, pgvector)
Vector DB Selection:
| Database | Best For | Limitations |
|---|---|---|
| Pinecone | Managed, scale | Cost at scale |
| Weaviate | Self-hosted, features | Operational overhead |
| Chroma | Simple, local dev | Not for production scale |
| pgvector | PostgreSQL users | Performance at >1M vectors |
| Qdrant | Performance | Newer, smaller community |
┌─────────────────────────────────────────────────────────────┐
│ API GATEWAY │
│ Rate limiting, auth, request routing │
└─────────────────────────────────────────────────────────────┘
│
┌─────────────┼─────────────┐
│ │ │
▼ ▼ ▼
┌────────┐ ┌────────┐ ┌────────┐
│ GPT-4 │ │ Claude │ │ Local │
│ API │ │ API │ │ Llama │
└────────┘ └────────┘ └────────┘
│
Model Router
(cost/latency/capability)
Multi-model strategy:
| Technique | Memory Reduction | Speed Impact |
|---|---|---|
| FP16 (Half Precision) | 50% | Neutral to faster |
| INT8 Quantization | 75% | 10-20% slower |
| INT4 Quantization | 87.5% | 20-40% slower |
| Gradient Checkpointing | 60-80% | 20-30% slower |
| Model Sharding | Distributed | Communication overhead |
# Dynamic batching pseudocode
class DynamicBatcher:
def __init__(self, max_batch=32, max_wait_ms=50):
self.queue = []
self.max_batch = max_batch
self.max_wait = max_wait_ms
async def add_request(self, request):
self.queue.append(request)
# Batch when full or timeout
if len(self.queue) >= self.max_batch:
return await self.process_batch()
await asyncio.sleep(self.max_wait / 1000)
return await self.process_batch()
async def process_batch(self):
batch = self.queue[:self.max_batch]
self.queue = self.queue[self.max_batch:]
return await self.model.predict_batch(batch)
| Metric | What It Measures | Alert Threshold |
|---|---|---|
| Latency (P95) | Response time | > 2x baseline |
| Throughput | Requests/second | < 80% capacity |
| Error Rate | Failed predictions | > 1% |
| Model Drift | Distribution shift | PSI > 0.2 |
| Data Quality | Input anomalies | > 5% anomalies |
Training Distribution ──┐
├──► Statistical Test ──► Alert
Production Distribution ─┘
(PSI, KS test, JS divergence)
Population Stability Index (PSI):
| Use Case | Recommended Model | Latency | Cost |
|---|---|---|---|
| Text Classification | DistilBERT | 10ms | Low |
| Text Generation | GPT-4 / Claude | 1-5s | Medium |
| Image Classification | EfficientNet-B0 | 5ms | Low |
| Object Detection | YOLOv8-n | 10ms | Low |
| Object Detection (Accurate) | YOLOv8-x | 50ms | Medium |
| Semantic Segmentation | SAM | 100ms | Medium |
| Speech-to-Text | Whisper-base | Real-time | Low |
| Embeddings | text-embedding-ada-002 | 50ms | Low |
| Scale | GPU | Model Size | Throughput |
|---|---|---|---|
| Development | T4 (16GB) | < 7B params | 10-50 req/s |
| Production Small | A10G (24GB) | < 13B params | 50-100 req/s |
| Production Medium | A100 (40GB) | < 70B params | 100-500 req/s |
| Production Large | A100 (80GB) x 2+ | > 70B params | 500+ req/s |