Use when evaluating agent performance, building test frameworks, measuring quality, or asking about "agent evaluation", "LLM-as-judge", "agent testing", "quality metrics", "evaluation rubrics", "agent benchmarks"
Use when "codex", "use gpt", "gpt-5", "openai codex", "let openai", "full-auto", "autonomous code generation"
Use when compressing agent context, implementing conversation summarization, reducing token usage in long sessions, or asking about "context compression", "conversation history", "token optimization", "context limits", "summarization strategies"
Use when diagnosing agent failures, debugging lost-in-middle issues, understanding context poisoning, or asking about "context degradation", "lost in middle", "context poisoning", "attention patterns", "context clash", "agent performance drops"
Use when optimizing agent context, reducing token costs, implementing KV-cache optimization, or asking about "context optimization", "token reduction", "context limits", "observation masking", "context budgeting", "context partitioning"
Use when "CrewAI", "multi-agent systems", "agent orchestration", "AI crews", or asking about "autonomous agents", "agent collaboration", "role-based agents", "agent workflows", "AI team coordination"
Use when facing 2+ independent tasks that can be worked on without shared state or sequential dependencies
Use when "DSPy", "declarative prompting", "automatic prompt optimization", "Stanford NLP", or asking about "optimizing prompts", "prompt compilation", "modular LLM programming", "chain of thought", "few-shot learning"