| name | llm-buster |
| description | Use this skill when the user asks to audit a codebase for LLM API calls (OpenAI, Anthropic โ Python and TS/JS SDKs plus equivalent raw HTTP) and decide which calls can be replaced with deterministic logic. Activates on phrases like 'audit LLM usage', 'bust LLM calls', 'bust the LLM here', 'llm-bust this', 'do I actually need an LLM here', 'replace LLM with deterministic', 'de-LLM this', 'find LLM calls', 'cut LLM costs', 'is this prompt necessary'; on grep hits for `openai.`, `client.chat.completions.create`, `from anthropic import`, `client.messages.create`, `api.openai.com/v1/chat/completions`, `api.anthropic.com/v1/messages`; and on questions about whether a specific LLM call could be a regex, parser, classifier, lookup, or rule. Produces a per-call audit report with ๐ข/๐ก/๐ด verdicts and concrete deterministic alternatives. Does NOT rewrite code unless the user explicitly asks after seeing the report. |
| metadata | {"version":"1.0.0","scope":"openai-anthropic-audit","file_policy":"markdown-only"} |
llm-buster skill
Use this skill when the user wants to know which LLM calls in an existing codebase are doing work a regex, parser, classifier, or lookup table could do for free. The skill is an audit tool, not a refactoring tool โ it surfaces call sites, classifies the task, applies a rubric, proposes a deterministic replacement, and estimates impact. The user decides what to actually swap.
Core stance
LLMs are paid wizards. They earn their keep when the input is open-ended, the output requires reasoning or generation, or the task surface is too broad to specify. They are wildly overused for tasks a 20-line parser would nail.
Three load-bearing consequences:
- The prompt is the spec. If the prompt names a fixed set of fields, a finite set of categories, or a format conversion โ that spec can usually be written in code. Read the prompt before reading the SDK call.
- Variability of input is the deciding factor. A "classify support ticket" call over emails from 12 known templates is different from the same call over free-form user chat. Same SDK, opposite verdicts.
- The LLM was probably right when added. Do not assume the original author was lazy. Sometimes the LLM is absorbing edge-case complexity that a regex would silently corrupt. When in doubt, mark ๐ก and recommend a shadow-mode A/B, not a swap.
What the report looks like
One block per call site, grouped ๐ข โ ๐ก โ ๐ด. Example of a single finding (full template in references/report.md):
๐ข โ services/intent.py:42
Task: classification (3 fixed labels: refund / shipping / other).
Prompt is a system message listing the labels with 4 hardcoded few-shot examples; output is parsed with label = resp.choices[0].message.content.strip().lower().
Replacement: rapidfuzz.process.extractOne(text, LABELS, scorer=fuzz.token_set_ratio) with a 60-score floor, fallback to "other". ~15 lines.
Risks: loses ability to handle paraphrases the few-shots didn't cover ("money back" โ refund). Mitigate by adding aliases to a dict, or shadow-mode for a week against the live LLM and grow the alias list from disagreements.
Per-call cost: ~280 input + ~3 output tokens on gpt-4o-mini โ $0.00005. ร requests/month = $TBD.
The audit ends with one closing line: "Want me to apply the ๐ข replacements?"
When to activate
Activate on any of:
- user asks to "audit / inventory / find / list" LLM calls in a project;
- user asks "do I need an LLM here", "can I drop this prompt", "is this overkill";
- user asks to "cut LLM costs" or "reduce token spend" via deterministic replacement;
- working directory has
import openai, from anthropic, @anthropic-ai/sdk, openai (npm), or HTTP to api.openai.com / api.anthropic.com;
- working directory has imports of common multi-provider wrappers โ
litellm, google.genai / Vertex Gemini SDK, boto3 with bedrock-runtime, ollama. The skill still activates here but operates in detection-only mode for those providers โ see references/detection.md ยง "Detection-only providers". Replacement recipes still target OpenAI/Anthropic call sites in the same repo;
- user explicitly invokes
/llm-buster or names this skill.
Do not activate for:
- adding new LLM features (out of scope โ this is removal-oriented);
- migrating between providers (OpenAI โ Anthropic);
- prompt engineering or quality improvement of an existing LLM call;
- generic "should I use AI for X" greenfield design questions.
First action: scope check, then inventory
Scope sanity check (10 seconds)
Before running grep, confirm you're actually looking at a project and not a one-shot scratch file. Check three things:
ls -la
find . -type f -not -path '*/\.*' | wc -l
If the answer is "one loose .py file" or "no project metadata", abort the per-call report machinery and just inspect the file conversationally โ a 200-line script doesn't need a verdict table, a per-file pivot, or a duplication pass. Tell the user: "This looks like a single script, not a project โ I'll skip the audit format and just walk through the LLM calls inline. OK?"
Inventory grep
From the project root. Two passes โ the second is for the broader provider surface so the user sees their full LLM cost footprint (calls there enter detection-only mode โ see next section).
rg -n --no-heading -e 'from openai' -e 'import openai' -e 'openai\.' \
-e 'from anthropic' -e 'import anthropic' -e 'anthropic\.' \
-e 'client\.chat\.completions\.create' -e 'client\.messages\.create' \
-e 'api\.openai\.com' -e 'api\.anthropic\.com' \
--glob '!**/node_modules/**' --glob '!**/.venv/**' --glob '!**/dist/**'
rg -n --no-heading -e "from ['\"]openai['\"]" -e "from ['\"]@anthropic-ai/sdk['\"]" \
-e 'new OpenAI\(' -e 'new Anthropic\(' \
--glob '!**/node_modules/**' --glob '!**/dist/**'
rg -n --no-heading \
-e 'import ollama' -e 'from ollama' -e 'ollama\.(chat|generate)\(' \
-e 'from litellm' -e 'litellm\.' \
-e 'from google import genai' -e 'google\.generativeai' -e 'from vertexai' -e 'GenerativeModel\(' -e 'generate_content\(' \
-e "boto3\.client\(['\"]bedrock-runtime" -e 'invoke_model\(' -e '\.converse\(' \
-e 'from langchain_(google_genai|aws|ollama|mistralai|cohere)' \
-e 'from llama_cpp' -e 'localhost:(11434|8000)/v1' \
--glob '!**/node_modules/**' --glob '!**/.venv/**' --glob '!**/dist/**'
If rg is unavailable, grep -rn with the same patterns works. If pass 1 returns hits, you usually don't need to chase wrapper frameworks (LangChain ChatOpenAI, instructor, LiteLLM to OpenAI) โ most projects also call the SDK directly somewhere. See references/detection.md for AST queries and the full wrapper pattern list when grep misses.
This produces the call-site list. Everything downstream operates on that list.
If only detection-only calls are found
When pass 1 is empty but pass 2 has hits, the project is non-OpenAI/Anthropic (Gemini, Bedrock, Ollama, etc.). The audit still runs โ same scope check, same per-call loop, same verdict rubric โ but each finding omits the proposed replacement code block and instead writes:
Replacement recipe out of scope for this skill version (provider: <gemini|bedrock|ollama|...>); the rubric still applies and this call is a candidate for deterministic replacement using the same task category.
Also: do not report dollar savings for local runtimes (Ollama, vLLM, llama-cpp) โ substitute "latency / GPU-time saved" qualitatively. See references/detection.md ยง "Detection-only providers" for the per-provider grep patterns and edge cases.
How to use this skill
Load references on demand, not all up front:
- references/detection.md โ grep patterns, AST queries, wrapper detection (LangChain, LiteLLM, instructor). Load when the basic inventory grep misses calls or returns false positives.
- references/taxonomy.md โ the 10 task categories (extraction, classification, routing, validation, normalization, summarization, rephrasing, generation, reasoning, agentic). Load when classifying a specific call.
- references/rubric.md โ the ๐ข / ๐ก / ๐ด decision rules with disqualifiers. Load when deciding the verdict for a call.
- references/replacements.md โ concrete code recipes per task category (regex,
pydantic, dateutil, rapidfuzz, sklearn TF-IDF + LogReg, email-validator, sentence-transformers). Load when writing the proposed replacement for a ๐ข or ๐ก call.
- references/report.md โ the report template the audit emits. Load when assembling the final output.
Per-call analysis loop
For each call site found in inventory, run these five steps in order. Do not skip step 2.
- Read the prompt template, not just the SDK call. The classification depends almost entirely on what the prompt asks the model to do. Resolve f-strings, template literals, message arrays โ find the actual instruction text. If the prompt is dynamically assembled from a database or config, note that and ask the user for a sample.
- Identify the output contract. Is
response_format={"type":"json_object"} set? Is there a tools= / tool_choice= argument forcing a function call? Is the assistant text parsed with json.loads, a regex, or fed verbatim to a user? The contract tells you whether the call is structured (replaceable candidate) or freeform (probably not).
- Classify the task. Map to one of the 10 categories in references/taxonomy.md. If two fit, pick the dominant one; note the secondary.
- Apply the rubric. From references/rubric.md: ๐ข / ๐ก / ๐ด. Apply disqualifiers (free-form input, open-ended output, multi-step reasoning, unbounded category space) โ any one disqualifier downgrades the verdict.
- For ๐ข and ๐ก only โ propose a replacement. Pull the recipe from references/replacements.md, adapt to the call's actual inputs/outputs, and write the snippet inline in the report. For ๐ด, write one sentence on why it stays.
Verdict rubric โ short version
Full table in references/rubric.md. Five verdict modes:
- ๐ข / ๐ก / ๐ด โ standard colour rubric for in-scope local calls.
- โ Out of scope โ call exists for benchmarking, training-data generation, or as a baseline being evaluated. Replacing it defeats the purpose of the code.
- ๐ต Delegated โ repo is a thin client to a remote service; all LLM work runs server-side. Audit pivots to architecture-level recommendations.
Decision shortcut for the colour rubric:
| Signal | Verdict |
|---|
| Fixed output schema + structured/semi-structured input + finite domain | ๐ข |
| Fixed schema but messy input (OCR, free-form chat, mixed languages) | ๐ก |
| Finite category set (โค ~20 labels) + โฅ100 labeled examples available | ๐ก |
| Deterministic sub-task buried inside an otherwise-LLM prompt (heuristic, pick-from-list, ratio bucketing) | ๐ก (sub-task extraction) |
| Prompt contains "explain", "summarize", "rephrase", "write", "answer the user's question" | ๐ด |
| Output is shown verbatim to a human end-user as natural language | ๐ด |
| Multi-step reasoning, chain-of-thought, or agentic tool-loop | ๐ด |
| Stateful conversation context required | ๐ด |
| Reasoning over free-form prose produced by another LLM upstream | ๐ด (hard โ see H3 in references/rubric.md) |
When two rows fight, the harder verdict wins (i.e. ๐ด trumps ๐ข). Structural verdicts (โ, ๐ต) replace the colour, not extend it.
Non-negotiables
- Do not rewrite code on a first pass. Output is the audit report. Only swap code when the user explicitly says "apply the ๐ข replacements" or names specific call sites after reading the report.
- Do not downgrade an LLM call without reading the prompt. SDK call detection is not classification. Two
client.chat.completions.create calls can be ๐ข and ๐ด sitting in adjacent files.
- Do not propose regex for free-form natural language. Regex against user-generated content is a footgun. If the prompt's input includes "user's message", "support ticket body", "tweet", "chat history" โ that's a disqualifier toward ๐ด unless the task is hard validation (length, format).
- Do not estimate dollar savings without call volume. If the user hasn't provided traffic numbers, show the per-call token cost only (input tokens ร price + output tokens ร price), and label the monthly figure as
ร requests/month = $TBD. Do not invent traffic.
- Do not claim a replacement is "drop-in" without listing what changes. Latency profile, error modes, and edge-case coverage all shift when you replace an LLM with a parser. The report's "Risks" field is mandatory for ๐ข and ๐ก.
- Do not count tokens by character length. Use this fallback order, in order, and stop at the first that works:
tiktoken for OpenAI models (pip install tiktoken, then encoding_for_model("gpt-4o-mini").encode(text)).
- Anthropic's
client.messages.count_tokens(...) API for Anthropic models โ costs nothing and returns exact counts.
anthropic-tokenizer (TS) / gpt-tokenizer (TS) for JS projects without Python tooling.
- If none of the above is reachable in the audit environment, mark the per-call savings as
tokens: estimate pending โ install tiktoken and proceed. Do not substitute len(text) / 4 or any other char-based heuristic; it's off by 30โ60% on code and structured data and silently inflates the savings table.
- Do not treat
tools=/tool_choice= calls as freeform. A forced function call with a fixed JSON schema is structured output โ it is a ๐ข/๐ก candidate, not ๐ด.
- Do not skip the prompt because it's long. Long system prompts often encode the entire deterministic spec verbatim (a 2000-token system message describing 8 fields is screaming "I am a parser"). Read it.
Default output structure
When the user asks "audit this project" or "is this LLM call necessary", produce the report from references/report.md:
- Header line โ
N calls found across M files.
- Verdict table โ counts per ๐ข / ๐ก / ๐ด with estimated savings.
- One block per call, ordered ๐ข โ ๐ก โ ๐ด, each with:
- location (
file:line);
- current snippet (the SDK call + the prompt template);
- detected task category;
- verdict + one-line justification;
- proposed replacement code (๐ข / ๐ก only);
- risks / edge cases the LLM was likely catching;
- per-call token cost estimate.
- Closing line โ what to do next: "Want me to apply the ๐ข replacements?" or "Sample real inputs for the ๐ก calls before deciding."
If the project has > ~15 call sites, group by file rather than listing all inline, and offer to deep-dive a subset.
Gotchas
- Wrapper frameworks hide the prompt. LangChain's
LLMChain(prompt=PromptTemplate(...)) and instructor's client.chat.completions.create(response_model=MyModel) both wrap the underlying call. Find the template / pydantic model โ that's the spec. See references/detection.md.
- Jupyter notebooks need cell extraction first. Naive
rg over .ipynb returns JSON-escaped fragments (" response = requests.post(...)\n"). Convert with jupyter nbconvert --to script or the stdlib snippet in references/detection.md before classifying.
- Umbrella / multi-subproject repos blow up per-call mode. When the repo has โฅ 3 subprojects, lead with a per-subproject table and deep-dive only the top 2โ3 with actionable findings. See references/report.md ยง "Umbrella-repo mode".
- Same call in N files is one finding, not N. Five sites running the same FinBERT-replaceable sentiment call should be reported as one consolidated ๐ก with a cross-reference table โ see references/report.md ยง "Duplication pass".
- Research/benchmark code is โ, not ๐ด. Calls inside
Evaluation_methods/, data_gen/, prepare_data/, or LLM-as-judge harnesses cannot be replaced with deterministic code by definition. Tag as โ and skip the colour rubric โ see references/rubric.md ยง "โ Out of scope".
- Thin clients to remote agentic services are ๐ต. When the standard grep returns zero direct LLM calls but the README describes an LLM product, the LLMs are server-side. Pivot to architecture-level recommendations โ see references/rubric.md ยง "๐ต Delegated".
- A
temperature=0 call is not automatically deterministic-replaceable. Temperature 0 means low randomness, not low complexity. The task might still be open-ended.
response_format={"type":"json_object"} โ schema-bound. Without a JSON Schema (via tools= or response_format={"type":"json_schema",...}), the model can return any JSON shape. Calls with a true schema are stronger ๐ข candidates than calls with just json_object.
- Few-shot prompts are red herrings. A prompt with 5 hardcoded
input โ output examples is usually ๐ข โ those examples are the lookup table. Build a dict from the examples and pattern-match.
gpt-4o-mini / claude-haiku calls feel cheap, so people don't audit them. Volume ร cheap-per-call still adds up; cheap models also have worse JSON adherence, meaning a deterministic replacement often improves reliability while cutting cost. Include cheap-model calls in the audit.
- Streaming calls (
stream=True) are usually generation. If output is streamed to a UI, it's almost always ๐ด โ the user is reading text as it arrives. Don't propose replacing those.