| name | content-cannibalization |
| description | Detect content cannibalization on a site — multiple URLs ranking (or impressing) for the same query, splitting click-through, diluting authority, and confusing Google about which URL is canonical for the intent. Pulls GSC query × page data, identifies cannibalized clusters (≥ 2 pages with ≥ 50 impressions for the same query within position 1-50), classifies the underlying issue (true duplicate / intent overlap / template confusion / accidental same H1 / variant page leakage), and recommends one of five resolutions: consolidate-and-301, canonicalize, differentiate-intent, noindex-the-weaker, or update-internal-links-to-disambiguate. Generates the actual redirect rules, canonical tags, content briefs for differentiation, and a recovery model showing expected click consolidation. TRIGGER on "cannibalization", "content cannibalization", "keyword cannibalization", "ranking conflicts", "multiple pages for same query", "consolidate content", "GSC cannibalization". |
Content Cannibalization Detector & Resolver
You find cannibalization and recommend the specific fix. Cannibalization is the single most common cause of "we have lots of content but rankings are stuck" — and it's invisible without joining GSC query-level data with the site's page intents.
============================================================
=== PRE-FLIGHT ===
Recovery:
- No backlink data: use internal PageRank from
/internal-link-graph as authority proxy.
- No intent map: auto-derive via top GSC query per page (with a "MUST_VERIFY" flag on results).
============================================================
=== PHASE 1: CANNIBALIZATION DETECTION ===
For each unique query in GSC (filter to impressions ≥ 50 in window):
def detect_cannibalization(gsc_rows, min_impressions=50, position_max=50):
"""
Returns clusters where a single query has 2+ URLs both within
position 1-50 with non-trivial impressions.
"""
by_query = defaultdict(list)
for row in gsc_rows:
if row.impressions >= min_impressions and row.position <= position_max:
by_query[row.query].append(row)
clusters = {}
for query, rows in by_query.items():
if len(rows) >= 2:
clusters[query] = sorted(rows, key=lambda r: -r.clicks)
return clusters
Output cannibalization_clusters.csv:
| Query | URL | Impressions | Clicks | Avg Position | CTR | Cluster Size |
|---|
| best CRM | /blog/best-crm | 8400 | 240 | 8.2 | 2.9% | 3 |
| best CRM | /pricing | 1100 | 12 | 22.4 | 1.1% | 3 |
| best CRM | /reviews | 600 | 5 | 31.8 | 0.8% | 3 |
VALIDATION: Detection produces non-zero clusters on any site with > 500 pages and > 100 ranking queries.
============================================================
=== PHASE 2: ROOT-CAUSE CLASSIFICATION ===
For each cluster, classify the cannibalization type:
| Class | Signal | Typical fix |
|---|
| True duplicate | Embedding similarity > 0.95 between competing pages | Consolidate, 301 weaker → stronger, remove |
| Intent overlap | Same query, different intents (e.g., transactional + informational) | Differentiate H1 / content; canonical NOT same |
| Template confusion | Multiple variant/filter pages indexable, same query | Canonicalize to parent; or noindex variants |
| Accidental same H1 | Different topics but same H1 / title | Rewrite the H1 / title of one |
| Variant page leakage | UTM / session ID / filter query strings indexed | Set <link rel="canonical"> to clean URL; robots Disallow query strings |
| Author overlap | Same author bio repeated, body content overlaps in intro paragraphs | Update author template to be lighter |
Detection per class:
- True duplicate: embedding cosine ≥ 0.95 across competing URLs.
- Intent overlap: SERP intent signal (Google AI Overview vs ten blue links vs shopping ads) suggests one canonical intent; multiple pages target different but Google merges.
- Template confusion: URL pattern suggests filter/variant (
?color=, ?sort=, /category/page/2/).
- Variant leakage: URL has query string or session ID.
VALIDATION: Each cluster has a class label + confidence (0-1).
============================================================
=== PHASE 3: RESOLUTION RECOMMENDATIONS ===
Per cluster, produce one of five resolutions:
A. Consolidate-and-301 (best for true duplicates):
- Identify "winner" (highest authority + best historical clicks + best target intent fit).
- All losers 301-redirect → winner.
- Merge unique content from losers into winner (don't lose value).
- Update internal links to point to winner.
B. Canonicalize (best for variant leakage):
- Set
<link rel="canonical" href="https://example.com/canonical"> on variants.
- Don't 301 (preserve UX for filtered nav).
C. Differentiate-intent (best for intent overlap):
- Keep both pages. Rewrite each to target distinct intent.
- Page A → transactional ("buy X"); Page B → informational ("what is X").
- Update internal links to disambiguate.
D. Noindex-the-weaker (best for template / pagination):
<meta name="robots" content="noindex,follow"> on weaker pages.
- Or use
rel="next/prev" pagination semantics where applicable.
E. Update-internal-links-only (lightest touch, valid when both pages should stay live):
- Reroute internal links so the intended page gets all the link equity.
- Don't change URL structure.
Per cluster, output resolution_{cluster_id}.md with:
- Recommended resolution (A-E)
- Reasoning (which signals drove it)
- Exact code/redirects/canonical tags to add
- Internal link updates needed
- Expected click recovery (sum of cluster clicks × consolidation multiplier 1.2-1.8)
VALIDATION: Recommendation per cluster is concrete and tied to evidence.
============================================================
=== PHASE 4: REDIRECT RULES & CANONICAL PATCHES ===
Generate platform-specific implementation:
Nginx:
location = /blog/old-url { return 301 /blog/winner-url; }
Apache (.htaccess):
RewriteRule ^blog/old-url$ /blog/winner-url [R=301,L]
Next.js (next.config.js):
async redirects() {
return [
{ source: '/blog/old-url', destination: '/blog/winner-url', permanent: true },
...
]
}
Cloudflare Workers / Vercel vercel.json / Netlify _redirects:
/blog/old-url /blog/winner-url 301
WordPress (Redirection plugin export):
source,target,type
/blog/old-url,/blog/winner-url,301
Canonical tag patches for "B. Canonicalize":
<link rel="canonical" href="https://example.com/canonical-url">
VALIDATION: Generated redirects parse correctly in their target stack.
============================================================
=== PHASE 5: RECOVERY MODEL ===
Estimate post-fix click recovery:
Expected clicks per cluster after fix =
(sum of impressions in cluster) × (CTR at winner's new expected position)
Position lift from consolidation:
if total cluster clicks > 100 and lift ~ -1 to -3 positions (closer to top)
use CTR-by-position curve (Google avg: pos 1=27%, pos 2=15%, pos 3=11%, pos 4=8%, pos 5=7%, etc.)
For each cluster:
- Current cluster total clicks/week
- Expected clicks/week after fix
- Net gain
- Confidence (high if true duplicate, medium for intent split, low for soft variants)
Aggregate: total expected click gain across all cannibalization fixes. Prioritize highest-gain × lowest-effort first.
VALIDATION: Recovery model uses real CTR-by-position curves, not made-up multipliers.
============================================================
=== PHASE 6: ACTION QUEUE ===
Generate action_queue.md ordered by impact × inverse effort:
# Cannibalization Action Queue — {site}
## P0 — High impact, low effort
1. [Consolidate] "best CRM" cluster: 301 /blog/reviews + /pricing → /blog/best-crm. Expected: +180 clicks/week. Effort: 30 min. Risk: low.
## P1 — High impact, medium effort
2. [Differentiate] "API rate limits" cluster: rewrite /docs/api-limits to focus on technical reference vs /blog/api-rate-limits which keeps tutorial focus. Expected: +90 clicks/week. Effort: 4 hours.
## P2 — Polish
3. [Canonicalize] "/?utm_source=*" variants: add canonical pointing to clean URL. Expected: +5 clicks/week (de-duped indexing). Effort: 1 hour.
VALIDATION: Action queue has per-item expected gain + effort.
============================================================
=== SELF-REVIEW ===
- Complete: Detection + classification + 5 resolution types + redirect patches + recovery model?
- Robust: Handles variant leakage with query strings? Avoids over-consolidating (some pairs SHOULD stay separate)?
- Clean: Recommendations are platform-specific (Nginx / Next.js / WordPress)?
- SEO-credible: Would an Ahrefs power-user accept the analysis as production-ready?
Common gap: recommending 301 on a page that's still gaining traffic. Always check trend before consolidating — growing-but-second-best may overtake.
============================================================
=== LEARNINGS CAPTURE ===
~/.claude/skills/content-cannibalization/LEARNINGS.md.
============================================================
=== STRICT RULES ===
- Never recommend a 301 without first checking that both pages have the same primary intent. Intent splits don't consolidate, they differentiate.
- Never propose a redirect that creates a chain (A→B→C). Always squash to A→C directly.
- Never assume cluster sizes > 2 = always bad. Sometimes Google legitimately surfaces multiple URLs (e.g., subdirectory + a sub-page).
- Always update internal links AFTER 301-ing. Otherwise the redirects hop forever and crawl budget burns.
- Always re-pull GSC after 4-8 weeks to verify recovery model was correct. Refine assumptions over time.