원클릭으로
duplicate-detection
// Use when user needs to find duplicate or similar content. Triggers on: duplicate, deduplication, plagiarism detection, similar content, near-duplicate, similarity detection, content dedup, find copies.
// Use when user needs to find duplicate or similar content. Triggers on: duplicate, deduplication, plagiarism detection, similar content, near-duplicate, similarity detection, content dedup, find copies.
| name | duplicate-detection |
| description | Use when user needs to find duplicate or similar content. Triggers on: duplicate, deduplication, plagiarism detection, similar content, near-duplicate, similarity detection, content dedup, find copies. |
Batch detection of duplicate or highly similar content using vector similarity.
Activate this skill when:
Do NOT activate when:
semantic-searchclusteringrec-system"What type of duplicates are you looking for?"
A) Exact duplicates (100% identical)
B) Near-duplicates (semantically similar)
C) Both
Which do you need? (A/B/C)
"How similar should content be to be considered duplicate?"
| Threshold | Interpretation | Use Case |
|---|---|---|
| 0.95+ | Near identical | Strict deduplication |
| 0.85-0.95 | Very similar | Plagiarism detection |
| 0.75-0.85 | Related | FAQ merging |
| 0.65-0.75 | Loosely related | Topic clustering |
What threshold fits your needs?
"Based on your requirements:
Proceed? (yes / adjust [what])"
Think of duplicate detection as finding twins in a crowd:
┌─────────────────────────────────────────────────────────┐
│ Duplicate Detection │
│ │
│ New Content: "Machine learning requires lots of data" │
│ │ │
│ ┌──────────────┴──────────────┐ │
│ │ │ │
│ ▼ ▼ │
│ ┌───────────────┐ ┌───────────────┐ │
│ │ Hash Check │ │ Vector Search │ │
│ │ (Exact) │ │ (Semantic) │ │
│ └───────┬───────┘ └───────┬───────┘ │
│ │ │ │
│ ▼ ▼ │
│ "Content hash matches "Similar to: 'ML needs │
│ doc_123" → EXACT DUP big datasets' (0.92)" │
│ → NEAR DUPLICATE │
│ │
│ Result: {"is_duplicate": true, "type": "near", │
│ "similarity": 0.92, "match": "doc_123"} │
└─────────────────────────────────────────────────────────┘
| Method | Speed | Finds | Misses |
|---|---|---|---|
| Hash | ★★★★★ | Exact copies | Paraphrases |
| Vector | ★★★ | Semantic similarity | Different meanings same words |
| Both | ★★★ | Comprehensive | Best coverage |
from pymilvus import MilvusClient, DataType
from sentence_transformers import SentenceTransformer
import hashlib
class DuplicateDetector:
def __init__(self, uri: str = "./milvus.db", threshold: float = 0.9):
self.client = MilvusClient(uri=uri)
self.model = SentenceTransformer('BAAI/bge-large-en-v1.5')
self.threshold = threshold
self.collection_name = "duplicate_detection"
self._init_collection()
def _init_collection(self):
if self.client.has_collection(self.collection_name):
return
schema = self.client.create_schema()
schema.add_field("id", DataType.VARCHAR, is_primary=True, max_length=64)
schema.add_field("content_hash", DataType.VARCHAR, max_length=64)
schema.add_field("content", DataType.VARCHAR, max_length=65535)
schema.add_field("source", DataType.VARCHAR, max_length=512)
schema.add_field("embedding", DataType.FLOAT_VECTOR, dim=1024)
index_params = self.client.prepare_index_params()
index_params.add_index(field_name="embedding", index_type="AUTOINDEX", metric_type="COSINE")
index_params.add_index(field_name="content_hash", index_type="TRIE")
self.client.create_collection(
collection_name=self.collection_name,
schema=schema,
index_params=index_params
)
def _hash_content(self, content: str) -> str:
"""Calculate content hash (normalized)"""
normalized = ''.join(content.lower().split())
return hashlib.md5(normalized.encode()).hexdigest()
def check_duplicate(self, content: str, source: str = "") -> dict:
"""Check if content is duplicate"""
content_hash = self._hash_content(content)
# 1. Exact match (hash)
exact_match = self.client.query(
collection_name=self.collection_name,
filter=f'content_hash == "{content_hash}"',
output_fields=["id", "source"],
limit=1
)
if exact_match:
return {
"is_duplicate": True,
"type": "exact",
"match_id": exact_match[0]["id"],
"match_source": exact_match[0]["source"],
"similarity": 1.0
}
# 2. Semantic similarity (vector)
embedding = self.model.encode(content).tolist()
results = self.client.search(
collection_name=self.collection_name,
data=[embedding],
limit=1,
output_fields=["id", "content", "source"]
)
if results[0] and results[0][0]["distance"] >= self.threshold:
return {
"is_duplicate": True,
"type": "near",
"match_id": results[0][0]["entity"]["id"],
"match_source": results[0][0]["entity"]["source"],
"similarity": results[0][0]["distance"],
"match_preview": results[0][0]["entity"]["content"][:200] + "..."
}
return {
"is_duplicate": False,
"similarity": results[0][0]["distance"] if results[0] else 0
}
def add_content(self, content_id: str, content: str, source: str = ""):
"""Add content to library"""
content_hash = self._hash_content(content)
embedding = self.model.encode(content).tolist()
self.client.insert(
collection_name=self.collection_name,
data=[{
"id": content_id,
"content_hash": content_hash,
"content": content,
"source": source,
"embedding": embedding
}]
)
def batch_dedup(self, items: list) -> dict:
"""Batch check and deduplicate
items: [{"id": "...", "content": "...", "source": "..."}]
Returns: {"unique": [...], "duplicates": [...]}
"""
unique = []
duplicates = []
for item in items:
result = self.check_duplicate(item["content"], item.get("source", ""))
result["id"] = item["id"]
if result["is_duplicate"]:
duplicates.append(result)
else:
unique.append(item)
self.add_content(item["id"], item["content"], item.get("source", ""))
return {
"unique": unique,
"duplicates": duplicates,
"unique_count": len(unique),
"duplicate_count": len(duplicates),
"duplicate_ratio": len(duplicates) / len(items) if items else 0
}
# Usage
detector = DuplicateDetector(threshold=0.85)
# Check single content
result = detector.check_duplicate("Machine learning requires a lot of data")
if result["is_duplicate"]:
print(f"Duplicate found! Similarity: {result['similarity']:.2f}")
else:
detector.add_content("doc001", "Machine learning requires a lot of data", "blog.md")
# Batch deduplication
results = detector.batch_dedup([
{"id": "1", "content": "Python is a programming language", "source": "a.txt"},
{"id": "2", "content": "Python is a coding language", "source": "b.txt"}, # Near-duplicate
{"id": "3", "content": "The weather is nice today", "source": "c.txt"},
])
print(f"Unique: {results['unique_count']}, Duplicates: {results['duplicate_count']}")
| Use Case | Threshold | Rationale |
|---|---|---|
| Exact dedup (data cleaning) | 0.95+ | Only near-identical |
| Plagiarism detection | 0.85-0.90 | Allow rewording |
| FAQ merging | 0.80-0.85 | Same question, different words |
| Similar content grouping | 0.70-0.80 | Topically related |
Problem: Everything is marked as duplicate
Why: Low threshold catches even loosely related content
Fix: Start with 0.90 and adjust down if needed
Problem: Paraphrased duplicates not detected
Why: Hash only catches exact matches
Fix: Always use vector similarity for semantic duplicates
Problem: Same content with different whitespace not detected
Why: Hash is sensitive to exact characters
Fix: Normalize content before hashing
normalized = ''.join(content.lower().split())
Problem: Different "original" detected depending on order
Why: First item becomes the reference
Fix: Process by timestamp (oldest first) or have clear policy
| Need | Upgrade To |
|---|---|
| Group similar content | clustering |
| Plagiarism with source matching | verticals/plagiarism.md |
| Large-scale batch processing | Add core:ray |
verticals/plagiarism.mdverticals/content-dedup.mdverticals/faq-merge.mdUse when user needs to group similar items together. Triggers on: clustering, group similar, topic modeling, user segmentation, categorization, automatic classification, unsupervised grouping.
Use when user needs long-term memory for chatbots. Triggers on: chat memory, conversation history, long-term memory, chatbot memory, memory retrieval, persistent memory, remember conversations.
Use when user wants to build image search or similar image finding. Triggers on: image search, similar image, visual search, image retrieval, CLIP, reverse image search, image matching, find similar photos.
Use when user needs RAG on documents with images and text. Triggers on: multimodal RAG, image-text mixed, document with images, PDF with charts, visual RAG, visual Q&A, documents with figures.
Use when user needs to search images using natural language descriptions. Triggers on: text to image, describe and find, natural language image search, image caption search, find image by description, describe to find.
Use when user needs to search video content by text or image. Triggers on: video search, video retrieval, video clips, meeting recordings, tutorial videos, surveillance playback, find moment in video.