mit einem Klick
clustering
// Use when user needs to group similar items together. Triggers on: clustering, group similar, topic modeling, user segmentation, categorization, automatic classification, unsupervised grouping.
// Use when user needs to group similar items together. Triggers on: clustering, group similar, topic modeling, user segmentation, categorization, automatic classification, unsupervised grouping.
Use when user needs to find duplicate or similar content. Triggers on: duplicate, deduplication, plagiarism detection, similar content, near-duplicate, similarity detection, content dedup, find copies.
Use when user needs long-term memory for chatbots. Triggers on: chat memory, conversation history, long-term memory, chatbot memory, memory retrieval, persistent memory, remember conversations.
Use when user wants to build image search or similar image finding. Triggers on: image search, similar image, visual search, image retrieval, CLIP, reverse image search, image matching, find similar photos.
Use when user needs RAG on documents with images and text. Triggers on: multimodal RAG, image-text mixed, document with images, PDF with charts, visual RAG, visual Q&A, documents with figures.
Use when user needs to search images using natural language descriptions. Triggers on: text to image, describe and find, natural language image search, image caption search, find image by description, describe to find.
Use when user needs to search video content by text or image. Triggers on: video search, video retrieval, video clips, meeting recordings, tutorial videos, surveillance playback, find moment in video.
| name | clustering |
| description | Use when user needs to group similar items together. Triggers on: clustering, group similar, topic modeling, user segmentation, categorization, automatic classification, unsupervised grouping. |
Automatically group similar content into clusters using vector embeddings — discover hidden patterns and categories in your data.
Activate this skill when:
Do NOT activate when:
duplicate-detectionfiltered-searchrec-system"What do you want to achieve with clustering?"
A) Topic discovery (documents, articles)
B) User segmentation (behavioral data)
C) Anomaly detection
D) Content organization
Which describes your goal? (A/B/C/D)
"Do you know how many clusters you want?"
| If You Know | Algorithm | Configuration |
|---|---|---|
| Yes, exactly N | KMeans | n_clusters=N |
| Roughly N | KMeans + silhouette | Find best K around N |
| No idea | DBSCAN/HDBSCAN | Auto-discovers |
"Based on your requirements:
Proceed? (yes / adjust [what])"
Think of clustering as a librarian organizing books without labels:
┌─────────────────────────────────────────────────────────┐
│ Clustering Pipeline │
│ │
│ Unlabeled Documents │
│ ┌─────┬─────┬─────┬─────┬─────┐ │
│ │Doc1 │Doc2 │Doc3 │Doc4 │Doc5 │ ... │
│ └──┬──┴──┬──┴──┬──┴──┬──┴──┬──┘ │
│ │ │ │ │ │ │
│ ▼ ▼ ▼ ▼ ▼ │
│ ┌─────────────────────────────┐ │
│ │ Embedding Model (BGE) │ │
│ │ Text → Vector │ │
│ └──────────────┬──────────────┘ │
│ │ │
│ ▼ │
│ [vec1] [vec2] [vec3] [vec4] [vec5] ... │
│ │ │
│ ▼ │
│ ┌─────────────────────────────┐ │
│ │ Clustering Algorithm │ │
│ │ (KMeans / DBSCAN) │ │
│ └──────────────┬──────────────┘ │
│ │ │
│ ┌───────────┼───────────┐ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────┐ ┌─────┐ ┌─────┐ │
│ │ C1 │ │ C2 │ │ C3 │ (Clusters) │
│ │Tech │ │Sport│ │Food │ (Named by LLM) │
│ └─────┘ └─────┘ └─────┘ │
└─────────────────────────────────────────────────────────┘
| Algorithm | Pros | Cons | Best For |
|---|---|---|---|
| KMeans | Fast, predictable clusters | Must specify K | Known cluster count |
| DBSCAN | Auto-discovers K, finds outliers | Sensitive to eps | Unknown clusters |
| HDBSCAN | More robust than DBSCAN | Slower | Large datasets |
from pymilvus import MilvusClient, DataType
from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans, DBSCAN
import numpy as np
class VectorClustering:
def __init__(self, uri: str = "./milvus.db"):
self.client = MilvusClient(uri=uri)
self.model = SentenceTransformer('BAAI/bge-large-en-v1.5')
self.collection_name = "clustering"
self._init_collection()
def _init_collection(self):
if self.client.has_collection(self.collection_name):
return
schema = self.client.create_schema()
schema.add_field("id", DataType.VARCHAR, is_primary=True, max_length=64)
schema.add_field("content", DataType.VARCHAR, max_length=65535)
schema.add_field("cluster_id", DataType.INT32)
schema.add_field("embedding", DataType.FLOAT_VECTOR, dim=1024)
index_params = self.client.prepare_index_params()
index_params.add_index(field_name="embedding", index_type="AUTOINDEX", metric_type="COSINE")
index_params.add_index(field_name="cluster_id", index_type="STL_SORT")
self.client.create_collection(
collection_name=self.collection_name,
schema=schema,
index_params=index_params
)
def add_data(self, items: list):
"""Add data (unclustered)
items: [{"id": "...", "content": "..."}]
"""
contents = [item["content"] for item in items]
embeddings = self.model.encode(contents).tolist()
data = [{"id": item["id"], "content": item["content"],
"cluster_id": -1, "embedding": emb}
for item, emb in zip(items, embeddings)]
self.client.insert(collection_name=self.collection_name, data=data)
def cluster_kmeans(self, n_clusters: int = 10) -> dict:
"""KMeans clustering - use when you know the number of clusters"""
all_data = self.client.query(
collection_name=self.collection_name,
filter="",
output_fields=["id", "content", "embedding"],
limit=100000
)
if len(all_data) < n_clusters:
raise ValueError(f"Data count {len(all_data)} < cluster count {n_clusters}")
ids = [item["id"] for item in all_data]
embeddings = np.array([item["embedding"] for item in all_data])
kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10)
labels = kmeans.fit_predict(embeddings)
# Update cluster labels in Milvus
for item_id, label in zip(ids, labels):
self.client.upsert(
collection_name=self.collection_name,
data=[{"id": item_id, "cluster_id": int(label)}]
)
# Organize results
clusters = {}
for item, label in zip(all_data, labels):
label = int(label)
if label not in clusters:
clusters[label] = []
clusters[label].append({"id": item["id"], "content": item["content"]})
return {
"n_clusters": n_clusters,
"clusters": clusters,
"cluster_sizes": {k: len(v) for k, v in clusters.items()}
}
def cluster_dbscan(self, eps: float = 0.3, min_samples: int = 5) -> dict:
"""DBSCAN clustering - use when cluster count is unknown"""
all_data = self.client.query(
collection_name=self.collection_name,
filter="",
output_fields=["id", "content", "embedding"],
limit=100000
)
embeddings = np.array([item["embedding"] for item in all_data])
dbscan = DBSCAN(eps=eps, min_samples=min_samples, metric='cosine')
labels = dbscan.fit_predict(embeddings)
# Update labels
for item, label in zip(all_data, labels):
self.client.upsert(
collection_name=self.collection_name,
data=[{"id": item["id"], "cluster_id": int(label)}]
)
# Organize results
clusters = {}
noise_count = 0
for item, label in zip(all_data, labels):
label = int(label)
if label == -1:
noise_count += 1
continue
if label not in clusters:
clusters[label] = []
clusters[label].append({"id": item["id"], "content": item["content"]})
return {
"n_clusters": len(clusters),
"clusters": clusters,
"cluster_sizes": {k: len(v) for k, v in clusters.items()},
"noise_count": noise_count # Outliers
}
def find_optimal_k(self, min_k: int = 2, max_k: int = 20) -> int:
"""Find optimal K using silhouette score"""
from sklearn.metrics import silhouette_score
all_data = self.client.query(
collection_name=self.collection_name,
filter="",
output_fields=["embedding"],
limit=100000
)
embeddings = np.array([item["embedding"] for item in all_data])
scores = []
for k in range(min_k, min(max_k + 1, len(embeddings))):
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
labels = kmeans.fit_predict(embeddings)
score = silhouette_score(embeddings, labels)
scores.append((k, score))
best_k = max(scores, key=lambda x: x[1])[0]
return best_k
def assign_cluster(self, content: str) -> dict:
"""Assign new content to existing cluster"""
embedding = self.model.encode(content).tolist()
results = self.client.search(
collection_name=self.collection_name,
data=[embedding],
limit=5,
output_fields=["cluster_id"]
)
# Vote for cluster
cluster_votes = {}
for hit in results[0]:
cid = hit["entity"]["cluster_id"]
if cid != -1: # Ignore noise
cluster_votes[cid] = cluster_votes.get(cid, 0) + 1
if not cluster_votes:
return {"cluster_id": -1, "confidence": 0}
best_cluster = max(cluster_votes, key=cluster_votes.get)
return {
"cluster_id": best_cluster,
"confidence": cluster_votes[best_cluster] / len(results[0])
}
# Usage
clustering = VectorClustering()
clustering.add_data([
{"id": "1", "content": "Python is great for data science"},
{"id": "2", "content": "Machine learning needs lots of data"},
{"id": "3", "content": "The weather is nice today"},
{"id": "4", "content": "Deep learning revolutionized AI"},
{"id": "5", "content": "Going hiking on weekends is relaxing"},
])
# Find optimal K
best_k = clustering.find_optimal_k(min_k=2, max_k=5)
print(f"Optimal K: {best_k}")
# Cluster
result = clustering.cluster_kmeans(n_clusters=best_k)
for cid, items in result['clusters'].items():
print(f"\nCluster {cid} ({len(items)} items):")
for item in items[:3]:
print(f" - {item['content'][:50]}...")
# Elbow method visualization
from sklearn.metrics import silhouette_score
scores = []
for k in range(2, 20):
kmeans = KMeans(n_clusters=k, n_init=10)
labels = kmeans.fit_predict(embeddings)
scores.append(silhouette_score(embeddings, labels))
# Pick K with highest silhouette score
best_k = scores.index(max(scores)) + 2
| eps Value | Effect |
|---|---|
| Too small | Too many tiny clusters |
| Too large | Everything in one cluster |
| Just right | Meaningful groups + outliers |
# Start with distance analysis
from sklearn.neighbors import NearestNeighbors
neighbors = NearestNeighbors(n_neighbors=5)
neighbors.fit(embeddings)
distances, _ = neighbors.kneighbors(embeddings)
# Plot distances and look for "elbow" to set eps
Problem: Clusters don't make sense
Why: Arbitrary K choice
Fix: Use silhouette score or domain knowledge
Problem: Small eps change dramatically changes results
Why: Density-based algorithm, data-dependent
Fix: Try HDBSCAN (more robust) or normalize embeddings
Problem: Forcing outliers into clusters degrades quality
Why: Not all data belongs to a cluster
Fix: Use DBSCAN to identify noise (label=-1)
Problem: Cluster IDs meaningless to users
Fix: Use LLM to name clusters based on samples
def name_cluster(samples):
prompt = f"Name this group based on samples: {samples}"
return llm.generate(prompt)
| Need | Upgrade To |
|---|---|
| Find duplicates | duplicate-detection |
| Hierarchical clusters | Use HDBSCAN |
| Real-time clustering | Add incremental clustering |
| Large scale | Add core:ray for distributed |
verticals/topic.mdverticals/user-segmentation.mdverticals/anomaly.md