Jeden Skill in Manus ausführen
mit einem Klick

Jeden Skill in Manus mit einem Klick ausführen

$pwd:

clustering

Name: Clustering
Author: zilliztech

// Use when user needs to group similar items together. Triggers on: clustering, group similar, topic modeling, user segmentation, categorization, automatic classification, unsupervised grouping.

In Manus ausführen

$ git log --oneline --stat

stars:2

forks:2

updated:4. Februar 2026 um 02:42

Datei-Explorer

2 Dateien

SKILL.md

readonly

related-skills.json

gleiches Repository

duplicate-detection.md

from "zilliztech/milvus-marketplace"

Use when user needs to find duplicate or similar content. Triggers on: duplicate, deduplication, plagiarism detection, similar content, near-duplicate, similarity detection, content dedup, find copies.

2026-02-042

chat-memory.md

from "zilliztech/milvus-marketplace"

Use when user needs long-term memory for chatbots. Triggers on: chat memory, conversation history, long-term memory, chatbot memory, memory retrieval, persistent memory, remember conversations.

2026-02-042

image-search.md

from "zilliztech/milvus-marketplace"

Use when user wants to build image search or similar image finding. Triggers on: image search, similar image, visual search, image retrieval, CLIP, reverse image search, image matching, find similar photos.

2026-02-042

multimodal-rag.md

from "zilliztech/milvus-marketplace"

Use when user needs RAG on documents with images and text. Triggers on: multimodal RAG, image-text mixed, document with images, PDF with charts, visual RAG, visual Q&A, documents with figures.

2026-02-042

text-to-image-search.md

from "zilliztech/milvus-marketplace"

Use when user needs to search images using natural language descriptions. Triggers on: text to image, describe and find, natural language image search, image caption search, find image by description, describe to find.

2026-02-042

video-search.md

from "zilliztech/milvus-marketplace"

Use when user needs to search video content by text or image. Triggers on: video search, video retrieval, video clips, meeting recordings, tutorial videos, surveillance playback, find moment in video.

2026-02-042

package.json

"author": "zilliztech"

"repository": "zilliztech/milvus-marketplace"

GitHub-Repository öffnen Creator-Repositorys ansehen

$ install --global

$ download --local

In Manus ausführen

$ useful --forSOC

DatenwissenschaftlerInformatik- und Mathematikberufe15-2051L4

name	clustering
description	Use when user needs to group similar items together. Triggers on: clustering, group similar, topic modeling, user segmentation, categorization, automatic classification, unsupervised grouping.

Clustering

Automatically group similar content into clusters using vector embeddings — discover hidden patterns and categories in your data.

When to Activate

Activate this skill when:

User wants to group similar items automatically
User mentions "clustering", "segmentation", "topic modeling"
User needs to discover categories in unlabeled data
User wants to organize content without predefined labels

Do NOT activate when:

User needs to find duplicates → use duplicate-detection
User has predefined categories → use filtered-search
User needs recommendations → use rec-system

Interactive Flow

Step 1: Understand Clustering Goal

"What do you want to achieve with clustering?"

A) Topic discovery (documents, articles)

Find themes in text corpus
Group by subject matter

B) User segmentation (behavioral data)

Group users by behavior
Marketing personas

C) Anomaly detection

Find outliers
Fraud detection

D) Content organization

Auto-categorization
Product grouping

Which describes your goal? (A/B/C/D)

Step 2: Determine Number of Clusters

"Do you know how many clusters you want?"

If You Know	Algorithm	Configuration
Yes, exactly N	KMeans	`n_clusters=N`
Roughly N	KMeans + silhouette	Find best K around N
No idea	DBSCAN/HDBSCAN	Auto-discovers

Step 3: Confirm Configuration

"Based on your requirements:

Algorithm: KMeans (you specified 5 clusters)
Embedding: BGE-large
Metric: COSINE similarity

Proceed? (yes / adjust [what])"

Core Concepts

Mental Model: Sorting a Library

Think of clustering as a librarian organizing books without labels:

Look at each book's content
Group similar topics together
Name each section after grouping

┌─────────────────────────────────────────────────────────┐
│                    Clustering Pipeline                   │
│                                                          │
│  Unlabeled Documents                                     │
│  ┌─────┬─────┬─────┬─────┬─────┐                       │
│  │Doc1 │Doc2 │Doc3 │Doc4 │Doc5 │ ...                   │
│  └──┬──┴──┬──┴──┬──┴──┬──┴──┬──┘                       │
│     │     │     │     │     │                           │
│     ▼     ▼     ▼     ▼     ▼                           │
│  ┌─────────────────────────────┐                        │
│  │    Embedding Model (BGE)     │                        │
│  │    Text → Vector             │                        │
│  └──────────────┬──────────────┘                        │
│                 │                                        │
│                 ▼                                        │
│  [vec1] [vec2] [vec3] [vec4] [vec5] ...                 │
│                 │                                        │
│                 ▼                                        │
│  ┌─────────────────────────────┐                        │
│  │   Clustering Algorithm       │                        │
│  │   (KMeans / DBSCAN)         │                        │
│  └──────────────┬──────────────┘                        │
│                 │                                        │
│     ┌───────────┼───────────┐                           │
│     │           │           │                           │
│     ▼           ▼           ▼                           │
│  ┌─────┐    ┌─────┐    ┌─────┐                         │
│  │ C1  │    │ C2  │    │ C3  │  (Clusters)             │
│  │Tech │    │Sport│    │Food │  (Named by LLM)         │
│  └─────┘    └─────┘    └─────┘                         │
└─────────────────────────────────────────────────────────┘

KMeans vs DBSCAN

Algorithm	Pros	Cons	Best For
KMeans	Fast, predictable clusters	Must specify K	Known cluster count
DBSCAN	Auto-discovers K, finds outliers	Sensitive to eps	Unknown clusters
HDBSCAN	More robust than DBSCAN	Slower	Large datasets

Implementation

from pymilvus import MilvusClient, DataType
from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans, DBSCAN
import numpy as np

class VectorClustering:
    def __init__(self, uri: str = "./milvus.db"):
        self.client = MilvusClient(uri=uri)
        self.model = SentenceTransformer('BAAI/bge-large-en-v1.5')
        self.collection_name = "clustering"
        self._init_collection()

    def _init_collection(self):
        if self.client.has_collection(self.collection_name):
            return

        schema = self.client.create_schema()
        schema.add_field("id", DataType.VARCHAR, is_primary=True, max_length=64)
        schema.add_field("content", DataType.VARCHAR, max_length=65535)
        schema.add_field("cluster_id", DataType.INT32)
        schema.add_field("embedding", DataType.FLOAT_VECTOR, dim=1024)

        index_params = self.client.prepare_index_params()
        index_params.add_index(field_name="embedding", index_type="AUTOINDEX", metric_type="COSINE")
        index_params.add_index(field_name="cluster_id", index_type="STL_SORT")

        self.client.create_collection(
            collection_name=self.collection_name,
            schema=schema,
            index_params=index_params
        )

    def add_data(self, items: list):
        """Add data (unclustered)
        items: [{"id": "...", "content": "..."}]
        """
        contents = [item["content"] for item in items]
        embeddings = self.model.encode(contents).tolist()

        data = [{"id": item["id"], "content": item["content"],
                 "cluster_id": -1, "embedding": emb}
                for item, emb in zip(items, embeddings)]

        self.client.insert(collection_name=self.collection_name, data=data)

    def cluster_kmeans(self, n_clusters: int = 10) -> dict:
        """KMeans clustering - use when you know the number of clusters"""
        all_data = self.client.query(
            collection_name=self.collection_name,
            filter="",
            output_fields=["id", "content", "embedding"],
            limit=100000
        )

        if len(all_data) < n_clusters:
            raise ValueError(f"Data count {len(all_data)} < cluster count {n_clusters}")

        ids = [item["id"] for item in all_data]
        embeddings = np.array([item["embedding"] for item in all_data])

        kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10)
        labels = kmeans.fit_predict(embeddings)

        # Update cluster labels in Milvus
        for item_id, label in zip(ids, labels):
            self.client.upsert(
                collection_name=self.collection_name,
                data=[{"id": item_id, "cluster_id": int(label)}]
            )

        # Organize results
        clusters = {}
        for item, label in zip(all_data, labels):
            label = int(label)
            if label not in clusters:
                clusters[label] = []
            clusters[label].append({"id": item["id"], "content": item["content"]})

        return {
            "n_clusters": n_clusters,
            "clusters": clusters,
            "cluster_sizes": {k: len(v) for k, v in clusters.items()}
        }

    def cluster_dbscan(self, eps: float = 0.3, min_samples: int = 5) -> dict:
        """DBSCAN clustering - use when cluster count is unknown"""
        all_data = self.client.query(
            collection_name=self.collection_name,
            filter="",
            output_fields=["id", "content", "embedding"],
            limit=100000
        )

        embeddings = np.array([item["embedding"] for item in all_data])

        dbscan = DBSCAN(eps=eps, min_samples=min_samples, metric='cosine')
        labels = dbscan.fit_predict(embeddings)

        # Update labels
        for item, label in zip(all_data, labels):
            self.client.upsert(
                collection_name=self.collection_name,
                data=[{"id": item["id"], "cluster_id": int(label)}]
            )

        # Organize results
        clusters = {}
        noise_count = 0
        for item, label in zip(all_data, labels):
            label = int(label)
            if label == -1:
                noise_count += 1
                continue
            if label not in clusters:
                clusters[label] = []
            clusters[label].append({"id": item["id"], "content": item["content"]})

        return {
            "n_clusters": len(clusters),
            "clusters": clusters,
            "cluster_sizes": {k: len(v) for k, v in clusters.items()},
            "noise_count": noise_count  # Outliers
        }

    def find_optimal_k(self, min_k: int = 2, max_k: int = 20) -> int:
        """Find optimal K using silhouette score"""
        from sklearn.metrics import silhouette_score

        all_data = self.client.query(
            collection_name=self.collection_name,
            filter="",
            output_fields=["embedding"],
            limit=100000
        )
        embeddings = np.array([item["embedding"] for item in all_data])

        scores = []
        for k in range(min_k, min(max_k + 1, len(embeddings))):
            kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
            labels = kmeans.fit_predict(embeddings)
            score = silhouette_score(embeddings, labels)
            scores.append((k, score))

        best_k = max(scores, key=lambda x: x[1])[0]
        return best_k

    def assign_cluster(self, content: str) -> dict:
        """Assign new content to existing cluster"""
        embedding = self.model.encode(content).tolist()

        results = self.client.search(
            collection_name=self.collection_name,
            data=[embedding],
            limit=5,
            output_fields=["cluster_id"]
        )

        # Vote for cluster
        cluster_votes = {}
        for hit in results[0]:
            cid = hit["entity"]["cluster_id"]
            if cid != -1:  # Ignore noise
                cluster_votes[cid] = cluster_votes.get(cid, 0) + 1

        if not cluster_votes:
            return {"cluster_id": -1, "confidence": 0}

        best_cluster = max(cluster_votes, key=cluster_votes.get)
        return {
            "cluster_id": best_cluster,
            "confidence": cluster_votes[best_cluster] / len(results[0])
        }

# Usage
clustering = VectorClustering()

clustering.add_data([
    {"id": "1", "content": "Python is great for data science"},
    {"id": "2", "content": "Machine learning needs lots of data"},
    {"id": "3", "content": "The weather is nice today"},
    {"id": "4", "content": "Deep learning revolutionized AI"},
    {"id": "5", "content": "Going hiking on weekends is relaxing"},
])

# Find optimal K
best_k = clustering.find_optimal_k(min_k=2, max_k=5)
print(f"Optimal K: {best_k}")

# Cluster
result = clustering.cluster_kmeans(n_clusters=best_k)
for cid, items in result['clusters'].items():
    print(f"\nCluster {cid} ({len(items)} items):")
    for item in items[:3]:
        print(f"  - {item['content'][:50]}...")

Parameter Tuning

KMeans: Choosing K

# Elbow method visualization
from sklearn.metrics import silhouette_score

scores = []
for k in range(2, 20):
    kmeans = KMeans(n_clusters=k, n_init=10)
    labels = kmeans.fit_predict(embeddings)
    scores.append(silhouette_score(embeddings, labels))

# Pick K with highest silhouette score
best_k = scores.index(max(scores)) + 2

DBSCAN: Tuning eps

eps Value	Effect
Too small	Too many tiny clusters
Too large	Everything in one cluster
Just right	Meaningful groups + outliers

# Start with distance analysis
from sklearn.neighbors import NearestNeighbors

neighbors = NearestNeighbors(n_neighbors=5)
neighbors.fit(embeddings)
distances, _ = neighbors.kneighbors(embeddings)
# Plot distances and look for "elbow" to set eps

Common Pitfalls

❌ Pitfall 1: Wrong K

Problem: Clusters don't make sense

Why: Arbitrary K choice

Fix: Use silhouette score or domain knowledge

❌ Pitfall 2: DBSCAN eps Too Sensitive

Problem: Small eps change dramatically changes results

Why: Density-based algorithm, data-dependent

Fix: Try HDBSCAN (more robust) or normalize embeddings

❌ Pitfall 3: Ignoring Outliers

Problem: Forcing outliers into clusters degrades quality

Why: Not all data belongs to a cluster

Fix: Use DBSCAN to identify noise (label=-1)

❌ Pitfall 4: Clusters Without Names

Problem: Cluster IDs meaningless to users

Fix: Use LLM to name clusters based on samples

def name_cluster(samples):
    prompt = f"Name this group based on samples: {samples}"
    return llm.generate(prompt)

When to Level Up

Need	Upgrade To
Find duplicates	`duplicate-detection`
Hierarchical clusters	Use HDBSCAN
Real-time clustering	Add incremental clustering
Large scale	Add `core:ray` for distributed

References

Topic modeling: verticals/topic.md
User segmentation: verticals/user-segmentation.md
Anomaly detection: verticals/anomaly.md

clustering

Mehr aus diesem Repository

Mehr aus diesem Repository

Clustering

When to Activate

Interactive Flow

Step 1: Understand Clustering Goal

Step 2: Determine Number of Clusters

Step 3: Confirm Configuration

Core Concepts

Mental Model: Sorting a Library

KMeans vs DBSCAN

Implementation

Parameter Tuning

KMeans: Choosing K

DBSCAN: Tuning eps

Common Pitfalls

❌ Pitfall 1: Wrong K

❌ Pitfall 2: DBSCAN eps Too Sensitive

❌ Pitfall 3: Ignoring Outliers

❌ Pitfall 4: Clusters Without Names

When to Level Up

References

Clustering

When to Activate

Interactive Flow

Step 1: Understand Clustering Goal

Step 2: Determine Number of Clusters

Step 3: Confirm Configuration

Core Concepts

Mental Model: Sorting a Library

KMeans vs DBSCAN

Implementation

Parameter Tuning

KMeans: Choosing K

DBSCAN: Tuning eps

Common Pitfalls

❌ Pitfall 1: Wrong K

❌ Pitfall 2: DBSCAN eps Too Sensitive

❌ Pitfall 3: Ignoring Outliers

❌ Pitfall 4: Clusters Without Names

When to Level Up

References