con un clic
troubleshooting
// Diagnose and resolve ClickHouse issues: OOM, slow merges, stuck mutations, query failures with error codes, and systematic error clustering.
// Diagnose and resolve ClickHouse issues: OOM, slow merges, stuck mutations, query failures with error codes, and systematic error clustering.
Exact column names for the system tables the agent queries most (processes, query_log, parts, merges, mutations, replicas, replication_queue, disks, settings, zookeeper, users/grants, metrics) plus rules for choosing dedicated tools over raw SQL. Load before hand-writing SQL against system tables.
Cluster management: distributed tables, ON CLUSTER DDL, node lifecycle, resharding, load balancing, and Keeper migration.
Schema migrations: ALTER patterns, engine changes, zero-downtime swaps, clickhouse-local offline migrations, and lightweight UPDATE/DELETE strategies.
ReplicatedMergeTree operations, failover procedures, lag diagnosis, quorum writes, and Keeper management.
RBAC configuration, row policies, quotas, network security, audit logging, and access control best practices.
Compression codecs, TTL policies, tiered storage, part management, and disk space optimization.
| name | troubleshooting |
| description | Diagnose and resolve ClickHouse issues: OOM, slow merges, stuck mutations, query failures with error codes, and systematic error clustering. |
Diagnosis: system.query_log WHERE memory_usage is high. system.metrics WHERE metric = 'MemoryTracking'.
Solutions:
max_memory_usage per query (e.g., 10GB), max_memory_usage_for_user per usermax_bytes_before_external_group_by / max_bytes_before_external_sort for spill-to-diskDiagnosis: system.merges — check elapsed, progress, total_size_bytes_compressed. system.part_log WHERE event_type = 'MergeParts' for throughput. Too many parts: max_parts_count_for_partition in system.asynchronous_metrics.
Solutions:
background_pool_size (default: 16)system.asynchronous_metrics (ReadBufferFromFileDescriptorReadBytes)OPTIMIZE TABLE ... FINAL to force merge (expensive, off-peak only)Diagnosis: system.mutations WHERE is_done = 0 — check parts_to_do, latest_fail_reason. Mutations block new merges on affected parts.
Solutions:
KILL MUTATION WHERE mutation_id = '...' to cancelDiagnosis: system.query_log WHERE type = 'ExceptionWhileProcessing'. Use error clustering to find patterns:
SELECT exception_code, count(), topK(10)(exception)
FROM system.query_log
WHERE type = 'ExceptionWhileProcessing'
GROUP BY exception_code ORDER BY count() DESC
For persistent errors, check system.error_log (requires error logging enabled).
Common error codes:
get_table_schema to check column namesDiagnosis: Check inter-server connectivity for replication or distributed queries. Verify interserver_http_port is reachable between nodes. Test with curl http://<peer>:9009. Check firewall rules and DNS resolution.
replication-guide skill for replication lag diagnosis and recovery.storage-optimization skill for disk recovery and tiered storage management.