Skip to content

Hot Partitions

Within a single topic, partitions should carry roughly even load. When one partition runs much hotter than its peers, usually from a skewed partition key, it becomes a bottleneck. Klag detects these statistical outliers.

Reported only when an outlier exists (so they stay quiet on healthy topics):

MetricDescription
klag.hot_partitionPartition throughput × 100 when statistically high.
klag.hot_partition.lagPartition lag on a hot partition specifically.

klag.hot_partition has only topic and partition tags, because throughput is partition-level and independent of any consumer.

For each topic, Klag computes per-partition throughput over a rolling sample buffer and flags partitions whose throughput exceeds the mean by more than HOT_PARTITION_SIGMA_MULTIPLIER standard deviations.

Detection only runs when there is enough data to be meaningful:

VariableDefaultRole
HOT_PARTITION_ENABLEDtrueMaster switch.
HOT_PARTITION_SIGMA_MULTIPLIER2.0Std-devs above mean to flag an outlier.
HOT_PARTITION_MIN_PARTITIONS3Min partitions per topic before detection runs.
HOT_PARTITION_MIN_SAMPLES3Min samples needed for a throughput estimate.
HOT_PARTITION_BUFFER_SIZE20Samples retained per partition.

A hot partition usually points at a partitioning-key problem in the producer. Use the Grafana dashboard hot-partition panels to spot which topic/partition is skewed, then rebalance the key or repartition the topic.