Skip to content

Data Loss Prevention

The most dangerous kind of lag is the kind that crosses your retention window. Once a consumer falls further behind than Kafka retains, the oldest unread messages are deleted, gone, unrecoverable. Klag warns you before that happens.

MetricDescription
klag.consumer.lag.retention_percentPercentage of the retention window consumed by lag (value × 100 for precision).

Tagged with consumer_group and topic. Empty partitions are excluded.

retention_percent = (lag / (logEndOffset - logStartOffset)) * 100
  • A rising value means the consumer is eating into its safety margin.
  • 100% means the consumer is at or behind logStartOffset: data loss has already occurred.

Retention in Kafka is enforced by the broker deleting old segments. Comparing lag to the actual span of available offsets (logEndOffset − logStartOffset) measures the real, current safety margin, more reliable than assuming a fixed time-based retention.

Alert when retention_percent crosses a threshold well below 100 (e.g. 70–80%) to give operators time to scale consumers or intervene before messages are lost. The Grafana dashboard includes retention-risk panels and an at-risk topics table.