Skip to content

Introduction

Klag is a Kafka consumer lag exporter built with Vert.x and Micrometer. It continuously monitors consumer lag and consumer-group state across your cluster and exposes the data to Prometheus, Datadog, or any OTLP-compatible backend.

It is inspired by kafka-lag-exporter (archived in 2024), rebuilt on a modern reactive stack.

Consumer lag is the gap between what Kafka has produced and what your consumers have processed. Left unmonitored, growing lag leads to:

  • Stale data in downstream systems.
  • Memory pressure as consumers struggle to catch up.
  • Silent failures when consumer groups die without alerts.

Klag surfaces these problems early, with enough signal to act before users notice.

FeatureWhy it matters
Lag velocityKnow if lag is growing or shrinking, to catch problems before they escalate.
Time-based lag estimationSee lag in seconds/minutes, beyond raw message counts.
Hot partition detectionFind partitions with uneven load causing bottlenecks.
Consumer group state trackingAlert on Rebalancing, Dead, or Empty states.
Request batchingSafely monitor large clusters without overwhelming brokers.
Stale group cleanupAutomatically stops reporting deleted/inactive groups.
Data loss preventionCatch the case where lag exceeds retention and data is lost.

Klag monitors thousands of consumer groups in ~50 MB heap. Request batching with configurable delays lets it fetch offsets for 500+ groups without spiking broker CPU. See Group Filtering and the KAFKA_CHUNK_COUNT / KAFKA_CHUNK_DELAY_MS settings in the Configuration Reference.

Klag requires read-only (DESCRIBE) access to Kafka, no write or alter permissions. See ACL Permissions for the exact grants on self-managed Kafka and Confluent Cloud.