kafka: Add support for time lag #17735

adithyachakilam · 2025-02-17T20:39:02Z

Description

We really don't know how much of effort it would take to clear out certain kafka lag. Since processing time would be different for every stream. In order to correctly measure the SLAs we need to measure the lag in terms of time. This PR adds support to calculate the lag of kafka stream in terms of time.

Release note

Kafka Supervisor would now adds additional lag metric which informs how many minutes of data are we falling behind.

Key changed/added classes in this PR

KafkaSupervisor

This PR has:

kfaraz

@adithyachakilam , thanks for the changes.
The approach makes sense to me.
But we can probably simplify it as follows:

Add field long timestamp to OrderedPartitionableRecord.
Update KafkaSupervisor.poll() to also set timestamp in the records
Add a config (maybe in KafkaSupervisorIOConfig) to enable emitting the new metric, mostly because I am not too sure of the impact of the additional polls.
Update method updatePartitionLagFromStream() as suggested below. With this approach, we can avoid additional seeks and polls and hold the lock for a shorter period of time. Kafka topics may sometimes have several partitions, and polling each one separately can be inefficient.
Compute the difference between highestIngestedTimestamps and latestTimestampsFromStream in method getPartitionTimeLag().
I would also advise adding a metric ingest/%s/updateOffsets/time in SeekableStreamSupervisor.updateCurrentAndLatestOffsets() which measures the total time spent in that method.
Please include results of any cluster testing in the PR description.

Please let me know what you think.

@Override
protected void updatePartitionLagFromStream()
{
    if (new config is enabled) {
        updatePartitionTimeAndRecordLagFromStream();
        return;
    }
    
    // existing code flow
}

/**
 * This method is similar to updatePartitionLagFromStream
 * but also determines time lag. Once this method has been
 * tested, we can remove the older one.
 */
private void updatePartitionTimeAndRecordLagFromStream()
{
    // NEW CODE - determine highest of the current offsets across all tasks
    final Map<KafkaTopicPartition, Long> highestCurrentOffsets = getHighestCurrentOffsets();
    
    getRecordSupplierLock().lock();
    try {
      Set<KafkaTopicPartition> partitionIds;
      try {
        partitionIds = recordSupplier.getPartitionIds(getIoConfig().getStream());
      }
      catch (Exception e) {
        log.warn("Could not fetch partitions for topic/stream [%s]", getIoConfig().getStream());
        throw new StreamException(e);
      }
      
      // NEW CODE - seek all partitions to highest current offsets
      for (each entry in highestCurrentOffsets) {
         if (entry.getKey() is present in partitionIds) {
           recordSupplier.seek(entry.getKey(), entry.getValue());
         }
      }
      
      Set<StreamPartition<KafkaTopicPartition>> partitions = partitionIds
          .stream()
          .map(e -> new StreamPartition<>(getIoConfig().getStream(), e))
          .collect(Collectors.toSet());

      // NEW CODE - poll records for all partitions at highest current offsets
      final List<Record> lastIngestedRecords = recordSupplier.poll(timeout);
      
      // NEW CODE - determine max ingested timestamps for each partition using the lastIngestedRecords above
      // Make sure to filter out relevant records as consumer may have returned records that had a higher offset than the one requested

      recordSupplier.seekToLatest(partitions);

      // this method isn't actually computing the lag, just fetching the latests offsets from the stream. This is
      // because we currently only have record lag for kafka, which can be lazily computed by subtracting the highest
      // task offsets from the latest offsets from the stream when it is needed
          
      // NEW CODE - poll records for all partitions at the latest offsets
      final List<Record> latestRecordsInStream = recordSupplier.poll(timeout);
      
      // NEW CODE - iterate over latest records to determine latestTimestampsFromStream and latestSequencesFromStream
    }
    catch (InterruptedException e) {
      throw new StreamException(e);
    }
    finally {
      getRecordSupplierLock().unlock();
    }
  }

kfaraz

Done a partial review, need to take another look at the new method updatePartitionTimeAndRecordLagFromStream()

docs/operations/metrics.md

...ng-service/src/main/java/org/apache/druid/indexing/seekablestream/common/RecordSupplier.java

.../main/java/org/apache/druid/indexing/seekablestream/supervisor/SeekableStreamSupervisor.java

...dexing-service/src/main/java/org/apache/druid/indexing/kafka/supervisor/KafkaSupervisor.java

...ka-indexing-service/src/main/java/org/apache/druid/data/input/kafka/KafkaTopicPartition.java

...rc/main/java/org/apache/druid/indexing/seekablestream/common/OrderedPartitionableRecord.java

...ng-service/src/main/java/org/apache/druid/indexing/seekablestream/common/RecordSupplier.java

...dexing-service/src/main/java/org/apache/druid/indexing/kafka/supervisor/KafkaSupervisor.java

github-actions bot added Area - Streaming Ingestion Area - Ingestion labels Feb 17, 2025

adithyachakilam mentioned this pull request Feb 17, 2025

Kafka: Emit production rate #17491

Closed

8 tasks

kafka: Add support for time lag

ea8f9d9

adithyachakilam force-pushed the kafka-time-lag branch from 629e52c to ea8f9d9 Compare February 17, 2025 20:52

kfaraz reviewed Feb 18, 2025

View reviewed changes

comments

85abdcc

github-actions bot added the Area - Documentation label Feb 19, 2025

Merge remote-tracking branch 'origin/master' into kafka-time-lag

1a68cd3

kfaraz reviewed Feb 19, 2025

View reviewed changes

...dexing-service/src/main/java/org/apache/druid/indexing/kafka/supervisor/KafkaSupervisor.java Outdated Show resolved Hide resolved

adithyachakilam added 3 commits February 20, 2025 22:37

comments

2374844

silly-mistake

d85c2ab

missed

7f019e0

cryptoe reviewed Feb 27, 2025

View reviewed changes

...ka-indexing-service/src/main/java/org/apache/druid/data/input/kafka/KafkaTopicPartition.java Outdated Show resolved Hide resolved

comments

50b7040

kfaraz reviewed Feb 28, 2025

View reviewed changes

comments

0f84b9b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kafka: Add support for time lag #17735

kafka: Add support for time lag #17735

adithyachakilam commented Feb 17, 2025 •

edited

Loading

kfaraz left a comment

kfaraz left a comment

kafka: Add support for time lag #17735

Are you sure you want to change the base?

kafka: Add support for time lag #17735

Conversation

adithyachakilam commented Feb 17, 2025 • edited Loading

Description

Release note

Key changed/added classes in this PR

kfaraz left a comment

Choose a reason for hiding this comment

kfaraz left a comment

Choose a reason for hiding this comment

adithyachakilam commented Feb 17, 2025 •

edited

Loading