Kafka printer improvements #555

AndreKurait · 2024-04-05T19:55:41Z

Description

Updates the KafkaPrinter with the ability to take in a commit offset to start printing from per partition.
Updates the KafkaExport script to be able to provide the commit offsets and record limits per partition.
Updates the KafkaExport script to directly compress the kafkaPrinter output script instead of first writing the uncompressed output to disc.

Note: Unit tests are currently lacking for the kafkaPrinter. Will focus on usability enhancements first then add testing once we know what we need.

Category: Enhancement
Why these changes are required? Improved usability of the KafkaExport script.
What is the old behavior before changes and new behavior after changes? KafkaExport script now can process more data with less disk space and has increased functionality with commit offsets and partition limits

Issues Resolved

https://opensearch.atlassian.net/browse/MIGRATIONS-1642

Is this a backport? If so, please add backport PR # and/or commits #

Testing

Ran the script with various configurations in an ECS container and verified correct publishing through s3

Check List

New functionality includes testing
- All tests pass, including unit test, integration test and doctest
New functionality has been documented
[ x] Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Andre Kurait <[email protected]>

…inline gzip from standard out Signed-off-by: Andre Kurait <[email protected]>

mikaylathompson · 2024-04-05T22:19:18Z

TrafficCapture/dockerSolution/src/main/docker/migrationConsole/kafkaExport.sh

+else
+    partition_limits_with_topic=""
+fi
+
 partition_offsets=$(./kafka/bin/kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list "$broker_endpoints" --topic "$topic" --time -1 $(echo "$kafka_command_settings"))
 comma_sep_partition_offsets=$(echo $partition_offsets | sed 's/ /,/g')


As far as I can tell, we're now ignoring these partition offsets in favor of partition_offsets_with_topic.
First of all, is it correct that we're now ignoring them? If so, let's remove this code.

These have always just been a print statement which allows a user to understand what the offset on the replayer may be, i've updated the variable names to make that more clear

mikaylathompson · 2024-04-05T22:24:10Z

TrafficCapture/dockerSolution/src/main/docker/migrationConsole/kafkaExport.sh

-  echo "  --s3-bucket-name                            Option to specify a given S3 bucket to store archive on".
+  echo "  --s3-bucket-name                            Option to specify a given S3 bucket to store archive on."
+  echo "  --partition-offsets                         Option to specify partition offsets in the format 'partition_id:offset,partition_id:offset'."
+  echo "  --partition-limits                          Option to specify partition limits in the format 'partition_id:num_records,partition_id:num_records'."


Can you elaborate on how partition offsets and partition limits work together? If I provide both for a given partition_id, does that mean it will print messages from offset->offset+num_records? What if I only provide one of them?

That's correct. I've updated the description to hopefully make that more clear.

When partition-offsets is not set for a partition, it defaults to the beginning of the partition in kafka

…on offsets Signed-off-by: Andre Kurait <[email protected]>

codecov · 2024-04-09T04:06:24Z

Codecov Report

Attention: Patch coverage is 0% with 32 lines in your changes are missing coverage. Please review.

Project coverage is 76.29%. Comparing base (44434ae) to head (e330325).
Report is 3 commits behind head on main.

Files	Patch %	Lines
...org/opensearch/migrations/replay/KafkaPrinter.java	0.00%	32 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main     #555      +/-   ##
============================================
- Coverage     76.54%   76.29%   -0.25%     
- Complexity     1408     1409       +1     
============================================
  Files           155      155              
  Lines          6033     6063      +30     
  Branches        543      548       +5     
============================================
+ Hits           4618     4626       +8     
- Misses         1049     1070      +21     
- Partials        366      367       +1

Flag	Coverage Δ
unittests	`76.29% <0.00%> (-0.25%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Andre Kurait <[email protected]>

AndreKurait · 2024-04-09T15:06:50Z

TrafficCapture/trafficReplayer/src/main/java/org/opensearch/migrations/replay/KafkaPrinter.java

@@ -142,7 +146,6 @@ public static void main(String[] args) throws FileNotFoundException {
        Properties properties = new Properties();
        properties.setProperty("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
        properties.setProperty("value.deserializer", "org.apache.kafka.common.serialization.ByteArrayDeserializer");
-        properties.setProperty(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");


Changed this to explicitly seek to the beginning on the first time a partition has been assigned to handle case if a groupId is reused

Signed-off-by: Andre Kurait <[email protected]>

peternied · 2024-04-09T20:48:41Z

TrafficCapture/dockerSolution/src/main/docker/migrationConsole/kafkaExport.sh

@@ -25,7 +27,9 @@ usage() {
  echo "Options:"


Naïve question, how do we know this change works / isn't broken by a future change?

AndreKurait added 2 commits April 5, 2024 14:08

Add support for initial partition offset in KafkaPrinter

ad0cad0

Signed-off-by: Andre Kurait <[email protected]>

Add partition-offsets and partition-limits to kafkaExport script and …

25e5527

…inline gzip from standard out Signed-off-by: Andre Kurait <[email protected]>

AndreKurait marked this pull request as ready for review April 5, 2024 19:55

AndreKurait requested review from chelma, gregschohn, kartg, lewijacn, mikaylathompson, okhasawn and sumobrian as code owners April 5, 2024 19:55

mikaylathompson reviewed Apr 5, 2024

View reviewed changes

Update KafkaExport to be more clear on expected behavior with partiti…

c3383bc

…on offsets Signed-off-by: Andre Kurait <[email protected]>

Update KafkaPrinter to seek to beginning at start

16a6128

Signed-off-by: Andre Kurait <[email protected]>

AndreKurait force-pushed the KafkaPrinterCommitOffset branch from 734953f to 16a6128 Compare April 9, 2024 15:03

Merge branch 'main' into KafkaPrinterCommitOffset

adc85d1

AndreKurait commented Apr 9, 2024

View reviewed changes

Add comment on existing offset print in kafkaExport

e330325

Signed-off-by: Andre Kurait <[email protected]>

mikaylathompson approved these changes Apr 9, 2024

View reviewed changes

AndreKurait merged commit 8c13003 into opensearch-project:main Apr 9, 2024
5 of 7 checks passed

AndreKurait deleted the KafkaPrinterCommitOffset branch April 9, 2024 21:00

peternied reviewed Apr 9, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kafka printer improvements #555

Kafka printer improvements #555

AndreKurait commented Apr 5, 2024

mikaylathompson Apr 5, 2024

AndreKurait Apr 9, 2024

mikaylathompson Apr 5, 2024

AndreKurait Apr 9, 2024

codecov bot commented Apr 9, 2024 •

edited

Loading

AndreKurait Apr 9, 2024

peternied Apr 9, 2024

Kafka printer improvements #555

Kafka printer improvements #555

Conversation

AndreKurait commented Apr 5, 2024

Description

Issues Resolved

Testing

Check List

mikaylathompson Apr 5, 2024

Choose a reason for hiding this comment

AndreKurait Apr 9, 2024

Choose a reason for hiding this comment

mikaylathompson Apr 5, 2024

Choose a reason for hiding this comment

AndreKurait Apr 9, 2024

Choose a reason for hiding this comment

codecov bot commented Apr 9, 2024 • edited Loading

Codecov Report

AndreKurait Apr 9, 2024

Choose a reason for hiding this comment

peternied Apr 9, 2024

Choose a reason for hiding this comment

codecov bot commented Apr 9, 2024 •

edited

Loading