FEATURE: Add ability to export Kafka records #488

lewijacn · 2024-01-18T23:16:56Z

Description

Introduces a kafkaExport.sh script onto the migration console for exporting all initially detected records for a given topic and n number of partitions to a gzip archive file. This uses the packaged Kafka console scripts to detect all partitions for a topic and the given number of records in each partition. Then feeds this info to the KafkaPrinter which has been updated to allow limits for partitions, and will complete when all limits have been reached.

Also included was the ability to export this archive to S3, mainly for the AWS use case but applicable to a docker case with the proper AWS credentials. To accommodate this, an S3 bucket for migration-artifacts is now created in the CDK deployment and linked to the migration console so that needed permissions are available for exports. This S3 bucket should also be useful for any Replayer log/tuple exports or other use cases in the future.

Minor additional changes:

Added ability to specify a different docker build context besides the directory of the Dockerfile for CDK ECS services. This is not currently in use by any of the ECS services, but was in use with testing having a build context at the root of the project.
Increased OSB to 1.2 which alleviated a conflict between awscli and osb previously encountered on the migration console.

Issues Resolved

https://opensearch.atlassian.net/browse/MIGRATIONS-1458
https://opensearch.atlassian.net/browse/MIGRATIONS-1459

Testing

Local/Cloud testing and unit testing for KafkaPrinter

Check List

New functionality includes testing
- All tests pass, including unit test, integration test and doctest
New functionality has been documented
Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Tanner Lewis <[email protected]>

codecov · 2024-01-18T23:26:13Z

Codecov Report

Attention: 54 lines in your changes are missing coverage. Please review.

Comparison is base (ffbb46e) 76.06% compared to head (9a2b755) 75.49%.

Files	Patch %	Lines
...org/opensearch/migrations/replay/KafkaPrinter.java	34.93%	52 Missing and 2 partials ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main     #488      +/-   ##
============================================
- Coverage     76.06%   75.49%   -0.57%     
- Complexity     1356     1360       +4     
============================================
  Files           158      158              
  Lines          6011     6085      +74     
  Branches        509      530      +21     
============================================
+ Hits           4572     4594      +22     
- Misses         1088     1138      +50     
- Partials        351      353       +2

Flag	Coverage Δ
unittests	`75.49% <34.93%> (-0.57%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Tanner Lewis <[email protected]>

gregschohn

There are a number of changes that I'd like to see to this PR, but the risk and tech-debt being introduced in the grand scheme of the codebase is minor, so I'll defer to your judgment about what should or needs to be addressed now and what can wait.

Since we don't know what the long-term needs are for using kafka topics exported in the format supported here, I'm fine with a more incremental approach. Do we have any tests on the replayer to confirm that it works (well) with files of this format? We DO have tests that integrate kafka and the replayer, but I cannot recall any for delimited stdio streams.

gregschohn · 2024-01-27T15:45:24Z

TrafficCapture/dockerSolution/build.gradle

+        if (projectName == "migrationConsole") {
+            def destDir = "src/main/docker/${projectName}/build/jars"
+            CommonUtils.copyArtifact(project, "trafficReplayer", projectName, destDir)
+            dependsOn "copyArtifact_${projectName}"
+        }


This is fine for now, but we should look into a way to do this through extension or dependency injection rather than special casing.

gregschohn · 2024-01-27T15:45:48Z

TrafficCapture/dockerSolution/src/main/docker/migrationConsole/Dockerfile

-    pip3 install urllib3==1.25.11 opensearch-benchmark==1.1.0 awscurl tqdm
-# TODO upon the next release of opensearch-benchmark the awscli package should be installed by pip3, with the expected boto3 version upgrade resolving the current conflicts between opensearch-benchmark and awscli
-RUN curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" && unzip awscliv2.zip && ./aws/install && rm -rf aws awscliv2.zip
+    pip3 install urllib3 opensearch-benchmark==1.2.0 awscurl tqdm awscli


THANK YOU! Now this will work on arm cpus too!

gregschohn · 2024-01-27T15:48:25Z

TrafficCapture/buildSrc/src/main/groovy/org/opensearch/migrations/common/CommonUtils.groovy

+        project.task("copyArtifact_${destProjectName}", type: Copy) {
+            dependsOn ":${artifactProjectName}:build"
+            dependsOn ":${artifactProjectName}:jar"
+            if (destProjectName == "trafficCaptureProxyServerTest") {


This doesn't seem like it should be a special case. We should just kill the test project altogether.

Should we just remove this if block, or are you saying the entire trafficCaptureProxyServerTest module should be removed? Not sure what our future plans are for it or if we use it today

gregschohn · 2024-01-27T15:50:01Z

TrafficCapture/dockerSolution/src/main/docker/migrationConsole/Dockerfile

+# Add Traffic Replayer jars for running KafkaPrinter from this container
+COPY build/jars /root/jars
+RUN printf "#!/bin/sh\njava -cp `echo /root/jars/*.jar | tr \   :` \"\$@\" " > /root/runJavaWithClasspath.sh
+RUN chmod +x /root/runJavaWithClasspath.sh


for later (and again from code predating these changes), we should use the gradle-provided wrappers instead of making our own.

Agreed 👍

gregschohn · 2024-01-27T15:51:49Z

TrafficCapture/dockerSolution/src/main/docker/migrationConsole/kafkaExport.sh

+group="export_$epoch_ts"
+
+set -o xtrace
+./runJavaWithClasspath.sh org.opensearch.migrations.replay.KafkaPrinter --kafka-traffic-brokers "$broker_endpoints" --kafka-traffic-topic "$topic" --kafka-traffic-group-id "$group" $(echo "$msk_auth_settings") --timeout-seconds "$timeout_seconds" --partition-limit "$comma_sep_partition_offsets" >> "$file_name"


why do you run 'echo' rather than just use the msk_auth_settings variable directly?

Why do you need a partition limit? I can understand if we have 100 partitions across 10TB that you don't want to try to pull all of that in a single process to a single file, but there could be a default with a warning for those doing quick and dirty tests.

For the first question: I had to use echo to handle some wonkiness with wanting to add an argument name versus just a value e.g. --kafka-traffic-enable-msk-auth

Second question: Maybe partition limit is a bit confusing here. This is more or less just a means to tell our KafkaPrinter when to stop otherwise we would continue to listen indefinitely. The user does not have the option to specify these currently with the script and by default we try to get as many records as are detected at the start of the script.

gregschohn · 2024-01-27T15:56:23Z

TrafficCapture/dockerSolution/src/main/docker/migrationConsole/kafkaExport.sh

+comma_sep_partition_offsets=$(echo $partition_offsets | sed 's/ /,/g')
+echo "Collected offsets from current Kafka topic: "
+echo $comma_sep_partition_offsets


This can only be one line, right?

Yes this becomes a line of topic:partition:num_records topic:partition:num_records ...

gregschohn · 2024-01-27T16:02:01Z

TrafficCapture/dockerSolution/src/main/docker/migrationConsole/kafkaExport.sh

+./runJavaWithClasspath.sh org.opensearch.migrations.replay.KafkaPrinter --kafka-traffic-brokers "$broker_endpoints" --kafka-traffic-topic "$topic" --kafka-traffic-group-id "$group" $(echo "$msk_auth_settings") --timeout-seconds "$timeout_seconds" --partition-limit "$comma_sep_partition_offsets" >> "$file_name"
+set +o xtrace
+
+tar -czvf "$archive_name" "$file_name" && rm "$file_name"


I presume that you're keeping open the option to have multiple files in one tar.gz file? That doesn't seem unreasonable to require clients to deal with that.

Actually, I think that you'll want to be doing this for multiple partitions. If we have a topic with 2 (or 200) partitions, we don't want to fold all of those into a single file, losing the partition assignments. We can always glue them back together later. While it's true that we can repartition as we'd like, maybe using the original scheme - or maybe not - it feels like it would be more scalable and easier to manage if we didn't do merges of partitions within this script.

I'm fine with this staying as-is for now, but we'll need to start thinking about how we're partitioning as we horizontally scale replayers.

gregschohn · 2024-01-27T16:38:56Z

TrafficCapture/dockerSolution/src/main/docker/migrationConsole/kafkaExport.sh

+epoch_ts=$(date +%s)
+file_name="kafka_export_$epoch_ts.proto"
+archive_name="$file_name.tar.gz"
+group="export_$epoch_ts"


I'd like to see a bit more uniqueness. Can you use a random value in here too or the hostname and pid too?

Can you make export a bit less generic. I know that this is a simple utility, but somebody could mistake this group for something else. You could make this something like exportFromMigrationConsole_.

Have made this more unique now

gregschohn · 2024-01-27T18:02:01Z

TrafficCapture/trafficReplayer/src/main/java/org/opensearch/migrations/replay/KafkaPrinter.java

        var records = kafkaConsumer.poll(CONSUMER_POLL_TIMEOUT);
-        binaryReceiver.accept(StreamSupport.stream(records.spliterator(), false)
-                .map(ConsumerRecord::value));
+        binaryReceiver.accept(StreamSupport.stream(records.spliterator(), false));


do you want to flatten records from across different partitions into just one stream? See my comments about preserving partition boundaries.

Have added default to preserve partition boundaries and create separate files

gregschohn · 2024-01-27T18:04:16Z

TrafficCapture/trafficReplayer/src/main/java/org/opensearch/migrations/replay/KafkaPrinter.java

    private static final Logger log = LoggerFactory.getLogger(KafkaPrinter.class);
    public static final Duration CONSUMER_POLL_TIMEOUT = Duration.ofSeconds(1);

+    static class Partition {


There's already a tuple class for topics and partitions in org.apache.kafka.common (TopicPartition)

Thanks was unaware, have adjusted to use now 👍

Signed-off-by: Tanner Lewis <[email protected]>

gregschohn

There are some things that would be nice to change, but they aren't critical; especially given the support nature of this code.

gregschohn · 2024-02-13T00:14:22Z

TrafficCapture/trafficReplayer/src/main/java/org/opensearch/migrations/replay/KafkaPrinter.java

+            OutputStream os = params.outputDirectoryPath == null ? System.out : new FileOutputStream(String.format("%s%s_%s_%s.proto", baseOutputPath, params.kafkaTrafficTopic, "all", uuid));
+            partitionOutputStreams.put(0, CodedOutputStream.newInstance(os));


This block can roll into the next block - excise and make the first if statement include captureRecords.isEmpty() ||...

gregschohn · 2024-02-13T00:19:12Z

TrafficCapture/dockerSolution/src/main/docker/migrationConsole/kafkaExport.sh

 set +o xtrace

-tar -czvf "$archive_name" "$file_name" && rm "$file_name"
+cd $dir_name
+tar -czvf "$archive_name" *.proto && rm *.proto


I'm fine doing this later, but this whole script would be much faster and easier to manage if you wrote compressed streams first, then tarred them together. As it is, you'll have a huge number of bytes going to disk (which might be remote) and back. Since the java program uses multiple files, you'd have to manage it within the java program (with a GZipInputStream).

gregschohn · 2024-02-13T00:22:25Z

TrafficCapture/trafficReplayer/src/main/java/org/opensearch/migrations/replay/KafkaPrinter.java

-    static java.util.function.Consumer<Stream<ConsumerRecord<String, byte[]>>> getDelimitedProtoBufOutputter(OutputStream outputStream, Map<Partition, PartitionTracker> capturedRecords) {
-        CodedOutputStream codedOutputStream = CodedOutputStream.newInstance(outputStream);
+    static java.util.function.Consumer<Stream<ConsumerRecord<String, byte[]>>> getDelimitedProtoBufOutputter(Map<TopicPartition, PartitionTracker> capturedRecords,
+        Map<Integer, CodedOutputStream> partitionOutputStreams, boolean separatePartitionOutputs) {


I should have asked before, but why are you using a CodedOutputStream (full disclosure, I wrote it with a CodedOutputStream) here instead of writeDelimitedTo? That would let you pass in a compressed output stream and you wouldn't need to worry about the uncompressed buffering problem.

I like the idea of using this writeDelimitedTo function but it seems to be on the Message interface, which for us it seems we would have to take the byte[] and have our TrafficStream protobuf object parse it, then we could use writeDelimitedTo. I wasn't sure after these requirements if there would be much benefit

Ahhh - good point. The current code is probably more efficient, so until we have a reason to, there's probably not a good reason to make the change now.

* MIGRATIONS-1459: Add ability to export Kafka records Signed-off-by: Tanner Lewis <[email protected]>

MIGRATIONS-1459: Add ability to export Kafka records

b8ea164

Signed-off-by: Tanner Lewis <[email protected]>

MIGRATIONS-1459: Minor updates to Kafka export and unit tests

8d1a6ee

Signed-off-by: Tanner Lewis <[email protected]>

lewijacn marked this pull request as ready for review January 19, 2024 21:47

lewijacn requested review from chelma, gregschohn, kartg, mikaylathompson, okhasawn and sumobrian as code owners January 19, 2024 21:47

Merge remote-tracking branch 'origin/main' into allow-kafka-export

7872516

gregschohn approved these changes Jan 27, 2024

View reviewed changes

MIGRATIONS-1459: Add support for files per partition in kafka export

d86418d

Signed-off-by: Tanner Lewis <[email protected]>

lewijacn requested a review from AndreKurait as a code owner February 9, 2024 20:15

lewijacn added 4 commits February 9, 2024 15:15

Merge remote-tracking branch 'origin/main' into allow-kafka-export

2015948

Add minor fixes

bb2e026

Signed-off-by: Tanner Lewis <[email protected]>

Merge remote-tracking branch 'origin/main' into allow-kafka-export

eeb01c1

Signed-off-by: Tanner Lewis <[email protected]>

Remove unnecessary flag runTestBenchmarks

9a2b755

Signed-off-by: Tanner Lewis <[email protected]>

gregschohn approved these changes Feb 13, 2024

View reviewed changes

lewijacn merged commit 8c6fa53 into opensearch-project:main Feb 13, 2024
5 of 7 checks passed

gregschohn pushed a commit to gregschohn/opensearch-migrations that referenced this pull request Feb 20, 2024

FEATURE: Add ability to export Kafka records (opensearch-project#488)

a5edcf3

* MIGRATIONS-1459: Add ability to export Kafka records Signed-off-by: Tanner Lewis <[email protected]>

lewijacn deleted the allow-kafka-export branch March 29, 2024 17:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FEATURE: Add ability to export Kafka records #488

FEATURE: Add ability to export Kafka records #488

lewijacn commented Jan 18, 2024 •

edited

Loading

codecov bot commented Jan 18, 2024 •

edited

Loading

gregschohn left a comment

gregschohn Jan 27, 2024

gregschohn Jan 27, 2024

gregschohn Jan 27, 2024

lewijacn Jan 30, 2024

gregschohn Jan 27, 2024

lewijacn Jan 30, 2024

gregschohn Jan 27, 2024

gregschohn Jan 27, 2024

lewijacn Jan 30, 2024

gregschohn Jan 27, 2024

lewijacn Jan 30, 2024

gregschohn Jan 27, 2024

gregschohn Jan 27, 2024

gregschohn Jan 27, 2024

gregschohn Jan 27, 2024

lewijacn Feb 8, 2024

gregschohn Jan 27, 2024

lewijacn Feb 9, 2024

gregschohn Jan 27, 2024

lewijacn Feb 8, 2024

gregschohn left a comment

gregschohn Feb 13, 2024

gregschohn Feb 13, 2024

gregschohn Feb 13, 2024

lewijacn Feb 13, 2024

gregschohn Feb 13, 2024

		OutputStream os = params.outputDirectoryPath == null ? System.out : new FileOutputStream(String.format("%s%s_%s_%s.proto", baseOutputPath, params.kafkaTrafficTopic, "all", uuid));
		partitionOutputStreams.put(0, CodedOutputStream.newInstance(os));

FEATURE: Add ability to export Kafka records #488

FEATURE: Add ability to export Kafka records #488

Conversation

lewijacn commented Jan 18, 2024 • edited Loading

Description

Minor additional changes:

Issues Resolved

Testing

Check List

codecov bot commented Jan 18, 2024 • edited Loading

Codecov Report

gregschohn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gregschohn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lewijacn commented Jan 18, 2024 •

edited

Loading

codecov bot commented Jan 18, 2024 •

edited

Loading