[HUDI-4445] S3 Incremental source improvements #6176

vamshigv · 2022-07-22T02:07:38Z

What is the purpose of the pull request

S3 Incremental source improvements:

Decode file resource url before operating on it.
Autodiscovery of partition columns in S3EventSource.
Fix serializability of hadoop conf
Improve efficiency and performance of filtering conditions

Brief change log

(for example:)

Modify AnnotationLocation checkstyle rule in checkstyle.xml

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

Added integration tests for end-to-end.
Added HoodieClientWriteTest to verify the change.
Manually verified the change by running a job locally.

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

codope · 2022-07-22T04:36:48Z

hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/S3EventsHoodieIncrSource.java

+        .rdd().toJavaRDD().mapPartitions(fileListIterator -> {
+          List<String> cloudFilesPerPartition = new ArrayList<>();
+          fileListIterator.forEachRemaining(row -> {
+            final Configuration configuration = serializableConfiguration.newCopy();


Why creating a copy again? I don't see any config modification happening within the executor. Why not pass serializableConfiguration simply?

codope · 2022-07-22T04:38:41Z

hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/S3EventsHoodieIncrSource.java

    }
+    LOG.warn("Extracted distinct files " + cloudFiles.size()


i assume it was for testing, change log to debug level?

codope · 2022-07-22T04:42:05Z

hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/S3EventsHoodieIncrSource.java

@@ -81,6 +94,12 @@ static class Config {
     * - --hoodie-conf hoodie.deltastreamer.source.s3incr.spark.datasource.options={"header":"true","encoding":"UTF-8"}
     */
    static final String SPARK_DATASOURCE_OPTIONS = "hoodie.deltastreamer.source.s3incr.spark.datasource.options";
+
+    // ToDo make it a list of extensions
+    static final String S3_ACTUAL_FILE_EXTENSIONS = "hoodie.deltastreamer.source.s3incr.file.extensions";


Is this a list of supported source data files extensions, e.g. .json, .parquet, .avro, etc?

Suggested change

static final String S3_ACTUAL_FILE_EXTENSIONS = "hoodie.deltastreamer.source.s3incr.file.extensions";

static final String S3INCR_FILE_EXTENSIONS_OPTIONS = "hoodie.deltastreamer.source.s3incr.file.extensions";

should align with the actual key, and suffix OPTIONS since it is a key not the extensions

codope · 2022-07-22T04:43:46Z

hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/S3EventsHoodieIncrSource.java

+    // ToDo make it a list of extensions
+    static final String S3_ACTUAL_FILE_EXTENSIONS = "hoodie.deltastreamer.source.s3incr.file.extensions";
+
+    static final String ATTACH_SOURCE_PARTITION_COLUMN = "hoodie.deltastreamer.source.s3incr.source.partition.exists";


Suggested change

static final String ATTACH_SOURCE_PARTITION_COLUMN = "hoodie.deltastreamer.source.s3incr.source.partition.exists";

// Add a comment on the purpose of this config and rename as below

static final String ADD_SOURCE_PARTITION_COLUMN = "hoodie.deltastreamer.source.s3incr.add.source.partition.column";

ditto. same naming rules here

codope · 2022-07-22T04:45:05Z

hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/S3EventsHoodieIncrSource.java

+    static final String S3_ACTUAL_FILE_EXTENSIONS = "hoodie.deltastreamer.source.s3incr.file.extensions";
+
+    static final String ATTACH_SOURCE_PARTITION_COLUMN = "hoodie.deltastreamer.source.s3incr.source.partition.exists";
+    static final Boolean DEFAULT_ATTACH_SOURCE_PARTITION_COLUMN = true;


Have we fully tested this change? If not, I would suggest keeping the default false for now.

codope · 2022-07-22T04:46:03Z

hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/S3EventsHoodieIncrSource.java

+  private Dataset addPartitionColumn(Dataset ds, List<String> cloudFiles) {
+    if (props.getBoolean(Config.ATTACH_SOURCE_PARTITION_COLUMN, Config.DEFAULT_ATTACH_SOURCE_PARTITION_COLUMN)
+        && !StringUtils.isNullOrEmpty(KeyGeneratorOptions.PARTITIONPATH_FIELD_NAME.key())) {
+      String partitionKey = props.getString(KeyGeneratorOptions.PARTITIONPATH_FIELD_NAME.key()).split(":")[0];


return early or log error/warn if partitionKey is null or empty?

codope · 2022-07-22T04:49:29Z

hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/S3EventsHoodieIncrSource.java

+      List<String> nestedPartition = Arrays.stream(filePath.split("/"))
+          .filter(level -> level.contains(partitionPathPattern)).collect(Collectors.toList());
+      if (nestedPartition.size() > 1) {
+        throw new HoodieException("More than one level of partitioning exists");


Is it planned to be supported sometime in future? If yes, let's create a tracking JIRA for that.

multiple level partition is very common. so this is a major limitation? if push this out, how would it affect existing users?

codope · 2022-07-22T04:51:05Z

hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/S3EventsHoodieIncrSource.java

+    return ds;
+  }
+
+  private Column s3EventsColumnFilter(String fileFormat) {


A minor suggestion to extract such kind of methods to a separate util class and keep this class plain and simple. Or if you prefer to keep these methods in this class for better readability then move it to the bottom (i.e. after the call site) for linear flow.

xushiyan

quick skimmed the code. have some comments

xushiyan · 2022-07-22T16:35:33Z

hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/S3EventsHoodieIncrSource.java

@@ -81,6 +94,12 @@ static class Config {
     * - --hoodie-conf hoodie.deltastreamer.source.s3incr.spark.datasource.options={"header":"true","encoding":"UTF-8"}
     */
    static final String SPARK_DATASOURCE_OPTIONS = "hoodie.deltastreamer.source.s3incr.spark.datasource.options";
+
+    // ToDo make it a list of extensions
+    static final String S3_ACTUAL_FILE_EXTENSIONS = "hoodie.deltastreamer.source.s3incr.file.extensions";


Suggested change

static final String S3_ACTUAL_FILE_EXTENSIONS = "hoodie.deltastreamer.source.s3incr.file.extensions";

static final String S3INCR_FILE_EXTENSIONS_OPTIONS = "hoodie.deltastreamer.source.s3incr.file.extensions";

should align with the actual key, and suffix OPTIONS since it is a key not the extensions

xushiyan · 2022-07-22T16:48:17Z

hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/S3EventsHoodieIncrSource.java

+    // ToDo make it a list of extensions
+    static final String S3_ACTUAL_FILE_EXTENSIONS = "hoodie.deltastreamer.source.s3incr.file.extensions";
+
+    static final String ATTACH_SOURCE_PARTITION_COLUMN = "hoodie.deltastreamer.source.s3incr.source.partition.exists";


ditto. same naming rules here

xushiyan · 2022-07-22T16:51:09Z

hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/S3EventsHoodieIncrSource.java

+      List<String> nestedPartition = Arrays.stream(filePath.split("/"))
+          .filter(level -> level.contains(partitionPathPattern)).collect(Collectors.toList());
+      if (nestedPartition.size() > 1) {
+        throw new HoodieException("More than one level of partitioning exists");


multiple level partition is very common. so this is a major limitation? if push this out, how would it affect existing users?

xushiyan · 2022-07-22T17:00:57Z

hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/S3EventsHoodieIncrSource.java

+              if (checkExists) {
+                FileSystem fs = FSUtils.getFs(s3Prefix + bucket, configuration);
+                try {
+                  if (fs.exists(new Path(decodeUrl))) {


creating hadoop Path gives much more memory overhead than normal instantiation. If just for checking, let's find a better way.

xushiyan · 2022-07-22T17:07:19Z

hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/S3EventsHoodieIncrSource.java

+        .filter(filterColumn)
+        .select("s3.bucket.name", "s3.object.key")
+        .distinct()
+        .rdd().toJavaRDD().mapPartitions(fileListIterator -> {


why convert to RDD? you should be able to do mapPartitions with Dataset too

hudi-bot · 2022-07-27T17:57:51Z

CI report:

d1d558f Azure: FAILURE

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

vamshigv · 2022-07-27T21:06:22Z

Priority is to land #6228 ahead of this while this can make it to the next release.

…-S3-Incremental-Source

codope · 2022-07-25T12:17:35Z

hudi-common/src/test/java/org/apache/hudi/common/testutils/HoodieTestDataGenerator.java

@@ -274,6 +277,11 @@ public RawTripTestPayload generatePayloadForShortTripSchema(HoodieKey key, Strin
    return new RawTripTestPayload(rec.toString(), key.getRecordKey(), key.getPartitionPath(), SHORT_TRIP_SCHEMA);
  }

+  public RawTripTestPayload generatePayloadForS3EventsSchema(HoodieKey key, String commitTime) throws IOException {


RawTripTestPayload assumes some form of trips schema. If you look at its constructor, we don't use the schema. And its APIs assume a few things about the schema. Should we keep all this out of HoodieTestDataGenerator?

This is an old comment. Please check if it's still valid.

codope · 2022-07-25T12:18:19Z

hudi-common/src/test/java/org/apache/hudi/common/testutils/S3EventsSchemaUtils.java

+import org.apache.avro.generic.GenericData;
+import org.apache.avro.generic.GenericRecord;
+
+// Utility for the schema of S3 events listed here (https://docs.aws.amazon.com/AmazonS3/latest/userguide/notification-content-structure.html)


should be multi-line comment

codope · 2022-07-25T12:20:05Z

hudi-common/src/test/java/org/apache/hudi/common/testutils/S3EventsSchemaUtils.java

+        .requiredString("eventSource")
+        .requiredString("eventName")
+        .name("s3")


Let's extract all these strings to constants.

codope · 2023-05-09T00:08:04Z

hudi-common/src/test/java/org/apache/hudi/common/testutils/GenericTestPayload.java

+      dos.write(jsonData.getBytes());
+    } finally {
+      dos.flush();
+      dos.close();


Will this close the ByteArrayOutputStream too?

codope · 2023-05-09T00:09:46Z

hudi-common/src/test/java/org/apache/hudi/common/testutils/GenericTestPayload.java

+/**
+ * Generic class for specific payload implementations to inherit from.
+ */
+public abstract class GenericTestPayload {


Maybe rename to AbstractJsonTestPayload? It's essentially for json data right?

codope · 2023-05-09T00:13:44Z

hudi-common/src/test/java/org/apache/hudi/common/testutils/S3EventsSchemaUtils.java

+import org.apache.avro.generic.GenericData;
+import org.apache.avro.generic.GenericRecord;
+
+// Utility for the schema of S3 events listed here (https://docs.aws.amazon.com/AmazonS3/latest/userguide/notification-content-structure.html)


Should be multi line comment.

codope · 2023-05-09T00:15:06Z

hudi-common/src/test/java/org/apache/hudi/common/testutils/S3EventsSchemaUtils.java

+        .requiredString("eventSource")
+        .requiredString("eventName")
+        .requiredString("_row_key")
+        .name("s3")


Preferably extract all these as static string constants.

codope · 2023-05-09T00:16:33Z

hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/S3EventsHoodieIncrSource.java

+  @Override
+  public Pair<Option<Dataset<Row>>, String> fetchNextBatch(Option<String> lastCkptStr, long sourceLimit) {
+    Pair<Option<Dataset<Row>>, String> sourceMetadata = fetchMetadata(lastCkptStr, sourceLimit);
+    if (!sourceMetadata.getKey().isPresent()) {


sourceMetadata.getLeft?

codope · 2023-05-09T00:17:56Z

hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/S3EventsHoodieIncrSource.java

          fileListIterator.forEachRemaining(row -> {
+            // TODO: configuration is updated in the getFs call. check if new copy is needed w.r.t to getFs.


Is this still required?

codope · 2023-05-09T00:21:45Z

...-utilities/src/test/java/org/apache/hudi/utilities/sources/TestS3EventsHoodieIncrSource.java

+  }
+
+  @Test
+  public void testHoodieIncrSource() throws IOException {


Maybe rename to testS3EventsHoodieIncrSource?

xushiyan · 2023-05-09T02:51:31Z

hudi-common/src/test/java/org/apache/hudi/common/testutils/GenericTestPayload.java

+    this.dataSize = jsonData.length();
+    Map<String, Object> jsonRecordMap = OBJECT_MAPPER.readValue(jsonData, Map.class);
+    this.rowKey = jsonRecordMap.get("_row_key").toString();
+    this.partitionPath = jsonRecordMap.get("time").toString().split("T")[0].replace("-", "/");


i recall this logic has been refactored in the current RawTripTestPayload

xushiyan · 2023-05-09T02:57:55Z

hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/S3EventsHoodieIncrSource.java

-        .mapPartitions((MapPartitionsFunction<Row, String>)  fileListIterator -> {
+        .rdd()
+        // JavaRDD simplifies coding with collect and suitable mapPartitions signature. check if this can be avoided.
+        .toJavaRDD()
+        .mapPartitions(fileListIterator -> {


we usually prefer high level dataframe apis. how is it actually beneficial to convert to rdd here? don't quite get the comment

xushiyan · 2023-05-09T02:58:27Z

hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/S3EventsHoodieIncrSource.java

-}
+}


we should have the EOL

xushiyan · 2023-05-09T03:06:26Z

hudi-common/src/test/java/org/apache/hudi/common/testutils/S3EventTestPayload.java

+/**
+ * Test payload for S3 event here (https://docs.aws.amazon.com/AmazonS3/latest/userguide/notification-content-structure.html).
+ */
+public class S3EventTestPayload extends GenericTestPayload implements HoodieRecordPayload<S3EventTestPayload> {


I'd suggest just test with DefaultHoodieRecordPayload with a specific S3 event schema, instead of creating a new test payload, as we want to test as close as the real scenario. Besides, we don't couple payload with schema, as payload is just responsible for how to merge

there is a lot of existing misused with the RawTripTestPayload see https://issues.apache.org/jira/browse/HUDI-6164

so you may want to decouple the improvement changes from payload changes.

xushiyan

For whatever improvements done for S3 incr source, we should make the same for GCS incr source?

vamshigv · 2023-05-17T03:55:55Z

Closing this as not needed.

vamshigv changed the title ~~Initial commit for s3 source improvements~~ S3 Incremental source improvements Jul 22, 2022

vamshigv closed this Jul 22, 2022

vamshigv reopened this Jul 22, 2022

Nits

e434c6b

vamshigv force-pushed the HUDI-4445-Improve-S3-Incremental-Source branch from 305c9e0 to e434c6b Compare July 22, 2022 02:28

xushiyan added the priority:blocker label Jul 22, 2022

vamshigv changed the title ~~S3 Incremental source improvements~~ [HUDI-4445] S3 Incremental source improvements Jul 22, 2022

Initial commit with necessary changes

0ff79e1

xushiyan self-assigned this Jul 22, 2022

codope reviewed Jul 22, 2022

View reviewed changes

xushiyan reviewed Jul 22, 2022

View reviewed changes

xushiyan assigned vamshigv Jul 25, 2022

vamshigv added 4 commits July 25, 2022 02:24

Added tests

e783a88

Add license info

3934625

Fixes with tests and refactoring

e9fd966

Fix checkstyle issues

d1d558f

vamshigv mentioned this pull request Jul 27, 2022

[HUDI-4488] Improve S3EventsHoodieIncrSource efficiency #6228

Merged

5 tasks

codope added priority:critical production down; pipelines stalled; Need help asap. and removed priority:blocker labels Aug 5, 2022

yihua added the hudistreamer issues related to Hudi streamer (Formely deltastreamer) label Sep 12, 2022

nsivabalan added the pr:wip Work in Progress/PRs label Nov 2, 2022

Merge remote-tracking branch 'upstream/master' into HUDI-4445-Improve…

96af133

…-S3-Incremental-Source

vinothchandar added the release-0.14.0 label Apr 25, 2023

codope reviewed May 9, 2023

View reviewed changes

xushiyan reviewed May 9, 2023

View reviewed changes

vamshigv closed this May 17, 2023

	static final String S3_ACTUAL_FILE_EXTENSIONS = "hoodie.deltastreamer.source.s3incr.file.extensions";
	static final String S3INCR_FILE_EXTENSIONS_OPTIONS = "hoodie.deltastreamer.source.s3incr.file.extensions";

	static final String ATTACH_SOURCE_PARTITION_COLUMN = "hoodie.deltastreamer.source.s3incr.source.partition.exists";
	// Add a comment on the purpose of this config and rename as below
	static final String ADD_SOURCE_PARTITION_COLUMN = "hoodie.deltastreamer.source.s3incr.add.source.partition.column";

		fileListIterator.forEachRemaining(row -> {
		// TODO: configuration is updated in the getFs call. check if new copy is needed w.r.t to getFs.

[HUDI-4445] S3 Incremental source improvements #6176

[HUDI-4445] S3 Incremental source improvements #6176

Conversation

vamshigv commented Jul 22, 2022 • edited by codope Loading

What is the purpose of the pull request

Brief change log

Verify this pull request

Committer checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xushiyan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hudi-bot commented Jul 27, 2022

CI report:

vamshigv commented Jul 27, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xushiyan left a comment

Choose a reason for hiding this comment

vamshigv commented May 17, 2023

vamshigv commented Jul 22, 2022 •

edited by codope

Loading