NIFI-11129 - adding PutMongoBulk processor to use the bulkWrite API #6918

sebastianrothbucher · 2023-02-01T20:23:34Z

NIFI-11129 - adding PutMongoBulk processor to use the bulkWrite API for way more efficient mass updates or inserts

Summary

NIFI-11129

Tracking

Please complete the following tracking steps prior to pull request creation.

Issue Tracking

Apache NiFi Jira issue created

Pull Request Tracking

Pull Request title starts with Apache NiFi Jira issue number, such as NIFI-00000
Pull Request commit message starts with Apache NiFi Jira issue number, as such NIFI-00000

Pull Request Formatting

Pull Request based on current revision of the main branch
Pull Request refers to a feature branch with one commit containing changes

Verification

Please indicate the verification steps performed prior to pull request creation.

Build

Build completed using mvn clean install -P contrib-check
- JDK 8 (nifi-mongodb-processors)
- JDK 11
- JDK 17

Licensing

n/a New dependencies are compatible with the Apache License 2.0 according to the License Policy
n/a New dependencies are documented in applicable LICENSE and NOTICE files

Documentation

n/a (but tested additionalDetails.html) Documentation formatting appears as expected in rendered files

MikeThomsen · 2023-02-08T17:01:05Z

...e/nifi-mongodb-processors/src/main/java/org/apache/nifi/processors/mongodb/PutMongoBulk.java

+@InputRequirement(Requirement.INPUT_REQUIRED)
+@CapabilityDescription("Writes the contents of a FlowFile to MongoDB as bulk-update")
+@SystemResourceConsideration(resource = SystemResource.MEMORY)
+public class PutMongoBulk extends AbstractMongoProcessor {


Two questions:

Do we need a minimum driver version to use this?

Have you considered breaking this functionality up so it can efficiently use the bulkwrite api with the record api for bulk ingestion as well as a generic bulk write operation processor?

Hi Mike,
thx for reviewing!
re 1.: I tested with the Mongo driver bundled with 1.13 at the time and it worked; I just had to do some adjustments for the driver bundled with 1.20-SNAPSHOT. Concerning operations: all Mongo 4.2+ (i.e. all active versions) support bulkWrite. I'd say: we should be fine
re 2.: I deliberately did not use records to allow freeform updates with $set, $inc, etc. as operations. Of course, you could define an avro schema for this, but I think it's a stretch. Original intention was to use it when you need more fine-grained control of updates vis-a-vis PutMongoRecord. Does that make sense? Could of course xpand the additional description HTML...

#2 makes sense. Is the bulkwrite api supposed to be faster and more flexible, or just more flexible? Based on your response, if the former then that has implications for PutMongoRecord too.

putMongoRecord uses the same bulkWrite API under the hood, so the PubMongoBulk would be more flexible and equally fast compared to PutMongoRecord

sebastianrothbucher · 2023-03-29T20:26:14Z

anything I can do to get the ball rolling again?

MikeThomsen · 2023-05-02T12:00:12Z

@sebastianrothbucher yeah, I'm adding this to my TODO to get a review going.

exceptionfactory

Thanks for the contribution @sebastianrothbucher. I noticed some implementation concerns, but before proceeding to review the code details, it would be helpful to evaluate the scope of this Processor.

Although the documentation makes it clear that the input JSON needs to match MongoDB requirements, the generalized nature of the Processor seems like it could be confusing. The Put prefix generally signifies some kind of insert operation, but the implementation allows insert, update, replace, or delete. Although having a general Processor can be helpful, the scope seems too open-ended in this situation. One option is renaming the Processor to something like ExecuteMongo, along the lines of ExecuteSQL. On the other hand, the conditionals inside the onTrigger method perform significantly different operations. Is the goal to support a combination of insert, update, and delete operations in a single execution? This is generally counter to most other Processors.

The Mongo Client Service should also be the preferred configuration method.

With that background, I think it would be helpful to refine the scope and target use case for this Processor.

MikeThomsen

Sorry it took a while to get back to you. Left you with a bunch of feedback. Good first pass at this processor.

...e/nifi-mongodb-processors/src/main/java/org/apache/nifi/processors/mongodb/PutMongoBulk.java

...nifi-mongodb-processors/src/test/java/org/apache/nifi/processors/mongodb/PutMongoBulkIT.java

MikeThomsen · 2023-06-27T17:38:07Z

@exceptionfactory

On the other hand, the conditionals inside the onTrigger method perform significantly different operations. Is the goal to support a combination of insert, update, and delete operations in a single execution? This is generally counter to most other Processors.

We already have something similar in the Elasticsearch package where you can do a very complex set of bulk operations from a single flowfile with PutElasticsearchRecord.

exceptionfactory · 2023-06-27T19:49:23Z

@exceptionfactory

On the other hand, the conditionals inside the onTrigger method perform significantly different operations. Is the goal to support a combination of insert, update, and delete operations in a single execution? This is generally counter to most other Processors.

We already have something similar in the Elasticsearch package where you can do a very complex set of bulk operations from a single flowfile with PutElasticsearchRecord.

Thanks for the reference @MikeThomsen, that is a helpful comparison. With that in mind, does it make sense to align this Processor to something more record-oriented, as opposed to requiring JSON?

MikeThomsen · 2023-06-28T15:11:10Z

Thanks for the reference @MikeThomsen, that is a helpful comparison. With that in mind, does it make sense to align this Processor to something more record-oriented, as opposed to requiring JSON?

No, I think it's good the way it is, and it doesn't bill itself as a record-aware processor. Maybe adding "Operations" to the end of it to further clarify that it is calling the BulkWrite API, but as-is I think it should fit a useful Mongo niche.

...e/nifi-mongodb-processors/src/main/java/org/apache/nifi/processors/mongodb/PutMongoBulk.java

exceptionfactory

Including Operations in the Processor name sounds like it would be helpful.

The comments about issues with the MongoDB Client Service need to be resolved, in addition to to the JUnit 4 test references. If these issues can be addressed, we should be able to move forward with the review.

sebastianrothbucher · 2023-07-02T18:20:05Z

thanks both of you, I'll take care of it

sebastianrothbucher · 2023-07-16T09:22:00Z

diff processors failing on otherwise identical code (I did rebase before pushing); got no windows box unfortunately to just debug in

exceptionfactory · 2023-07-17T12:45:27Z

diff processors failing on otherwise identical code (I did rebase before pushing); got no windows box unfortunately to just debug in

Thanks @sebastianrothbucher, the test failure on Windows is unrelated, I restarted the workflow to give it another try.

joewitt · 2023-10-01T18:31:42Z

@exceptionfactory Seems like the build checks out fine. Was that the last concern?

exceptionfactory

@joewitt and @sebastianrothbucher I went through the code in more detail and noted several recommendations. After addressing these issues, I think this should be ready to go forward.

...godb-processors/src/main/java/org/apache/nifi/processors/mongodb/PutMongoBulkOperations.java

sebastianrothbucher · 2023-10-23T19:23:55Z

Thanks! Did address them all (or at least leave an excuse in the comment ;-) )

sebastianrothbucher · 2023-10-23T20:22:15Z

Test is due to newer mongo - working on it

exceptionfactory

Thanks for your patience on this pull request @sebastianrothbucher. Unfortunately there are still some implementation problems that I noted in the latest set of comments. They appear to be missed in the course of refactoring, which can happen.

To make it easier to review subsequent changes, please put new changes in a new commit, even when rebasing, so that it is easier to follow the changes from one set of comments to another.

...godb-processors/src/main/java/org/apache/nifi/processors/mongodb/PutMongoBulkOperations.java

exceptionfactory · 2023-11-07T15:17:06Z

...godb-processors/src/main/java/org/apache/nifi/processors/mongodb/PutMongoBulkOperations.java

+        List<PropertyDescriptor> _propertyDescriptors = new ArrayList<>();
+        _propertyDescriptors.addAll(descriptors);
+        _propertyDescriptors.add(ORDERED);
+        _propertyDescriptors.add(TRANSACTIONS_ENABLED);
+        _propertyDescriptors.add(CHARACTER_SET);
+        propertyDescriptors = Collections.unmodifiableList(_propertyDescriptors);
+
+        final Set<Relationship> _relationships = new HashSet<>();
+        _relationships.add(REL_SUCCESS);
+        _relationships.add(REL_FAILURE);
+        relationships = Collections.unmodifiableSet(_relationships);


These declarations can be streamlined and replaced with List.of() and Set.of()

I did what PutMongo also did to stay kinda in sync; for first, shorthand does not work b/c of 113; to me, it's even uglier to change style midway. Up2u, I can do it either way

Thanks for the response. As this is a new Processor, this is a good opportunity to improve the convention, so moving to List.of() and Set.of() declarations, as opposed to the static initializer and underscored collection variables, would be helpful.

...godb-processors/src/main/java/org/apache/nifi/processors/mongodb/PutMongoBulkOperations.java

exceptionfactory · 2023-11-07T15:19:38Z

...godb-processors/src/main/java/org/apache/nifi/processors/mongodb/PutMongoBulkOperations.java

+            .defaultValue("true")
+            .build();
+
+    static final PropertyDescriptor TRANSACTIONS_ENABLED = new PropertyDescriptor.Builder()


Reviewing the implementation, is there value in making this a configurable property? On the one hand, removing the property avoids changing the client service interface and introducing the new method. On the other hand, should transactions always be used?

Not really, I almost never use them tbh b/c most of the time, the document is the scope of work. Tried to stick to the driver here also

Thanks for the reply. If you don't think it is necessary to support transactions for now, it seems like it would be better to remove this property, and the method from MongoDBClientService. Removing the option would simplify both the configuration and implementation. It looks like it wouldn't be difficult to add later if needed, but if you don't see a common need, then this might be a case of "less is more" for the initial version.

...godb-processors/src/main/java/org/apache/nifi/processors/mongodb/PutMongoBulkOperations.java

…rite API for way more efficient mass updates or inserts

sebastianrothbucher · 2023-11-07T18:45:22Z

thanks for checking!

…rite API for way more efficient mass updates or inserts - finishing touches

sebastianrothbucher · 2023-11-07T22:26:11Z

2nd failure is a Heisenbug ;-)
think we're good for this Processor

exceptionfactory · 2023-11-07T22:30:32Z

2nd failure is a Heisenbug ;-) think we're good for this Processor

Recent build changes introduced automated running of integration tests, and it looks like it is failing on the new IT:

Error:  Failures: 
Error:    PutMongoBulkOperationsIT.testBulkWriteInsert:65 expected: <3> but was: <0>

A number of the existing integration tests are being skipped because they are not in a position to run in an automated workflow, or they are otherwise unreliable.

In this case, the new integration test needs to be reliable, or it should not be included. Does it work for you on a local build? GitHub runners are less powerful, which can introduce certain unexpected behaviors, but this is something that will need to be addressed in this pull request as far as which way to go for the integration test.

sebastianrothbucher · 2023-11-08T16:09:53Z

I still have to look. I ran tests before each push - and they just worked. Also from the error messages - it seems like sth gets stuck due to load. Let me get back2u

…rite API for way more efficient mass updates or inserts - test stability

sebastianrothbucher · 2023-11-08T21:13:53Z

I don't think it was sth I did, specifically; it's just that Mongo might not yet have committed the deletion of the DB in question. I nonetheless did two things: 1.) use different test docs in different tests (which should not be necessary but gives additoinal security). and 2.) add more deletion and an explicit close to the Mongo write tests - which hopefully got this from 0.1% to 0.001% or sth

I re-ran the full test suite - and I have some other spurious errors. One in GridFS I totally cannot explain (except somehow the CI overwrites the mongo docker image - and we use another MongoDB on Github CI; just didn't see it).

The other issue is in GetMongoIT. We seem to overwrite flowfile attrs with "environment variables". Looks to me the following change makes it a little more stable. Didn't include it in the commit; maybe still useful

diff --git a/nifi-nar-bundles/nifi-mongodb-bundle/nifi-mongodb-processors/src/test/java/org/apache/nifi/processors/mongodb/GetMongoIT.java b/nifi-nar-bundles/nifi-mongodb-bundle/nifi-mongodb-processors/src/test/java/org/apache/nifi/processors/mongodb/GetMongoIT.java
index 87a0ca3d56..58e9adea6f 100644
--- a/nifi-nar-bundles/nifi-mongodb-bundle/nifi-mongodb-processors/src/test/java/org/apache/nifi/processors/mongodb/GetMongoIT.java
+++ b/nifi-nar-bundles/nifi-mongodb-bundle/nifi-mongodb-processors/src/test/java/org/apache/nifi/processors/mongodb/GetMongoIT.java
@@ -502,10 +502,9 @@ public class GetMongoIT extends AbstractMongoIT {
             db.getCollection(collections[x])
                 .insertOne(new Document().append("msg", "Hello, World"));
 
-            Map<String, String> attrs = new HashMap<>();
-            attrs.put("db", dbs[x]);
-            attrs.put("collection", collections[x]);
-            runner.enqueue(query, attrs);
+            runner.setEnvironmentVariableValue("db", dbs[x]);
+            runner.setEnvironmentVariableValue("collection", collections[x]);
+            runner.enqueue(query);
             runner.run();
 
             db.drop();

exceptionfactory

Thanks for the latest round of updates, I noted two more comments, and based on your response and updates, I think this should be ready to go.

exceptionfactory · 2023-11-16T02:14:16Z

...godb-processors/src/main/java/org/apache/nifi/processors/mongodb/PutMongoBulkOperations.java

+        List<PropertyDescriptor> _propertyDescriptors = new ArrayList<>();
+        _propertyDescriptors.addAll(descriptors);
+        _propertyDescriptors.add(ORDERED);
+        _propertyDescriptors.add(TRANSACTIONS_ENABLED);
+        _propertyDescriptors.add(CHARACTER_SET);
+        propertyDescriptors = Collections.unmodifiableList(_propertyDescriptors);
+
+        final Set<Relationship> _relationships = new HashSet<>();
+        _relationships.add(REL_SUCCESS);
+        _relationships.add(REL_FAILURE);
+        relationships = Collections.unmodifiableSet(_relationships);


Thanks for the response. As this is a new Processor, this is a good opportunity to improve the convention, so moving to List.of() and Set.of() declarations, as opposed to the static initializer and underscored collection variables, would be helpful.

exceptionfactory · 2023-11-16T02:17:30Z

...godb-processors/src/main/java/org/apache/nifi/processors/mongodb/PutMongoBulkOperations.java

+            .defaultValue("true")
+            .build();
+
+    static final PropertyDescriptor TRANSACTIONS_ENABLED = new PropertyDescriptor.Builder()


Thanks for the reply. If you don't think it is necessary to support transactions for now, it seems like it would be better to remove this property, and the method from MongoDBClientService. Removing the option would simplify both the configuration and implementation. It looks like it wouldn't be difficult to add later if needed, but if you don't see a common need, then this might be a case of "less is more" for the initial version.

…rite API for way more efficient mass updates or inserts - final review feedback

sebastianrothbucher · 2023-11-17T11:57:59Z

all right - addressed both final issues

exceptionfactory

Thanks again for your patience and perseverance on this new Processor @sebastianrothbucher. After one more review, I noticed that the Ordered property can be set as required because it should always have a value, but I will make that adjustment when merging. The latest version looks good, thanks again! +1 merging

sebastianrothbucher · 2023-11-30T20:50:50Z

+1 as well, thanks!

MikeThomsen reviewed Feb 8, 2023

View reviewed changes

exceptionfactory requested changes Jun 22, 2023

View reviewed changes

MikeThomsen requested changes Jun 27, 2023

View reviewed changes

exceptionfactory reviewed Jun 28, 2023

View reviewed changes

...e/nifi-mongodb-processors/src/main/java/org/apache/nifi/processors/mongodb/PutMongoBulk.java Outdated Show resolved Hide resolved

exceptionfactory reviewed Jun 28, 2023

View reviewed changes

...e/nifi-mongodb-processors/src/main/java/org/apache/nifi/processors/mongodb/PutMongoBulk.java Outdated Show resolved Hide resolved

exceptionfactory requested changes Jun 28, 2023

View reviewed changes

sebastianrothbucher force-pushed the nifi-11129 branch 2 times, most recently from 440a232 to 0949341 Compare July 11, 2023 18:04

exceptionfactory requested changes Oct 2, 2023

View reviewed changes

sebastianrothbucher force-pushed the nifi-11129 branch from 0949341 to 7cfb02d Compare October 23, 2023 19:21

sebastianrothbucher force-pushed the nifi-11129 branch from 7cfb02d to e426178 Compare October 23, 2023 20:29

sebastianrothbucher requested a review from exceptionfactory October 29, 2023 16:34

exceptionfactory requested changes Nov 7, 2023

View reviewed changes

NIFI-11129 - adding PutMongoBulkOperations processor to use the bulkW…

a93593a

…rite API for way more efficient mass updates or inserts

sebastianrothbucher force-pushed the nifi-11129 branch from e426178 to 1364453 Compare November 7, 2023 18:17

NIFI-11129 - adding PutMongoBulkOperations processor to use the bulkW…

5960536

…rite API for way more efficient mass updates or inserts - finishing touches

sebastianrothbucher force-pushed the nifi-11129 branch from 1364453 to 5960536 Compare November 7, 2023 20:12

NIFI-11129 - adding PutMongoBulkOperations processor to use the bulkW…

05898ce

…rite API for way more efficient mass updates or inserts - test stability

exceptionfactory reviewed Nov 16, 2023

View reviewed changes

NIFI-11129 - adding PutMongoBulkOperations processor to use the bulkW…

c31a967

…rite API for way more efficient mass updates or inserts - final review feedback

exceptionfactory approved these changes Nov 30, 2023

View reviewed changes

exceptionfactory closed this in 685700a Nov 30, 2023

NIFI-11129 - adding PutMongoBulk processor to use the bulkWrite API #6918

NIFI-11129 - adding PutMongoBulk processor to use the bulkWrite API #6918

Conversation

sebastianrothbucher commented Feb 1, 2023

Summary

Tracking

Issue Tracking

Pull Request Tracking

Pull Request Formatting

Verification

Build

Licensing

Documentation

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sebastianrothbucher commented Mar 29, 2023

MikeThomsen commented May 2, 2023

exceptionfactory left a comment • edited Loading

Choose a reason for hiding this comment

MikeThomsen left a comment

Choose a reason for hiding this comment

MikeThomsen commented Jun 27, 2023

exceptionfactory commented Jun 27, 2023

MikeThomsen commented Jun 28, 2023

exceptionfactory left a comment

Choose a reason for hiding this comment

sebastianrothbucher commented Jul 2, 2023

sebastianrothbucher commented Jul 16, 2023

exceptionfactory commented Jul 17, 2023

joewitt commented Oct 1, 2023

exceptionfactory left a comment

Choose a reason for hiding this comment

sebastianrothbucher commented Oct 23, 2023

sebastianrothbucher commented Oct 23, 2023

exceptionfactory left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sebastianrothbucher commented Nov 7, 2023

sebastianrothbucher commented Nov 7, 2023

exceptionfactory commented Nov 7, 2023 • edited Loading

sebastianrothbucher commented Nov 8, 2023

sebastianrothbucher commented Nov 8, 2023

exceptionfactory left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sebastianrothbucher commented Nov 17, 2023

exceptionfactory left a comment

Choose a reason for hiding this comment

sebastianrothbucher commented Nov 30, 2023

exceptionfactory left a comment •

edited

Loading

exceptionfactory commented Nov 7, 2023 •

edited

Loading