SingleStoreIO #23535

AdalbertMemSQL · 2022-10-07T10:40:36Z

Implemented SingleStoreIO according to this design doc:
https://docs.google.com/document/d/1WU-hkoZ93SaGXyOz_UtX0jXzIRl194hCId_IdmEV9jw/edit?usp=sharing
addresses #22617

In future pull requests, it is planned to add the possibility to use Read, ReadWithPartitions, and Write PTransforms without setting RowMapper and UserDataMapper and implement SchemaTransforms for them.

codecov · 2022-10-07T11:09:04Z

Codecov Report

Merging #23535 (ca842ff) into master (48c70cc) will increase coverage by 9.70%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master   #23535      +/-   ##
==========================================
+ Coverage   73.47%   83.18%   +9.70%     
==========================================
  Files         714      473     -241     
  Lines       96403    66952   -29451     
==========================================
- Hits        70828    55691   -15137     
+ Misses      24252    11261   -12991     
+ Partials     1323        0    -1323

Flag	Coverage Δ
go	`?`
python	`83.18% <ø> (-0.03%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
...hon/apache_beam/runners/worker/worker_pool_main.py	`56.32% <0.00%> (-2.94%)`	⬇️
.../python/apache_beam/transforms/periodicsequence.py	`97.01% <0.00%> (-1.50%)`	⬇️
...apache_beam/typehints/native_type_compatibility.py	`85.16% <0.00%> (-0.37%)`	⬇️
sdks/python/apache_beam/typehints/typehints.py	`93.05% <0.00%> (-0.33%)`	⬇️
...hon/apache_beam/runners/worker/bundle_processor.py	`93.54% <0.00%> (-0.13%)`	⬇️
sdks/python/apache_beam/ml/inference/__init__.py	`100.00% <0.00%> (ø)`
...thon/apache_beam/ml/inference/pytorch_inference.py	`0.00% <0.00%> (ø)`
...pache_beam/typehints/pytorch_type_compatibility.py	`0.00% <0.00%> (ø)`
sdks/go/pkg/beam/core/graph/coder/bool.go
sdks/go/pkg/beam/runners/direct/buffer.go
... and 251 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

github-actions · 2022-10-26T14:26:09Z

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @kileys for label java.
R: @Abacn for label build.
R: @pabloem for label io.

Available commands:

stop reviewer notifications - opt out of the automated review tooling
remind me after tests pass - tag the comment author after tests pass
waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

AdalbertMemSQL · 2022-10-28T12:01:12Z

R: @kennknowles

github-actions · 2022-10-28T12:02:25Z

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control

AdalbertMemSQL · 2022-10-28T12:03:43Z

R: @johnjcasey

AdalbertMemSQL · 2022-10-28T12:04:35Z

R: @TheNeuralBit

AdalbertMemSQL · 2022-10-31T10:20:44Z

retest this please

AdalbertMemSQL · 2022-10-31T10:35:06Z

Run CommunityMetrics PreCommit

AdalbertMemSQL · 2022-10-31T14:20:29Z

Run Python PreCommit

AdalbertMemSQL · 2022-10-31T14:33:35Z

Run Java PreCommit

AdalbertMemSQL · 2022-11-07T11:48:41Z

Run Java PreCommit

johnjcasey

Reviewed the core code, Looks really solid so far.

johnjcasey · 2022-10-28T18:22:33Z

sdks/java/io/singlestore/build.gradle

+    implementation project(path: ":sdks:java:core", configuration: "shadow")
+    implementation group: 'com.singlestore', name: 'singlestore-jdbc-client', version: '1.1.4'
+    implementation library.java.slf4j_api
+    implementation "org.apache.commons:commons-dbcp2:2.8.0"


can we move these dependencies to BeamModulePlugin.groovy?

Do I understand correctly that your proposition is to add these libs to this map

// A map of maps containing common libraries used per language. To use: // dependencies { // compile library.java.slf4j_api // } project.ext.library = [ java : [

and use it here?

...io/singlestore/src/main/java/org/apache/beam/sdk/io/singlestore/DataSourceConfiguration.java

...java/io/singlestore/src/main/java/org/apache/beam/sdk/io/singlestore/ReadWithPartitions.java

johnjcasey · 2022-11-08T15:05:34Z

...java/io/singlestore/src/main/java/org/apache/beam/sdk/io/singlestore/ReadWithPartitions.java

+      DataSource dataSource = dataSourceConfiguration.getDataSource();
+      Connection conn = dataSource.getConnection();
+      try {
+        for (long partition = tracker.currentRestriction().getFrom();


this should probably not have partition++. In principle, this will always cause an error, because at some point the tryClaim will attempt to claim an invalid partition.

either because it is outside of the tracker range, or because it is beyond the total partition count.

The same loop is presented in this blog https://beam.apache.org/blog/splittable-do-fn/

public void process(ProcessContext c, OffsetRangeTracker tracker) { for (long i = tracker.currentRestriction().getFrom(); tracker.tryClaim(i); ++i) { c.output(KV.of(c.element().getKey(), i)); } }

I thought that tryClaim will just return false when an invalid partition is provided.

sdks/java/io/singlestore/src/main/java/org/apache/beam/sdk/io/singlestore/Read.java

sdks/java/io/singlestore/src/main/java/org/apache/beam/sdk/io/singlestore/Util.java

sdks/java/io/singlestore/src/main/java/org/apache/beam/sdk/io/singlestore/Write.java

johnjcasey · 2022-11-08T15:13:07Z

sdks/java/io/singlestore/src/main/java/org/apache/beam/sdk/io/singlestore/Write.java

+
+          final Exception[] writeException = new Exception[1];
+
+          Thread dataWritingThread =


Avoid creating threads inside of dofns. In Beam, asynchronous and multithreading behavior is best handled by the Beam framework itself.

Hmm....
I don't know how to implement this asynchronous task using Beam framework.
Here we have two parts - statement execution and writing data to the stream.
They are tightly coupled (writing will get stuck if the statement is not executed and the buffer is full and the statement execution won't finish until writing is finished).
And also these parts share a common PipedOutputStream which is not serializable.

Does the Beam framework have features that will allow something like this?

This is somewhat unusual for me as well. @lukecwik do we have a pattern for something like this?

johnjcasey · 2022-11-11T20:26:21Z

This looks good to me outside of the threading, which I'm unsure about.

@chamikaramj and @Abacn can you take a second look, as this is an entire IO I'd like more pairs of eyes

Abacn · 2022-11-11T20:37:57Z

Thanks @AdalbertMemSQL! Will take a look. For now, could you please integrate Read.java and Write.java into SingleStoreIO.java, as the patterns used for other IO connectors (e.g. JdbcIO). users may not able to do something like

import org.apache.beam.sdk.io.singlestore.Read;
import org.apache.beam.sdk.io.singlestore.Write;

since there may be naming conflictions.

Abacn · 2022-11-14T15:38:25Z

run seed job

Abacn · 2022-11-14T16:29:54Z

Run SQL PreCommit

Abacn · 2022-11-14T16:30:09Z

Run Seed Job

Abacn · 2022-11-14T16:34:38Z

Hi @AdalbertMemSQL would you mind rebase the branch onto latest master and I can run a seedjob (make beam_PerformanceTests_SingleStoreIO and Java_SingleStore_IO_Direct work), thanks

also please consider adding an updates to https://github.com/apache/beam/blob/master/CHANGES.md

Abacn · 2022-11-14T22:43:33Z

Run Java_Spark3_Versions PreCommit

AdalbertMemSQL · 2022-11-14T22:48:09Z

I thought that SingleStore_IO_Direct will run unit tests.
But after your comment, I realized that it is running the same integration test.
Deleted it for now.

Abacn · 2022-11-14T22:57:21Z

Run Java SingleStoreIO Performance Test

Abacn · 2022-11-14T23:11:50Z

Notice that SingleStoreIO Performance Test actually runs on direct runner. Direct runner is only for local debug use. It adds some redundant calculations for validations and is not performant. Could you please consider running it on Dataflow runner (there is an switches("-DintegrationTestRunner=dataflow") could refer to other performance tests)

AdalbertMemSQL · 2022-11-14T23:17:15Z

Sure :)

AdalbertMemSQL · 2022-11-15T05:56:29Z

@Abacn Can you now run the perf test?

Abacn · 2022-11-15T16:55:07Z

Run Seed Job

Abacn · 2022-11-15T18:31:17Z

Run Java SingleStoreIO Performance Test

Abacn

Besides the performance test, left some other comments here.

Abacn · 2022-11-15T18:34:14Z

sdks/java/io/singlestore/src/main/java/org/apache/beam/sdk/io/singlestore/SingleStoreUtil.java

+import org.slf4j.Logger;
+
+/** Provides utility functions for working with {@link SingleStoreIO}. */
+public class SingleStoreUtil {


Suggested change

public class SingleStoreUtil {

final class SingleStoreUtil {

We may not expose this as public API (as the practice of other IOs)

Abacn · 2022-11-15T18:35:26Z

CHANGES.md

@@ -60,6 +60,7 @@

 * Support for Bigtable sink (Write and WriteBatch) added (Go) ([#23324](https://github.com/apache/beam/issues/23324)).
 * S3 implementation of the Beam filesystem (Go) ([#23991](https://github.com/apache/beam/issues/23991)).
+* Support for SingleStoreDB source added (Java) ([#22617](https://github.com/apache/beam/issues/22617)).


Do you mean "SingleStoreDB Source and Sink"?

Abacn · 2022-11-15T18:46:54Z

.test-infra/jenkins/job_PerformanceTests_SingleStoreIO.groovy

+    singleStoreUsername : "admin",
+    singleStorePassword : "secretpass",
+    singleStorePort: "3306",
+    numberOfRecords: "100000",


Looks like the performance test is passing. The Write then Read steps finishing within seconds. To get a more meaningful metric, ideally the pipeline runs several minutes. How about setting numberOfRecords to 10000000?

Sorry I meant 10M records (100 times of "100000").

The metrics shown in https://ci-beam.apache.org/view/PerformanceTests/job/beam_PerformanceTests_SingleStoreIO/6/console

13:54:32 Load test results for test (ID): 403b18ab-b3dc-475a-b4d9-0a77866159c8 and timestamp: 2022-11-15T18:54:32.477000000Z: 13:54:32 Metric: Value: 13:54:32 write_time 0.628 13:54:32 Load test results for test (ID): 403b18ab-b3dc-475a-b4d9-0a77866159c8 and timestamp: 2022-11-15T18:54:32.477000000Z: 13:54:33 Metric: Value: 13:54:33 read_time 4.433 13:54:33 Load test results for test (ID): 403b18ab-b3dc-475a-b4d9-0a77866159c8 and timestamp: 2022-11-15T18:54:32.477000000Z: 13:54:33 Metric: Value: 13:54:33 read_with_partitions_time 1.87

If performance is linear 1M record still costs seconds to execute write.

SingleStoreDB is pretty performant!

Wow, it is really pretty fast :)
Increased the number of rows to 10M.

ah sorry. The test row pre-defined some hash for validation. 10000000 does not work but 5000000 will do:

private static final ImmutableMap<Integer, String> EXPECTED_HASHES = ImmutableMap.of( 1000, "7d94d63a41164be058a9680002914358", 100_000, "c7cbddb319209e200f1c5eebef8fe960", 600_000, "e2add2f680de9024e9bc46cd3912545e", 5_000_000, "c44f8a5648cd9207c9c6f77395a998dc");

Changed SingleStoreUtil to be final class. Changed CHANGES.md file. Increased number of rows in the integration test.

Abacn

One last comment from my side. Thanks!

Abacn · 2022-11-15T23:43:23Z

sdks/java/io/singlestore/src/test/java/org/apache/beam/sdk/io/singlestore/SingleStoreIOIT.java

+      PipelineResult writeResult = runWrite();
+      writeResult.waitUntilFinish();
+      PipelineResult readResult = runRead();
+      readResult.waitUntilFinish();


We should assert the pipeline status for write and read is DONE; we may also assert the number of record we read is consistent with the number of record we have written (some basic validation). Looks like currently if the pipeline fails remotely the test will still pass.

number of records is checked here

private void testReadResult(PCollection<TestRow> namesAndIds) { PAssert.thatSingleton(namesAndIds.apply("Count All", Count.globally())) .isEqualTo((long) numberOfRows);

(lines 248-250)

Will add asserts for pipeline status. Thanks!

Abacn · 2022-11-16T15:46:49Z

Run Java SingleStoreIO Performance Test

Abacn · 2022-11-16T15:47:30Z

Run Java_Examples_Dataflow_Java17 PreCommit

Abacn · 2022-11-16T16:58:14Z

Run Java SingleStoreIO Performance Test

Abacn · 2022-11-16T17:03:45Z

Run Java SingleStoreIO Performance Test

Abacn · 2022-11-16T18:12:46Z

Succeeded: https://ci-beam.apache.org/view/PerformanceTests/job/beam_PerformanceTests_SingleStoreIO/9/console

metrics:

13:10:34 org.apache.beam.sdk.io.singlestore.SingleStoreIOIT > testWriteThenRead STANDARD_OUT
13:10:34     Load test results for test (ID): fef7d33b-660b-4c77-a441-22fd9d6d548a and timestamp: 2022-11-16T18:10:34.119000000Z:
13:10:34                      Metric:                    Value:
13:10:34                   write_time                    25.008
13:10:34     Load test results for test (ID): fef7d33b-660b-4c77-a441-22fd9d6d548a and timestamp: 2022-11-16T18:10:34.119000000Z:
13:10:34                      Metric:                    Value:
13:10:34                    read_time                    74.979
13:10:34     Load test results for test (ID): fef7d33b-660b-4c77-a441-22fd9d6d548a and timestamp: 2022-11-16T18:10:34.119000000Z:
13:10:34                      Metric:                    Value:
13:10:34     read_with_partitions_time                    28.704

Abacn · 2022-11-16T18:17:20Z

sdks/java/io/singlestore/src/main/java/org/apache/beam/sdk/io/singlestore/SingleStoreIO.java

+ *    );
+ * }</pre>
+ */
+final class SingleStoreIO {


Oops I believe this is a mistake. should be final class SingleStoreUtil and public class SingleStoreIO

Abacn

LGTM once the last typo gets resolved (see above)

Some followups about tests (in separate PRs)

Visualize the metrics in Grafana dashboard: http://104.154.241.245/d/bnlHKP3Wz/java-io-it-tests-dataflow?orgId=1
Setup integration tests with local SingleStoreDB client (either some test client provided by singlestore or container)

Abacn · 2022-11-16T19:26:28Z

Let us pin @lukecwik and @chamikaramj again who got mentioned in this thread and see if they have any inputs. Will be a nice feature for Beam 2.44.0

Abacn · 2022-11-18T20:25:01Z

ok let's get this in so tests are continuously exercised

Summary: Implemented SingleStoreIO according to this design doc: https://docs.google.com/document/d/1WU-hkoZ93SaGXyOz_UtX0jXzIRl194hCId_IdmEV9jw/edit?usp=sharing Changed SingleStoreUtil to be final class. Changed CHANGES.md file. Increased number of rows in the integration test.

github-actions bot added build infra io java jdbc labels Oct 7, 2022

AdalbertMemSQL force-pushed the SingleStoreIO branch from 5c85399 to 4952064 Compare October 7, 2022 10:42

github-actions bot removed the jdbc label Oct 7, 2022

AdalbertMemSQL marked this pull request as ready for review October 26, 2022 14:24

github-actions bot added the Next Action: Reviewers label Oct 26, 2022

AdalbertMemSQL force-pushed the SingleStoreIO branch from 1de0def to 50b669e Compare October 31, 2022 12:41

johnjcasey requested changes Nov 8, 2022

View reviewed changes

AdalbertMemSQL requested a review from johnjcasey November 10, 2022 10:02

Deleted SingleStore_IO_Direct test

d0364bf

Changed to use dataflow runner instead of direct runner

aa6f893

AdalbertMemSQL added 2 commits November 15, 2022 01:20

Added runner in the pipelineOptions to be consistent with other tests

ef37828

Added tempRoot and project to pipelineOptions

9db5423

Abacn reviewed Nov 15, 2022

View reviewed changes

AdalbertMemSQL added 2 commits November 16, 2022 00:29

Resolved comments

a8bdd02

Changed SingleStoreUtil to be final class. Changed CHANGES.md file. Increased number of rows in the integration test.

Increased number of records in the integration test to 10M

05a4b6a

Abacn reviewed Nov 15, 2022

View reviewed changes

Added asserts that pipeline status is DONE

b25f096

Set supported numberOfRecords for hash

3680dd4

Abacn reviewed Nov 16, 2022

View reviewed changes

Abacn approved these changes Nov 16, 2022

View reviewed changes

Change SingleStoreIO to be public and SingleStoreUtil to be final

ca842ff

Abacn merged commit 0265634 into apache:master Nov 18, 2022


		final Exception[] writeException = new Exception[1];

		Thread dataWritingThread =

SingleStoreIO #23535

SingleStoreIO #23535

Conversation

AdalbertMemSQL commented Oct 7, 2022 • edited Loading

codecov bot commented Oct 7, 2022 • edited Loading

Codecov Report

github-actions bot commented Oct 26, 2022

AdalbertMemSQL commented Oct 28, 2022

github-actions bot commented Oct 28, 2022

AdalbertMemSQL commented Oct 28, 2022

AdalbertMemSQL commented Oct 28, 2022

AdalbertMemSQL commented Oct 31, 2022

AdalbertMemSQL commented Oct 31, 2022

AdalbertMemSQL commented Oct 31, 2022

AdalbertMemSQL commented Oct 31, 2022

AdalbertMemSQL commented Nov 7, 2022

johnjcasey left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johnjcasey commented Nov 11, 2022

Abacn commented Nov 11, 2022 • edited Loading

Abacn commented Nov 14, 2022

Abacn commented Nov 14, 2022

Abacn commented Nov 14, 2022

Abacn commented Nov 14, 2022

Abacn commented Nov 14, 2022

AdalbertMemSQL commented Nov 14, 2022

Abacn commented Nov 14, 2022

Abacn commented Nov 14, 2022

AdalbertMemSQL commented Nov 14, 2022

AdalbertMemSQL commented Nov 15, 2022

Abacn commented Nov 15, 2022

Abacn commented Nov 15, 2022

Abacn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Abacn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Abacn commented Nov 16, 2022

Abacn commented Nov 16, 2022

Abacn commented Nov 16, 2022

Abacn commented Nov 16, 2022

Abacn commented Nov 16, 2022

Abacn Nov 16, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Abacn left a comment

Choose a reason for hiding this comment

Abacn commented Nov 16, 2022

Abacn commented Nov 18, 2022

AdalbertMemSQL commented Oct 7, 2022 •

edited

Loading

codecov bot commented Oct 7, 2022 •

edited

Loading

Abacn commented Nov 11, 2022 •

edited

Loading

Abacn Nov 16, 2022 •

edited

Loading