Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Delta synchronizer for GCS #12523

Conversation

findinpath
Copy link
Contributor

@findinpath findinpath commented May 24, 2022

Description

The implementation instantiates explicitly Storage
Cloud Storage API client in a similar fashion as it is
done withing hadoop-connectors library.

The client is used for fine-tuning an API Insert call
in order to be able to create the blob corresponding to
the transaction log file.
the creation of the blob will succeed only
The creation of the blob will succeed if and only if
there are no live versions of the object.

This approach is a workaround for the limitations of the
hadoop-connectors GCS library which is exposing only
output streams when in need to create a blob.
In case of dealing mid-write of the blob with I/O exceptions,
the output stream can't be closed because this action would
expose the blob in a corrupted state on GCS. Effectively in
such a situation, the output stream would need to be intentionally
leaked - not closed, which can lead to multiple unforeseen
consequences.

Is this change a fix, improvement, new feature, refactoring, or other?

Improvement

Is this a change to the core query engine, a connector, client library, or the SPI interfaces? (be specific)

Delta Lake connector

How would you describe this change to a non-technical end user or system administrator?

Add the ability to perform DML statements of tables backed by Google Cloud Storage

Related issues, pull requests, and links

PR where the Storage instance is retrieved via reflection from hadoop-connectors library: #12309

Fixes: #12264

Documentation

(x) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

( ) No release notes entries required.
( ) Release notes entries required with the following suggested text:

# Delta Lake
* Add support for DML statements on tables backed by GCS

@cla-bot cla-bot bot added the cla-signed label May 24, 2022
@findinpath findinpath requested a review from findepi May 24, 2022 06:21
@findinpath findinpath requested a review from alexjo2144 May 24, 2022 06:21
}

private static final class PassthroughGoogleCredential
extends GoogleCredential
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GoogleCredential is obsolete (see https://cloud.google.com/java/docs/reference/google-api-client/latest/com.google.api.client.googleapis.auth.oauth2.GoogleCredential).

The current version of hadoop-connectors library depends however on Credential for instantiating the RetryHttpInitializer.
With an upgrade of hadoop-connectors to a more recent version containing GoogleCloudDataproc/hadoop-connectors#704 the implementation wouldn't need to extend from an obsolete class.

@findinpath findinpath requested a review from ebyhr May 24, 2022 06:33
@findinpath findinpath force-pushed the delta-gcs-log-synchronizer-explicit-storage-instantiation branch from 7d1c2ba to c56056e Compare May 25, 2022 10:06
@findinpath findinpath force-pushed the delta-gcs-log-synchronizer-explicit-storage-instantiation branch from c56056e to d4423ba Compare June 13, 2022 14:27
@findinpath findinpath force-pushed the delta-gcs-log-synchronizer-explicit-storage-instantiation branch 2 times, most recently from 4c8f31a to f53717b Compare June 24, 2022 14:05
@findinpath findinpath force-pushed the delta-gcs-log-synchronizer-explicit-storage-instantiation branch from f53717b to 0dbd94f Compare August 16, 2022 13:35
@findinpath findinpath changed the title Add Delta synchronizer for GCS - Storage is initiated explicitly Add Delta synchronizer for GCS Aug 16, 2022
@findinpath findinpath force-pushed the delta-gcs-log-synchronizer-explicit-storage-instantiation branch from 0dbd94f to 0d29de7 Compare August 16, 2022 13:57
@findinpath findinpath force-pushed the delta-gcs-log-synchronizer-explicit-storage-instantiation branch from 0d29de7 to b9eb669 Compare August 29, 2022 09:44
@findinpath
Copy link
Contributor Author

@findepi / @ebyhr could you please run the PR with secrets?

@ebyhr
Copy link
Member

ebyhr commented Aug 29, 2022

Sent #13895 within the repository.

@findinpath findinpath force-pushed the delta-gcs-log-synchronizer-explicit-storage-instantiation branch from b9eb669 to 934e3a4 Compare August 29, 2022 10:28
@findinpath
Copy link
Contributor Author

I've discovered locally a few overridden tests failing because DML was unsupported on GS. I've removed them so that TestDeltaLakeGcsConnectorSmokeTest is hopefully green.

cc: @ebyhr

@findinpath findinpath force-pushed the delta-gcs-log-synchronizer-explicit-storage-instantiation branch from 934e3a4 to e09d8dd Compare August 30, 2022 14:46
Copy link
Member

@ebyhr ebyhr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just skimmed.

@findinpath findinpath force-pushed the delta-gcs-log-synchronizer-explicit-storage-instantiation branch 3 times, most recently from f7f7f78 to 1ae12f6 Compare August 31, 2022 10:20
this.gcsStorageFactory = requireNonNull(gcsStorageFactory, "gcsStorageFactory is null");
}

// This approach should be compatible with OSS Delta Lake.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it "should be". Is it?
Either remove the comment, or replace with one stating the fact.
You can also link to a specific file in Delta project that implements same logic there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing the comment
This is a left-over from the initial draft which copied partly code from AzureTransactionLogSynchronizer

testing/trino-faulttolerant-tests/pom.xml Outdated Show resolved Hide resolved
@findinpath findinpath force-pushed the delta-gcs-log-synchronizer-explicit-storage-instantiation branch from 1ae12f6 to 86968a7 Compare September 5, 2022 15:20
@findinpath
Copy link
Contributor Author

I changed slightly the way the httpTransport is being created in io.trino.plugin.deltalake.GcsStorageFactory#create in order to depend (similarly to the hadoop-connectors implementation) on the Configuration content in order to allow eventually its customization.

@findinpath findinpath force-pushed the delta-gcs-log-synchronizer-explicit-storage-instantiation branch from 9d60ca1 to beeb9b4 Compare September 27, 2022 09:50
@findinpath
Copy link
Contributor Author

@ebyhr / @findepi I made the necessary modifications to run the Delta GCS tests in a separate job in order to reduce the risk of running into an eventual timeout caused by, assumingly, cross cloud Azure -> GCS operations. Could you please run update the corresponding trino PR #13895 ? Thank you in advance.

@ebyhr
Copy link
Member

ebyhr commented Sep 28, 2022

Could you fix below failure?

Error:  Parameter 'testing.gcp-storage-bucket' is required by @Parameters on method public io.trino.plugin.deltalake.TestDeltaLakeGcsConnectorSmokeTest(java.lang.String,java.lang.String) but has not been marked @Optional or defined

@findinpath findinpath force-pushed the delta-gcs-log-synchronizer-explicit-storage-instantiation branch 8 times, most recently from 44f6b3c to 5e7eb20 Compare October 3, 2022 15:27
@findinpath
Copy link
Contributor Author

Given that the Delta Lake tests on GCS failed with OOM

Error:  java.lang.OutOfMemoryError: Java heap space
Error:  Terminating due to java.lang.OutOfMemoryError: Java heap space
...
Error:  Command was /bin/sh -c cd /home/runner/work/trino/trino/plugin/trino-delta-lake && /opt/hostedtoolcache/Java_Zulu_jdk/17.0.4-8/x64/bin/java -Duser.timezone=America/Bahia_Banderas -Duser.language=en -Duser.region=US -Dfile.encoding=UTF-8 -Xmx3g -Xms3g -XX:+ExitOnOutOfMemoryError -XX:+HeapDumpOnOutOfMemoryError -XX:-OmitStackTraceInFastThrow -jar /home/runner/work/trino/trino/plugin/trino-delta-lake/target/surefire/surefirebooter18199502927712579772.jar /home/runner/work/trino/trino/plugin/trino-delta-lake/target/surefire 2022-10-04T13-11-47_658-jvmRun1 surefire6324101022156948381tmp surefire_08238049524145815535tmp
Error:  Error occurred in starting fork, check output in log
Error:  Process Exit Code: 3

https://github.com/trinodb/trino/actions/runs/3180848245/jobs/5187102248

let's try running for a few days stress tests (running delta-lake tests with the profile gcs-tests) on GCS infrastructure and gather the results.

@findinpath findinpath force-pushed the delta-gcs-log-synchronizer-explicit-storage-instantiation branch 2 times, most recently from 5eed40c to dfd6c1b Compare October 13, 2022 19:09
The implementation instantiates explicitly `Storage`
Cloud Storage API client in a similar fashion as it is
done withing `hadoop-connectors` library.

The client is used for fine-tuning an API `Insert` call
in order to be able to create the blob corresponding to
the transaction log file.
The creation of the blob will succeed if and only if
there are no live versions of the object.

This approach is a workaround for the limitations of the
`hadoop-connectors` GCS library which is exposing only
output streams when in need to create a blob.
In case of dealing mid-write of the blob with I/O exceptions,
the output stream can't be closed because this action would
expose the blob in a corrupted state on GCS. Effectively in
such a situation, the output stream would need to be intentionally
leaked - not closed, which can lead to multiple unforeseen
consequences.
The GitHub runner is running on the Azure cloud and
doing cross cloud operations against a Google Cloud Storage
bucket may incur from time to time increased latency.

Run the GCS Delta Lake tests in a dedicated job in order
to have enough buffer to run the GCS integration tests
and avoid an eventual job timeout.
Use `trino-ci-test` multi region GCS bucket
in testing GCS functionality of the
`trino-delta-lake` connector.
@findinpath findinpath force-pushed the delta-gcs-log-synchronizer-explicit-storage-instantiation branch from dfd6c1b to cf3fc1c Compare October 14, 2022 12:32
@findepi
Copy link
Member

findepi commented Oct 17, 2022

@findepi findepi force-pushed the delta-gcs-log-synchronizer-explicit-storage-instantiation branch from cf3fc1c to dbeed72 Compare October 17, 2022 10:58
@findepi
Copy link
Member

findepi commented Oct 17, 2022

Removed the temporary change !temporary Run only delta lake related tests

@findinpath
Copy link
Contributor Author

Unrelated CI hit for trino-kudu

TestKuduConnectorTest.testAggregation may relate to #14706

@findepi findepi merged commit 0ccc025 into trinodb:master Oct 25, 2022
@findepi findepi mentioned this pull request Oct 25, 2022
@github-actions github-actions bot added this to the 401 milestone Oct 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

Support Delta DML operations on Google Cloud Storage
6 participants