-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Delta synchronizer for GCS #12523
Add Delta synchronizer for GCS #12523
Conversation
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/GcsStorageFactory.java
Outdated
Show resolved
Hide resolved
} | ||
|
||
private static final class PassthroughGoogleCredential | ||
extends GoogleCredential |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GoogleCredential
is obsolete (see https://cloud.google.com/java/docs/reference/google-api-client/latest/com.google.api.client.googleapis.auth.oauth2.GoogleCredential).
The current version of hadoop-connectors
library depends however on Credential
for instantiating the RetryHttpInitializer
.
With an upgrade of hadoop-connectors
to a more recent version containing GoogleCloudDataproc/hadoop-connectors#704 the implementation wouldn't need to extend from an obsolete class.
7d1c2ba
to
c56056e
Compare
c56056e
to
d4423ba
Compare
4c8f31a
to
f53717b
Compare
f53717b
to
0dbd94f
Compare
0dbd94f
to
0d29de7
Compare
0d29de7
to
b9eb669
Compare
Sent #13895 within the repository. |
b9eb669
to
934e3a4
Compare
I've discovered locally a few overridden tests failing because DML was unsupported on GS. I've removed them so that cc: @ebyhr |
934e3a4
to
e09d8dd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just skimmed.
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/GcsStorageFactory.java
Outdated
Show resolved
Hide resolved
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/GcsStorageFactory.java
Show resolved
Hide resolved
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/GcsStorageFactory.java
Outdated
Show resolved
Hide resolved
...main/java/io/trino/plugin/deltalake/transactionlog/writer/GcsTransactionLogSynchronizer.java
Outdated
Show resolved
Hide resolved
...main/java/io/trino/plugin/deltalake/transactionlog/writer/GcsTransactionLogSynchronizer.java
Outdated
Show resolved
Hide resolved
f7f7f78
to
1ae12f6
Compare
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/GcsStorageFactory.java
Outdated
Show resolved
Hide resolved
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/GcsStorageFactory.java
Outdated
Show resolved
Hide resolved
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/GcsStorageFactory.java
Outdated
Show resolved
Hide resolved
this.gcsStorageFactory = requireNonNull(gcsStorageFactory, "gcsStorageFactory is null"); | ||
} | ||
|
||
// This approach should be compatible with OSS Delta Lake. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it "should be". Is it?
Either remove the comment, or replace with one stating the fact.
You can also link to a specific file in Delta project that implements same logic there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing the comment
This is a left-over from the initial draft which copied partly code from AzureTransactionLogSynchronizer
...main/java/io/trino/plugin/deltalake/transactionlog/writer/GcsTransactionLogSynchronizer.java
Outdated
Show resolved
Hide resolved
...main/java/io/trino/plugin/deltalake/transactionlog/writer/GcsTransactionLogSynchronizer.java
Outdated
Show resolved
Hide resolved
...main/java/io/trino/plugin/deltalake/transactionlog/writer/GcsTransactionLogSynchronizer.java
Outdated
Show resolved
Hide resolved
...main/java/io/trino/plugin/deltalake/transactionlog/writer/GcsTransactionLogSynchronizer.java
Outdated
Show resolved
Hide resolved
...main/java/io/trino/plugin/deltalake/transactionlog/writer/GcsTransactionLogSynchronizer.java
Outdated
Show resolved
Hide resolved
1ae12f6
to
86968a7
Compare
I changed slightly the way the |
9d60ca1
to
beeb9b4
Compare
@ebyhr / @findepi I made the necessary modifications to run the Delta GCS tests in a separate job in order to reduce the risk of running into an eventual timeout caused by, assumingly, cross cloud Azure -> GCS operations. Could you please run update the corresponding trino PR #13895 ? Thank you in advance. |
Could you fix below failure?
|
44f6b3c
to
5e7eb20
Compare
Given that the Delta Lake tests on GCS failed with OOM
https://github.com/trinodb/trino/actions/runs/3180848245/jobs/5187102248 let's try running for a few days stress tests (running delta-lake tests with the profile |
5eed40c
to
dfd6c1b
Compare
The implementation instantiates explicitly `Storage` Cloud Storage API client in a similar fashion as it is done withing `hadoop-connectors` library. The client is used for fine-tuning an API `Insert` call in order to be able to create the blob corresponding to the transaction log file. The creation of the blob will succeed if and only if there are no live versions of the object. This approach is a workaround for the limitations of the `hadoop-connectors` GCS library which is exposing only output streams when in need to create a blob. In case of dealing mid-write of the blob with I/O exceptions, the output stream can't be closed because this action would expose the blob in a corrupted state on GCS. Effectively in such a situation, the output stream would need to be intentionally leaked - not closed, which can lead to multiple unforeseen consequences.
The GitHub runner is running on the Azure cloud and doing cross cloud operations against a Google Cloud Storage bucket may incur from time to time increased latency. Run the GCS Delta Lake tests in a dedicated job in order to have enough buffer to run the GCS integration tests and avoid an eventual job timeout.
Use `trino-ci-test` multi region GCS bucket in testing GCS functionality of the `trino-delta-lake` connector.
dfd6c1b
to
cf3fc1c
Compare
cf3fc1c
to
dbeed72
Compare
Removed the temporary change |
Unrelated CI hit for
|
Description
The implementation instantiates explicitly
Storage
Cloud Storage API client in a similar fashion as it is
done withing
hadoop-connectors
library.The client is used for fine-tuning an API
Insert
callin order to be able to create the blob corresponding to
the transaction log file.
the creation of the blob will succeed only
The creation of the blob will succeed if and only if
there are no live versions of the object.
This approach is a workaround for the limitations of the
hadoop-connectors
GCS library which is exposing onlyoutput streams when in need to create a blob.
In case of dealing mid-write of the blob with I/O exceptions,
the output stream can't be closed because this action would
expose the blob in a corrupted state on GCS. Effectively in
such a situation, the output stream would need to be intentionally
leaked - not closed, which can lead to multiple unforeseen
consequences.
Improvement
Delta Lake connector
Add the ability to perform DML statements of tables backed by Google Cloud Storage
Related issues, pull requests, and links
PR where the
Storage
instance is retrieved via reflection fromhadoop-connectors
library: #12309Fixes: #12264
Documentation
(x) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.
Release notes
( ) No release notes entries required.
( ) Release notes entries required with the following suggested text: