Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GCP: Support resumable download and custom buffering size in pipe cli #475

Merged
merged 3 commits into from
Jul 2, 2019

Conversation

tcibinan
Copy link
Contributor

@tcibinan tcibinan commented Jul 1, 2019

Relates to #435.

Add support for resumable blob downloads along with custom buffering size in pipe cli for Google cloud storages. See original issue for more information on the matter.

From now on blobs downloading from the Google cloud storage is resumable. It basically means that if any low-level connection error happens the interrupted download operation will be resumed from the current position.

Buffering size for downloading output files has been increased from the system defaults (usually 8192 B) up to 1 MB.

Two additional environment variables were added. Both of them can be used to override default pipe cli configurations:

  • CP_CLI_DOWNLOAD_BUFFERING_SIZE - buffering size for file system flushing. Defaults to 1 MB.
  • CP_CLI_RESUMABLE_DOWNLOAD_ATTEMPTS - maximum number of download sequential resumes before failing. Defaults to 5 attempts.

Performance benchmarks

Instance type: n1-standard-32
Instance disk: 500 GB
Instance and storage region: europe-west1

Download

File system Size Before, s After, s gsutl
Local 10 GB 74 78 71
Shared 10 GB 435 102 405

Upload

File system Size Before, s After, s gsutl, s
Local 10 GB 154 134 95
Shared 10 GB 144 130 98

tcibinan added 3 commits July 1, 2019 17:33
Resumable downloading of the google storage blobs resolves connection
reset issue and custom buffering increases an overall download
performance. Both changes are described in details in the corresponding
issue: #435.
@tcibinan tcibinan marked this pull request as ready for review July 2, 2019 12:48
@tcibinan tcibinan requested a review from mzueva July 2, 2019 12:48
@mzueva mzueva merged commit 094016d into develop Jul 2, 2019
@tcibinan tcibinan deleted the issue_435-gcp-cli-shared-fs-performance branch July 2, 2019 13:41
evgeniimv pushed a commit to evgeniimv/cloud-pipeline that referenced this pull request Oct 14, 2019
…epam#475)

* Add resumable downloads and custom buffering size for GCP blobs.

Resumable downloading of the google storage blobs resolves connection
reset issue and custom buffering increases an overall download
performance. Both changes are described in details in the corresponding
issue: epam#435.

* Add checksum validation for GCP resumable downloads

* Fix and refactor google storage downloading classes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants