Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GCP: CP CLI poor download performance #435

Closed
tcibinan opened this issue Jun 24, 2019 · 3 comments · Fixed by #438
Closed

GCP: CP CLI poor download performance #435

tcibinan opened this issue Jun 24, 2019 · 3 comments · Fixed by #438
Assignees
Labels
cloud/gcp Issues related to the GCP integration kind/enhancement New feature or request sys/cli Issues related to the pipe cli
Milestone

Comments

@tcibinan
Copy link
Contributor

tcibinan commented Jun 24, 2019

Version: 0.16.0.1477.64ff1d341960a21d1839a378253e963de023aebc

Originally cloud pipeline used simple download strategy for all file downloads as long as it is default in google-cloud-storage library. Later chucked download strategy was introduced in #253.

It turned out that the chucked downloading strategy has a tremendous effect on the downloading performance and therefore has to be replaced with the original simple download strategy.

Example of the copying time with simple and chucked download strategies.

Size Simple Chunked gsutil
~ 8Gb 1m 28s 5m 8s 1m 59s
@tcibinan tcibinan added kind/enhancement New feature or request cloud/gcp Issues related to the GCP integration sys/cli Issues related to the pipe cli state/underway Issues that are currently being solved/implemented labels Jun 24, 2019
@tcibinan tcibinan self-assigned this Jun 24, 2019
@sidoruka sidoruka removed priority/high Issues to implement before others labels Jun 24, 2019
@sidoruka sidoruka added this to the v0.16 milestone Jun 24, 2019
@mzueva mzueva reopened this Jun 25, 2019
@mzueva mzueva added state/verify Issues that are already addressed and require validation and removed state/underway Issues that are currently being solved/implemented labels Jun 25, 2019
@tcibinan
Copy link
Contributor Author

Version: 0.16.0.1501.07b4552a3d8dd1c5bf169f6b77c2165084589683

Performance of the downloading and uploading operations for the same amount of data before and after the changes is shown below.

Size Operation Before After
~ 8 Gb Download 5m 8s 1m 28s
~ 8 Gb Upload 4m 51s 2m 46s

Pipe cli integration tests were performed and no new failures were found.

@tcibinan tcibinan removed the state/verify Issues that are already addressed and require validation label Jun 25, 2019
@tcibinan tcibinan reopened this Jul 1, 2019
@tcibinan
Copy link
Contributor Author

tcibinan commented Jul 1, 2019

Version: 0.16.0.1576.21a4f71dd58b9af5511545bc6e758af46486f3db

After the long research several issues regarding the google cloud storage support in pipe cli were found. It turned out that google-cloud-storage library that pipe cli uses to interact with google cloud storages requires additional tuning. Otherwise the overall performance remains poor in several specific cases.

Buffering size

First of all simple download operation uses buffer of 8KB while writing blob content to a local file system. It means that for every downloaded 8KB of data the filesystem will be called. The described approach performs extremely bad if the destination disk is slow and even worse if the destination file system is shared.

To increase the performance of the pipe cli the buffering size can be increased. As a simple heuristic a size of 1MB can be used.

Connection resets

Nevertheless, a simple replacement of the buffering size cannot be applied because there is a deeper problem in google-cloud-storage which occurs way more often on the custom buffering size. There is a nonzero chance to get connection reset error on any download and probably upload operation with or without custom buffering size. The similar issue was resolved in the node client using the retry mechanism but there is no support for such behavior in python client yet.

As a way to resolve the connection reset issue a resumable download mechanism can be introduced.

tcibinan added a commit that referenced this issue Jul 1, 2019
Resumable downloading of the google storage blobs resolves connection
reset issue and custom buffering increases an overall download
performance. Both changes are described in details in the corresponding
issue: #435.
mzueva pushed a commit that referenced this issue Jul 2, 2019
…#475)

* Add resumable downloads and custom buffering size for GCP blobs.

Resumable downloading of the google storage blobs resolves connection
reset issue and custom buffering increases an overall download
performance. Both changes are described in details in the corresponding
issue: #435.

* Add checksum validation for GCP resumable downloads

* Fix and refactor google storage downloading classes
tcibinan added a commit that referenced this issue Jul 2, 2019
Use buffering size as download chunk size in order to improve overall
google storage blobs download performance. Also increase default
download resume attempts to bypass the connection reset issue described
in #435 for most of the possible cases.
mzueva pushed a commit that referenced this issue Jul 2, 2019
Use buffering size as download chunk size in order to improve overall
google storage blobs download performance. Also increase default
download resume attempts to bypass the connection reset issue described
in #435 for most of the possible cases.
@tcibinan
Copy link
Contributor Author

tcibinan commented Jul 4, 2019

Download performance looks reasonable now.

Performance benchmarks

Version: 0.16.0.1665.e8e381b37222815f11edd0e428d1c44001d58fbd
Instance type: n1-standard-32
Instance disk: 500 GB
Instance and storage region: gcp-us-east

Download

Size File system Download, s
10 GB Local 54
10 GB Shared 79

@tcibinan tcibinan closed this as completed Jul 4, 2019
evgeniimv pushed a commit to evgeniimv/cloud-pipeline that referenced this issue Oct 14, 2019
…epam#475)

* Add resumable downloads and custom buffering size for GCP blobs.

Resumable downloading of the google storage blobs resolves connection
reset issue and custom buffering increases an overall download
performance. Both changes are described in details in the corresponding
issue: epam#435.

* Add checksum validation for GCP resumable downloads

* Fix and refactor google storage downloading classes
evgeniimv pushed a commit to evgeniimv/cloud-pipeline that referenced this issue Oct 14, 2019
…am#479)

Use buffering size as download chunk size in order to improve overall
google storage blobs download performance. Also increase default
download resume attempts to bypass the connection reset issue described
in epam#435 for most of the possible cases.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cloud/gcp Issues related to the GCP integration kind/enhancement New feature or request sys/cli Issues related to the pipe cli
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants