Slow download performance for Storage API #3929

mcantrell · 2018-11-09T20:03:57Z

The new version of the storage client (com.google.cloud:google-cloud-storage:1.52.0) appears download storage content at a MUCH slower rate than the legacy client (com.google.apis:google-api-services-storage:v1-rev141-1.25.0).

Legacy client (~40MB/s):

@Test
public void download() throws Exception {
    Storage storage = buildStorage();
    Storage.Objects.Get get = storage.objects().get(BUCKET_NAME, BUCKET_PATH);
    StorageObject storageObject = get.execute();

    File tempFile = createTempFile();
    try (OutputStream out = new FileOutputStream(tempFile)) {
        Stopwatch stopwatch = Stopwatch.createStarted();
        get.getMediaHttpDownloader().setDirectDownloadEnabled(true);
        get.executeMediaAndDownloadTo(out);
        long elapsedSeconds = stopwatch.stop().elapsed(TimeUnit.SECONDS);
        double fileSizeMb = storageObject.getSize().doubleValue() / 1024 / 1024;
        double throughput = fileSizeMb / elapsedSeconds;
        log.info("Completed download: elapsed={}s,fileSize={}MB,throughput={}MB/s",
                elapsedSeconds, fileSizeMb, throughput);
        assertTrue("Expected at least 20MB/s", throughput > 20);
    }
}

New client (~10MB/s):

@Test
   public void download() throws Exception {
       Storage storage = StorageOptions.getDefaultInstance().getService();
       Blob blob = storage.get(BLOB_ID);
       Path tempFile = createTempFile();

       Stopwatch stopwatch = Stopwatch.createStarted();
       blob.downloadTo(tempFile);
       long elapsedSeconds = stopwatch.stop().elapsed(TimeUnit.SECONDS);
       double fileSizeMb = blob.getSize().doubleValue() / 1024 / 1024;
       double throughput = fileSizeMb / elapsedSeconds;
       log.info("Completed download: elapsed={}s,fileSize={}MB,throughput={}MB/s",
               elapsedSeconds, fileSizeMb, throughput);
       assertTrue("Expected at least 20MB/s", throughput > 20);
   }

I'm attaching a couple of test cases that I've ran from a GCE instance (Ubuntu 16.04 with Java 1.8.0_191):

storage-performance-legacy.zip
storage-performance-new.zip

The text was updated successfully, but these errors were encountered:

mcantrell · 2018-11-12T14:16:07Z

It seems that a new API request is issued to fill each byte buffer for the channel reader (com.google.cloud.storage.BlobReadChannel#read). The byte buffer argument size is 2MB. The client is issuing ~500 service requests to download this file (~1GB). The time difference between the two clients is ~60s. This puts the overhead at ~120ms per request.

I was able to use a 50MB byte buffer to get closer to the legacy client (~33MB/s) but I'm not sure that large of a byte buffer is healthy for our system considering the number of concurrent downloads we handle.

Is the need for individual service requests to fulfill retry requirements?

mcantrell · 2018-11-12T19:57:42Z

Just as a concept, I created a quick and dirty downloader that uses the internal storage client's executeMediaAndDownloadTo method with retry support. It adjusts the mediaDownloader to the last good byte offset written to the output stream:

public class BlobDownloader {
    protected final StorageOptions options;
    protected final Storage storage;
    protected final BlobId blobId;
    protected final Storage.Objects.Get getOperation;


    public BlobDownloader(StorageOptions options, BlobId blobId) {
        this.options = options;
        this.blobId = blobId;
        this.storage = buildStorage(options);
        try {
            this.getOperation = storage.objects().get(blobId.getBucket(), blobId.getName());
        } catch (IOException e) {
            throw new StorageException(e);
        }

        this.getOperation.getMediaHttpDownloader().setDirectDownloadEnabled(true);
    }

    public void download(OutputStream target) {
        CountingOutputStream out = new CountingOutputStream(target);
        try {
            Callable<Void> callable = () -> {
                try {
                    getOperation.getMediaHttpDownloader().setBytesDownloaded(out.getCount());
                    getOperation.executeMediaAndDownloadTo(out);
                    return null;
                } catch (IOException e) {
                    throw new StorageException(e);
                }
            };
            runWithRetries(callable, options.getRetrySettings(), BaseService.EXCEPTION_HANDLER, options.getClock());
        } catch (RetryHelper.RetryHelperException e) {
            throw StorageException.translateAndThrow(e);
        }
    }


    /**
     * Copied from HttpStorageRpc to get internal storage client. This allows easy access to media downloader.
     *
     * @see com.google.cloud.storage.spi.v1.HttpStorageRpc#HttpStorageRpc(com.google.cloud.storage.StorageOptions)
     */
    protected Storage buildStorage(StorageOptions options) {
        HttpTransportOptions transportOptions = (HttpTransportOptions) options.getTransportOptions();
        HttpTransport transport = transportOptions.getHttpTransportFactory().create();
        HttpRequestInitializer initializer = transportOptions.getHttpRequestInitializer(options);

        return new Storage.Builder(transport, new JacksonFactory(), initializer)
                .setRootUrl(options.getHost())
                .setApplicationName(options.getApplicationName())
                .build();
    }
}

It's not well implemented but it displays the concept.

andrey-qlogic · 2019-01-11T14:21:47Z

@mcantrell , that is a good example, and I checked that the performance can be different between legacy client and the new API from google-cloud-java, but there is the reason why method downloadTo was added Blob.java issue #2107

The method uses new Channel API

google-cloud-java/google-cloud-clients/google-cloud-storage/src/main/java/com/google/cloud/storage/Blob.java

Line 213 in 2aa9d80

public void downloadTo(Path path, BlobSourceOption... options) {

))

Your approach adds a dependency to legacy client com.google.apis:google-api-services that maybe is not a good option. And what is not clear to me that even BlobDownloader or BlobDownloadHelper will happen to exist, where shall I put the class?

I would suggest hearing something from @frankyn or @garrettjonesgoogle

mcantrell · 2019-01-14T13:52:43Z

The approach is tied to the legacy client but from what I can tell, so is everything else. For instance, the Reader created for the downloadTo method uses the legacy client:

    BlobReadChannel(StorageOptions serviceOptions, BlobId blob,
        Map<StorageRpc.Option, ?> requestOptions) {
      this.serviceOptions = serviceOptions;
      this.blob = blob;
      this.requestOptions = requestOptions;
      isOpen = true;
      storageRpc = serviceOptions.getStorageRpcV1();
      storageObject = blob.toPb();
    }

I'm not sure the code referenced is really meant to be implemented exactly as is. It was just an example to demonstrate that you don't need to issue so many service requests to achieve a resumable download.

andrey-qlogic · 2019-01-14T18:58:06Z

@mcantrell

The approach is tied to an obsolete client, but the performance of the new method downloadToPathWithMediaHttpDownloader
looks better than the other downloadTo method. I would suggest adding this approach to the user separately from another method. If it looks reasonable, then I will add also unit tests.

mcantrell · 2019-01-15T21:42:52Z

I would suggest creating a method that is decoupled from java.nio.file.Path. I would assume that a lot of users (myself included) would not have a file to write to. For example, you may want to stream the data from the storage API to a browser via HTTP response.

e.g.

    downloadWithMediaHttpDownloaderTo(OutputStream out)

or

    downloadWithMediaHttpDownloaderTo(WritableChannel out)

andrey-qlogic · 2019-01-18T16:21:50Z

the change to OutputStream is reasonable, but building com.google.api.services.storage.Storage object from Blob may not be a good idea.

andrey-qlogic · 2019-01-25T16:26:15Z

@frankyn, it turned out that adding the new option USE_DIRECT_DOWNLOAD and setting MediaHttpDownloader directDownloadEnable to true/false depends on the option does not change the performance. The throughput still ~10Mb/s when the legacy client showed 40Mb/s. At the same time, if I increase buffer allocating size at the 'downloadTo' method

ByteBuffer bytes = ByteBuffer.allocate(DEFAULT_CHUNK_SIZE);

, it improves the performance up to 3x time on files of size 1Gb. What if instead, we add the next parameter to the method 'downloadTo' to give the user a chance to set the buffer size on his own?

frankyn · 2019-01-31T23:08:35Z

I think that sounds like a better idea. Thanks for digging into @andrey-qlogic!

andrey-qlogic · 2019-02-05T18:29:15Z

@frankyn I added a commit to the PR that is in progress, which has thye same performnce as the legacy client.

andrey-qlogic · 2019-02-15T09:55:00Z

The PR makes performance the same as legacy client
#4337

JustinBeckwith added the triage me I really want to be triaged. label Nov 10, 2018

chingor13 added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. priority: p2 Moderately-important priority. Fix may not be included in next release. labels Nov 12, 2018

JustinBeckwith removed the triage me I really want to be triaged. label Nov 12, 2018

mcantrell mentioned this issue Nov 14, 2018

Give users the ability to disable gzip content encoding to increase throughput #3822

Closed

sduskis added the api: storage Issues related to the Cloud Storage API. label Nov 19, 2018

andrey-qlogic mentioned this issue Jan 14, 2019

3929 Slow download performance for Storage API. Added new downloadToPathWithMediaHttpDownloader method with better performance. #4337

Closed

andrey-qlogic self-assigned this Jan 28, 2019

andrey-qlogic mentioned this issue Jan 29, 2019

3929: Changed to Guava ByteStreams.copy() that has a bit better performance googleapis/google-api-java-client#1239

Merged

andrey-qlogic closed this as completed in googleapis/google-api-java-client#1239 Feb 5, 2019

andrey-qlogic reopened this Feb 15, 2019

yoshi-automation added 🚨 This issue needs some love. and removed 🚨 This issue needs some love. labels May 8, 2019

sduskis unassigned andrey-qlogic Jun 11, 2019

This was referenced Jun 13, 2019

[WIP]Storage: remove reflection to get raw input stream #5472

Closed

Add option to return raw input stream for response googleapis/google-api-java-client#1323

Merged

Storage: remove reflection to get raw input stream #5515

Merged

ajaaym mentioned this issue Jun 20, 2019

Storage: add service api to read storage object in to output stream using direct download #5525

Merged

chingor13 assigned JesseLovelace Jul 8, 2019

This was referenced Jul 19, 2019

[WIP]Storage: Fix #3929 Slow download performance for Storage API #5787

Closed

Storage: Fix slow download performance #5791

Merged

ajaaym closed this as completed in #5791 Jul 31, 2019

domZippilli mentioned this issue Dec 4, 2019

Increase DEFAULT_CHUNK_SIZE to reduce transfer overhead, increase throughput googleapis/java-core#86

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow download performance for Storage API #3929

Slow download performance for Storage API #3929

mcantrell commented Nov 9, 2018

mcantrell commented Nov 12, 2018

mcantrell commented Nov 12, 2018 •

edited

Loading

andrey-qlogic commented Jan 11, 2019

mcantrell commented Jan 14, 2019

andrey-qlogic commented Jan 14, 2019

mcantrell commented Jan 15, 2019

andrey-qlogic commented Jan 18, 2019

andrey-qlogic commented Jan 25, 2019 •

edited

Loading

frankyn commented Jan 31, 2019

andrey-qlogic commented Feb 5, 2019

andrey-qlogic commented Feb 15, 2019

Slow download performance for Storage API #3929

Slow download performance for Storage API #3929

Comments

mcantrell commented Nov 9, 2018

mcantrell commented Nov 12, 2018

mcantrell commented Nov 12, 2018 • edited Loading

andrey-qlogic commented Jan 11, 2019

mcantrell commented Jan 14, 2019

andrey-qlogic commented Jan 14, 2019

mcantrell commented Jan 15, 2019

andrey-qlogic commented Jan 18, 2019

andrey-qlogic commented Jan 25, 2019 • edited Loading

frankyn commented Jan 31, 2019

andrey-qlogic commented Feb 5, 2019

andrey-qlogic commented Feb 15, 2019

mcantrell commented Nov 12, 2018 •

edited

Loading

andrey-qlogic commented Jan 25, 2019 •

edited

Loading