Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Refactor upchunk to use readable streams for memory usage impro… #95

Merged
merged 8 commits into from
Nov 28, 2022

Conversation

cjpillsbury
Copy link
Contributor

@cjpillsbury cjpillsbury commented Nov 16, 2022

…vement.

Overview

This PR is a rearchitecture of Upchunk to use ReadableStreams as the basis for reading bytes from a file. Unlike the current implementation, which relies on loading the entire file into the JavaScript runtime heap, this new architecture allows us to reduce the memory footprint to a given read() from the file's ReadableStream (plus any remaining bytes from the previous read that have yet to be uploaded).

Why a separate AsyncIterable class (ChunkedStreamIterable)?

While the code could have been entirely written inline as part of the UpChunk instance, creating a class that conforms to ECMA standards for async iteration allows us to:

  • Take advantage of constructs like iterators, making the asynchronous, serial process of chunked upload much simpler and easier to understand in its use in UpChunk::sendChunks().
  • Have clear separation of concerns, leaving room for more granular/unit-level testing and isolating some of the more complicated parts when refactoring/reasoning about functionality, since ChunkedStreamIterable shouldn't need to be refactored as frequently.
  • Have a much easier path forward for plausible future features, like direct MediaRecorder support.
  • Have a much easier path forward for alternative application of chunked uploads (e.g. WebSockets or WebRTC data channels, transforms, etc.)

Why a "pull"-based use of streams (instead of a "push")

Using things like pipes with backpressure can often be a very clean way to chain together discrete transformations and side effects when working with streams. Unfortunately, the current APIs don't have sufficient ways of cleanly handling:

  • pauses (in a way that wouldn't result in aggregation of file bytes into memory)
  • dynamic queuing strategies to apply appropriate backpressure (e.g. with dynamicChunkSize enabled)

Additional notes

Given the scope of changes here, additional tests have been added to validate that the "uploaded" files are identical (in bytes) to the files provided to UpChunk. Also, even though the API has only been changed in an additive way (e.g. adding off() and once() methods) and should be fully backwards compatible, this will likely be released as a major version change, due to the scope of the refactor.

resolves: #89

@cjpillsbury cjpillsbury marked this pull request as ready for review November 17, 2022 16:00
import xhr, { type XhrUrlConfig, type XhrHeaders, type XhrResponse } from 'xhr';

const DEFAULT_CHUNK_SIZE = 30720;
const DEFAULT_MAX_CHUNK_SIZE = 512000; // in kB
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

512MB seems a lot for a chunk, will it ever use this much?

Copy link
Contributor Author

@cjpillsbury cjpillsbury Nov 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tbh I'm not sure but I think that can happen (since there are tradeoffs in overall time between big chunk that takes awhile and several smaller chunks that each take shorter but add overhead on total RTT). This is just keeping it original value (originally defined inline in UpChunk::constructor())

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a simple algorithm on dynamic chunk sizing at the moment: the chunk size doubles when the last chunk duration (lastChunkInterval) was less than 10 seconds, and it halves when the last chunk took more than 30 seconds to upload. So as-is we should only see a 512MB chunk if the client was just able to upload a 256MB chunk in < 10 seconds (25.6 MB/s / 205 Mbit/s).

If we don't scale the chunk size up, then high bandwidth clients are kneecapped by round trip time and don't really have the opportunity to hit full throughput. However there's of course a tradeoff where it starts using too much memory.

Honestly 512MB is as large or larger than what I saw from any of the other sites I tested back in February. Google Drive, for example, caps out at 200MB. If there's concern that 512MB is too large, then 256MB is probably a more reasonable limit.

I also worry that 256 KB is too low, but then nobody will ever hit that unless it's taking them >30 seconds to upload 512KB. Really this setting is about trying to balance upload throughput vs memory usage vs maximum amount of time lost if a chunk has to retry.

return this.chunkSize * 1024;
}

async *[Symbol.asyncIterator](): AsyncIterator<Blob> {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool!

let res: XhrResponse | undefined;
try {
this.attemptCount = this.attemptCount + 1;
this.lastChunkStart = new Date();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit (non-blocking): using Date.now() is a little more direct

Copy link
Contributor Author

@cjpillsbury cjpillsbury Nov 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was another example of me trying to "leave code the same as much as possible". Since this is how it was previously implemented, I was planning on keeping as is.

// Side effects
const lastChunkEnd = new Date();
const lastChunkInterval =
(lastChunkEnd.getTime() - this.lastChunkStart.getTime()) / 1000;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if using Date.now() update here too. it also saves some characters

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

"private-method-regex": "^\\*?\\[?[a-zA-Z][\\w\\d\\.]*\\]?$",
"protected-method-regex": "^\\*?\\[?[a-zA-Z][\\w\\d\\.]*\\]?$",
"static-method-regex": "^\\*?\\[?[a-zA-Z][\\w\\d\\.]*\\]?$",
"function-regex": "^\\*?\\[?[a-zA-Z][\\w\\d\\.]*\\]?$"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these really needed? seems a lot of extra config

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that they're all needed. This was based on a solution to the general problem of tslint not properly supporting symbols, taken from here: microsoft/tslint-microsoft-contrib#459. I can add a comment referencing that? Given that tslint is deprecated/unmaintained, I'd lean towards not changing this and instead having a separate effort to migrate from tslint to eslint (more like our setup in Open Elements).

Copy link

@luwes luwes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM I'll leave for others to chime in as well

src/upchunk.ts Show resolved Hide resolved
src/upchunk.ts Outdated Show resolved Hide resolved
return this.chunkSize * 1024;
}

async *[Symbol.asyncIterator](): AsyncIterator<Blob> {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is upchunk's expected browser support matrix? Does our usage of Symbols, generators, etc. inadvertently change it? Or does it compile away with regenerator runtime?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's always been a bit squishy, but, given the hard dependency on ReadableStream (and the corresponding update to the File API) for this effort, we should be good. Good callout though.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool, and since you're targeting making this a major, it should be fine. (I also wasn't sure what the overlap between generators/ReadableStream/Symbols was, but probably large enough not to worry about except IE).
Worth solidifying the support matrix somewhere, though.

src/upchunk.ts Outdated Show resolved Hide resolved
src/upchunk.ts Outdated Show resolved Hide resolved
src/upchunk.ts Outdated Show resolved Hide resolved
@cjpillsbury cjpillsbury merged commit af6c85c into muxinc:master Nov 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Browser memory usage
4 participants