feat: Refactor upchunk to use readable streams for memory usage impro… #95

cjpillsbury · 2022-11-16T23:26:38Z

…vement.

Overview

This PR is a rearchitecture of Upchunk to use ReadableStreams as the basis for reading bytes from a file. Unlike the current implementation, which relies on loading the entire file into the JavaScript runtime heap, this new architecture allows us to reduce the memory footprint to a given read() from the file's ReadableStream (plus any remaining bytes from the previous read that have yet to be uploaded).

Why a separate `AsyncIterable` class (`ChunkedStreamIterable`)?

While the code could have been entirely written inline as part of the UpChunk instance, creating a class that conforms to ECMA standards for async iteration allows us to:

Take advantage of constructs like iterators, making the asynchronous, serial process of chunked upload much simpler and easier to understand in its use in UpChunk::sendChunks().
Have clear separation of concerns, leaving room for more granular/unit-level testing and isolating some of the more complicated parts when refactoring/reasoning about functionality, since ChunkedStreamIterable shouldn't need to be refactored as frequently.
Have a much easier path forward for plausible future features, like direct MediaRecorder support.
Have a much easier path forward for alternative application of chunked uploads (e.g. WebSockets or WebRTC data channels, transforms, etc.)

Why a "pull"-based use of streams (instead of a "push")

Using things like pipes with backpressure can often be a very clean way to chain together discrete transformations and side effects when working with streams. Unfortunately, the current APIs don't have sufficient ways of cleanly handling:

pauses (in a way that wouldn't result in aggregation of file bytes into memory)
dynamic queuing strategies to apply appropriate backpressure (e.g. with dynamicChunkSize enabled)

Additional notes

Given the scope of changes here, additional tests have been added to validate that the "uploaded" files are identical (in bytes) to the files provided to UpChunk. Also, even though the API has only been changed in an additive way (e.g. adding off() and once() methods) and should be fully backwards compatible, this will likely be released as a major version change, due to the scope of the refactor.

resolves: #89

…vement.

…k file bytes.

luwes · 2022-11-17T17:13:48Z

src/upchunk.ts

+import xhr, { type XhrUrlConfig, type XhrHeaders, type XhrResponse } from 'xhr';
+
+const DEFAULT_CHUNK_SIZE = 30720;
+const DEFAULT_MAX_CHUNK_SIZE = 512000; // in kB


512MB seems a lot for a chunk, will it ever use this much?

tbh I'm not sure but I think that can happen (since there are tradeoffs in overall time between big chunk that takes awhile and several smaller chunks that each take shorter but add overhead on total RTT). This is just keeping it original value (originally defined inline in UpChunk::constructor())

There's a simple algorithm on dynamic chunk sizing at the moment: the chunk size doubles when the last chunk duration (lastChunkInterval) was less than 10 seconds, and it halves when the last chunk took more than 30 seconds to upload. So as-is we should only see a 512MB chunk if the client was just able to upload a 256MB chunk in < 10 seconds (25.6 MB/s / 205 Mbit/s).

If we don't scale the chunk size up, then high bandwidth clients are kneecapped by round trip time and don't really have the opportunity to hit full throughput. However there's of course a tradeoff where it starts using too much memory.

Honestly 512MB is as large or larger than what I saw from any of the other sites I tested back in February. Google Drive, for example, caps out at 200MB. If there's concern that 512MB is too large, then 256MB is probably a more reasonable limit.

I also worry that 256 KB is too low, but then nobody will ever hit that unless it's taking them >30 seconds to upload 512KB. Really this setting is about trying to balance upload throughput vs memory usage vs maximum amount of time lost if a chunk has to retry.

luwes · 2022-11-17T17:32:38Z

src/upchunk.ts

+    return this.chunkSize * 1024;
+  }
+
+  async *[Symbol.asyncIterator](): AsyncIterator<Blob> {


luwes · 2022-11-17T19:20:27Z

src/upchunk.ts

+    let res: XhrResponse | undefined;
+    try {
+      this.attemptCount = this.attemptCount + 1;
+      this.lastChunkStart = new Date();


nit (non-blocking): using Date.now() is a little more direct

This was another example of me trying to "leave code the same as much as possible". Since this is how it was previously implemented, I was planning on keeping as is.

luwes · 2022-11-17T19:21:39Z

src/upchunk.ts

+      // Side effects
+      const lastChunkEnd = new Date();
+      const lastChunkInterval =
+        (lastChunkEnd.getTime() - this.lastChunkStart.getTime()) / 1000;


if using Date.now() update here too. it also saves some characters

same as above

luwes · 2022-11-17T19:24:45Z

tslint.json

+        "private-method-regex": "^\\*?\\[?[a-zA-Z][\\w\\d\\.]*\\]?$",
+        "protected-method-regex": "^\\*?\\[?[a-zA-Z][\\w\\d\\.]*\\]?$",
+        "static-method-regex": "^\\*?\\[?[a-zA-Z][\\w\\d\\.]*\\]?$",
+        "function-regex": "^\\*?\\[?[a-zA-Z][\\w\\d\\.]*\\]?$"


are these really needed? seems a lot of extra config

I'm not sure that they're all needed. This was based on a solution to the general problem of tslint not properly supporting symbols, taken from here: microsoft/tslint-microsoft-contrib#459. I can add a comment referencing that? Given that tslint is deprecated/unmaintained, I'd lean towards not changing this and instead having a separate effort to migrate from tslint to eslint (more like our setup in Open Elements).

luwes

LGTM I'll leave for others to chime in as well

src/upchunk.ts

gkatsev · 2022-11-18T18:46:51Z

src/upchunk.ts

+    return this.chunkSize * 1024;
+  }
+
+  async *[Symbol.asyncIterator](): AsyncIterator<Blob> {


what is upchunk's expected browser support matrix? Does our usage of Symbols, generators, etc. inadvertently change it? Or does it compile away with regenerator runtime?

It's always been a bit squishy, but, given the hard dependency on ReadableStream (and the corresponding update to the File API) for this effort, we should be good. Good callout though.

cool, and since you're targeting making this a major, it should be fine. (I also wasn't sure what the overlap between generators/ReadableStream/Symbols was, but probably large enough not to worry about except IE).
Worth solidifying the support matrix somewhere, though.

src/upchunk.ts

…ry PR feedback.

cjpillsbury added 4 commits November 16, 2022 15:24

feat: Refactor upchunk to use readable streams for memory usage impro…

c3c4eee

…vement.

test: Add test to ensure chunked upload bytes are identical to upchun…

5f2b722

…k file bytes.

feat: Add once and off for events. Add/cleanup comments.

833536b

test: Add tests for file upload validation under different conditions.

bd8723f

cjpillsbury marked this pull request as ready for review November 17, 2022 16:00

luwes reviewed Nov 17, 2022

View reviewed changes

src/upchunk.ts

return this.chunkSize * 1024;

}

async *[Symbol.asyncIterator](): AsyncIterator<Blob> {

Copy link

luwes Nov 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool!

luwes reviewed Nov 17, 2022

View reviewed changes

gkatsev reviewed Nov 18, 2022

View reviewed changes

cjpillsbury added 4 commits November 28, 2022 06:47

chore: Clean up comments and code org per Gary PR feedback.

30650a5

chore: do not minify example build (for debugging).

1898907

fix: Remove redundant/unnecessary awaits from async upload fns per Ga…

01a02f0

…ry PR feedback.

fix: set initial totalChunks value for non-dynamic usage.

55be500

gkatsev approved these changes Nov 28, 2022

View reviewed changes

cjpillsbury merged commit af6c85c into muxinc:master Nov 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Refactor upchunk to use readable streams for memory usage impro… #95

feat: Refactor upchunk to use readable streams for memory usage impro… #95

cjpillsbury commented Nov 16, 2022 •

edited

Loading

luwes Nov 17, 2022

cjpillsbury Nov 17, 2022 •

edited

Loading

bgentry Nov 18, 2022

luwes Nov 17, 2022

luwes Nov 17, 2022

cjpillsbury Nov 17, 2022 •

edited

Loading

luwes Nov 17, 2022

cjpillsbury Nov 17, 2022

luwes Nov 17, 2022

cjpillsbury Nov 17, 2022

luwes left a comment

gkatsev Nov 18, 2022

cjpillsbury Nov 18, 2022

gkatsev Nov 18, 2022

feat: Refactor upchunk to use readable streams for memory usage impro… #95

feat: Refactor upchunk to use readable streams for memory usage impro… #95

Conversation

cjpillsbury commented Nov 16, 2022 • edited Loading

Overview

Why a separate AsyncIterable class (ChunkedStreamIterable)?

Why a "pull"-based use of streams (instead of a "push")

Additional notes

Choose a reason for hiding this comment

cjpillsbury Nov 17, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cjpillsbury Nov 17, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

luwes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cjpillsbury commented Nov 16, 2022 •

edited

Loading

Why a separate `AsyncIterable` class (`ChunkedStreamIterable`)?

cjpillsbury Nov 17, 2022 •

edited

Loading

cjpillsbury Nov 17, 2022 •

edited

Loading