Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stream performance improvement ideas #117

Open
benjamingr opened this issue Sep 21, 2023 · 32 comments
Open

Stream performance improvement ideas #117

benjamingr opened this issue Sep 21, 2023 · 32 comments

Comments

@benjamingr
Copy link
Member

cc @anonrig from our chat

  • Reduce memory overhead by using bit maps instead of many booleans for state
  • Use a deque instead of a linked list for the BufferList
  • Optimize for a single pipe rather than an array of pipes and fallback to array of pipes

More wild ideas:

  • Split Readable into several classes and "bailout" from fast to slow ones (e.g. objectMode and bytes, multiple readers vs. one, flowing vs. non flowing) based on most streams not changing their type during their lifetime.
  • Don't require a "full" event emitter necessarily for a stream unless the user is listening to an event and "fast-path" if a stream is piped

cc @ronag @nodejs/streams

@ronag
Copy link
Member

ronag commented Sep 21, 2023

cc @anonrig from our chat

  • Reduce memory overhead by using bit maps instead of many booleans for state

Not sure whether this is worth the effort. Is memory usage a problem? Number of streams in typical application * state size should be quite low, relatively speaking? I tried this once but the performance impact was negligible. I was thinking that being able to keep more of the state in a single cache line should make a impact but I was wrong. Might be different now...

  • Use a deque instead of a linked list for the BufferList

I think adding unshift to FixedQueue would be one way to go?

  • Optimize for a single pipe rather than an array of pipes and fallback to array of pipes

Not sure what this is? We already optimized for single pipe? What am I missing

More wild ideas:

  • Split Readable into several classes and "bailout" from fast to slow ones (e.g. objectMode and bytes, multiple readers vs. one, flowing vs. non flowing) based on most streams not changing their type during their lifetime.

Difficult.

  • Don't require a "full" event emitter necessarily for a stream unless the user is listening to an event and "fast-path" if a stream is piped

Skeptical if this is worth it.

Other stuff:

  • add support in pipeline for sync transformer function
  • Rewrite transform stream as a Readable which implements the Writable trait. Avoiding the double buffering from Duplex.
  • Try to avoid nextTick calls. I have a PR for this.

@ronag
Copy link
Member

ronag commented Sep 21, 2023

Do we have any actual profile data to back our efforts?

@ronag
Copy link
Member

ronag commented Sep 21, 2023

TBH. I think the biggest impact we can have is to try and increase the highwatermark across the board, e.g. increasing from 16k to 128k has a theoretical perf improvement of 8x in the context of streams overhead.

@benjamingr
Copy link
Member Author

Not sure whether this is worth the effort. Is memory usage a problem? Number of streams in typical application * state size should be quite low, relatively speaking? I tried this once but the performance impact was negligible. I was thinking that being able to keep more of the state in a single cache line should make a impact but I was wrong. Might be different now...

i'll open a PR and we can check but since streams are so omnipresent I would suspect so especially when people construct smaller streams (e.g. to use the iterator helpers). I feel like doing the work and seeing would be easier and I feel like I got sucked into a lot more talking than coding roles recently so that would be fun.

I think adding unshift to FixedQueue would be one way to go?

That would work

Not sure what this is? We already optimized for single pipe? What am I missing

We always store a .pipes array even though most streams only pipe to one place, we could make the default behavior (one pipe) faster and bail out on multiple pipes.

Do we have any actual profile data to back our efforts?

No, I went through the systems analyzer with --log-all and noticed some stuff (optimzing streams wasn't the meeting topic V8 was)

@ronag
Copy link
Member

ronag commented Sep 21, 2023

We always store a .pipes array even though most streams only pipe to one place, we could make the default behavior (one pipe) faster and bail out on multiple pipes.

I don't see how this affects any hot path.

@benjamingr
Copy link
Member Author

Basically all streams become slightly smaller and lookup for the pipe becomes slightly faster which would have very small impact in "many big streams" but measurable in "many small streams"

@benjamingr
Copy link
Member Author

Said PR nodejs/node#49745

@mcollina
Copy link
Member

The main performance bottleneck in streams comes from the cost of buffering at each step of a stream pipeline. From both an algorithmic and memory perspective, this is heavy. Worsely, we do this double buffering in http, because both the socket and IncomingRequests are Readable.

We should internalize a library such as syncthrough or minipass (or one built with the same principles) that does not perform any buffering. Look at the benchmarks.

@benjamingr
Copy link
Member Author

@mcollina we can explore/add a BYOB fast mode where different steps of the transform mutate the data in place instead of copying it which should ideally mean 0 copies

@mcollina
Copy link
Member

we can explore/add a BYOB fast mode where different steps of the transform mutate the data in place instead of copying it which should ideally mean 0 copies

That is impossible with the current API. We did a full design for this a long time ago, but there was no budget to make it happen. Using synchthrough/minipass where you'd use a Transform is a good intermediate step (you'd need this anyway for a BYOB solution).

@benjamingr
Copy link
Member Author

@mcollina why? We keep the same design and just emit the buffers that were passed to us, we keep a pointer to what we're currently using etc?

@benjamingr
Copy link
Member Author

@ronag trying FixedQueue suggestion in nodejs/node#49754

@mcollina
Copy link
Member

For a BYOB strategy to work, you need to "lock" the two streams sides so that others could not interject and inspect that. Our current stream implementation is designed as a one-to-many abstraction.

@benjamingr
Copy link
Member Author

For a BYOB strategy to work, you need to "lock" the two streams sides so that others could not interject and inspect that.

Correct, we could for example have a base class that is not an event emitter and only pipelines (and composes etc) and listening to events on that would throw or some such. Would need a lot of thought but can happen I think?

@mcollina
Copy link
Member

The design for that was already done in https://github.com/Fishrock123/bob. I think this was 2017-2018 work.

@Uzlopak
Copy link
Contributor

Uzlopak commented Sep 23, 2023

In Readable and ReadableState we do a stream instanceof DuplexStream. Same with Writable.

Is there a way to avoid the instanceof? maybe using an symbol attribute kIsDuplexStream set to true DuplexStream in constructor and checked by Writable and Readable

@benjamingr
Copy link
Member Author

@Uzlopak sure though that would need to be a "fast path" while we still support instanceof for ecosystem compat

@mcollina bob was a different thing than I had in mind and a complete overhaul though?

@Uzlopak
Copy link
Contributor

Uzlopak commented Sep 23, 2023

I checked it locally, and it would break for Writable.

@benjamingr
Copy link
Member Author

@Uzlopak why?

@Uzlopak
Copy link
Contributor

Uzlopak commented Sep 23, 2023

Idk. But if you check... Writable has also not the following check

  if (!(this instanceof Writable))
    return new Writable(options);

@vweevers
Copy link

Some object mode Readable implementations could benefit from a new pushv API. Those that get data from their source in bulk and currently do for (const chunk of chunks) this.push(chunk) which could instead be this.pushv(chunks) to skip repeated state checks.

@ronag
Copy link
Member

ronag commented Sep 23, 2023

Another thing we could look into is to merge the Readable from undici into core. It has some smart tricks to reduce overhead.

https://github.com/nodejs/undici/blob/main/lib/api/readable.js

@benjamingr
Copy link
Member Author

Sitting together with Ronag he pointed out some more important/useful ideas for the bitmap:

errored -> move to flags (and mutate shape on error - make non-error path faster)
decoder -> kHasDecoder and optimize for fast path
encoding -> optimize for utf-8 or buffer and if it's different then mutate shape
defaultEncoding -> similarly (utf8 or buffer or different)
pipes -> optimize for single pipe? Why the heck do we keep a ref to the pipe?
flowing -> convert to 2 bits (true/false/null)
kPaused -> can just be a flag in the bitmap since that's private

We should also optimize for non-legacy streams and keep compat in a slow path.

@Uzlopak
Copy link
Contributor

Uzlopak commented Sep 27, 2023

I know it sounds ignorant... Why do we call destroy in next tick? All the other operations are done in the current loop. What makes destroy so special?

@mcollina
Copy link
Member

I know it sounds ignorant... Why do we call destroy in next tick? All the other operations are done in the current loop. What makes destroy so special?

To support this flow:

const server = net.createServer((socket) => {
  socket.on('error', ...)
})

If we did not emit on the next tick, we could have streams that are errored before the user has a chance to attach an handler. Unfortunately this pattern is pervasive in our codebase and the ecosystem. A possible optimization in this regard is to not nextTick if there is a listener.

@Uzlopak
Copy link
Contributor

Uzlopak commented Sep 27, 2023

Then we should consider this. Actually nextTick is the sole reason, that destroy is slower than the other ops.

@benjamingr
Copy link
Member Author

nextTick isn't that expensive (though its queue should use a deque if it doesn't assuming we can't move to V8 queueMicrotask for timing reasons) but the nextTick is also there to provide consistent ordering so code isn't sometimes synchronous it always fires listeners asynchronously

@ronag
Copy link
Member

ronag commented Oct 4, 2023

Current state of recent improvements vs 20.7.0:

streams/readable-async-iterator.js sync='no' n=100000                                           ***      8.98 %       ±2.22%  ±3.09%  ±4.31%
streams/readable-async-iterator.js sync='yes' n=100000                                                   9.42 %       ±9.77% ±13.45% ±18.45%
streams/readable-bigread.js n=1000                                                                *      2.85 %       ±2.41%  ±3.44%  ±5.01%
streams/readable-bigunevenread.js n=1000                                                        ***     18.86 %       ±5.99%  ±8.44% ±12.03%
streams/readable-boundaryread.js type='buffer' n=2000                                            **     -2.81 %       ±1.83%  ±2.60%  ±3.78%
streams/readable-boundaryread.js type='string' n=2000                                             *     -1.24 %       ±1.04%  ±1.42%  ±1.94%
streams/readable-from.js n=10000000                                                             ***     -5.65 %       ±2.69%  ±3.69%  ±5.03%
streams/readable-readall.js n=5000                                                              ***      1.53 %       ±0.72%  ±1.01%  ±1.43%
streams/readable-unevenread.js n=1000                                                                    0.62 %       ±0.70%  ±0.96%  ±1.31%
streams/pipe-object-mode.js n=5000000                                                            **      7.60 %       ±4.26% ±5.85% ±7.97%
streams/pipe.js n=5000000                                                                         *     -3.14 %       ±2.61% ±3.58% ±4.88%
streams/writable-manywrites.js len=1024 callback='no' writev='no' sync='no' n=100000            ***      6.28 %       ±2.66%  ±3.72%  ±5.26%
streams/writable-manywrites.js len=1024 callback='no' writev='no' sync='yes' n=100000           ***     20.64 %       ±3.17%  ±4.37%  ±6.00%
streams/writable-manywrites.js len=1024 callback='no' writev='yes' sync='no' n=100000                    4.72 %       ±4.95%  ±6.96%  ±9.88%
streams/writable-manywrites.js len=1024 callback='no' writev='yes' sync='yes' n=100000          ***     31.86 %       ±3.09%  ±4.24%  ±5.81%
streams/writable-manywrites.js len=1024 callback='yes' writev='no' sync='no' n=100000                    5.64 %       ±5.93%  ±8.49% ±12.41%
streams/writable-manywrites.js len=1024 callback='yes' writev='no' sync='yes' n=100000          ***     11.81 %       ±3.08%  ±4.29%  ±5.98%
streams/writable-manywrites.js len=1024 callback='yes' writev='yes' sync='no' n=100000           **      5.64 %       ±3.65%  ±5.10%  ±7.15%
streams/writable-manywrites.js len=1024 callback='yes' writev='yes' sync='yes' n=100000                 20.52 %      ±25.07% ±34.52% ±47.41%
streams/writable-manywrites.js len=32768 callback='no' writev='no' sync='no' n=100000           ***      9.54 %       ±3.76%  ±5.28%  ±7.49%
streams/writable-manywrites.js len=32768 callback='no' writev='no' sync='yes' n=100000          ***    298.42 %       ±5.77%  ±7.97% ±11.00%
streams/writable-manywrites.js len=32768 callback='no' writev='yes' sync='no' n=100000           **     11.73 %       ±6.77%  ±9.53% ±13.57%
streams/writable-manywrites.js len=32768 callback='no' writev='yes' sync='yes' n=100000         ***    263.39 %       ±8.58% ±11.95% ±16.72%
streams/writable-manywrites.js len=32768 callback='yes' writev='no' sync='no' n=100000           **      7.93 %       ±5.03%  ±7.20% ±10.53%
streams/writable-manywrites.js len=32768 callback='yes' writev='no' sync='yes' n=100000         ***    258.94 %       ±8.28% ±11.86% ±17.33%
streams/writable-manywrites.js len=32768 callback='yes' writev='yes' sync='no' n=100000          **      9.24 %       ±5.66%  ±7.80% ±10.71%
streams/writable-manywrites.js len=32768 callback='yes' writev='yes' sync='yes' n=100000        ***    233.27 %       ±4.76%  ±6.74%  ±9.71%
streams/transform-manywrites.js len=1024 callback='no' sync='no' n=100000                       ***     13.15 %       ±2.51%  ±3.45%  ±4.74%
streams/transform-manywrites.js len=1024 callback='no' sync='yes' n=100000                      ***     22.76 %       ±4.67%  ±6.51%  ±9.09%
streams/transform-manywrites.js len=1024 callback='yes' sync='no' n=100000                              14.59 %      ±17.30% ±24.48% ±35.12%
streams/transform-manywrites.js len=1024 callback='yes' sync='yes' n=100000                             17.09 %      ±22.63% ±31.08% ±42.50%
streams/transform-manywrites.js len=32768 callback='no' sync='no' n=100000                      ***     14.05 %       ±2.87%  ±4.00%  ±5.61%
streams/transform-manywrites.js len=32768 callback='no' sync='yes' n=100000                     ***    188.51 %      ±27.29% ±39.20% ±57.63%
streams/transform-manywrites.js len=32768 callback='yes' sync='no' n=100000                     ***     11.64 %       ±5.04%  ±6.97%  ±9.64%
streams/transform-manywrites.js len=32768 callback='yes' sync='yes' n=100000                    ***    190.91 %       ±8.86% ±12.44% ±17.62%
streams/creation.js kind='duplex' n=5000000                                                     ***     29.24 %       ±5.16%  ±7.31% ±10.50%
streams/creation.js kind='readable' n=5000000                                                   ***     27.02 %       ±1.11%  ±1.53%  ±2.09%
streams/creation.js kind='transform' n=5000000                                                  ***     39.30 %       ±9.82% ±13.57% ±18.74%
streams/creation.js kind='writable' n=5000000                                                   ***     49.21 %       ±2.42%  ±3.33%  ±4.54%
streams/destroy.js kind='duplex' n=100000                                                       ***     34.21 %       ±4.67%  ±6.46%  ±8.93%
streams/destroy.js kind='readable' n=100000                                                     ***     21.11 %      ±10.58% ±14.68% ±20.41%
streams/destroy.js kind='transform' n=100000                                                    ***     29.88 %       ±7.28% ±10.02% ±13.76%
streams/destroy.js kind='writable' n=100000                                                              6.64 %       ±7.53% ±10.73% ±15.58%

@ronag
Copy link
Member

ronag commented Oct 4, 2023

@benjamingr @mcollina ☝️

@benjamingr
Copy link
Member Author

That's great

@anonrig
Copy link
Member

anonrig commented Oct 4, 2023

Good work!

@benjamingr
Copy link
Member Author

We can:

  • Still make the impl a lot more v8 friendly
  • Figure out how to buffer less for people who pipeline things
    • Figure out efficient allocation and memory layout more generally
    • Stop using linked lists for like anything ever
  • Figure out if we can remove legacy compatibility for certain things to userland
  • Ideally make our web streams as fast as regular streams (doable)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants