Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: using ArrayPool<byte> in MemoryStream with new APIs #22428

Open
Tracked by #64596
stephentoub opened this issue Jun 22, 2017 · 28 comments
Open
Tracked by #64596

Proposal: using ArrayPool<byte> in MemoryStream with new APIs #22428

stephentoub opened this issue Jun 22, 2017 · 28 comments
Labels
api-needs-work API needs work before it is approved, it is NOT ready for implementation area-System.IO partner-impact This issue impacts a partner who needs to be kept updated Team:Libraries
Milestone

Comments

@stephentoub
Copy link
Member

stephentoub commented Jun 22, 2017

AB#1244354
When not constructed with a specific byte[], MemoryStream allocates byte[]s every time it needs to grow. It would be tempting to just change the implementation to use ArrayPool<byte>.Shared.Rent to get that array, but this is problematic for a few reasons, mostly to do with existing code:

  1. It's fairly common to not dispose MemoryStreams, as currently the Dispose is effectively a nop. But with wrapping buffers from ArrayPool<byte>, it's important to release the currently used buffer back to the pool when the stream is disposed. And it would be expensive to make MemoryStream finalizable to deal with this.
  2. MemoryStream by default allows the buffers it creates to be exposed through its TryGetBuffer method. That means folks today could be taking and using these buffers, potentially even after the MemoryStream is disposed, which could lead to code using the array after the array is returned to the pool and potentially used by someone else.

One solution would just be to introduce a new Stream type, e.g.

namespace System.IO
{
    public sealed class ArrayPoolStream : Stream
    {
        public ArrayPoolStream() : this(ArrayPool<byte>.Shared) { }
        public ArrayPoolStream(ArrayPool<byte> pool);
        ~ArrayPoolStream();

        public ArraySegment<byte> DangerousGetArray();

        ... // override all the relevant members on Stream
    }
}

That has its own downsides, though. In particular, there's something nice about this just being a part of MemoryStream, being easily used in existing code that uses MemoryStreams, etc. Another option would be to just plumb this into MemoryStream, but in an opt-in manner, and accept some of the resulting limitations because they're opt-in:

  • Add a ctor: public MemoryStream(ArrayPool<byte> pool).
  • When that ctor is used, buffers come from the pool rather than being new'd up.
  • If you don't Dispose, you leak the buffer, but since this only happens when you use the new ctor, we're at least not impacting existing code, and it's little different than just using the pool directly and not returning a buffer.
  • We either make TryGetBuffer return false, or as with Dispose we accept that you had to opt-in to this by using the new ctor.

I'm leaning towards the second option: just add the new MemoryStream(ArrayPool<byte>) ctor, and allow TryGetBuffer to work; devs just need to know that when they create the stream with this ctor, they should dispose the stream, and they shouldn't use the array retrieved from TryGetBuffer after doing something else to the stream.

@svick
Copy link
Contributor

svick commented Jun 22, 2017

It's fairly common to not dispose MemoryStreams, as currently the Dispose is effectively a nop.

BTW, I have recently added a note to the documentation of MemoryStream that says that it doesn't have to be disposed. So changing that now would be breaking a promise the documentation makes. (And if it's not an appropriate promise, the documentation should be changed back.)

@omariom
Copy link
Contributor

omariom commented Jun 22, 2017

Would it be less dangerous to return ReadOnlySpan in another overload of TryGetBuffer ?

@stephentoub
Copy link
Member Author

Would it be less dangerous to return ReadOnlySpan in another overload of TryGetBuffer ?

A little maybe, but many of the same problems apply, e.g.

stream.TryGetBuffer(out ArraySegment<byte> buffer);
stream.Write(...);
Use(buffer);

vs

stream.TryGetBuffer(out ReadOnlySpan<byte> buffer);
stream.Write(...);
Use(buffer);

In both cases, the stream may have replaced the array referenced by buffer, in which case buffer is now pointing to something old.

Plus, many of the cases for which TryGetBuffer really need access to the array.

@madhub
Copy link

madhub commented Jul 21, 2018

We use MemoryStream all the time, it would be great if bytes array come from ArrayPool instead of allocating . Please plan this feature for your next milestone.

@Timovzl
Copy link

Timovzl commented Aug 16, 2018

@svick If a particular class that implements IDisposable does not need to be disposed, I would always consider that an implementation detail, never to be part of the public documentation. Clients should honor the IDisposable interface, or ignore it at their (our) own peril. The implementation must be able to change, in my opinion.

@ThatRendle
Copy link

How about having a MemoryStreamPool that provides a MemoryStream backed by a pooled array, and having very clear Return() or Dispose() requirements? The MemoryStreamPool.Shared instance could use arrays from ArrayPool<byte>.Shared, and constructed instances could create their own pool.

@svick
Copy link
Contributor

svick commented Aug 18, 2018

@Timovzl Are you saying you Dispose() all Tasks in your async code?

I think that in the cases where Dispose does not do anything, it's just unnecessary ceremony. But since it's hard to find out whether that's the case, and whether it will stay that way, I think documentation should make this clear.

@wokket
Copy link

wokket commented Aug 20, 2018

Just adding another voice to this, given the widespread use of MemoryStream in our apps the ability to reduce (generally quite large) allocs would be great. I'm comfortable with an opt-in approach if required, although the 2.1 release showed how awesome it is when 'things just get faster'.

re: Disposing memory streams: There's so much tooling, helpers, VS Add-Ins, guidance etc that all falls to a pit of 'always dispose your disposables'. We treat any undisposed (especially I/O) class as a red flag in code reviews, and feel dirty when forced to hand such things off with no concrete confirmation of cleanup ( MVC's FileResult() is a great example there)

@Timovzl
Copy link

Timovzl commented Aug 22, 2018

@Timovzl Are you saying you Dispose() all Tasks in your async code?

@svick I uh... always use ValueTask! :P Just kidding, you had me there.

Still, I maintain my position, exceptions notwithstanding. As this very issue demonstrates, allowing our class documentation to say, "hey, you may use this class without dispose, just for laziness", puts us in this awkward position when the implementation can no longer support that liberty.

Even if disposing is not necessary, why share this in the documentation at all? Why let people skip dispose on some Stream types? Just have them disposed as usual, even where it was not strictly necessary.

And then we can keep the exceptions minimal, such as for Task. The fewer exceptions, the less to remember. The fewer exceptions, the fewer questions raised by code reviewers, such as when they spot undisposed Streams. And our implementations are free to change.

@imanushin
Copy link

@markrendle , if you need MemoryStreamPool - you can just use
Microsoft.IO.RecyclableMemoryStream.

Looks like it has all that you want:

  • It uses chunks internally (which is nice to re-send huge files)
  • It re-uses byte arrays created

@Timovzl
Copy link

Timovzl commented Nov 13, 2018

@imanushin RecyclableMemoryStream looks interesting. Note that its GetBuffer() method returns the actual (current) buffer. It matches the docs' current documentation and gives you the power to do things - to do things wrong, too. Of course, in part, this was already the case with MemoryStream, but now you can retain access to a buffer that will be returned to the pool and used by something else. I'm not too fond of this potential for misuse. Granted, the alternative isn't great...

@tmds
Copy link
Member

tmds commented Nov 21, 2019

It uses chunks internally (which is nice to re-send huge files)

Maybe something that could be considered here also.

@msftgits msftgits transferred this issue from dotnet/corefx Jan 31, 2020
@msftgits msftgits added this to the 5.0 milestone Jan 31, 2020
@maryamariyan maryamariyan added the untriaged New issue has not been triaged by the area owner label Feb 23, 2020
@stephentoub stephentoub removed the untriaged New issue has not been triaged by the area owner label Feb 25, 2020
@stephentoub stephentoub modified the milestones: 5.0, Future Feb 25, 2020
@geoffkizer
Copy link
Contributor

It uses chunks internally (which is nice to re-send huge files)

Maybe something that could be considered here also.

Using chunks internally seems inconsistent with supporting GetBuffer().

I do think using chunks internally is interesting to consider, but I'm not sure how to square this with the GetBuffer() API; seems like we might need a separate type for this.

@saucecontrol
Copy link
Member

It gets a new larger contiguous buffer if necessary to present a single buffer with all content:

https://github.com/microsoft/Microsoft.IO.RecyclableMemoryStream/blob/master/src/RecyclableMemoryStream.cs#L54-L57

The only problem there is that the buffer might be from the pool, and now you have a second potential owner of said buffer. Would be safer to just treat GetBuffer() like ToArray() with a fresh non-pooled array.

@Timovzl
Copy link

Timovzl commented Nov 19, 2020

Would be safer to just treat GetBuffer() like ToArray() with a fresh non-pooled array.

In that case, at the very least an array from the array pool could be returned. That leaves the user with the option of returning it, making the code mostly allocation-free. As for users unaware of this... renting an array from the array pool and not returning it has no downsides over just allocating a new array, right?

@saucecontrol
Copy link
Member

Good point, as long as the MemoryStream doesn't return it when Disposed in that case. That's the tricky part of it... the buffer can't be owned by both the MemoryStream and the consumer because it's fine if neither returns it to the pool but bad if one returns it and the other continues using it.

@Timovzl
Copy link

Timovzl commented Nov 19, 2020

Good point, as long as the MemoryStream doesn't return it when Disposed in that case [...] because it's fine if neither returns it to the pool but bad if one returns it and the other continues using it.

Agreed. Once you call GetBuffer(), you become the owner of the array. Either you return it neatly with ArrayPool<byte>.Return() (best reuse) or you let it be garbage collected (nothing goes wrong, at least).

This does require some attention to the single-block case too. Do we implement a special case to relinquish ownership of the single block? That seems hard to keep track of. We might need to switch to "large buffers" as soon as GetBuffer() is invoked.

Any clever ideas? It would be so nice if the single-block case could stay copy-free...

@tmds
Copy link
Member

tmds commented Nov 19, 2020

Would be safer to just treat GetBuffer() like ToArray() with a fresh non-pooled array.

GetBuffer allows you to also change the data that is in the MemoryStream, which would not be the case if a fresh non-pooled array was returned.

Using chunks internally seems inconsistent with supporting GetBuffer().

  • When backed by a chunks, GetBuffer could be disallowed, similar as when MemoryStream is constructed with publiclyVisible to false using:
public MemoryStream (byte[] buffer, int index, int count, bool writable, bool publiclyVisible);
  • Or, the same parameter could be added to the new constructor:
public MemoryStream(ArrayPool<byte> pool, bool publiclyVisible)

When publiclyVisible is set to false, the implementation can use chunks.

  • Or, when someone calls GetBuffer, the implementation could automagically switch to non-chunked mode.

@stephentoub
Copy link
Member Author

stephentoub commented Nov 19, 2020

The more I've thought about this, the more I think it needs to be done separately from MemoryStream:

  • It's very common to not Dispose a MemoryStream; not doing so with pooled buffers will "leak" / drain the pool.
  • MemoryStream.ToArray() is explicitly usable after MemoryStream.Dispose(); that's called out in the docs and a non-trivial amount of code relies on that behavior. That breaks badly if Dispose returns the buffer to the pool, and there's no way for the MemoryStream to know whether ToArray will be called. Using a different constructor could change the behavior of ToArray to disallow that, but that breaks the MemoryStream contract.
  • Existing misuses of MemoryStream, such as concurrent accesses, could result in spooky-action-at-a-distance bugs, with arrays being returned to the pool while still being used, resulting in new very hard to diagnose failure modes. When we've introduced ArrayPool usage into other existing streams, we've generally only done so when there was either existing synchronization we could guard usage behind or where the overhead of adding such synchronization wasn't prohibitive, but I fear it would be in MemoryStream.
  • Changing the MemoryStream behavior from growing an array to maintaining a linked list of arrays has observable performance impact, both positive for some cases (avoiding copying for append-only writes) and negative for others (e.g. random access via seeking, consuming more buffers of varying sizes from the pool). It's hard to predict whether the impact will be net positive or negative across the vast usage of MemoryStream.

@geoffkizer
Copy link
Contributor

Changing the MemoryStream behavior from growing an array to maintaining a linked list of arrays has observable performance impact

Also the GetBuffer() issues mentioned above.

@Timovzl
Copy link

Timovzl commented Nov 21, 2020

@tmds

GetBuffer allows you to also change the data that is in the MemoryStream, which would not be the case if a fresh non-pooled array was returned.

Actually, with the above discussion about this, the intent was that the returned array becomes the stream's own buffer (at least until you write to the stream), so changing the data certainly works. Hopefully the discussion makes more sense with that in mind.

When backed by a chunks, GetBuffer could be disallowed [...]

I imagine a less ideal GetBuffer (that sometimes needs to do copying) is preferable to one that sometimes simply does not work. The caller doesn't necessarily know when it's using small vs. large buffers. Having the method throw sometimes, in a way that is nontransparent to the caller, would render the method a lot less usable.

Or, when someone calls GetBuffer, the implementation could automagically switch to non-chunked mode.

That would be the easiest. It is a bit disappointing that the simple case of a single chunk would then involve copying, so I'm still holding out hopes that we can come up with something more clever. But it might be acceptable.

@Timovzl
Copy link

Timovzl commented Nov 21, 2020

The more I've thought about this, the more I think it needs to be done separately from MemoryStream:

I think a separate, more suitable API would certainly be worth pursuing. I'm wondering, when we wish to supply a low-allocation implementation, in how many cases we must pass a MemoryStream. Can we practically always get away with passing any Stream? Or do we have example use cases where MemoryStream is a requirement?

If a MemoryStream tends to be the requirement, we should probably tackle both issues, but separately. If any Stream will do, then let's forget about MemoryStream and make something ideal.

I'm not fond of the idea of honoring API misuse. If you neglect to dispose an IDisposable object, invoke methods on a disposed object, or concurrently access an object that is not thread-safe... then there may be consequences - now, or when the implementation changes.

If a type unrelated to MemoryStream turns out to be the way to go anyway, then it's a plus that API misuse isn't punished. But should that consideration be a priority?

Admittedly, as for not disposing, the documentation says that's allowed, so that needs to be addressed. (Strange implementation detail to mention in the documentation, but I digress.) If we can make sure that not disposing merely degenerates to the equivalent of using non-pooled streams (analog to not returning arrays rented from an array pool), then this is a non-issue, right? Potential gains, with no potential losses.

@stephentoub
Copy link
Member Author

stephentoub commented Nov 21, 2020

If we can make sure that not disposing merely degenerates to the equivalent of using non-pooled streams (analog to not returning arrays rented from an array pool), then this is a non-issue, right?

Arrays from the pool are more "valuable" than others, as they're much more likely to have been around for longer, be in higher generations, etc. There's also been discussion about using pinned/aligned arrays in the pool. So taking arrays from the pool and not returning them is generally worse on a system than allocating new ones, degrading other unrelated components using the pool.

I'm not fond of the idea of honoring API misuse. If you neglect to dispose an IDisposable object, invoke methods on a disposed object, or concurrently access an object that is not thread-safe... then there may be consequences - now, or when the implementation changes.

Not disposing a MemoryStream is not misuse, nor is using ToArray after disposal.

Concurrent use is misuse, but at the same time, MemoryStream is used by millions of libraries and applications; there is guaranteed to be misuse. And we can't afford to introduce changes into such a type that could introduce a myriad of difficult to diagnose race conditions in entirely unrelated parts of the app, e.g. misuse of a MemoryStream over here causes sensitive data to be leaked to an http response over there. We need to be cognizant of such issues for new APIs, but it's a much bigger deal for existing ones, especially ones as core s MemoryStream.

@Timovzl
Copy link

Timovzl commented Nov 21, 2020

Arrays from the pool are more "valuable" than others, as they're much more likely to have been around for longer, be in higher generations, etc. There's also been discussion about using pinned/aligned arrays in the pool. So taking arrays from the pool and not returning them is generally worse on a system than allocating new ones, degrading other unrelated components using the pool.

I did not realize! That does limit our options... and support your proposition to not inherit from MemoryStream.

I'm still very curious if we even have examples where inheriting from MemoryStream poses an advantage. I don't seem to recall any APIs that take one as a parameter.

@stephentoub
Copy link
Member Author

I don't seem to recall any APIs that take one as a parameter.

To my knowledge there aren't any public APIs in the core libraries strongly-typed to accept a MemoryStream. No doubt there are cases outside of the core libraries.

@houseofcat
Copy link

houseofcat commented Jul 23, 2021

@stephentoub Would this possibly fall under Up For Grabs? Or have you made some traction internally?

I think ArrayPoolStream would work nicely, but add a little creative breathing room without need for backwards compatibility. Could also operate as a MemoryStream drop-in replacement for a large amount of use cases (those not misusing the Stream) to begin with.

@stephentoub
Copy link
Member Author

@houseofcat, sorry for my prolonged delay in responding; I missed the GitHub notification and only just saw the response.

It's up for grabs, but at this point that's about coming up with a design proposal rather than actually submitting a PR with the implementation.

@jeffhandley
Copy link
Member

Weighing the value of this against other Stream-related work, we've concluded that we will not pursue bringing this feature into .NET 8. We do expect to bring this feature into dotnet/runtime in the future, but for the time-being we encourage usage of the RecyclableMemoryStream implementation.

If folks encounter blockers for using RecyclableMemoryStream, the .NET team would consider contributing to that library to unblock those scenarios.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-needs-work API needs work before it is approved, it is NOT ready for implementation area-System.IO partner-impact This issue impacts a partner who needs to be kept updated Team:Libraries
Projects
None yet
Development

No branches or pull requests