-
Notifications
You must be signed in to change notification settings - Fork 764
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revisit DataContent and friends #5719
Comments
I get that the benefit of the subclasses is nonobvious, hence I was originally proposing to remove them too. However, the benefit of these subclasses isn't for For example, if you want your app to support displaying image responses, you can check if a content item I'd be fine with eliminating
Sounds good
Agreed
Not sure, what would that be used for? The Stream contract implicitly relies on there being only one consumer (e.g., there's only one Position, and given asynchronous access you can't rely on being able to reset the Position back to an earlier value after access completes) so we'd be relying on the ecosystem to follow patterns where the Stream is only consumed in exactly one place. Is it so that, on successive
I don't think so, but also not sure they should. With the OpenAI realtime APIs:
|
Can you? What if the IChatClient just constructed the base DataContent for the image rather than the derived ImageContent?
Yes. Alternatively we decide on some policy for how they can't be sent multiple times. Or maybe we could make it a
What about scenarios where you're uploading a large amount of data to be analyzed, eg the video analysis example in https://aws.amazon.com/blogs/aws/introducing-amazon-nova-frontier-intelligence-and-industry-leading-price-performance/ (search for "Describe this video.") We just say those require buffering in memory? |
I think that would make it an unhelpful IChatClient, bordering on broken. It might as well represent the image as a URL inside a
That's a good point, and I guess it is a strong reason to support them for outbound calls. However it's not a reason to support re-sending the same stream multiple times just because it's in a preserved chat history. Maybe it's OK for the subsequent calls to fail with "stream was already disposed" or similar. |
Should DataContent be abstract then? |
If we also add |
DataContent is BinaryContent, we just renamed it so as not to conflict with the existing type of the same name. |
What are the scenarios where you'd want to use DataContent/BinaryContent rather than a derived type? |
Oh yes, I remember! So we'd need a different name for it. |
This is why I was thinking maybe it could be a |
We want to support scenarios where there isn't a specialized derived type, right? For example supplying a PDF or Word doc and asking "summarize this". |
Yes, although I don't think any of the IChatClient implementations support that today. Do any of the services accept arbitrary byte payloads today as part of chat and that aren't specific to image/video/audio? |
It's an extra hoop to go through but it could be named something like |
Answering my own question, yes, eg https://docs.aws.amazon.com/bedrock/latest/APIReference/API_runtime_DocumentBlock.html Maybe we need a DocumentContent. |
Also sometimes we've spoken of IChatClient as being more of a general API contract for arbitrary structured conversations. |
Yeah, my concern is primarily that when you allow for arbitrary DataContent to be created for arbitrary payloads and mime types, you make it much more likely that content which "should" use derived types end up using the base, at which point a consumer needs to handle both and you may as well not have the derived. I think we either need to make it harder to use the base (by having it be abstract) and ensure we have derived types for all relevant categories, or do away with the derived types and make it easier to categorize based on media type. |
If we think that's viable in practice then it's the better option because it's more future-proof (when we add detection of a new category in the future, existing code that detects it manually still works, and there's no need for IChatClient implementations to change to use a new subtype). I'm not aware of existing parts of .NET like |
I'm now starting to take a look at this issue (I've self-assigned it), and before I start on the implementation, I want to summarize what I believe are the main conclusions from the discussion so far.
It seems like we landed on removing these derived types, replacing them with another mechanism that allows
If we remove the derived types, then we won't be adding this. We'd probably instead take videos into account in the other logic determining the content type.
I was kind of struggling to see the benefit of this at first. The chat client implementations I'm aware of work via HTTP. So, when an HTTP-based chat client receives, for example, base64-encoded content from an HTTP response, it needs load that content into memory anyway as part parsing the JSON response, right? I guess you could still expose that data as a stream that reads the base64 and decodes it into the raw bytes, but this still requires the original representation to be loaded in memory. Regarding sends - I was actually curious about whether it's possible to copy the content stream directly into the request stream as part of JSON serialization. I tried it out, but ran into dotnet/runtime#67337, which meant the content stream had to be read into memory before being written into the request JSON. Funnily enough, a PR got merged just yesterday to address that issue. So I can see the potential benefit there. External libraries may need to include support for streams as well, though. For example, an Maybe the idea is that our abstractions should support streams as a way to enable individual chat client implementations to take advantage of them when applicable. If that's the reasoning, then it makes sense to me.
If we did the
From @SteveSandersonMS's comment it sounds like we wouldn't try to make Does this match your expectations, @stephentoub and @SteveSandersonMS? |
Thanks, @MackinnonBuck. Once the paths forward are confirmed/refined, I think we'd be able to split the issue into 2: one for the media type APIs and one for the stream handling. |
If we did away with the derived types, I'd imagine we end up with something like this:
However, this would block us from adding strongly-typed information to If the goal is to prevent customers from using |
After going back through this discussion, I'm inclined to change my view on the need to have a first-class built-in way to identify image-vs-audio-vs-video etc, and consider instead what if we just went back to exposing a MIME type only. Implementing MIME type checks may feel inconvenient to app developers, but in the end their UI technology only supports some particular set of image/audio/etc formats. The app could use logic like It's even more true in the case of documents, since what's the point of having MEAI identify that So my point is that in the end, it doesn't make that much difference if we only expose the media type info and don't take on the responsibility of identifying image vs audio vs video vs document etc.
As far as I know there's no pressure for us to make So my proposal simplifies down to:
What do you think? It's been a few weeks since I thought about this so maybe I missed something critical. |
I think this definitely makes sense, especially from the perspective of the app developer. Chat clients may still need to perform a higher-level content type classification, though. For example, we need to at least distinguish between images and non-images in
That's true. I guess I was more concerned with not being able to do this within the framework. For example, I'm not sure we'd be able to implement #5524. Maybe in that specific case it would work to create an Edit: Looking into it more, I see there's an So, I think this proposal sounds good 🙂
This was something I tried recently, but ran into dotnet/runtime#67337. However, a PR was just merged that addresses this, so I think this should be possible. |
A few issues to sort out:
Derived types. We currently have DataContent (base type), ImageContent (derives from DataContent), and AudioContent (derives from DataContent). It's not clear what significant value ImageContent / AudioContent add: they're just constructors and enable applying more strong-typing based on media type. But presumably chat client implementations should also handle cases where DataContent is used rather than the derived type (we're not currently), and once you're handling both, what's the benefit of having the derived types.
Videos. We currently lack a VideoContent. If we're going to have ImageContent and AudioContent, we should also have VideoContent.
Non-in-memory data. We currently don't have any Stream-based support; everything needs to be loaded into memory. We should look at either adding a StreamingDataContent or augmenting DataContent with support for Stream. As these can be stored in chat message lists, we'll need to work through that. Should it require any Stream to be seekable? What should happen with JSON serialization employed as part of logging / caching / etc.?
Realtime. Do these content types work fine for representing partial data as might be sent or received as part of a streaming request / response?
cc: @SteveSandersonMS, @RogerBarreto, @eiriktsarpalis
The text was updated successfully, but these errors were encountered: