Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revisit DataContent and friends #5719

Open
stephentoub opened this issue Dec 3, 2024 · 21 comments
Open

Revisit DataContent and friends #5719

stephentoub opened this issue Dec 3, 2024 · 21 comments

Comments

@stephentoub
Copy link
Member

A few issues to sort out:

  • Derived types. We currently have DataContent (base type), ImageContent (derives from DataContent), and AudioContent (derives from DataContent). It's not clear what significant value ImageContent / AudioContent add: they're just constructors and enable applying more strong-typing based on media type. But presumably chat client implementations should also handle cases where DataContent is used rather than the derived type (we're not currently), and once you're handling both, what's the benefit of having the derived types.

  • Videos. We currently lack a VideoContent. If we're going to have ImageContent and AudioContent, we should also have VideoContent.

  • Non-in-memory data. We currently don't have any Stream-based support; everything needs to be loaded into memory. We should look at either adding a StreamingDataContent or augmenting DataContent with support for Stream. As these can be stored in chat message lists, we'll need to work through that. Should it require any Stream to be seekable? What should happen with JSON serialization employed as part of logging / caching / etc.?

  • Realtime. Do these content types work fine for representing partial data as might be sent or received as part of a streaming request / response?

cc: @SteveSandersonMS, @RogerBarreto, @eiriktsarpalis

@SteveSandersonMS
Copy link
Member

It's not clear what significant value ImageContent / AudioContent add ... once you're handling both, what's the benefit of having the derived types

I get that the benefit of the subclasses is nonobvious, hence I was originally proposing to remove them too. However, the benefit of these subclasses isn't for IChatClient implementors - it's for app developers.

For example, if you want your app to support displaying image responses, you can check if a content item is ImageContent and if so, use its binary data in an <img> tag (if it's a web app) or equivalents for native apps. If there was no ImageContent and instead all image/audio/video data was represented as DataContent, then the app would need to do some MediaType parsing and detect a set of known MIME types, which is a whole layer of inconvenience that app devs likely don't want to deal with nor think they should have to. Note that it's not sufficient just to check MediaType.StartsWith("image/") etc., given the existence of media types like multipart/voice-message or the application/* ones.

I'd be fine with eliminating ImageContent/AudioContent if we also add some way other than MIME type to identify "this is an image" vs audio etc. For example we could have a mandatory ContentType enum with values (image, audio, binary). That's almost equivalent to using subclasses as a discriminator.

Videos. We currently lack a VideoContent. If we're going to have ImageContent and AudioContent, we should also have VideoContent.

Sounds good

We should look at either adding a StreamingDataContent or augmenting DataContent with support for Stream.

Agreed

Should it require any Stream to be seekable?

Not sure, what would that be used for? The Stream contract implicitly relies on there being only one consumer (e.g., there's only one Position, and given asynchronous access you can't rely on being able to reset the Position back to an earlier value after access completes) so we'd be relying on the ecosystem to follow patterns where the Stream is only consumed in exactly one place.

Is it so that, on successive CompleteAsync calls, the same stream can be re-read from an existing message history to be re-sent? That implies the stream wasn't disposed, so I'm not sure how that would be orchestrated (ChatMessage itself isn't disposable). Re-sending streams may be catastrophic for perf/bandwidth anyway so we might want to actively prevent anyone seeking the stream, thereby intentionally limiting this to being used when reading results and not for outbound calls.

Do these content types work fine for representing partial data as might be sent or received as part of a streaming request / response?

I don't think so, but also not sure they should. With the OpenAI realtime APIs:

  • For sending local microphone data to the service, it just expects you to give it a Stream (example). I'm not sure there needs to be a further content abstraction.
  • For receiving chunks of audio data to play back to the speakers, you get a BinaryData (example). Even if we wrapped this in AudioContent, that doesn't contain the information you need in order to do useful things with it. You need to know the audio format, sample rate, and number of channels, none of which are present on AudioContent.
    • It makes sense to differentiate "self-contained media files" from "streaming chunks" in that media files (as a combination of bytes and MediaType) already hold all the information needed to do something with them. Streaming chunks, on the other hand, don't repeat their metadata in every chunk and can only be used in conjunction with all the preceding chunks in the stream.

@stephentoub
Copy link
Member Author

stephentoub commented Dec 4, 2024

For example, if you want your app to support displaying image responses, you can check if a content item is ImageContent

Can you? What if the IChatClient just constructed the base DataContent for the image rather than the derived ImageContent?

Is it so that, on successive CompleteAsync calls, the same stream can be re-read from an existing message history to be re-sent?

Yes. Alternatively we decide on some policy for how they can't be sent multiple times. Or maybe we could make it a Func<Stream> as a factory, with ownership transfer of the result?

Re-sending streams may be catastrophic for perf/bandwidth anyway so we might want to actively prevent anyone seeking the stream, thereby intentionally limiting this to being used when reading results and not for outbound calls.

What about scenarios where you're uploading a large amount of data to be analyzed, eg the video analysis example in https://aws.amazon.com/blogs/aws/introducing-amazon-nova-frontier-intelligence-and-industry-leading-price-performance/ (search for "Describe this video.") We just say those require buffering in memory?

@SteveSandersonMS
Copy link
Member

Can you? What if the IChatClient just constructed the base DataContent for the image rather than the derived ImageContent?

I think that would make it an unhelpful IChatClient, bordering on broken. It might as well represent the image as a URL inside a TextContent or some other unhelpful representation. IChatClient implementations shouldn't do that, and if they do, applications may not be able to work well with them.

What about scenarios where you're uploading a large amount of data to be analyzed

That's a good point, and I guess it is a strong reason to support them for outbound calls. However it's not a reason to support re-sending the same stream multiple times just because it's in a preserved chat history. Maybe it's OK for the subsequent calls to fail with "stream was already disposed" or similar.

@stephentoub
Copy link
Member Author

I think that would make it an unhelpful IChatClient, bordering on broken

Should DataContent be abstract then?

@SteveSandersonMS
Copy link
Member

Should DataContent be abstract then?

If we also add BinaryContent then we can do that.

@stephentoub
Copy link
Member Author

DataContent is BinaryContent, we just renamed it so as not to conflict with the existing type of the same name.

@stephentoub
Copy link
Member Author

What are the scenarios where you'd want to use DataContent/BinaryContent rather than a derived type?

@SteveSandersonMS
Copy link
Member

Oh yes, I remember! So we'd need a different name for it.

@stephentoub
Copy link
Member Author

However it's not a reason to support re-sending the same stream multiple times just because it's in a preserved chat history. Maybe it's OK for the subsequent calls to fail with "stream was already disposed" or similar.

This is why I was thinking maybe it could be a Func<Stream>--based factory rather than just a Stream. The semantics would dictate invoking the Func and disposing the Stream on each use. That's better for sends. Not sure it works well for receives, though.

@SteveSandersonMS
Copy link
Member

What are the scenarios where you'd want to use DataContent/BinaryContent rather than a derived type?

We want to support scenarios where there isn't a specialized derived type, right? For example supplying a PDF or Word doc and asking "summarize this".

@stephentoub
Copy link
Member Author

What are the scenarios where you'd want to use DataContent/BinaryContent rather than a derived type?

We want to support scenarios where there isn't a specialized derived type, right? For example supplying a PDF or Word doc and asking "summarize this".

Yes, although I don't think any of the IChatClient implementations support that today. Do any of the services accept arbitrary byte payloads today as part of chat and that aren't specific to image/video/audio?

@SteveSandersonMS
Copy link
Member

Not sure it works well for receives, though.

It's an extra hoop to go through but it could be named something like TryTakeStream(..., out var stream) so (1) the ownership-transfer semantics are clear from "take", and (2) it can fail gracefully if you call it a second time.

@stephentoub
Copy link
Member Author

stephentoub commented Dec 4, 2024

What are the scenarios where you'd want to use DataContent/BinaryContent rather than a derived type?

We want to support scenarios where there isn't a specialized derived type, right? For example supplying a PDF or Word doc and asking "summarize this".

Yes, although I don't think any of the IChatClient implementations support that today. Do any of the services accept arbitrary byte payloads today as part of chat and that aren't specific to image/video/audio?

Answering my own question, yes, eg https://docs.aws.amazon.com/bedrock/latest/APIReference/API_runtime_DocumentBlock.html

Maybe we need a DocumentContent.

@SteveSandersonMS
Copy link
Member

SteveSandersonMS commented Dec 4, 2024

Answering my own question, yes

Also sometimes we've spoken of IChatClient as being more of a general API contract for arbitrary structured conversations.

@stephentoub
Copy link
Member Author

stephentoub commented Dec 4, 2024

Answering my own question, yes

Also sometimes we've spoken of IChatClient as being more of a general API contract for arbitrary structured conversations.

Yeah, my concern is primarily that when you allow for arbitrary DataContent to be created for arbitrary payloads and mime types, you make it much more likely that content which "should" use derived types end up using the base, at which point a consumer needs to handle both and you may as well not have the derived. I think we either need to make it harder to use the base (by having it be abstract) and ensure we have derived types for all relevant categories, or do away with the derived types and make it easier to categorize based on media type.

@SteveSandersonMS
Copy link
Member

make it easier to categorize based on media type

If we think that's viable in practice then it's the better option because it's more future-proof (when we add detection of a new category in the future, existing code that detects it manually still works, and there's no need for IChatClient implementations to change to use a new subtype).

I'm not aware of existing parts of .NET like System.Net.Mime containing logic that categorizes media types in any opinionated way, but maybe it would suffice for us just to have some simple logic that checks for prefixes like image/ etc for now, and retain the option to put in further special cases in the future. In any case where our logic fails to account for a special case that occurs in practice, app devs can still do their own categorization manually based on the MediaType.

@MackinnonBuck
Copy link
Member

MackinnonBuck commented Jan 16, 2025

I'm now starting to take a look at this issue (I've self-assigned it), and before I start on the implementation, I want to summarize what I believe are the main conclusions from the discussion so far.

  • Derived types. ... It's not clear what significant value ImageContent / AudioContent add... what's the benefit of having the derived types.

It seems like we landed on removing these derived types, replacing them with another mechanism that allows IChatClient implementations to determine the content type.

  • Videos. We currently lack a VideoContent....

If we remove the derived types, then we won't be adding this. We'd probably instead take videos into account in the other logic determining the content type.

  • Non-in-memory data. We currently don't have any Stream-based support; everything needs to be loaded into memory. We should look at either adding a StreamingDataContent or augmenting DataContent with support for Stream. As these can be stored in chat message lists, we'll need to work through that...

I was kind of struggling to see the benefit of this at first. The chat client implementations I'm aware of work via HTTP. So, when an HTTP-based chat client receives, for example, base64-encoded content from an HTTP response, it needs load that content into memory anyway as part parsing the JSON response, right? I guess you could still expose that data as a stream that reads the base64 and decodes it into the raw bytes, but this still requires the original representation to be loaded in memory.

Regarding sends - I was actually curious about whether it's possible to copy the content stream directly into the request stream as part of JSON serialization. I tried it out, but ran into dotnet/runtime#67337, which meant the content stream had to be read into memory before being written into the request JSON. Funnily enough, a PR got merged just yesterday to address that issue. So I can see the potential benefit there. External libraries may need to include support for streams as well, though. For example, an Azure.AI.Inference.ChatMessageImageContentItem can be initialized from a Stream, but it just reads the stream into memory anyway.

Maybe the idea is that our abstractions should support streams as a way to enable individual chat client implementations to take advantage of them when applicable. If that's the reasoning, then it makes sense to me.

@stephentoub: This is why I was thinking maybe it could be a Func<Stream> based factory rather than just a Stream. The semantics would dictate invoking the Func and disposing the Stream on each use. That's better for sends. Not sure it works well for receives, though.

@SteveSandersonMS: It's an extra hoop to go through but it could be named something like TryTakeStream(..., out var stream) so (1) the ownership-transfer semantics are clear from "take", and (2) it can fail gracefully if you call it a second time.

If we did the Func<Stream> approach, would we still need the limitation that you couldn't read the stream more than once? The default behavior could be that a new stream gets read whenever it's sent as part of the conversation. If the developer doesn't want that to happen, they could remove it from the message list. I suppose this is kind of a similar problem to the one that @SteveSandersonMS describes in #5747.

  • Realtime. Do these content types work fine for representing partial data as might be sent or received as part of a streaming request / response?

From @SteveSandersonMS's comment it sounds like we wouldn't try to make DataContent work for streaming content.

Does this match your expectations, @stephentoub and @SteveSandersonMS?

@jeffhandley
Copy link
Member

Thanks, @MackinnonBuck. Once the paths forward are confirmed/refined, I think we'd be able to split the issue into 2: one for the media type APIs and one for the stream handling.

@MackinnonBuck
Copy link
Member

If we did away with the derived types, I'd imagine we end up with something like this:

  • Add a ContentKind enum to represent the various content types
  • Add a Kind property on DataContent whose type is ContentKind
  • As part of the media type validation (already implemented today), also infer the ContentKind and use that to initialize the Kind property value
  • Probably also add new constructors to DataContent that allow you to specify a ContentKind without specifying a media type (this would be equivalent to, e.g., creating an ImageContent instance without providing a mediaType).

However, this would block us from adding strongly-typed information to DataContent variants. The existence of #5524 suggests that there's a potential need for this.

If the goal is to prevent customers from using DataContent when they should be using one of its derivatives, then I'd prefer making DataContent abstract. We could add a BinaryContent/DocumentContent for content types that don't fit into one of the established categories, and we have the ability to add strongly-typed information to each type as we see fit.

@SteveSandersonMS
Copy link
Member

SteveSandersonMS commented Jan 20, 2025

After going back through this discussion, I'm inclined to change my view on the need to have a first-class built-in way to identify image-vs-audio-vs-video etc, and consider instead what if we just went back to exposing a MIME type only.

Implementing MIME type checks may feel inconvenient to app developers, but in the end their UI technology only supports some particular set of image/audio/etc formats. The app could use logic like if (content.MediaType.StartsWith("image/")) and hope that it only receives supported image formats, and that will basically work in practice at least for web apps. But if they need to be more explicit about which exact formats their UI tech supports, they still have to check against a list of supported media types because maybe their UI tech doesn't support image/super-obscure-format even if MEAI could identify it as an image.

It's even more true in the case of documents, since what's the point of having MEAI identify that application/vnd.mariobros-savegame is a "document"? Apps in general don't support all possible document formats.

So my point is that in the end, it doesn't make that much difference if we only expose the media type info and don't take on the responsibility of identifying image vs audio vs video vs document etc.

However, this would block us from adding strongly-typed information to DataContent variants.

As far as I know there's no pressure for us to make DataContent be sealed. As of #5789 it's possible to add custom AIContent subclasses, so developers could use this to define DataContent subclasses with custom strongly-typed properties. I know that reopens the ambiguity about whether the consumer would receive that subclass or the base class, but that's the app dev's own choice and their own problem if they mix things up. As long as we don't create that problem by having both DataContent and ImageContent etc., we're not encouraging anyone to get into difficulty and most people won't.

So my proposal simplifies down to:

  • Remove the existing subclasses of DataContent, but leave DataContent unsealed
  • As a separate work item, work out what to do with streaming data.
    • The best current proposal seems to be adding another type StreamingDataContent which does not inherit from DataContent but rather inherits AIContent directly. The person instantiating this type would need to pass in a suitable factory/callback to the constructor that would return the stream. The person consuming this type would call content.TryTakeStream(out var stream), which may or may not give them a stream (depending on whether it's capable of supplying more than one reader) and if it does, hands responsibility for disposing that stream instance to the consumer.
      • I'm unclear on how this interacts with JSON serialization. Is there at least one practical case where this actually avoids buffering a large file in memory? I know it probably doesn't help with receiving any data from JSON-over-HTTP. But maybe it does help when sending JSON-over-HTTP. If there are no cases today where this actually avoids buffering in practice, then we might choose not to prioritize doing this now.

What do you think? It's been a few weeks since I thought about this so maybe I missed something critical.

@MackinnonBuck
Copy link
Member

MackinnonBuck commented Jan 21, 2025

So my point is that in the end, it doesn't make that much difference if we only expose the media type info and don't take on the responsibility of identifying image vs audio vs video vs document etc.

I think this definitely makes sense, especially from the perspective of the app developer. Chat clients may still need to perform a higher-level content type classification, though. For example, we need to at least distinguish between images and non-images in AzureAIInferenceChatClient and OllamaChatClient, but this can probably just be done with an internal helper. For external chat client implementations, we could say it's their task to decide which how to interpret various MIME types and determine whether they're supported by the underlying model, etc., similar to what the app developer would need to do.

However, this would block us from adding strongly-typed information to DataContent variants.

As of #5789 it's possible to add custom AIContent subclasses, so developers could use this to define DataContent subclasses with custom strongly-typed properties.

That's true. I guess I was more concerned with not being able to do this within the framework. For example, I'm not sure we'd be able to implement #5524. Maybe in that specific case it would work to create an internal subclass with that extra information (or maybe not; I don't really know the details there).

Edit: Looking into it more, I see there's an AIContent.AdditionalProperties that already lets us add information to any AIContent, and we already use it to store image detail. It's just not strongly-typed, but I think that's fine.

So, I think this proposal sounds good 🙂

As a separate work item, work out what to do with streaming data... I'm unclear on how this interacts with JSON serialization. Is there at least one practical case where this actually avoids buffering a large file in memory?

This was something I tried recently, but ran into dotnet/runtime#67337. However, a PR was just merged that addresses this, so I think this should be possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants