Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[API Proposal]: UTF8 Support for System.Data.Common.DbDataReader #57262

Closed
aersam opened this issue Aug 12, 2021 · 11 comments
Closed

[API Proposal]: UTF8 Support for System.Data.Common.DbDataReader #57262

aersam opened this issue Aug 12, 2021 · 11 comments
Labels
api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.Data

Comments

@aersam
Copy link

aersam commented Aug 12, 2021

Background and motivation

There is Utf8JsonWriter which allows us to write UTF8 Strings directly to the browser.
However my source is often some database which always returns data as common C# Strings, which are UTF16. This means that some encoding is needed. In fact, MS SQL Server and many other databases support (eg MySql, Postgres) storing text data in UTF8. It does not make sense to me to convert all that UTF8 data to UTF16 and then back to UTF8. Of course one cannot support every possible Charset but UTF8 is different - I mean, it has it's special JsonWriter.

API Proposal

namespace System.Data.Common
{
     public class DbDataReader
     {
          public virtual ReadOnlySpan<Byte> GetUtf8String(int ordinal)
          {//Default implementation needed

          }
     }
}
// TBD: Add to IDataRecord as well (as we now have default implementations for interfaces), but that seems overkill to me

API Usage

DbCommand cmd;
System.Text.Json.Utf8JsonWriter jsonWriter;
using var rdr = await cmd.ExecuteReaderAsync();
while(await rdr.ReadAsync())
{
    jsonWriter.WriteStartObject();
    for(int i=0; i<rdr.FieldCount; i++)
    {
        jsonWriter.WriteString(propertyName: rdr.GetName(i), utf8Value: rdr.GetUtf8String(i));
    }
    jsonWriter.WriteEndObject();
}

Risks

As long as db providers do not implement UTF8 in their drivers, the new API does not benefit. It would just mean to do the UTF8->UTF16 conversion earlier.
Also, it might be a rare usecase as it's quite low-level. And it's unclear to me whether ORM's like Entity Framework could benefit from this or not. If they did use this interface, they would expose a lot more UTF8-Strings in C# which is not friendly to newbies. I think UTF8 Strings should stay low-level

@aersam aersam added the api-suggestion Early API idea and discussion, it is NOT ready for implementation label Aug 12, 2021
@dotnet-issue-labeler dotnet-issue-labeler bot added area-System.Text.Json untriaged New issue has not been triaged by the area owner labels Aug 12, 2021
@ghost
Copy link

ghost commented Aug 12, 2021

Tagging subscribers to this area: @eiriktsarpalis, @layomia
See info in area-owners.md if you want to be subscribed.

Issue Details

Background and motivation

There is Utf8JsonWriter which allows us to write UTF8 Strings directly to the browser.
However my source is often some database which always returns data as common C# Strings, which are UTF16. This means that some encoding is needed. In fact, MS SQL Server and many other databases support (eg MySql, Postgres) storing text data in UTF8. It does not make sense to me to convert all that UTF8 data to UTF16 and then back to UTF8. Of course one cannot support every possible Charset but UTF8 is different - I mean, it has it's special JsonWriter.

API Proposal

namespace System.Data.Common
{
     public class DbDataReader
     {
          public virtual ReadOnlySpan<Byte> GetUtf8String(int ordinal)
          {//Default implementation needed

          }
     }
}
// TBD: Add to IDataRecord as well (as we now have default implementations for interfaces), but that seems overkill to me

API Usage

DbCommand cmd;
System.Text.Json.Utf8JsonWriter jsonWriter;
using var rdr = await cmd.ExecuteReaderAsync();
while(await rdr.ReadAsync())
{
    jsonWriter.WriteStartObject();
    for(int i=0; i<rdr.FieldCount; i++)
    {
        jsonWriter.WriteString(propertyName: rdr.GetName(i), utf8Value: rdr.GetUtf8String(i));
    }
    jsonWriter.WriteEndObject();
}

Risks

As long as db providers do not implement UTF8 in their drivers, the new API does not benefit. It would just mean to do the UTF8->UTF16 conversion earlier.
Also, it might be a rare usecase as it's quite low-level. And it's unclear to me whether ORM's like Entity Framework could benefit from this or not. If they did use this interface, they would expose a lot more UTF8-Strings in C# which is not friendly to newbies. I think UTF8 Strings should stay low-level

Author: aersamkull
Assignees: -
Labels:

api-suggestion, area-System.Text.Json, untriaged

Milestone: -

@aersam
Copy link
Author

aersam commented Aug 12, 2021

Hey msft-bot: It's acutally area-System.Data :) Therefore @ajcvickers @cheenamalhotra @David-Engel , please have a look

@aersam aersam changed the title [API Proposal]: UTF8 Support for DbDataReader [API Proposal]: UTF8 Support for System.Data.Common.DbDataReader Aug 12, 2021
@ghost
Copy link

ghost commented Aug 12, 2021

Tagging subscribers to this area: @roji, @ajcvickers
See info in area-owners.md if you want to be subscribed.

Issue Details

Background and motivation

There is Utf8JsonWriter which allows us to write UTF8 Strings directly to the browser.
However my source is often some database which always returns data as common C# Strings, which are UTF16. This means that some encoding is needed. In fact, MS SQL Server and many other databases support (eg MySql, Postgres) storing text data in UTF8. It does not make sense to me to convert all that UTF8 data to UTF16 and then back to UTF8. Of course one cannot support every possible Charset but UTF8 is different - I mean, it has it's special JsonWriter.

API Proposal

namespace System.Data.Common
{
     public class DbDataReader
     {
          public virtual ReadOnlySpan<Byte> GetUtf8String(int ordinal)
          {//Default implementation needed

          }
     }
}
// TBD: Add to IDataRecord as well (as we now have default implementations for interfaces), but that seems overkill to me

API Usage

DbCommand cmd;
System.Text.Json.Utf8JsonWriter jsonWriter;
using var rdr = await cmd.ExecuteReaderAsync();
while(await rdr.ReadAsync())
{
    jsonWriter.WriteStartObject();
    for(int i=0; i<rdr.FieldCount; i++)
    {
        jsonWriter.WriteString(propertyName: rdr.GetName(i), utf8Value: rdr.GetUtf8String(i));
    }
    jsonWriter.WriteEndObject();
}

Risks

As long as db providers do not implement UTF8 in their drivers, the new API does not benefit. It would just mean to do the UTF8->UTF16 conversion earlier.
Also, it might be a rare usecase as it's quite low-level. And it's unclear to me whether ORM's like Entity Framework could benefit from this or not. If they did use this interface, they would expose a lot more UTF8-Strings in C# which is not friendly to newbies. I think UTF8 Strings should stay low-level

Author: aersamkull
Assignees: -
Labels:

api-suggestion, area-System.Data, untriaged

Milestone: -

@danmoseley
Copy link
Member

It must have triggered off mention of Utf8JsonWriter

@GrabYourPitchforks
Copy link
Member

GrabYourPitchforks commented Aug 12, 2021

@danmoseley Another use case for an first class Utf8String type so that we can avoid passing ROS<byte> around everywhere? :)

Though I should mention that UTF-8 <-> UTF-16 conversion is fast. Very, very fast. And JSON's WriteString(..., utf16) is faster than WriteString(..., utf8) when dealing with non-ASCII data.

@roji
Copy link
Member

roji commented Aug 12, 2021

Duplicate of #28135

@roji roji marked this as a duplicate of #28135 Aug 12, 2021
@roji
Copy link
Member

roji commented Aug 12, 2021

Returning a ReadOnlySpan from DbDataReader is problematic in various ways...

  • First, there's no guarantee that the entire column data is buffered. Specifically, passing CommandBehavior.SequentialAccess to ExecuteReader explicitly requests streaming row/column access, e.g. for reading large columns, so that wouldn't work there. Even without CommandBehavior.SequentialAccess there's nothing formally requiring providers to hold the entire column in memory, so it's not clear exactly when this method would be callable and when it wouldn't.
  • Related to the above, your method above is synchronous, but DbDataReader has async APIs for reading columns - that wouldn't work with ReadOnlySpan. These APIs are again relevant for large rows in SequentialAccess mode, where e.g. seeking to the target column would involve network I/O to read more data.
  • This API would be pretty unsafe, since reading anything else from the DbDataReader could invalidate the ReadOnlySpan, as it's pointing to internal driver memory.

Ideally, if one day we get a 1st-class Utf8String (@GrabYourPitchforks), providers could then add support for returning that. It would still probably mean copying data in-memory once, so that we can hand the user an independent Utf8String instance that avoids all the lifecycle/buffering issues above; but it would still avoid the UTF8<->UTF16 conversion.

Note that you can simply read into a byte[] today (via DbDataReader.GetBytes or just GetFieldValue<byte[]>()) and pass that to Utf8JsonReader as a ROS<byte>.

@aersam
Copy link
Author

aersam commented Aug 12, 2021

I get the point with the Lifecycle/Buffering, but I actually did not want to go that far: I can live with a copy and just thought the data type for Utf8 Strings is ROS<byte> today. I don't care where it's pointing to - I'm a C# dev, not a C++ dev ;) Still, it's using less memory and less CPU. Maybe something like GetUtf8Chars/GetUtf8Bytes would be interesting then

I did not know that GetBytes works with strings, however I had to base on metadata that would tell me if it's a utf8 column or a varchar column or an nvarchar column (in case of MS SQL Server), right? That sounds very error prone

@roji
Copy link
Member

roji commented Aug 12, 2021

@aersamkull if you're OK with getting a copy, then yeah, simply asking for a byte array with DbDataReader.GetFieldValue<byte[]> should be a sufficient API. Whether a particular provider supports getting a string that way is another question - Npgsql does, but SqlClient probably doesn't. However, there's nothing stopping them from doing so - this doesn't necessarily require an API change at the ADO.NET level.

You're right that with SQL Server you'd need to know whether the column is is varchar or nvarchar (that can be checked via DbDataReader.GetDataTypeName); in that sense it sounds like you're proposing a new API whose sole purpose would be to check whether the column indeed contains UTF8 data, and then return a byte[] as usual. That sounds like something that could be easily done as an extension method (call GetDataTypeName, followed by GetFieldValue<byte[]>). More importantly, the varchar/nvarchar situation is fairly SQL Server-specific - PostgreSQL, for example, doesn't have a "unicode column type" - so I'm not sure this makes sense in ADO.NET.

@aersam
Copy link
Author

aersam commented Aug 12, 2021

I think the perfect solution would be an Utf8String type and something like GetFieldValue<Utf8String>()
Yes, until then I could implement it as an extension method and ask SqlClient for implementing GetBytes support for receiving text. Not very nice but ok for such a low-level thing

@roji
Copy link
Member

roji commented Aug 12, 2021

Sounds like the right way forward - and thanks for the proposal, it's important to know that people are interested in UTF8 strings etc.

I'll keep #28135 to track the more extreme proposal (which avoids copying in addition to avoiding decoding), but otherwise I don't think there's anything left to do here at the runtime level.

@roji roji closed this as completed Aug 12, 2021
@roji roji removed the untriaged New issue has not been triaged by the area owner label Aug 12, 2021
@ghost ghost locked as resolved and limited conversation to collaborators Sep 11, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.Data
Projects
None yet
Development

No branches or pull requests

4 participants