-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/storage/avro parser #11483
Feature/storage/avro parser #11483
Conversation
@kasobol-msft , @gapra-msft can you take a look? |
{ | ||
public static async Task<byte[]> ReadFixedBytesAsync( | ||
Stream stream, | ||
int length, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the max size of an array in .Net is max int, so we can't return more bytes than that from this method. QQ records are limited to 4 MB, so I don't think this will be an issue.
https://docs.microsoft.com/en-us/dotnet/api/system.array.length?view=netstandard-2.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is internal, so I'm not worried about getting into another "we need to expose longs" problem. Especially since every stream read method in .NET uses int for the count.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
4MB sounds like safe limit.
Would returning Iterable ease "we need to expose longs" effort here? (should that ever happen).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could address this in the future, if this situation comes up. I'd prefer not to change this now, as it would require a significant refactor of the parser (which is currently working with Quick Query and Change Feed).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we at least leave a comment expressing this, so if anyone picks it up and enhances it, they can keep that in mind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a comment mentioning this.
return bytes[0]; | ||
} | ||
|
||
// Stolen because the linked references in the Avro spec were subpar... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might want to revisit that comment. End of day it's legit under their license.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the comment.
// Validate codec | ||
_metadata.TryGetValue(AvroConstants.CodecKey, out string codec); | ||
if (codec == AvroConstants.DeflateCodec) | ||
{ | ||
throw new ArgumentException("Deflate codec is not supported"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather inverse this validation as there are more codecs available.
See https://github.com/apache/avro/blob/6b0d470a79b4c4e10d9890183d6c913608a2a225/lang/py3/avro/datafile.py#L94-L99
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
|
||
namespace Azure.Storage.Internal.Avro | ||
{ | ||
internal class Record |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deleted, it was an artifact of a previous implementation attempt.
{ | ||
List<TestCase> testCases = new List<TestCase> | ||
{ | ||
new TestCase("test_null_0.avro", o => Assert.IsNull(o)), // null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deja vu :-)
|
||
namespace Azure.Storage.Internal.Avro.Tests | ||
{ | ||
public class AvroReaderTests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd love to see a test case where we're skipping some records, so that we leverage option to pass two streams into it.
CancellationToken cancellationToken) | ||
{ | ||
byte b = await ReadByteAsync(stream, async, cancellationToken).ConfigureAwait(false); | ||
return b != 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add explicit checking here - if 0 return false, if 1 return true, otherwise throw. I find it weird that the original Apache code didnt do this check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
Stream stream, | ||
bool async, | ||
CancellationToken cancellationToken) => | ||
(int)(await ReadLongAsync(stream, async, cancellationToken).ConfigureAwait(false)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to do any checking here (make sure the num is int sized) or can we just let the cast fail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think its fine to let the cast fail.
bool async, | ||
CancellationToken cancellationToken) | ||
{ | ||
int size = await ReadIntAsync(stream, async, cancellationToken).ConfigureAwait(false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you either leave a comment saying this should be long as per spec, use int since array length is int or just make this read a long and convert that to int explicitly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment.
bool async, | ||
CancellationToken cancellationToken); | ||
|
||
public static AvroType FromSchema(JsonElement schema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this may be neater if you split it further into FromStringSchema, FromArraySchema, FromObjectSchema
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it would be helpful to add constants for each type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added constants for each type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Split into FromStringSchema, FromArraySchema, FromObjectSchema.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. I think in general it would be super helpful for a dev in the future(if they pick this up) if you documented some of the logic that's going on everywhere
_initalized = false; | ||
BlockOffset = currentBlockOffset; | ||
ObjectIndex = indexWithinCurrentBlock; | ||
_initalized = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
duplicate with line 88?
_initalized = false; | ||
} | ||
|
||
private async Task Initalize(bool async, CancellationToken cancellationToken = default) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: Initialize
/// <summary> | ||
/// If this Avro Reader has been initalized. | ||
/// </summary> | ||
private bool _initalized; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo
} | ||
} | ||
|
||
public bool HasNext() => !_initalized || _itemsRemainingInBlock > 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it better we call Initalize()
here in case the avro file contains 0 records?
No description provided.