Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/storage/avro parser #11483

Conversation

seanmcc-msft
Copy link
Member

No description provided.

@seanmcc-msft seanmcc-msft marked this pull request as ready for review April 21, 2020 22:36
@seanmcc-msft
Copy link
Member Author

@kasobol-msft , @gapra-msft can you take a look?

{
public static async Task<byte[]> ReadFixedBytesAsync(
Stream stream,
int length,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worst case avro predicts is bytes that's as long as long can define. Would it be a problem for QQ ?
image

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the max size of an array in .Net is max int, so we can't return more bytes than that from this method. QQ records are limited to 4 MB, so I don't think this will be an issue.

https://docs.microsoft.com/en-us/dotnet/api/system.array.length?view=netstandard-2.0

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is internal, so I'm not worried about getting into another "we need to expose longs" problem. Especially since every stream read method in .NET uses int for the count.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4MB sounds like safe limit.

Would returning Iterable ease "we need to expose longs" effort here? (should that ever happen).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could address this in the future, if this situation comes up. I'd prefer not to change this now, as it would require a significant refactor of the parser (which is currently working with Quick Query and Change Feed).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we at least leave a comment expressing this, so if anyone picks it up and enhances it, they can keep that in mind?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a comment mentioning this.

return bytes[0];
}

// Stolen because the linked references in the Avro spec were subpar...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might want to revisit that comment. End of day it's legit under their license.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the comment.

Comment on lines 107 to 112
// Validate codec
_metadata.TryGetValue(AvroConstants.CodecKey, out string codec);
if (codec == AvroConstants.DeflateCodec)
{
throw new ArgumentException("Deflate codec is not supported");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, fixed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.


namespace Azure.Storage.Internal.Avro
{
internal class Record
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this for?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleted, it was an artifact of a previous implementation attempt.

{
List<TestCase> testCases = new List<TestCase>
{
new TestCase("test_null_0.avro", o => Assert.IsNull(o)), // null
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deja vu :-)


namespace Azure.Storage.Internal.Avro.Tests
{
public class AvroReaderTests
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd love to see a test case where we're skipping some records, so that we leverage option to pass two streams into it.

CancellationToken cancellationToken)
{
byte b = await ReadByteAsync(stream, async, cancellationToken).ConfigureAwait(false);
return b != 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add explicit checking here - if 0 return false, if 1 return true, otherwise throw. I find it weird that the original Apache code didnt do this check

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

Stream stream,
bool async,
CancellationToken cancellationToken) =>
(int)(await ReadLongAsync(stream, async, cancellationToken).ConfigureAwait(false));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to do any checking here (make sure the num is int sized) or can we just let the cast fail?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its fine to let the cast fail.

bool async,
CancellationToken cancellationToken)
{
int size = await ReadIntAsync(stream, async, cancellationToken).ConfigureAwait(false);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you either leave a comment saying this should be long as per spec, use int since array length is int or just make this read a long and convert that to int explicitly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment.

bool async,
CancellationToken cancellationToken);

public static AvroType FromSchema(JsonElement schema)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this may be neater if you split it further into FromStringSchema, FromArraySchema, FromObjectSchema

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it would be helpful to add constants for each type?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added constants for each type.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Split into FromStringSchema, FromArraySchema, FromObjectSchema.

Copy link
Member

@gapra-msft gapra-msft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I think in general it would be super helpful for a dev in the future(if they pick this up) if you documented some of the logic that's going on everywhere

@seanmcc-msft seanmcc-msft merged commit 6f8316b into Azure:feature/storage/stg73base Apr 23, 2020
@seanmcc-msft seanmcc-msft deleted the feature/storage/avroParser branch April 23, 2020 20:55
_initalized = false;
BlockOffset = currentBlockOffset;
ObjectIndex = indexWithinCurrentBlock;
_initalized = false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duplicate with line 88?

_initalized = false;
}

private async Task Initalize(bool async, CancellationToken cancellationToken = default)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: Initialize

/// <summary>
/// If this Avro Reader has been initalized.
/// </summary>
private bool _initalized;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo

}
}

public bool HasNext() => !_initalized || _itemsRemainingInBlock > 0;
Copy link
Member

@ljian3377 ljian3377 May 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it better we call Initalize() here in case the avro file contains 0 records?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants