-
Notifications
You must be signed in to change notification settings - Fork 244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Optimize remote Avro reading for a PartitionFile #5304
Comments
so to clarify here, the current Avro reader is basically reading all of the blocks metadata (start/end/length) in the file and then is filtering out the ones it really needs, correct? So its reading a lot more then it really needs to, which is slow. Does the CPU version skip? From a brief look it appears it uses: So I am assuming it is only reading what is necessary. From a high level your proposal makes sense. I'd have to look further but could we just use the Avro reader like the CPU side and use that sync function? |
We are introducing this sync in our new AVRO reader, you can find the draft code here, and it is used in the multi-threaded reader. |
can you please answer my questions to confirm my understanding. |
Yes, the current What's more, the operation which really takes time, I think, could be the |
Is your feature request related to a problem? Please describe.
Avro doesn't have a file level header to let us quickly find the correct range to read for a PartitionFile of Avro.
In current Avro multithread reading, we use "filterBlocks" to get the block position information by reading through the avro file remotely. From the log, we found it's too expensive to get the information of all blocks remotely.
Describe the solution you'd like
We should read 3 parts of an Avro file for a PartitionFile.
First, to understand the schema, we always need to read out the header of Avro file. We should use multithread pool to download the header part of the Avro file at the beginning, which includes magic and metadata. We need a reasonable length to read the header to avoid downloading multiple times to get the whole metadata. So that it can download once to get the header.
Second, downloading the range(start, end from PartitionFile) from the Avro file to local buffer by multithread pool. Use the same way from the code DataFileReader.sync, to figure out the sync marker as the beginning of the necessary block buffer. Then parse the block information from this local buffer, and compare to the start, end in PartitionFile to cover the blocks we need to read in. We should ignore the partial block before start position and keep reading the unfinished block at the end position. Append the buffer from the sync marker byte before the first block to the header buffer.
Third, reading the remaining part of the last block according to its block size, and append to the previous buffer.
Ideally, with 3 times of downloading, we can get the blocks for a PartitionFile.
Some more optimization for special cases like,
Case1: the start position is 0(the beginning of the Avro file, we can download for 1 & 2 in one because the header should be included in the [start, end] buffer.
Case2: the end position is the end of the Avro file, we can skip the Block parsing, just append everything to the buffer.
Case3: If the PartitionFile is the whole file, the downloading should be only once.
The text was updated successfully, but these errors were encountered: