Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Optimize remote Avro reading for a PartitionFile #5304

Closed
GaryShen2008 opened this issue Apr 25, 2022 · 4 comments · Fixed by #5421
Closed

[FEA] Optimize remote Avro reading for a PartitionFile #5304

GaryShen2008 opened this issue Apr 25, 2022 · 4 comments · Fixed by #5421
Assignees
Labels
performance A performance related task/issue

Comments

@GaryShen2008
Copy link
Collaborator

GaryShen2008 commented Apr 25, 2022

Is your feature request related to a problem? Please describe.
Avro doesn't have a file level header to let us quickly find the correct range to read for a PartitionFile of Avro.
In current Avro multithread reading, we use "filterBlocks" to get the block position information by reading through the avro file remotely. From the log, we found it's too expensive to get the information of all blocks remotely.

Describe the solution you'd like
We should read 3 parts of an Avro file for a PartitionFile.

  1. The header of Avro file
  2. The range from PartitionFile with start, end
  3. The remaining block bytes after end position

First, to understand the schema, we always need to read out the header of Avro file. We should use multithread pool to download the header part of the Avro file at the beginning, which includes magic and metadata. We need a reasonable length to read the header to avoid downloading multiple times to get the whole metadata. So that it can download once to get the header.
Second, downloading the range(start, end from PartitionFile) from the Avro file to local buffer by multithread pool. Use the same way from the code DataFileReader.sync, to figure out the sync marker as the beginning of the necessary block buffer. Then parse the block information from this local buffer, and compare to the start, end in PartitionFile to cover the blocks we need to read in. We should ignore the partial block before start position and keep reading the unfinished block at the end position. Append the buffer from the sync marker byte before the first block to the header buffer.
Third, reading the remaining part of the last block according to its block size, and append to the previous buffer.
Ideally, with 3 times of downloading, we can get the blocks for a PartitionFile.

Some more optimization for special cases like,
Case1: the start position is 0(the beginning of the Avro file, we can download for 1 & 2 in one because the header should be included in the [start, end] buffer.
Case2: the end position is the end of the Avro file, we can skip the Block parsing, just append everything to the buffer.
Case3: If the PartitionFile is the whole file, the downloading should be only once.

@GaryShen2008 GaryShen2008 added feature request New feature or request ? - Needs Triage Need team to review and classify labels Apr 25, 2022
@GaryShen2008 GaryShen2008 changed the title [FEA] Optimize Avro reading in Multithread for a PartitionFile [FEA] Optimize Avro reading for a PartitionFile Apr 25, 2022
@GaryShen2008 GaryShen2008 changed the title [FEA] Optimize Avro reading for a PartitionFile [FEA] Optimize remote Avro reading for a PartitionFile Apr 25, 2022
@sameerz sameerz added performance A performance related task/issue and removed feature request New feature or request ? - Needs Triage Need team to review and classify labels Apr 26, 2022
@tgravescs
Copy link
Collaborator

so to clarify here, the current Avro reader is basically reading all of the blocks metadata (start/end/length) in the file and then is filtering out the ones it really needs, correct? So its reading a lot more then it really needs to, which is slow.

Does the CPU version skip? From a brief look it appears it uses:
reader.sync(partitionedFile.start) (which you linked to above) and
override val stopPosition = partitionedFile.start + partitionedFile.length

So I am assuming it is only reading what is necessary.

From a high level your proposal makes sense. I'd have to look further but could we just use the Avro reader like the CPU side and use that sync function?
DataFileReader.openReader(in, datumReader)
If you can't use their reader then using logic from sync functionality makes sense.

@firestarman
Copy link
Collaborator

firestarman commented Apr 29, 2022

We are introducing this sync in our new AVRO reader, you can find the draft code here, and it is used in the multi-threaded reader.

@tgravescs
Copy link
Collaborator

can you please answer my questions to confirm my understanding.

@firestarman
Copy link
Collaborator

firestarman commented May 4, 2022

Yes, the current fliterBlocks will read all of the blocks' metadata in a file first, then filter out the unnecessary blocks according to the start and stopPosition. Later the reader opens the file again to read the necessary data according to the filtered blocks' metadata.
Yes, CPU is only reading what is necessary. Besides, CPU does not collect the blocks' meta firstly, instead it reads the block metadata and data at the same time in the iterator pattern. So the CPU reader only reads through the necessary part of the file once.

What's more, the operation which really takes time, I think, could be the seek for each block. Since the perf is still bad even introducing the sync and pastSync to read only the necessary blocks in our test case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance A performance related task/issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants