[FEA] Optimize remote Avro reading for a PartitionFile #5304

GaryShen2008 · 2022-04-25T01:20:21Z

Is your feature request related to a problem? Please describe.
Avro doesn't have a file level header to let us quickly find the correct range to read for a PartitionFile of Avro.
In current Avro multithread reading, we use "filterBlocks" to get the block position information by reading through the avro file remotely. From the log, we found it's too expensive to get the information of all blocks remotely.

Describe the solution you'd like
We should read 3 parts of an Avro file for a PartitionFile.

The header of Avro file
The range from PartitionFile with start, end
The remaining block bytes after end position

First, to understand the schema, we always need to read out the header of Avro file. We should use multithread pool to download the header part of the Avro file at the beginning, which includes magic and metadata. We need a reasonable length to read the header to avoid downloading multiple times to get the whole metadata. So that it can download once to get the header.
Second, downloading the range(start, end from PartitionFile) from the Avro file to local buffer by multithread pool. Use the same way from the code DataFileReader.sync, to figure out the sync marker as the beginning of the necessary block buffer. Then parse the block information from this local buffer, and compare to the start, end in PartitionFile to cover the blocks we need to read in. We should ignore the partial block before start position and keep reading the unfinished block at the end position. Append the buffer from the sync marker byte before the first block to the header buffer.
Third, reading the remaining part of the last block according to its block size, and append to the previous buffer.
Ideally, with 3 times of downloading, we can get the blocks for a PartitionFile.

Some more optimization for special cases like,
Case1: the start position is 0(the beginning of the Avro file, we can download for 1 & 2 in one because the header should be included in the [start, end] buffer.
Case2: the end position is the end of the Avro file, we can skip the Block parsing, just append everything to the buffer.
Case3: If the PartitionFile is the whole file, the downloading should be only once.

tgravescs · 2022-04-28T18:45:36Z

so to clarify here, the current Avro reader is basically reading all of the blocks metadata (start/end/length) in the file and then is filtering out the ones it really needs, correct? So its reading a lot more then it really needs to, which is slow.

Does the CPU version skip? From a brief look it appears it uses:
reader.sync(partitionedFile.start) (which you linked to above) and
override val stopPosition = partitionedFile.start + partitionedFile.length

So I am assuming it is only reading what is necessary.

From a high level your proposal makes sense. I'd have to look further but could we just use the Avro reader like the CPU side and use that sync function?
DataFileReader.openReader(in, datumReader)
If you can't use their reader then using logic from sync functionality makes sense.

firestarman · 2022-04-29T05:50:13Z

We are introducing this sync in our new AVRO reader, you can find the draft code here, and it is used in the multi-threaded reader.

tgravescs · 2022-04-29T13:43:19Z

can you please answer my questions to confirm my understanding.

firestarman · 2022-05-04T14:28:41Z

Yes, the current fliterBlocks will read all of the blocks' metadata in a file first, then filter out the unnecessary blocks according to the start and stopPosition. Later the reader opens the file again to read the necessary data according to the filtered blocks' metadata.
Yes, CPU is only reading what is necessary. Besides, CPU does not collect the blocks' meta firstly, instead it reads the block metadata and data at the same time in the iterator pattern. So the CPU reader only reads through the necessary part of the file once.

What's more, the operation which really takes time, I think, could be the seek for each block. Since the perf is still bad even introducing the sync and pastSync to read only the necessary blocks in our test case.

GaryShen2008 added feature request New feature or request ? - Needs Triage Need team to review and classify labels Apr 25, 2022

GaryShen2008 changed the title ~~[FEA] Optimize Avro reading in Multithread for a PartitionFile~~ [FEA] Optimize Avro reading for a PartitionFile Apr 25, 2022

GaryShen2008 changed the title ~~[FEA] Optimize Avro reading for a PartitionFile~~ [FEA] Optimize remote Avro reading for a PartitionFile Apr 25, 2022

tgravescs mentioned this issue Apr 25, 2022

Support multi-threaded reading for avro[databricks] #5255

Closed

sameerz added performance A performance related task/issue and removed feature request New feature or request ? - Needs Triage Need team to review and classify labels Apr 26, 2022

GaryShen2008 assigned firestarman Apr 27, 2022

firestarman mentioned this issue May 4, 2022

Support multi-threaded reading for avro #5421

Merged

tgravescs closed this as completed May 10, 2022

tgravescs reopened this May 10, 2022

firestarman closed this as completed in #5421 May 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Optimize remote Avro reading for a PartitionFile #5304

[FEA] Optimize remote Avro reading for a PartitionFile #5304

GaryShen2008 commented Apr 25, 2022 •

edited

Loading

tgravescs commented Apr 28, 2022

firestarman commented Apr 29, 2022 •

edited

Loading

tgravescs commented Apr 29, 2022

firestarman commented May 4, 2022 •

edited

Loading

[FEA] Optimize remote Avro reading for a PartitionFile #5304

[FEA] Optimize remote Avro reading for a PartitionFile #5304

Comments

GaryShen2008 commented Apr 25, 2022 • edited Loading

tgravescs commented Apr 28, 2022

firestarman commented Apr 29, 2022 • edited Loading

tgravescs commented Apr 29, 2022

firestarman commented May 4, 2022 • edited Loading

GaryShen2008 commented Apr 25, 2022 •

edited

Loading

firestarman commented Apr 29, 2022 •

edited

Loading

firestarman commented May 4, 2022 •

edited

Loading