Closes #3083: Optimize Parquet string read code #3082

bmcdonald3 · 2024-04-05T20:29:26Z

This PR reworks the existing Parquet string read implementation by implementing a distinct string path, due to differences in the layout of the strings.

Parquet strings cannot be read directly into Chapel buffers, like they can for other Arkouda types, like int and bool, due to Arrow having a unique representation of strings.

The way that Parquet reading was implemented prior to this PR was reading each Parquet file once, the first time to get the number of bytes from the file, which requires reading the entire file, since number of bytes is not available in the metadata, and then the second time to actually read and store the data into the Arkouda buffer. Each Parquet string was read 1 value at a time and then deep copied on the C++ side into the Chapel buffer.

This PR takes a different approach, reading all of the values in once, storing them in an intermediary data structure to avoid the second file read.

To get the data from Arrow into Chapel, we are creating a row group reader for each row group in the file, which needs to be kept alive to avoid cleaning up the data from the actual Arrow strings. Once we have all of the data stored in the interim data structure, we can then do the copies into the Chapel buffer in parallel on the Chapel side.

Performance results reading one large file collected on an XC with a Lustre file system:

nl	master	my branch
1	8.17s	1.56s
2	8.06s	1.95s
4	8.32s	1.90s
8	8.31s	1.96s

So, this results in a 4-5x speedup.

The reason that 1 nodes reads faster than 2 nodes is that each file is assigned to a locale, rather than a chunk (like the existing implementation), which means that the first node will read the contents and then broadcast that out to the other nodes, which results in a modest amount of overhead.

Closes #3083

stress-tess

Looks good to me!! Thanks for doing this! I don't envy u having to figure out how to keep the maps from going out of scope switching from c++ to chpl lol

src/ArrowFunctions.h

src/ParquetMsg.chpl

bmcdonald3 · 2024-04-11T20:24:52Z

Also, in regards to your higher level comment, the hard part here wasn't getting the maps in place, it was figuring out that we needed them in the first place! There is no indication in the API that you need to keep column readers in scope to keep string allocated data around, so that was a long rabbit chase to figure that one out.

jaketrookman

Looks good

stress-tess · 2024-04-15T19:41:17Z

awesome @bmcdonald3 it looks like this will be good to go once the merge conflicts are resolved (sorry I messed urs up by merging mine first 😅 )

bmcdonald3 · 2024-04-16T16:13:51Z

@stress-tess should be good to go

bmcdonald3 and others added 28 commits March 18, 2024 09:56

Add standalone string read benchmark

47edbd9

Fix typos

d93fde0

Add standalone benchmarks

bc083b0

remove C++ prints

64b662b

Add attempt for string read with reader in Chapel

3c31bdb

Seems to be working???

38424c0

Working baby

117322d

Add timers

340561c

Add timers

664aa7b

Add timers

7c2a72c

Add C++ timers

6f33d5f

Pretty much working

e352575

Working on single locale

a2cf73c

All working except copy

ffad312

Working about from start idx

8d160b5

Row group start index determine

40de76c

Add free and remove write

3412645

Free on locale owning data

c80a53d

Parallelize over files with reader hash

beecd59

create readers serially, read parallel

76cbe88

Fix row group segment calculation

89c8976

Clean up

9a59884

From horizon

574265b

Think working, party

1f41c5a

Fix row group size

d7f1526

Remove standalone benchmark directory

03eadf8

Clean up

7b75a34

Clean up Chapel standalone file

8d60d35

bmcdonald3 changed the title ~~Optimize Parquet string read code~~ Closes #3083: Optimize Parquet string read code Apr 5, 2024

Fix indentation from working on other machines

20fbc2e

bmcdonald3 mentioned this pull request Apr 8, 2024

Failure reading int64 column with null values with parquet #3087

Closed

bmcdonald3 added 3 commits April 9, 2024 14:10

Fix null values

ac5e46e

Clean up

79ead45

Add error handling

0a1279c

stress-tess requested review from stress-tess, ajpotts, jaketrookman and drculhane April 9, 2024 22:20

bmcdonald3 added 6 commits April 10, 2024 09:46

Fix seg fault

1a43cfc

fix null check

808f26f

Fix error handling

0c7eb9c

Clean up deprecation

36a9ac8

Fix null check

7cf85c8

Fix compat modules

3d59ae3

stress-tess approved these changes Apr 11, 2024

View reviewed changes

src/ArrowFunctions.h Outdated Show resolved Hide resolved

src/ParquetMsg.chpl Show resolved Hide resolved

src/ParquetMsg.chpl Show resolved Hide resolved

src/ParquetMsg.chpl Outdated Show resolved Hide resolved

src/ParquetMsg.chpl Show resolved Hide resolved

bmcdonald3 added 3 commits April 11, 2024 10:14

Address Tess feedback

fa3a225

Oops...

e002d9a

Fix wrong index

5942cc6

bmcdonald3 added 2 commits April 11, 2024 13:45

Fix loop start

66e1d5d

Revert to old open file

9980bf1

stress-tess approved these changes Apr 15, 2024

View reviewed changes

jaketrookman approved these changes Apr 15, 2024

View reviewed changes

Merge branch 'master' into pq-string-rework

7c26eab

stress-tess approved these changes Apr 16, 2024

View reviewed changes

stress-tess added this pull request to the merge queue Apr 16, 2024

Merged via the queue into Bears-R-Us:master with commit f22e5b2 Apr 16, 2024
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Closes #3083: Optimize Parquet string read code #3082

Closes #3083: Optimize Parquet string read code #3082

bmcdonald3 commented Apr 5, 2024 •

edited

Loading

stress-tess left a comment

bmcdonald3 commented Apr 11, 2024

jaketrookman left a comment

stress-tess commented Apr 15, 2024 •

edited

Loading

bmcdonald3 commented Apr 16, 2024

Closes #3083: Optimize Parquet string read code #3082

Closes #3083: Optimize Parquet string read code #3082

Conversation

bmcdonald3 commented Apr 5, 2024 • edited Loading

stress-tess left a comment

Choose a reason for hiding this comment

bmcdonald3 commented Apr 11, 2024

jaketrookman left a comment

Choose a reason for hiding this comment

stress-tess commented Apr 15, 2024 • edited Loading

bmcdonald3 commented Apr 16, 2024

bmcdonald3 commented Apr 5, 2024 •

edited

Loading

stress-tess commented Apr 15, 2024 •

edited

Loading