Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add avro reader support [databricks] #4956

Merged
merged 37 commits into from
Mar 23, 2022
Merged

Conversation

wbo4958
Copy link
Collaborator

@wbo4958 wbo4958 commented Mar 15, 2022

This PR is trying to support the avro reader on basic type support.

Since there is no meta info for the block data, this PR first iterates the whole avro file to get all the block info, it didn't read all files. instead, it only needs to read small bytes by seeking the desired position, then filters the blocks according to the PartitionedFile. finally read the blocks into CPU and send to GPU to decode ...

Close #4935 .
Re #4831 .

@wbo4958 wbo4958 changed the title [WIP] Add avro reader support Add avro reader support [databricks] Mar 16, 2022
@wbo4958 wbo4958 marked this pull request as ready for review March 16, 2022 10:49
@wbo4958
Copy link
Collaborator Author

wbo4958 commented Mar 16, 2022

build

@sameerz sameerz added the feature request New feature or request label Mar 16, 2022
@sameerz sameerz added this to the Feb 28 - Mar 18 milestone Mar 16, 2022
@wbo4958
Copy link
Collaborator Author

wbo4958 commented Mar 17, 2022

build

@wbo4958
Copy link
Collaborator Author

wbo4958 commented Mar 20, 2022

build

1 similar comment
@wbo4958
Copy link
Collaborator Author

wbo4958 commented Mar 20, 2022

build

@wbo4958 wbo4958 requested review from jlowe and revans2 March 21, 2022 01:00
@wbo4958
Copy link
Collaborator Author

wbo4958 commented Mar 21, 2022

build

jlowe
jlowe previously approved these changes Mar 21, 2022
revans2
revans2 previously approved these changes Mar 21, 2022
# Only 3 jars: cudf.jar dist.jar integration-test.jar
ALL_JARS="$CUDF_JARS $PLUGIN_JARS $TEST_JARS"
ALL_JARS="$CUDF_JARS $PLUGIN_JARS $TEST_JARS $AVRO_JARS"
Copy link
Collaborator

@firestarman firestarman Mar 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: Better to check whether the avro jar exists. If not, set AVRO_JARS="" and CI_EXCLUDE_AVRO=true as well.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DONE

#
# `CI_EXCLUDE_AVRO=true ./run_pyspark_from_build.sh -k not avro_test.py` run all tests excluding
# those in avro_test.py
if [[ "${CI_EXCLUDE_AVRO}" != "" ]];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if [[ "${CI_EXCLUDE_AVRO}" != "" ]];
if [[ "${CI_EXCLUDE_AVRO}" == "true" ]];

otherwise, CI_EXCLUDE_AVRO=false can also skip the tests.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@wbo4958 wbo4958 dismissed stale reviews from revans2 and jlowe via 0be4e9a March 22, 2022 03:23
@wbo4958
Copy link
Collaborator Author

wbo4958 commented Mar 22, 2022

build

@wbo4958
Copy link
Collaborator Author

wbo4958 commented Mar 22, 2022

build

1 similar comment
@wbo4958
Copy link
Collaborator Author

wbo4958 commented Mar 22, 2022

build

@wbo4958 wbo4958 merged commit 19124fe into NVIDIA:branch-22.04 Mar 23, 2022
@wbo4958 wbo4958 deleted the avro branch March 23, 2022 01:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Support reading Avro: primitive types
6 participants