Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement chunked ORC reader #10723

Merged
merged 16 commits into from
May 16, 2024
Merged

Conversation

ttnghia
Copy link
Collaborator

@ttnghia ttnghia commented Apr 18, 2024

This implements the ORC chunked reader which supports reading ORC files in an interactive manner. This allows to read very large files (containing more than 2B rows) while maintaining device memory usage within a constant limit.

Depends on:

Closes #7131.

@ttnghia ttnghia added feature request New feature or request SQL part of the SQL/Dataframe plugin P1 Nice to have for release task Work required that improves the product but is not user facing labels Apr 18, 2024
@ttnghia ttnghia requested a review from revans2 April 18, 2024 04:25
@ttnghia ttnghia self-assigned this Apr 18, 2024
ttnghia added 2 commits April 17, 2024 21:40
Signed-off-by: Nghia Truong <[email protected]>
@ttnghia ttnghia marked this pull request as ready for review April 22, 2024 18:52
ttnghia added 2 commits April 22, 2024 13:17
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
@ttnghia ttnghia force-pushed the chunked_orc_reader branch from f31fe5c to 67ec36f Compare April 22, 2024 21:40
# Conflicts:
#	docs/additional-functionality/advanced_configs.md
#	sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala
Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

@ttnghia
Copy link
Collaborator Author

ttnghia commented May 9, 2024

build

@ttnghia
Copy link
Collaborator Author

ttnghia commented May 16, 2024

I ran an NDS benchmark at scale sf200 with data generated such that each table is written in just one file. These files are:

4.0K	./orc/household_demographics/part-00000-95dae66f-cd68-4210-83b4-12597870c423-c000.snappy.orc
4.0K	./orc/income_band/part-00000-dccb4614-4687-4ddf-a4df-24ba600b260b-c000.snappy.orc
4.0K	./orc/reason/part-00000-4425fe4e-cb76-47db-981c-b3bcd56ac303-c000.snappy.orc
4.0K	./orc/ship_mode/part-00000-9ebf1ca8-6467-4949-8971-d153e3c96daf-c000.snappy.orc
4.0K	./orc/warehouse/part-00000-5dcdeea3-9698-4687-b524-c8be98763a64-c000.snappy.orc
8.0K	./orc/call_center/part-00000-7b01e9bb-a8e3-4852-952d-a5e7877acb6c-c000.snappy.orc
8.0K	./orc/web_page/part-00000-b00dd5c2-56e5-473a-aab8-04b3b36d05e4-c000.snappy.orc
12K	./orc/web_site/part-00000-1d2a5342-3ff3-417e-a58f-f738af9efa4b-c000.snappy.orc
24K	./orc/promotion/part-00000-a1b88f57-6c34-46d3-9e1c-38b2248bbe93-c000.snappy.orc
24K	./orc/store/part-00000-fc7bb994-270e-4470-927d-455a3e6dd06e-c000.snappy.orc
120K	./orc/customer_demographics/part-00000-c1c5b651-ba99-4bbc-ad25-22dc060da252-c000.snappy.orc
432K	./orc/time_dim/part-00000-fe708db1-7126-47cb-b6a6-e1f23a290f8d-c000.snappy.orc
492K	./orc/date_dim/part-00000-200fa1e1-901c-4574-b427-ebbc9c1a00ec-c000.snappy.orc
608K	./orc/catalog_page/part-00000-88437db7-e9f5-4bde-bc8e-3fef74f9f8fc-c000.snappy.orc
4.2M	./orc/item/part-00000-a77db32e-6b0b-462b-8a43-c3bb2f8baf4e-c000.snappy.orc
15M	./orc/customer_address/part-00000-e66546d9-dc91-497c-92d5-2d3f688e3a7b-c000.snappy.orc
73M	./orc/inventory/part-00000-edbb3790-cf29-4589-8e57-42700b9ea0ff-c000.snappy.orc
81M	./orc/customer/part-00000-697a9499-9bb1-4c58-adf2-7938ffbaf421-c000.snappy.orc
828M	./orc/web_returns/part-00000-415548d7-cc74-4d2d-a033-c4929221d10d-c000.snappy.orc
1.7G	./orc/catalog_returns/part-00000-ee2c1172-11f0-431e-80ad-2ca5cd1d352a-c000.snappy.orc
2.7G	./orc/store_returns/part-00000-4284adf9-a7d5-4c24-9047-0541f696fa34-c000.snappy.orc
8.2G	./orc/web_sales/part-00000-913eb386-9d3c-426f-8333-3664a8e6f9bb-c000.snappy.orc
16G	./orc/catalog_sales/part-00000-e30529dc-28fa-41ca-956b-7b54555983c6-c000.snappy.orc
22G	./orc/store_sales/part-00000-4d890e59-72b6-4e3d-b2a0-a8dc6ccc01fb-c000.snappy.orc

and

4.0K	./parquet/income_band/part-00000-b5df2871-e2e2-46da-b51c-24ecc17d25fd-c000.snappy.parquet
4.0K	./parquet/reason/part-00000-66b0a10d-de0a-4f3c-8f1a-2ccebb7cfb4e-c000.snappy.parquet
4.0K	./parquet/ship_mode/part-00000-f9f212cf-93af-413c-b0b2-a97aab763f24-c000.snappy.parquet
8.0K	./parquet/warehouse/part-00000-aec427fc-28a0-4c6b-8da4-1acaf9b8c9eb-c000.snappy.parquet
12K	./parquet/call_center/part-00000-210572d7-fd2b-4d96-a168-8a21e2b76715-c000.snappy.parquet
12K	./parquet/web_page/part-00000-3ad7904b-2ba1-4ca5-bf42-9c054eaa9d6c-c000.snappy.parquet
16K	./parquet/web_site/part-00000-418e07f4-aef9-442c-b125-99838804bc53-c000.snappy.parquet
32K	./parquet/household_demographics/part-00000-aaf384b3-4f54-4e06-ab8c-f8d91f14a745-c000.snappy.parquet
32K	./parquet/promotion/part-00000-2f974abf-8d60-48bb-8736-0ace3417d56d-c000.snappy.parquet
32K	./parquet/store/part-00000-0f1c76fd-fd7f-49a9-93ca-0039a9fff2b9-c000.snappy.parquet
680K	./parquet/catalog_page/part-00000-af68914e-77eb-45ff-80bf-f5a7eba0d92b-c000.snappy.parquet
1.2M	./parquet/time_dim/part-00000-4aac3bc0-1e2d-4eab-a6df-d925a6aea89f-c000.snappy.parquet
1.8M	./parquet/date_dim/part-00000-decea4df-2b44-44e7-a49a-622b1dc1485a-c000.snappy.parquet
4.5M	./parquet/item/part-00000-2fce9f08-2ab6-4fe2-b4b4-4ae0cb2da622-c000.snappy.parquet
7.5M	./parquet/customer_demographics/part-00000-5d95e7a0-9582-4e92-b309-8177a497f974-c000.snappy.parquet
15M	./parquet/customer_address/part-00000-06fea56b-606b-48cd-9b07-39a5ebd887bd-c000.snappy.parquet
82M	./parquet/customer/part-00000-2f75e736-a872-4ce6-a9cb-730fa1c3aca7-c000.snappy.parquet
190M	./parquet/inventory/part-00000-cf84b5ba-b2ec-4e82-8b7d-e0ab841cd063-c000.snappy.parquet
998M	./parquet/web_returns/part-00000-99cecdf0-fc24-4774-bf7f-cbcc6a4c7967-c000.snappy.parquet
2.0G	./parquet/catalog_returns/part-00000-770ea751-3eb9-4f84-90e6-0e47927affa5-c000.snappy.parquet
3.2G	./parquet/store_returns/part-00000-e4f749ad-711e-45c7-af94-615b5f7e0f73-c000.snappy.parquet
9.1G	./parquet/web_sales/part-00000-c523d6ce-9f42-43f8-9628-59e5dd23ccfb-c000.snappy.parquet
20G	./parquet/catalog_sales/part-00000-00821685-a882-4bcd-a4ef-1be87c76fa59-c000.snappy.parquet
25G	./parquet/store_sales/part-00000-fb1c5b06-fd0b-4b4b-8a52-ae24d96ad52a-c000.snappy.parquet

Benchmark

With ORC input format:

Benchmark times
===============
313.0
310.0
315.0
307.0
304.0

With Parquet input format:

Benchmark times
===============
174.0
173.0
174.0
174.0
180.0

Before this PR, the benchmark fails with a lot of exceptions being thrown as the plugin couldn't read large files.

Note that chunked ORC reader can be further optimized to improve performance, which is already planned in the future work.

@ttnghia ttnghia merged commit 713bb7b into NVIDIA:branch-24.06 May 16, 2024
43 checks passed
@ttnghia ttnghia deleted the chunked_orc_reader branch May 16, 2024 06:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request P1 Nice to have for release SQL part of the SQL/Dataframe plugin task Work required that improves the product but is not user facing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Chunked ORC reading
2 participants