-
Notifications
You must be signed in to change notification settings - Fork 28.5k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[SPARK-41248][SQL] Add "spark.sql.json.enablePartialResults" to enabl…
…e/disable JSON partial results ### What changes were proposed in this pull request? This PR adds a SQL config `spark.sql.json.enablePartialResults` to control SPARK-40646 change. This allows us to fall back to the behaviour before the change. It was observed that SPARK-40646 could cause a performance regression for deeply nested schemas. I, however, could not reproduce the regression with Apache Spark JSON benchmarks (maybe we need to extend them, I can do it as a follow-up). Regardless, I propose to add a SQL config to have an ability to disable the change in case of performance degradation during JSON parsing. Benchmark results are attached to the JIRA ticket. ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? SQL config `spark.sql.json.enablePartialResults` is added to control the behaviour of SPARK-40646 JSON partial results parsing. Users can disable the feature if they find any performance regressions when reading JSON files. ### How was this patch tested? I extended existing unit tests to test with flag enabled and disabled. Closes #38784 from sadikovi/add-flag-json-parsing. Authored-by: Ivan Sadikov <[email protected]> Signed-off-by: Max Gekk <[email protected]>
- Loading branch information
Showing
5 changed files
with
158 additions
and
110 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,120 +1,105 @@ | ||
================================================================================================ | ||
Benchmark for performance of JSON parsing | ||
================================================================================================ | ||
|
||
Preparing data for benchmarking ... | ||
OpenJDK 64-Bit Server VM 1.8.0_322-b06 on Linux 5.13.0-1021-azure | ||
Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz | ||
OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws | ||
Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz | ||
JSON schema inferring: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative | ||
------------------------------------------------------------------------------------------------------------------------ | ||
No encoding 3363 3446 79 1.5 672.7 1.0X | ||
UTF-8 is set 4894 4976 72 1.0 978.7 0.7X | ||
No encoding 2545 2616 65 2.0 509.0 1.0X | ||
UTF-8 is set 3845 3854 8 1.3 768.9 0.7X | ||
|
||
Preparing data for benchmarking ... | ||
OpenJDK 64-Bit Server VM 1.8.0_322-b06 on Linux 5.13.0-1021-azure | ||
Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz | ||
OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws | ||
Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz | ||
count a short column: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative | ||
------------------------------------------------------------------------------------------------------------------------ | ||
No encoding 3088 3123 32 1.6 617.6 1.0X | ||
UTF-8 is set 4854 4938 87 1.0 970.9 0.6X | ||
No encoding 2130 2176 41 2.3 426.0 1.0X | ||
UTF-8 is set 3907 3911 4 1.3 781.3 0.5X | ||
|
||
Preparing data for benchmarking ... | ||
OpenJDK 64-Bit Server VM 1.8.0_322-b06 on Linux 5.13.0-1021-azure | ||
Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz | ||
OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws | ||
Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz | ||
count a wide column: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative | ||
------------------------------------------------------------------------------------------------------------------------ | ||
No encoding 6411 7338 1497 0.2 6411.2 1.0X | ||
UTF-8 is set 10589 10644 58 0.1 10589.1 0.6X | ||
No encoding 5032 5068 50 0.2 5032.3 1.0X | ||
UTF-8 is set 8304 8349 40 0.1 8304.3 0.6X | ||
|
||
Preparing data for benchmarking ... | ||
OpenJDK 64-Bit Server VM 1.8.0_322-b06 on Linux 5.13.0-1021-azure | ||
Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz | ||
OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws | ||
Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz | ||
select wide row: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative | ||
------------------------------------------------------------------------------------------------------------------------ | ||
No encoding 12862 13165 263 0.0 257239.1 1.0X | ||
UTF-8 is set 14792 15110 371 0.0 295834.1 0.9X | ||
No encoding 10782 10872 78 0.0 215647.2 1.0X | ||
UTF-8 is set 12514 12560 41 0.0 250277.3 0.9X | ||
|
||
Preparing data for benchmarking ... | ||
OpenJDK 64-Bit Server VM 1.8.0_322-b06 on Linux 5.13.0-1021-azure | ||
Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz | ||
OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws | ||
Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz | ||
Select a subset of 10 columns: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative | ||
------------------------------------------------------------------------------------------------------------------------ | ||
Select 10 columns 2352 2369 17 0.4 2351.8 1.0X | ||
Select 1 column 2680 2683 5 0.4 2680.0 0.9X | ||
Select 10 columns 1901 1903 2 0.5 1901.0 1.0X | ||
Select 1 column 1493 1501 8 0.7 1493.3 1.3X | ||
|
||
Preparing data for benchmarking ... | ||
OpenJDK 64-Bit Server VM 1.8.0_322-b06 on Linux 5.13.0-1021-azure | ||
Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz | ||
OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws | ||
Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz | ||
creation of JSON parser per line: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative | ||
------------------------------------------------------------------------------------------------------------------------ | ||
Short column without encoding 884 887 2 1.1 884.1 1.0X | ||
Short column with UTF-8 1193 1202 8 0.8 1192.6 0.7X | ||
Wide column without encoding 12289 12448 170 0.1 12289.3 0.1X | ||
Wide column with UTF-8 16609 16663 79 0.1 16608.6 0.1X | ||
Short column without encoding 697 700 3 1.4 697.2 1.0X | ||
Short column with UTF-8 979 979 0 1.0 978.7 0.7X | ||
Wide column without encoding 10365 10403 51 0.1 10364.5 0.1X | ||
Wide column with UTF-8 15209 15226 15 0.1 15208.7 0.0X | ||
|
||
Preparing data for benchmarking ... | ||
OpenJDK 64-Bit Server VM 1.8.0_322-b06 on Linux 5.13.0-1021-azure | ||
Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz | ||
OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws | ||
Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz | ||
JSON functions: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative | ||
------------------------------------------------------------------------------------------------------------------------ | ||
Text read 147 148 0 6.8 147.2 1.0X | ||
from_json 2201 2202 1 0.5 2200.7 0.1X | ||
json_tuple 2452 2473 20 0.4 2452.5 0.1X | ||
get_json_object 2248 2263 22 0.4 2248.2 0.1X | ||
Text read 120 123 4 8.3 120.2 1.0X | ||
from_json 1944 1957 21 0.5 1944.4 0.1X | ||
json_tuple 2142 2146 4 0.5 2141.6 0.1X | ||
get_json_object 1967 1969 2 0.5 1966.7 0.1X | ||
|
||
Preparing data for benchmarking ... | ||
OpenJDK 64-Bit Server VM 1.8.0_322-b06 on Linux 5.13.0-1021-azure | ||
Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz | ||
OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws | ||
Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz | ||
Dataset of json strings: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative | ||
------------------------------------------------------------------------------------------------------------------------ | ||
Text read 647 654 7 7.7 129.4 1.0X | ||
schema inferring 2842 2862 25 1.8 568.4 0.2X | ||
parsing 3213 3239 33 1.6 642.6 0.2X | ||
Text read 537 542 4 9.3 107.5 1.0X | ||
schema inferring 2319 2323 4 2.2 463.7 0.2X | ||
parsing 2828 2854 29 1.8 565.6 0.2X | ||
|
||
Preparing data for benchmarking ... | ||
OpenJDK 64-Bit Server VM 1.8.0_322-b06 on Linux 5.13.0-1021-azure | ||
Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz | ||
OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws | ||
Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz | ||
Json files in the per-line mode: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative | ||
------------------------------------------------------------------------------------------------------------------------ | ||
Text read 1046 1058 12 4.8 209.3 1.0X | ||
Schema inferring 3321 3378 58 1.5 664.2 0.3X | ||
Parsing without charset 3751 3791 36 1.3 750.2 0.3X | ||
Parsing with UTF-8 5361 5403 37 0.9 1072.1 0.2X | ||
Text read 798 811 16 6.3 159.6 1.0X | ||
Schema inferring 2774 2781 10 1.8 554.9 0.3X | ||
Parsing without charset 3213 3218 7 1.6 642.7 0.2X | ||
Parsing with UTF-8 4574 4588 13 1.1 914.7 0.2X | ||
|
||
OpenJDK 64-Bit Server VM 1.8.0_322-b06 on Linux 5.13.0-1021-azure | ||
Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz | ||
OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws | ||
Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz | ||
Write dates and timestamps: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative | ||
------------------------------------------------------------------------------------------------------------------------ | ||
Create a dataset of timestamps 171 173 2 5.8 171.3 1.0X | ||
to_json(timestamp) 1414 1427 12 0.7 1414.0 0.1X | ||
write timestamps to files 1183 1211 40 0.8 1183.2 0.1X | ||
Create a dataset of dates 191 198 7 5.2 191.5 0.9X | ||
to_json(date) 934 945 16 1.1 934.1 0.2X | ||
write dates to files 727 748 22 1.4 726.9 0.2X | ||
Create a dataset of timestamps 143 144 2 7.0 142.7 1.0X | ||
to_json(timestamp) 1075 1079 7 0.9 1074.9 0.1X | ||
write timestamps to files 928 932 4 1.1 928.1 0.2X | ||
Create a dataset of dates 165 170 4 6.1 165.2 0.9X | ||
to_json(date) 739 742 3 1.4 739.0 0.2X | ||
write dates to files 573 576 4 1.7 573.4 0.2X | ||
|
||
OpenJDK 64-Bit Server VM 1.8.0_322-b06 on Linux 5.13.0-1021-azure | ||
Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz | ||
OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws | ||
Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz | ||
Read dates and timestamps: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative | ||
------------------------------------------------------------------------------------------------------------------------ | ||
read timestamp text from files 263 264 1 3.8 262.8 1.0X | ||
read timestamps from files 2743 2807 59 0.4 2742.9 0.1X | ||
infer timestamps from files 14799 15093 383 0.1 14799.3 0.0X | ||
read date text from files 245 253 8 4.1 245.5 1.1X | ||
read date from files 998 1008 9 1.0 998.4 0.3X | ||
timestamp strings 383 403 17 2.6 382.8 0.7X | ||
parse timestamps from Dataset[String] 3165 3185 17 0.3 3165.4 0.1X | ||
infer timestamps from Dataset[String] 15717 15830 147 0.1 15717.2 0.0X | ||
date strings 434 450 19 2.3 433.5 0.6X | ||
parse dates from Dataset[String] 1466 1472 7 0.7 1465.6 0.2X | ||
from_json(timestamp) 4682 4736 50 0.2 4681.9 0.1X | ||
from_json(date) 2823 2848 22 0.4 2822.6 0.1X | ||
read timestamp text from files 215 220 5 4.6 215.2 1.0X | ||
read timestamps from files 2389 2424 31 0.4 2388.8 0.1X | ||
infer timestamps from files 6115 6122 11 0.2 6115.4 0.0X | ||
read date text from files 191 193 2 5.2 191.4 1.1X | ||
read date from files 840 841 2 1.2 839.7 0.3X | ||
timestamp strings 301 306 4 3.3 300.8 0.7X | ||
parse timestamps from Dataset[String] 2706 2713 6 0.4 2706.1 0.1X | ||
infer timestamps from Dataset[String] 6476 6482 5 0.2 6475.9 0.0X | ||
date strings 343 343 0 2.9 342.5 0.6X | ||
parse dates from Dataset[String] 1169 1172 5 0.9 1168.6 0.2X | ||
from_json(timestamp) 4067 4074 7 0.2 4066.5 0.1X | ||
from_json(date) 2470 2472 3 0.4 2469.9 0.1X | ||
|
||
OpenJDK 64-Bit Server VM 1.8.0_322-b06 on Linux 5.13.0-1021-azure | ||
Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz | ||
OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws | ||
Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz | ||
Filters pushdown: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative | ||
------------------------------------------------------------------------------------------------------------------------ | ||
w/o filters 21058 21148 143 0.0 210582.1 1.0X | ||
pushdown disabled 20208 20464 226 0.0 202080.3 1.0X | ||
w/ filters 750 756 6 0.1 7499.1 28.1X | ||
|
||
|
||
w/o filters 18219 18230 18 0.0 182188.8 1.0X | ||
pushdown disabled 17180 17183 4 0.0 171798.7 1.1X | ||
w/ filters 1197 1219 22 0.1 11974.0 15.2X |
Oops, something went wrong.