Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Approx Percentile #3301

Merged
merged 60 commits into from
Sep 28, 2021
Merged
Show file tree
Hide file tree
Changes from 9 commits
Commits
Show all changes
60 commits
Select commit Hold shift + click to select a range
8527f94
Rough out approx percentile aggregation
andygrove Aug 26, 2021
ae1f093
Cast percentiles to array of input data type
andygrove Aug 26, 2021
06ea6be
expand tests
andygrove Aug 26, 2021
481ea42
wip save current work
andygrove Aug 31, 2021
edc7cda
Fix bug in reported data type and implement tests that check that GPU…
andygrove Aug 31, 2021
a051425
test passes from command line and IDE now
andygrove Aug 31, 2021
a5a069a
add comment
andygrove Sep 1, 2021
5f6b4ac
docs and mapping of spark delta to gpu delta
andygrove Sep 1, 2021
208f7c6
code cleanup
andygrove Sep 1, 2021
02f83b7
code cleanup
andygrove Sep 1, 2021
8404cb7
add pyspark tests
andygrove Sep 1, 2021
808a3c6
remove window function, add high level docs
andygrove Sep 1, 2021
eb8bcf8
remove special handling for optional accuracy
andygrove Sep 1, 2021
e1fbd61
address PR feedback
andygrove Sep 1, 2021
c85cd3a
address more feedback
andygrove Sep 1, 2021
9cd15f7
simply type checks
andygrove Sep 1, 2021
031d359
handle nulls in tests
andygrove Sep 9, 2021
dc2f74f
Update t-digest data type and fix scala style issues
andygrove Sep 13, 2021
5e76dba
add config to enable approx_percentile
andygrove Sep 13, 2021
0a0f9d2
fall back to cpu for reduction
andygrove Sep 14, 2021
c783dbd
Test with Spark 3.2.1-SNAPSHOT (#3479)
andygrove Sep 15, 2021
18c1761
more tests
andygrove Sep 15, 2021
f1dd0b4
docs
andygrove Sep 16, 2021
c052cb0
merge from branch-21.10
andygrove Sep 17, 2021
c9dc22b
merge from branch-21.10
andygrove Sep 17, 2021
7dbcb0a
fix regression
andygrove Sep 17, 2021
16af8a9
Use same delta as Spark, with minimum of 1000
andygrove Sep 21, 2021
ca60d44
save progress with integration tests
andygrove Sep 21, 2021
5013dd9
make scala and integration tests consistent
andygrove Sep 21, 2021
52d82fc
more tests
andygrove Sep 21, 2021
532e970
handle scalar percentile
andygrove Sep 22, 2021
afbc999
handle scalar percentile
andygrove Sep 22, 2021
70451fc
make test more robust and remove temp debug logging
andygrove Sep 22, 2021
9b53e70
rename var
andygrove Sep 22, 2021
f5fa656
documentation
andygrove Sep 22, 2021
40ca3c3
Fall back to CPU for some edge cases
andygrove Sep 22, 2021
fc823a2
array with nulls fallback to cpu
andygrove Sep 22, 2021
856e941
fall back to CPU for decimal types
andygrove Sep 22, 2021
2d1935d
Remove allowedNonGpu=SortExec in tests
andygrove Sep 22, 2021
1764c87
use ArrayData instead of GenericArrayData
andygrove Sep 22, 2021
f0ac460
improve exception
andygrove Sep 22, 2021
a1879a9
enable decimal tests
andygrove Sep 22, 2021
b50c124
update comment
andygrove Sep 22, 2021
d5aa9e3
revert TypeSig change
andygrove Sep 22, 2021
ff5f101
update supported ops
andygrove Sep 22, 2021
474fcd5
upmerge
andygrove Sep 22, 2021
3ac4da7
fix merge conflict
andygrove Sep 22, 2021
97c1c8c
documentation for tests
andygrove Sep 22, 2021
8e7c986
add error handling for unexpected percentile type
andygrove Sep 22, 2021
4eea440
remove custom config and use disableByDefault instead
andygrove Sep 22, 2021
d6a73f7
remove custom config and use disableByDefault instead
andygrove Sep 22, 2021
aa05cc4
cherry pick groupByOnly
ttnghia Sep 22, 2021
a653ef8
use ExprChecks.groupByOnly
andygrove Sep 22, 2021
e49576e
generated docs
andygrove Sep 23, 2021
5d42e2c
Update sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuApproxima…
andygrove Sep 23, 2021
3514d4e
use extractListElement instead of getChildColumnView
andygrove Sep 23, 2021
a774f2c
Merge branch 'approx-percentile' of github.com:andygrove/spark-rapids…
andygrove Sep 23, 2021
967db59
fix regression and address nits
andygrove Sep 27, 2021
75efd33
fix build error with 320
andygrove Sep 27, 2021
31759a7
Use columnarEvalToColumn and ignore ANSI mode
andygrove Sep 27, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/configs.md
Original file line number Diff line number Diff line change
Expand Up @@ -299,6 +299,7 @@ Name | SQL Function(s) | Description | Default Value | Notes
<a name="sql.expression.WindowSpecDefinition"></a>spark.rapids.sql.expression.WindowSpecDefinition| |Specification of a window function, indicating the partitioning-expression, the row ordering, and the width of the window|true|None|
<a name="sql.expression.Year"></a>spark.rapids.sql.expression.Year|`year`|Returns the year from a date or timestamp|true|None|
<a name="sql.expression.AggregateExpression"></a>spark.rapids.sql.expression.AggregateExpression| |Aggregate expression|true|None|
<a name="sql.expression.ApproximatePercentile"></a>spark.rapids.sql.expression.ApproximatePercentile|`percentile_approx`, `approx_percentile`|Approximate percentile|true|None|
<a name="sql.expression.Average"></a>spark.rapids.sql.expression.Average|`avg`, `mean`|Average aggregate operator|true|None|
<a name="sql.expression.CollectList"></a>spark.rapids.sql.expression.CollectList|`collect_list`|Collect a list of non-unique elements, not supported in reduction.|true|None|
<a name="sql.expression.CollectSet"></a>spark.rapids.sql.expression.CollectSet|`collect_set`|Collect a set of unique elements, not supported in reduction.|true|None|
Expand Down
376 changes: 373 additions & 3 deletions docs/supported_ops.md
Original file line number Diff line number Diff line change
Expand Up @@ -534,9 +534,9 @@ Accelerator supports are described below.
<td>S</td>
<td><b>NS</b></td>
<td><b>NS</b></td>
<td><em>PS<br/>not allowed for grouping expressions;<br/>max nested DECIMAL precision of 18;<br/>UTC is only supported TZ for nested TIMESTAMP;<br/>missing nested BINARY, CALENDAR, ARRAY, MAP, STRUCT, UDT</em></td>
<td><em>PS<br/>not allowed for grouping expressions;<br/>max nested DECIMAL precision of 18;<br/>UTC is only supported TZ for nested TIMESTAMP;<br/>missing nested BINARY, CALENDAR, ARRAY, MAP, STRUCT, UDT</em></td>
<td><em>PS<br/>not allowed for grouping expressions;<br/>max nested DECIMAL precision of 18;<br/>UTC is only supported TZ for nested TIMESTAMP;<br/>missing nested BINARY, CALENDAR, ARRAY, MAP, STRUCT, UDT</em></td>
<td><em>PS<br/>not allowed for grouping expressions;<br/>max nested DECIMAL precision of 18;<br/>UTC is only supported TZ for nested TIMESTAMP;<br/>missing nested BINARY, CALENDAR, ARRAY, MAP, UDT</em></td>
<td><em>PS<br/>not allowed for grouping expressions;<br/>max nested DECIMAL precision of 18;<br/>UTC is only supported TZ for nested TIMESTAMP;<br/>missing nested BINARY, CALENDAR, ARRAY, MAP, UDT</em></td>
<td><em>PS<br/>not allowed for grouping expressions;<br/>max nested DECIMAL precision of 18;<br/>UTC is only supported TZ for nested TIMESTAMP;<br/>missing nested BINARY, CALENDAR, ARRAY, MAP, UDT</em></td>
<td><b>NS</b></td>
</tr>
<tr>
Expand Down Expand Up @@ -13045,6 +13045,376 @@ are limited.
<th>UDT</th>
</tr>
<tr>
<td rowSpan="16">ApproximatePercentile</td>
<td rowSpan="16">`percentile_approx`, `approx_percentile`</td>
<td rowSpan="16">Approximate percentile</td>
<td rowSpan="16">None</td>
<td rowSpan="4">aggregation</td>
<td>input</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td><em>PS<br/>UTC is only supported TZ for TIMESTAMP</em></td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td><em>PS<br/>UTC is only supported TZ for nested TIMESTAMP</em></td>
<td><em>PS<br/>UTC is only supported TZ for nested TIMESTAMP</em></td>
<td><em>PS<br/>UTC is only supported TZ for nested TIMESTAMP</em></td>
<td>S</td>
</tr>
<tr>
<td>percentage</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td><em>PS<br/>UTC is only supported TZ for TIMESTAMP</em></td>
<td>S</td>
<td><em>PS<br/>max DECIMAL precision of 18</em></td>
<td><b>NS</b></td>
<td><b>NS</b></td>
<td><b>NS</b></td>
<td><em>PS<br/>max nested DECIMAL precision of 18;<br/>UTC is only supported TZ for nested TIMESTAMP;<br/>missing nested NULL, BINARY, CALENDAR, ARRAY, MAP, STRUCT, UDT</em></td>
<td><b>NS</b></td>
<td><b>NS</b></td>
<td><b>NS</b></td>
</tr>
<tr>
<td>accuracy</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td><em>PS<br/>UTC is only supported TZ for TIMESTAMP</em></td>
<td>S</td>
<td><em>PS<br/>max DECIMAL precision of 18</em></td>
<td><b>NS</b></td>
<td><b>NS</b></td>
<td><b>NS</b></td>
<td><b>NS</b></td>
<td><b>NS</b></td>
<td><b>NS</b></td>
<td><b>NS</b></td>
</tr>
<tr>
<td>result</td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td><em>PS<br/>missing nested BOOLEAN, DATE, TIMESTAMP, STRING, NULL, BINARY, CALENDAR, ARRAY, MAP, STRUCT, UDT</em></td>
<td> </td>
<td> </td>
<td> </td>
</tr>
<tr>
<td rowSpan="4">reduction</td>
<td>input</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td><em>PS<br/>UTC is only supported TZ for TIMESTAMP</em></td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td><em>PS<br/>UTC is only supported TZ for nested TIMESTAMP</em></td>
<td><em>PS<br/>UTC is only supported TZ for nested TIMESTAMP</em></td>
<td><em>PS<br/>UTC is only supported TZ for nested TIMESTAMP</em></td>
<td>S</td>
</tr>
<tr>
<td>percentage</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td><em>PS<br/>UTC is only supported TZ for TIMESTAMP</em></td>
<td>S</td>
<td><em>PS<br/>max DECIMAL precision of 18</em></td>
<td><b>NS</b></td>
<td><b>NS</b></td>
<td><b>NS</b></td>
<td><em>PS<br/>max nested DECIMAL precision of 18;<br/>UTC is only supported TZ for nested TIMESTAMP;<br/>missing nested NULL, BINARY, CALENDAR, ARRAY, MAP, STRUCT, UDT</em></td>
<td><b>NS</b></td>
<td><b>NS</b></td>
<td><b>NS</b></td>
</tr>
<tr>
<td>accuracy</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td><em>PS<br/>UTC is only supported TZ for TIMESTAMP</em></td>
<td>S</td>
<td><em>PS<br/>max DECIMAL precision of 18</em></td>
<td><b>NS</b></td>
<td><b>NS</b></td>
<td><b>NS</b></td>
<td><b>NS</b></td>
<td><b>NS</b></td>
<td><b>NS</b></td>
<td><b>NS</b></td>
</tr>
<tr>
<td>result</td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td><em>PS<br/>missing nested BOOLEAN, DATE, TIMESTAMP, STRING, NULL, BINARY, CALENDAR, ARRAY, MAP, STRUCT, UDT</em></td>
<td> </td>
<td> </td>
<td> </td>
</tr>
<tr>
<td rowSpan="4">window</td>
<td>input</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td><em>PS<br/>UTC is only supported TZ for TIMESTAMP</em></td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td><em>PS<br/>UTC is only supported TZ for nested TIMESTAMP</em></td>
<td><em>PS<br/>UTC is only supported TZ for nested TIMESTAMP</em></td>
<td><em>PS<br/>UTC is only supported TZ for nested TIMESTAMP</em></td>
<td>S</td>
</tr>
<tr>
<td>percentage</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td><em>PS<br/>UTC is only supported TZ for TIMESTAMP</em></td>
<td>S</td>
<td><em>PS<br/>max DECIMAL precision of 18</em></td>
<td><b>NS</b></td>
<td><b>NS</b></td>
<td><b>NS</b></td>
<td><em>PS<br/>max nested DECIMAL precision of 18;<br/>UTC is only supported TZ for nested TIMESTAMP;<br/>missing nested NULL, BINARY, CALENDAR, ARRAY, MAP, STRUCT, UDT</em></td>
<td><b>NS</b></td>
<td><b>NS</b></td>
<td><b>NS</b></td>
</tr>
<tr>
<td>accuracy</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td><em>PS<br/>UTC is only supported TZ for TIMESTAMP</em></td>
<td>S</td>
<td><em>PS<br/>max DECIMAL precision of 18</em></td>
<td><b>NS</b></td>
<td><b>NS</b></td>
<td><b>NS</b></td>
<td><b>NS</b></td>
<td><b>NS</b></td>
<td><b>NS</b></td>
<td><b>NS</b></td>
</tr>
<tr>
<td>result</td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td><em>PS<br/>missing nested BOOLEAN, DATE, TIMESTAMP, STRING, NULL, BINARY, CALENDAR, ARRAY, MAP, STRUCT, UDT</em></td>
<td> </td>
<td> </td>
<td> </td>
</tr>
<tr>
<td rowSpan="4">project</td>
<td>input</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td><em>PS<br/>UTC is only supported TZ for TIMESTAMP</em></td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td><em>PS<br/>UTC is only supported TZ for nested TIMESTAMP</em></td>
<td><em>PS<br/>UTC is only supported TZ for nested TIMESTAMP</em></td>
<td><em>PS<br/>UTC is only supported TZ for nested TIMESTAMP</em></td>
<td>S</td>
</tr>
<tr>
<td>percentage</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td><em>PS<br/>UTC is only supported TZ for TIMESTAMP</em></td>
<td>S</td>
<td><em>PS<br/>max DECIMAL precision of 18</em></td>
<td><b>NS</b></td>
<td><b>NS</b></td>
<td><b>NS</b></td>
<td><em>PS<br/>max nested DECIMAL precision of 18;<br/>UTC is only supported TZ for nested TIMESTAMP;<br/>missing nested NULL, BINARY, CALENDAR, ARRAY, MAP, STRUCT, UDT</em></td>
<td><b>NS</b></td>
<td><b>NS</b></td>
<td><b>NS</b></td>
</tr>
<tr>
<td>accuracy</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td><em>PS<br/>UTC is only supported TZ for TIMESTAMP</em></td>
<td>S</td>
<td><em>PS<br/>max DECIMAL precision of 18</em></td>
<td><b>NS</b></td>
<td><b>NS</b></td>
<td><b>NS</b></td>
<td><b>NS</b></td>
<td><b>NS</b></td>
<td><b>NS</b></td>
<td><b>NS</b></td>
</tr>
<tr>
<td>result</td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td><em>PS<br/>missing nested BOOLEAN, DATE, TIMESTAMP, STRING, NULL, BINARY, CALENDAR, ARRAY, MAP, STRUCT, UDT</em></td>
<td> </td>
<td> </td>
<td> </td>
</tr>
<tr>
<th>Expression</th>
<th>SQL Functions(s)</th>
<th>Description</th>
<th>Notes</th>
<th>Context</th>
<th>Param/Output</th>
<th>BOOLEAN</th>
<th>BYTE</th>
<th>SHORT</th>
<th>INT</th>
<th>LONG</th>
<th>FLOAT</th>
<th>DOUBLE</th>
<th>DATE</th>
<th>TIMESTAMP</th>
<th>STRING</th>
<th>DECIMAL</th>
<th>NULL</th>
<th>BINARY</th>
<th>CALENDAR</th>
<th>ARRAY</th>
<th>MAP</th>
<th>STRUCT</th>
<th>UDT</th>
</tr>
<tr>
<td rowSpan="6">Average</td>
<td rowSpan="6">`avg`, `mean`</td>
<td rowSpan="6">Average aggregate operator</td>
Expand Down
Loading