Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minor: Add tests showing aggregate behavior for NaNs #10634

Merged
merged 3 commits into from
May 27, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 35 additions & 0 deletions datafusion/sqllogictest/test_files/aggregate.slt
Original file line number Diff line number Diff line change
Expand Up @@ -4626,6 +4626,41 @@ GROUP BY dummy
----
text1, text1, text1

# Tests for aggregating with NaN values
statement ok
CREATE TABLE float_table (
col_f32 FLOAT,
col_f32_nan FLOAT,
col_f64 DOUBLE,
col_f64_nan DOUBLE
) as VALUES
( -128.2, -128.2, -128.2, -128.2 ),
( 32768.3, arrow_cast('NAN','Float32'), 32768.3, 32768.3 ),
( 27.3, 27.3, 27.3, arrow_cast('NAN','Float64') );

query RRRRI
select min(col_f32), max(col_f32), avg(col_f32), sum(col_f32), count(col_f32) from float_table;
----
-128.2 32768.3 10889.13359451294 32667.40078353882 3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What concerns me is below:

DataFusion CLI v38.0.0
> CREATE TABLE float_table (
     col_f32 FLOAT,
     col_f32_nan FLOAT,
     col_f64 DOUBLE,
     col_f64_nan DOUBLE
 ) as VALUES
 ( -128.2,  -128.2,                      -128.2,  -128.2 ),
 ( 32768.3, arrow_cast('NAN','Float32'), 32768.3, 32768.3 ),
 ( 27.3,    27.3,                        27.3,    arrow_cast('NAN','Float64') );
0 row(s) fetched. 
Elapsed 0.028 seconds.

> select min(col_f32), max(col_f32), avg(col_f32), sum(col_f32), count(col_f32) from float_table;
+--------------------------+--------------------------+--------------------------+--------------------------+----------------------------+
| MIN(float_table.col_f32) | MAX(float_table.col_f32) | AVG(float_table.col_f32) | SUM(float_table.col_f32) | COUNT(float_table.col_f32) |
+--------------------------+--------------------------+--------------------------+--------------------------+----------------------------+
| -128.2                   | 32768.3                  | 10889.13359451294        | 32667.40078353882        | 3                          |
+--------------------------+--------------------------+--------------------------+--------------------------+----------------------------+
1 row(s) fetched. 
Elapsed 0.011 seconds.


but in both duckdb and postgres

>
>  select -128.2 col_f32,  -128.2 col_f32_nan,                      -128.2 col_f64,  -128.2 col_f64_nan union all
>  select 32768.3, 'NaN'::DOUBLE PRECISION, 32768.3, 32768.3 union all
> select 27.3,    27.3,                        27.3,    'NaN'::DOUBLE PRECISION) x
> ;
┌──────────────┬──────────────┬────────────────────┬───────────────┬────────────────┐
│ min(col_f32) │ max(col_f32) │    avg(col_f32)    │ sum(col_f32)  │ count(col_f32) │
│ decimal(6,1) │ decimal(6,1) │       double       │ decimal(38,1) │     int64      │
├──────────────┼──────────────┼────────────────────┼───────────────┼────────────────┤
│       -128.2 │      32768.3 │ 10889.133333333333 │       32667.4 │              3 │
└──────────────┴──────────────┴────────────────────┴───────────────┴────────────────┘

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AVG is incorrect precision

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AVG is incorrect precision

One reason the AVG result may different is that your example used single precision (f32). When I run the same query with double precision (f64) DataFusion gets the same 10889.133333333333 as duckdb

> select min(col_f64), max(col_f64), avg(col_f64), sum(col_f64), count(col_f64) from float_table;
+--------------------------+--------------------------+--------------------------+--------------------------+----------------------------+
| MIN(float_table.col_f64) | MAX(float_table.col_f64) | AVG(float_table.col_f64) | SUM(float_table.col_f64) | COUNT(float_table.col_f64) |
+--------------------------+--------------------------+--------------------------+--------------------------+----------------------------+
| -128.2                   | 32768.3                  | 10889.133333333333       | 32667.4                  | 3                          |
+--------------------------+--------------------------+--------------------------+--------------------------+----------------------------+
1 row(s) fetched.
Elapsed 0.001 seconds.


query RRRRI
select min(col_f32_nan), max(col_f32_nan), avg(col_f32_nan), sum(col_f32_nan), count(col_f32_nan) from float_table;
----
-128.2 NaN NaN NaN 3

query RRRRI
select min(col_f64), max(col_f64), avg(col_f64), sum(col_f64), count(col_f64) from float_table;
----
-128.2 32768.3 10889.133333333333 32667.4 3

query RRRRI
select min(col_f64_nan), max(col_f64_nan), avg(col_f64_nan), sum(col_f64_nan), count(col_f64_nan) from float_table;
----
-128.2 NaN NaN NaN 3

statement ok
drop table float_table


# Queries with nested count(*)

Expand Down