Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: expected result for pow(1, pd.NA) or pow(pd.NA, 0) #29997

Closed
TomAugspurger opened this issue Dec 3, 2019 · 13 comments · Fixed by #30097
Closed

API: expected result for pow(1, pd.NA) or pow(pd.NA, 0) #29997

TomAugspurger opened this issue Dec 3, 2019 · 13 comments · Fixed by #30097
Labels
Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Numeric Operations Arithmetic, Comparison, and Logical operations
Milestone

Comments

@TomAugspurger
Copy link
Contributor

Should pow(1, pd.NA) be 1 or NA?

In [1]: import pandas as pd

In [2]: 1 ** pd.NA
Out[2]: NA

In [3]: import numpy as np

In [5]: 1 ** np.nan
Out[5]: 1.0

cc @jorisvandenbossche

@TomAugspurger
Copy link
Contributor Author

R gives 1

> 1 ^ NA
[1] 1

@jorisvandenbossche
Copy link
Member

Julia seems to do missing:

julia> 1 ^ missing
missing

But I think the rationale of returning 1 (i.e. "whathever number is used, 1 to the power something will always be 1") also makes sense.

@jschendel
Copy link
Member

Likewise there's an inconsistency with np.nan for pow(pd.NA, 0):

In [1]: import numpy as np; import pandas as pd; pd.__version__
Out[1]: '0.26.0.dev0+1155.ged20822a5'

In [2]: pd.NA ** 0
Out[2]: NA

In [3]: np.nan ** 0
Out[3]: 1.0

@jorisvandenbossche
Copy link
Member

R also gives 1 then:

> NA ^ 0
[1] 1

@jorisvandenbossche
Copy link
Member

I am +1 changing those two cases

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Dec 5, 2019
@jreback jreback added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Numeric Operations Arithmetic, Comparison, and Logical operations labels Dec 5, 2019
@jreback jreback added this to the 1.0 milestone Dec 5, 2019
@brandon-b-miller
Copy link

@TomAugspurger @jorisvandenbossche Sorry to raise an old issue, I've been looking at aligning cuDF with pandas for these special cases. I believe I follow what's been discussed so far here, as well as in the associated PR, however there's arguably some consistency that is broken by stopping nulls from propagating for special values. Would you mind sharing your thoughts on this tradeoff?

I don't think consistency in itself is necessarily a rock solid reason to change this, but also am struggling to be completely convinced that this is exactly right either.

@TomAugspurger
Copy link
Contributor Author

I think our invariant is that NA represents unknown. So if you have an operation that is know with only one operand, like “True | NA”, then the output should be known and not NA.

@brandon-b-miller
Copy link

brandon-b-miller commented Mar 2, 2021

@TomAugspurger Thank you for your response. That line of logic makes a lot of sense to me and I agree with it, with the True | NA example being especially clear.

That said, I noticed that in SQL and Spark, doing a SELECT TRUE | NULL gives me NULL. There's no standard that I know of for how to handle this in python specifically, but if one takes the position that pandas and other dataframe centric tools derive a lot of their logic and use cases from SQL, one might look to the ANSI SQL standard for answers. To be clear I have not reviewed that yet, so I'm not sure if it says anything/what it says, but I think it has a fairly robust set of semantics for missing data, and if we in the python ecosystem opt to go a different way I think it's worth discussing the implications of that choice.

EDIT: Did some asking around and turned up this text from an early draft of the 92 standard, FWIW.

        1) If the value of any <numeric primary> simply contained in a
            <numeric value expression> is the null value, then the result of
            the <numeric value expression> is the null value.

@jorisvandenbossche
Copy link
Member

That said, I noticed that in SQL and Spark, doing a SELECT TRUE | NULL gives me NULL.

In the original discussion about using such Kleene logic (or three-values logic) for logical operators, I actually used the fact that SQL uses this as well as an argument for it ... (see #28778).

Now, I am no SQL expert, but I could quite well be mistaken here. But from googling, https://modern-sql.com/concept/three-valued-logic seems to indicate that as well. And testing with Postgres gives:

test_db=# SELECT TRUE OR NULL;
 ?column? 
----------
 t
(1 row)

(so this returned True)

@brandon-b-miller
Copy link

I think there might be something specific about short circuiting logical OR or other ops that is producing that behavior. In spark, I get the same thing. As such, I think I can concede the logical OR case. However this rule doesn't seem to extend to POW.

sqltest=# SELECT POW(1, 2);
 pow 
-----
   1
(1 row)

sqltest=# SELECT POW(1, NULL);
 pow 
-----
    
(1 row)
sqltest=# SELECT POW(5, 0);
 pow 
-----
   1
(1 row)

sqltest=# SELECT POW(NULL, 0);
 pow 
-----
    
(1 row)

in spark, running

sp_df.withColumn('c', sp_df['a']**sp_df['b']).show()

gives me

+---+----+----+
|  a|   b|   c|
+---+----+----+
|  1|null|null|
+---+----+----+

@jorisvandenbossche jorisvandenbossche changed the title pow(1, pd.NA) maybe gives the wrong result API: expected result for pow(1, pd.NA) or pow(pd.NA, 0) Mar 8, 2021
@jorisvandenbossche
Copy link
Member

Yeah, the power and logical ops are of course different operations, and indeed for power operations SQL seems to have a different behaviour compared to some other systems (eg numpy or R).

Now, I think there are valid arguments for both behaviours (always propagate nulls in arithmetic operations vs the result is know regardless of which value the null value would represent). So not really sure what's the best option to do here.

@TomAugspurger
Copy link
Contributor Author

Now, I think there are valid arguments for both behaviours (always propagate nulls in arithmetic operations vs the result is know regardless of which value the null value would represent). So not really sure what's the best option to do here.

Why would we treat arithmetic operations differently from boolean? I think we should propagate NA if and only if the result is unknown. For the case of NA ** 0 the result is known to be 1 for all real numbers (assuming you accept that 0**0 is 1, which is what Python does).

@brandon-b-miller
Copy link

Having thought about this a few days I admit I still lean a little towards nulls propagating, because it seems like a simple rule that leads to predictable behavior. At the same time, libraries don't have to solve the problem of introspecting its own data to determine if it needs to do an end run around it's own null logic. Admittedly this is more of a problem for the library I work on than Pandas, but it seems like it could be extra maintenance on the Pandas side as well, especially when other questionable cases come to mind (should pd.NA * 0 be 0?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants