REGR: AssertionError when subtracting Timestamp-valued DataFrames with non-indentical column index #31623

DomKennedy · 2020-02-03T15:10:53Z

import pandas as pd

df = pd.DataFrame(
    {
        "foo": [pd.Timestamp("2019"), pd.Timestamp("2020")],
        "bar": [pd.Timestamp("2018"), pd.Timestamp("2021")],
    }
)

df2 = df[["foo"]]

print(df - df2)

Problem description

The above snippet raises the following exception:

Traceback (most recent call last):
  File ".venv/lib/python3.6/site-packages/pandas/core/ops/array_ops.py", line 149, in na_arithmetic_op
    result = expressions.evaluate(op, str_rep, left, right)
  File ".v
env/lib/python3.6/site-packages/pandas/core/computation/expressions.py", line 208, in evaluate
    return _evaluate(op, op_str, a, b)
  File ".venv/lib/python3.6/site-packages/pandas/core/computation/expressions.py", line 70, in _evaluate_standard
    return op(a, b)
  File ".venv/lib/python3.6/site-packages/pandas/core/ops/common.py", line 64, in new_method
    return method(self, other)
  File ".venv/lib/python3.6/site-packages/pandas/core/ops/__init__.py", line 500, in wrapper
    result = arithmetic_op(lvalues, rvalues, op, str_rep)
  File ".venv/lib/python3.6/site-packages/pandas/core/ops/array_ops.py", line 192, in arithmetic_op
    res_values = dispatch_to_extension_op(op, lvalues, rvalues)
  File ".venv/lib/python3.6/site-packages/pandas/core/ops/dispatch.py", line 125, in dispatch_to_extension_op
    res_values = op(left, right)
  File ".venv/lib/python3.6/site-packages/pandas/core/arrays/datetimelike.py", line 1390, in __rsub__
    f"cannot subtract {type(self).__name__} from {type(other).__name__}"
TypeError: cannot subtract DatetimeArray from ndarray

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "pandas_bug.py", line 36, in <module>
    print(df2 - df)
  File ".venv/lib/python3.6/site-packages/pandas/core/ops/__init__.py", line 703, in f
    new_data = left._combine_frame(right, pass_op, fill_value)
  File ".venv/lib/python3.6/site-packages/pandas/core/frame.py", line 5297, in _combine_frame
    new_data = ops.dispatch_to_series(self, other, _arith_op)
  File ".venv/lib/python3.6/site-packages/pandas/core/ops/__init__.py", line 416, in dispatch_to_series
    new_data = expressions.evaluate(column_op, str_rep, left, right)
  File ".venv/lib/python3.6/site-packages/pandas/core/computation/expressions.py", line 208, in evaluate
    return _evaluate(op, op_str, a, b)
  File ".venv/lib/python3.6/site-packages/pandas/core/computation/expressions.py", line 70, in _evaluate_standard
    return op(a, b)
  File ".venv/lib/python3.6/site-packages/pandas/core/ops/__init__.py", line 385, in column_op
    return {i: func(a.iloc[:, i], b.iloc[:, i]) for i in range(len(a.columns))}
  File ".venv/lib/python3.6/site-packages/pandas/core/ops/__init__.py", line 385, in <dictcomp>
    return {i: func(a.iloc[:, i], b.iloc[:, i]) for i in range(len(a.columns))}
  File ".venv/lib/python3.6/site-packages/pandas/core/ops/array_ops.py", line 121, in na_op
    return na_arithmetic_op(x, y, op, str_rep)
  File ".venv/lib/python3.6/site-packages/pandas/core/ops/array_ops.py", line 151, in na_arithmetic_op
    result = masked_arith_op(left, right, op)
  File ".venv/lib/python3.6/site-packages/pandas/core/ops/array_ops.py", line 75, in masked_arith_op
    assert isinstance(x, np.ndarray), type(x)

This is a 1.0.0 regression; in 0.25.3, the operation succeeds and the unmatched bar column is filled with NaN in the output.

The same error occurs with:

Any combination of incompatible columns (strict subset, strict superset, overlapping, disjoint)
Calling the subtract method instead of using the subtraction operator
Timezone-aware Timestamps as well as timezone-naive

It does not seem to occur with:

Mismatches on the row index; transposing the dataframes in the above example prevents the errors occuring.
pd.Series objects with mismatched indexes (e.g. calling the above on the first row of each dataframe works fine)
Other dtypes; bool, float, and int seem to work fine. Similarly, if the dataframes are explicitly cast to dtype object, the operation succeeds.

Expected Output

   bar    foo
0  NaN 0 days
1  NaN 0 days

Output of `pd.show_versions()`

``` INSTALLED VERSIONS ------------------ commit : None python : 3.6.8.final.0 python-bits : 64 OS : Linux OS-release : 4.15.0-74-generic machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_GB.UTF-8 LOCALE : en_GB.UTF-8

pandas : 1.0.0
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 19.3.1
setuptools : 41.6.0
Cython : None
pytest : 5.3.5
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pytest : 5.3.5
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : None


</details>

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2020-02-03T16:03:12Z

Thanks for the report. The NaNs are introduced in

pandas/pandas/core/ops/__init__.py

Line 725 in a2721fd

self, other = _align_method_FRAME(self, other, axis, flex=True, level=level)

, which calls DataFrame.align.

I wonder, should this be changed?

In [6]: df.align(df2)[1]
Out[6]:
   bar        foo
0  NaN 2019-01-01
1  NaN 2020-01-01

to have bar be datetime64[ns] dtype, to match the left?

TomAugspurger · 2020-02-03T16:03:56Z

cc @jbrockmendel.

jbrockmendel · 2020-02-03T16:20:29Z

ill look at this today

jbrockmendel · 2020-02-04T19:58:10Z

So this is pretty ugly, but one option that tentatively works is to patch ops._arith_method_FRAME so that we only operate on shared columns, then reindex the result.

jbrockmendel · 2020-02-04T20:21:12Z

might actually improve perf for cases where we have very few shared columns

TomAugspurger · 2020-02-04T20:22:02Z

That seems reasonable. The alternative is to ensure that the correct fill_value is used in align, which seems difficult since we'd potentially have different fill values for different columns / dtypes.

Is that likely to cause issues with methods like DataFrame.add? I forget whether the fill_value from add is done before or after the op.

jbrockmendel · 2020-02-05T00:21:01Z

The alternative is to ensure that the correct fill_value is used in align, which seems difficult since we'd potentially have different fill values for different columns / dtypes.

yah, it would also depend on op, which would become a nightmare.

I'll put up a proof of concept in a bit

TomAugspurger · 2020-02-05T16:04:07Z

@DomKennedy in the meantime, here's a workaround

In [14]: import operator

In [15]: operator.sub(*df.align(df2, fill_value=pd.NaT))
Out[15]:
  bar    foo
0 NaT 0 days
1 NaT 0 days

There are lots of issues with that (if you have other columns that don't align, NaT won't be the right fill value) but hopefully not too bad for now.

We'll try to get this fixed properly for 1.0.2.

DomKennedy changed the title ~~REGR: AssertionError when when subtracting Timestamp-valued DataFrames with non-indentical column index~~ REGR: AssertionError when subtracting Timestamp-valued DataFrames with non-indentical column index Feb 3, 2020

TomAugspurger added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Numeric Operations Arithmetic, Comparison, and Logical operations labels Feb 3, 2020

jorisvandenbossche added this to the 1.0.1 milestone Feb 4, 2020

jorisvandenbossche added the Regression Functionality that used to work in a prior pandas version label Feb 4, 2020

jbrockmendel mentioned this issue Feb 5, 2020

REGR: fix op(frame, frame2) with reindex #31679

Merged

5 tasks

TomAugspurger mentioned this issue Feb 5, 2020

RLS: 1.0.1 #31523

Closed

jorisvandenbossche modified the milestones: 1.0.1, 1.0.2 Feb 5, 2020

jreback closed this as completed in #31679 Feb 19, 2020

sfc-gh-mvashishtha mentioned this issue Aug 16, 2024

BUG: subtracting datetime series from datetime dataframe, or datetime dataframe from datetime series, raises TypeError or UFuncTypeError #59529

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

REGR: AssertionError when subtracting Timestamp-valued DataFrames with non-indentical column index #31623

REGR: AssertionError when subtracting Timestamp-valued DataFrames with non-indentical column index #31623

DomKennedy commented Feb 3, 2020 •

edited

Loading

TomAugspurger commented Feb 3, 2020

TomAugspurger commented Feb 3, 2020

jbrockmendel commented Feb 3, 2020

jbrockmendel commented Feb 4, 2020

jbrockmendel commented Feb 4, 2020

TomAugspurger commented Feb 4, 2020 •

edited

Loading

jbrockmendel commented Feb 5, 2020

TomAugspurger commented Feb 5, 2020

REGR: AssertionError when subtracting Timestamp-valued DataFrames with non-indentical column index #31623

REGR: AssertionError when subtracting Timestamp-valued DataFrames with non-indentical column index #31623

Comments

DomKennedy commented Feb 3, 2020 • edited Loading

Problem description

Expected Output

Output of pd.show_versions()

TomAugspurger commented Feb 3, 2020

TomAugspurger commented Feb 3, 2020

jbrockmendel commented Feb 3, 2020

jbrockmendel commented Feb 4, 2020

jbrockmendel commented Feb 4, 2020

TomAugspurger commented Feb 4, 2020 • edited Loading

jbrockmendel commented Feb 5, 2020

TomAugspurger commented Feb 5, 2020

DomKennedy commented Feb 3, 2020 •

edited

Loading

Output of `pd.show_versions()`

TomAugspurger commented Feb 4, 2020 •

edited

Loading