Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Change in series copy behavior in 1.5 #48664

Open
3 tasks done
anthonyaag opened this issue Sep 20, 2022 · 6 comments
Open
3 tasks done

BUG: Change in series copy behavior in 1.5 #48664

anthonyaag opened this issue Sep 20, 2022 · 6 comments

Comments

@anthonyaag
Copy link

anthonyaag commented Sep 20, 2022

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
ser1 = pd.Series([1,2,3,4])
ser2 = pd.Series(ser1)
ser3 = pd.Series(ser1, index=ser1.index)
print(ser1.index is ser2.index)
print(ser1.index is ser3.index)
print(ser1._mgr is ser2._mgr)
print(ser1._mgr is ser3._mgr)

Issue Description

Hi All!

Very excited to use pandas==1.5.0 but noticed this change. Not sure if its intentional as part of some of the series refactoring

In pandas 1.4.4 all of the above return true but in pandas 1.5.0 ser1._mgr is ser3._mgr returns False which is unexpected

Now when passed an index kwarg, that generates a new SingleBlockManager, but before 1.5 it didn't.

Expected Behavior

In [1]: import pandas as pd

In [2]: ser1 = pd.Series([1,2,3,4])
...: ser2 = pd.Series(ser1)
...: ser3 = pd.Series(ser1, index=ser1.index)
...: print(ser1.index is ser2.index)
...: print(ser1.index is ser3.index)
...: print(ser1._mgr is ser2._mgr)
...: print(ser1._mgr is ser3._mgr)
True
True
True
True

Installed Versions

INSTALLED VERSIONS

commit : 87cfe4e
python : 3.10.7.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.117.hrtdev
Version : #3 SMP Tue May 11 13:44:03 EDT 2021
machine : x86_64
processor :
byteorder : little
LC_ALL :
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.5.0
numpy : 1.23.2
pytz : 2022.1
dateutil : 2.8.2
setuptools : 62.1.0
pip : 22.2.2
Cython : 0.29.30
pytest : 7.1.2
hypothesis : 6.52.3
sphinx : 5.0.2
blosc : 1.10.6
feather : None
xlsxwriter : None
lxml.etree : 4.9.1
html5lib : 1.1
pymysql : 1.0.2
psycopg2 : 2.9.3
jinja2 : 3.1.2
IPython : 8.4.0
pandas_datareader: None
bs4 : 4.11.1
bottleneck : 1.3.5
brotli : 1.0.9
fastparquet : 0.8.1
fsspec : 2022.5.0
gcsfs : None
matplotlib : 3.5.2
numba : 0.56.2
numexpr : 2.8.3
odfpy : None
openpyxl : 3.0.10
pandas_gbq : None
pyarrow : 7.0.0
pyreadstat : None
pyxlsb : 1.0.9
s3fs : None
scipy : 1.9.1
snappy : None
sqlalchemy : 1.4.39
tables : 3.7.0
tabulate : 0.8.10
xarray : 2022.6.0
xlrd : 2.0.1
xlwt : None
zstandard : 0.18.0
tzdata : 2022.1

@anthonyaag anthonyaag added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 20, 2022
@phofl
Copy link
Member

phofl commented Sep 20, 2022

This was caused by

commit 221f6362bc25833da87f00015d4d5418ee316eff
Author: Joris Van den Bossche <[email protected]>
Date:   Sat Aug 20 20:45:23 2022 +0200

    API: New copy / view semantics using Copy-on-Write (#46958)

cc @jorisvandenbossche was this intended?

In general, this is private. Does this leak somehow into the public api?

@mroeschke mroeschke added Copy / view semantics and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 20, 2022
@phofl
Copy link
Member

phofl commented Sep 20, 2022

Adding milestone 1.5.1 for now

@phofl phofl added this to the 1.5.1 milestone Sep 20, 2022
@jorisvandenbossche
Copy link
Member

Thanks for the ping. I have to look into it in more depth, but from a quick look: this might be due to a change in reindex (when passing a Series object to the Series constructor, and also passing an index, then we reindex that series:

data = data.reindex(index, copy=copy)
). Before, reindex() could return self, but in #46958 I changed that to return a shallow copy (https://github.com/pandas-dev/pandas/pull/46958/files#diff-1a2e3df0db7dd8bddc2ec4bff9de8a7a55e328e6c32e2cecde761dc9549fcd46L5273-L5275).

That change in reindex itself was intentional (needed for CoW), but I didn't really consider the potential impact of it for reindex no longer returning self in case of an "identical" index (same values + same name). With 1.4, we have:

n [1]: s = pd.Series([1, 2, 3], index=['a', 'b', 'c'])

In [2]: s.reindex(s.index, copy=False) is s
Out[2]: True

In [3]: s.reindex(pd.Index(['a', 'b', 'c'], name="test"), copy=False) is s
Out[3]: False

In [4]: s.reindex(pd.Index(['a', 'b', 'c']), copy=False) is s
Out[4]: True

Personally, I would see this as a good change for reindex (to consistently return a new object, regardless of the index being equal or not). But it is a change in behaviour. So if we would like to preserve the old behaviour, it could easily be changed to only do this change if actually using CoW, and so the default behaviour wouldn't change.


@anthonyaag could you give a bit more detail how you ran into this specific issue? In what way to do rely on those managers being identical?

@anthonyaag
Copy link
Author

anthonyaag commented Sep 20, 2022

With a few extension of a Series & Indexes. The test case that I found was breaking was basically testing almost exactly the above behavior.

        cs2 = CustomSeries(cs1)
        assert_series_equal(cs2, cs1)
        # this also changes cs1's index 
        cs2.index = CustomIndex(n=cs2.size)
        assert_series_equal(cs2, cs1)

So when the index stoped changing cs1's index then the test failed.

Also if this new behavior is better then that's fine too -- just wanted to raise it as it was a change. 😀

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Sep 26, 2022

Interesting, that's a bit different from the code in the top post, though. Replicating this example with pure pandas (without subclass), I still see this behaviour of changing the second Series' index also updating the original one (using the latest main branch):

In [21]: s = pd.Series([1, 2, 3])

In [22]: s2 = pd.Series(s)

In [23]: s2.index = pd.Index(['a', 'b', 'c'])

In [24]: s.index
Out[24]: Index(['a', 'b', 'c'], dtype='object')

In [25]: pd.testing.assert_series_equal(s, s2)

But so the case in the top post is when passing an index to the constructor. And so also that case breaks here:

In [3]: s3 = pd.Series(s, index=s.index)

In [4]: s3.index = pd.Index([10, 11, 12])

In [5]: s.index
Out[5]: RangeIndex(start=0, stop=3, step=1)

In [6]: pd.testing.assert_series_equal(s, s3)
...
AssertionError: Series.index are different

In [7]: s3._mgr is s._mgr
Out[7]: False

Personally, seeing those examples, I very much prefer the behaviour where mutating one Series index propery doesn't change the other. But as mentioned above, it is clearly a breaking change, so we could also revert this for now, and only change it later (eg in 2.0, or 3.0).

(note: updated this after posting, I made an error while testing this, the last case did change behaviour, I initially wrote it didn't)

@datapythonista datapythonista modified the milestones: 1.5.1, 1.5.2 Oct 20, 2022
@datapythonista datapythonista modified the milestones: 1.5.2, 1.5.3 Nov 15, 2022
@datapythonista datapythonista modified the milestones: 1.5.3, 1.5.4 Jan 18, 2023
@datapythonista datapythonista modified the milestones: 1.5.4, 2.0 Feb 27, 2023
@MarcoGorelli MarcoGorelli modified the milestones: 2.0, 2.1 Mar 27, 2023
@MarcoGorelli
Copy link
Member

moving off the 2.0 milestone as it's a regression from 1.5

@mroeschke mroeschke removed this from the 2.1 milestone Aug 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants