Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update to pandas 2.1.2 #108

Merged
merged 6 commits into from
Nov 13, 2023
Merged

Update to pandas 2.1.2 #108

merged 6 commits into from
Nov 13, 2023

Conversation

nanne-aben
Copy link
Owner

@nanne-aben nanne-aben commented Oct 31, 2023

Main objective of this PR is to update the pandas requirement to 2.1.2.

Unfortunately, the CI failed initially. It seems that some functions within pandas 2.1.2 use in-place modification. When called on a DataSet, it raises an error. For more details, see for example this run.

The solution I take here is to automatically convert a DataSet to DataFrame when calling a DataFrame function, such that the immutable option does not become a problem. Shouldn't be a problem, since all DataFrame functions called on a DataSet would convert the DataSet to a DataFrame anyway. Now we just do it upfront.

@nanne-aben nanne-aben requested a review from caneff November 10, 2023 16:18
@caneff
Copy link
Collaborator

caneff commented Nov 10, 2023

Main objective of this PR is to update the pandas requirement to 2.1.2.

Unfortunately, the CI failed initially. It seems that some functions within pandas 2.1.2 use in-place modification. When called on a DataSet, it raises an error. For more details, see for example this run.

Did you file an issue with this to Pandas? Seems bad that this should happen with just a patch version increase (or at all really pandas shouldn't be doing anything in place unless you explicitly ask any more). Note that I didn't actually look at the code for why it failed I am just commenting on your comment here.

The solution I take here is to automatically convert a DataSet to DataFrame when calling a DataFrame function, such that the immutable option does not become a problem. Shouldn't be a problem, since all DataFrame functions called on a DataSet would convert the DataSet to a DataFrame anyway. Now we just do it upfront.

@nanne-aben
Copy link
Owner Author

Did you file an issue with this to Pandas? Seems bad that this should happen with just a patch version increase (or at all really pandas shouldn't be doing anything in place unless you explicitly ask any more).

Thanks, good point! I did some debugging. The following raises an error (in pandas 2.1.2, prior to this PR):

import pandas as pd
from strictly_typed_pandas import DataSet

class B:
    id: int
    age: int

df1 = pd.DataFrame(
    dict(
        id=[1, 2, 3],
        name=['a', 'b', 'c'],
    )
)

df2 = DataSet[B](
    pd.DataFrame(
        dict(
            id=[1, 2, 3],
            age=[10, 20, 30],
        ),
    ).set_index('age')
)

df3 = df1.merge(df2, on='id')

Error:

File ~/.pyenv/versions/3.11.2/envs/strictly_typed_pandas/lib/python3.11/site-packages/pandas/core/frame.py:10487, in DataFrame.merge(self, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
  10468 @Substitution(\"\")
  10469 @Appender(_merge_doc, indents=2)
  10470 def merge(
   (...)
  10483     validate: MergeValidate | None = None,
  10484 ) -> DataFrame:
  10485     from pandas.core.reshape.merge import merge
> 10487     return merge(
  10488         self,
  10489         right,
  10490         how=how,
  10491         on=on,
  10492         left_on=left_on,
  10493         right_on=right_on,
  10494         left_index=left_index,
  10495         right_index=right_index,
  10496         sort=sort,
  10497         suffixes=suffixes,
  10498         copy=copy,
  10499         indicator=indicator,
  10500         validate=validate,
  10501     )

File ~/.pyenv/versions/3.11.2/envs/strictly_typed_pandas/lib/python3.11/site-packages/pandas/core/reshape/merge.py:169, in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
    154     return _cross_merge(
    155         left_df,
    156         right_df,
   (...)
    166         copy=copy,
    167     )
    168 else:
--> 169     op = _MergeOperation(
    170         left_df,
    171         right_df,
    172         how=how,
    173         on=on,
    174         left_on=left_on,
    175         right_on=right_on,
    176         left_index=left_index,
    177         right_index=right_index,
    178         sort=sort,
    179         suffixes=suffixes,
    180         indicator=indicator,
    181         validate=validate,
    182     )
    183     return op.get_result(copy=copy)

File ~/.pyenv/versions/3.11.2/envs/strictly_typed_pandas/lib/python3.11/site-packages/pandas/core/reshape/merge.py:797, in _MergeOperation.__init__(self, left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, indicator, validate)
    794     self.left = self.left._drop_labels_or_levels(left_drop)
    796 if right_drop:
--> 797     self.right = self.right._drop_labels_or_levels(right_drop)
    799 self._maybe_require_matching_dtypes(self.left_join_keys, self.right_join_keys)
    800 self._validate_tolerance(self.left_join_keys)

File ~/Documents/projects/2022/strictly_typed_pandas/strictly_typed_pandas/immutable.py:43, in inplace_argument_interceptor.<locals>.func(*args, **kwargs)
     40 if \"inplace\" in kwargs and kwargs[\"inplace\"]:
     41     raise NotImplementedError(immutable_error_msg)
---> 43 return call(*args, **kwargs)

File ~/.pyenv/versions/3.11.2/envs/strictly_typed_pandas/lib/python3.11/site-packages/pandas/core/generic.py:1922, in NDFrame._drop_labels_or_levels(self, keys, axis)
   1920     # Handle dropping columns labels
   1921     if labels_to_drop:
-> 1922         dropped.drop(labels_to_drop, axis=1, inplace=True)
   1923 else:
   1924     # Handle dropping column levels
   1925     if levels_to_drop:

File ~/Documents/projects/2022/strictly_typed_pandas/strictly_typed_pandas/immutable.py:41, in inplace_argument_interceptor.<locals>.func(*args, **kwargs)
     38     raise NotImplementedError(immutable_error_msg)
     40 if \"inplace\" in kwargs and kwargs[\"inplace\"]:
---> 41     raise NotImplementedError(immutable_error_msg)
     43 return call(*args, **kwargs)

NotImplementedError: To ensure that the DataSet adheres to its schema, you cannot perform inplace modifications. You can either use dataset.to_dataframe() to cast the DataSet to a DataFrame, or use operations that return a DataFrame, e.g. df = df.assign(...)."
}

Let's reproduce that without stp. And let's see if there are any side-effects (due to in-place modifications) in df2.

import pandas as pd

df1 = pd.DataFrame(
    dict(
        id=[1, 2, 3],
        name=['a', 'b', 'c'],
    )
)

df2 = pd.DataFrame(
    dict(
        id=[1, 2, 3],
        age=[10, 20, 30],
    ),
).set_index('age')

df3 = df1.merge(df2, on='id')

df2

Output:
afbeelding

Appears like there are no side-effects.

So I guess pandas does an inplace modification of a copy of the DataFrame. And since we subclass the DataFrame into a DataSet (where we forbid inplace modifications) this results in an error. I could report it with them, but I don't think they'd consider this part of their public interface.

Anyhow, it seems like we got a workaround. I'll merge this now!

@nanne-aben nanne-aben merged commit 1097617 into main Nov 13, 2023
6 checks passed
@nanne-aben nanne-aben deleted the fix-newest-pandas branch November 13, 2023 08:41
caneff pushed a commit to caneff/strictly_typed_pandas that referenced this pull request Nov 14, 2023
* Update to pandas 2.1.2

* Update to pandas 2.1.2

* import from typing_extensions

* remove args and kwargs type annotations, they do not work with typeguard for older versions of python

* convert dataset to dataframe when calling a dataframe function, such that the immutable option does not become a problem
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants