-
Notifications
You must be signed in to change notification settings - Fork 655
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
reset_index
followed by groupby
causes exception in some cases
#4522
Comments
Signed-off-by: Devin Petersohn <[email protected]>
Quick addition: Repro script accidentally df = pd.read_csv("some.csv", index_col=[0,1,2]).reset_index()
df.groupby(df.columns[:2]).count() # error it has to be more than one column to get the bug. |
I did some digging, and I believe that the error is caused in this specific case because the results of the |
If I create a dataframe where the index and label have the same name and try a groupby, that errors out: In [11]: df = pd.DataFrame([[1, 2, 3]], index=pd.Index([0], name="so"), columns=['so', 'b', 'c'])
In [12]: df.groupby(df.columns[:2]).count()
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Input In [12], in <cell line: 1>()
----> 1 df.groupby(df.columns[:2]).count()
File ~/.miniconda3/envs/modin/lib/python3.8/site-packages/pandas/core/frame.py:7712, in DataFrame.groupby(self, by, axis, level, as_index, sort, group_keys, squeeze, observed, dropna)
7707 axis = self._get_axis_number(axis)
7709 # https://github.com/python/mypy/issues/7642
7710 # error: Argument "squeeze" to "DataFrameGroupBy" has incompatible type
7711 # "Union[bool, NoDefault]"; expected "bool"
-> 7712 return DataFrameGroupBy(
7713 obj=self,
7714 keys=by,
7715 axis=axis,
7716 level=level,
7717 as_index=as_index,
7718 sort=sort,
7719 group_keys=group_keys,
7720 squeeze=squeeze, # type: ignore[arg-type]
7721 observed=observed,
7722 dropna=dropna,
7723 )
File ~/.miniconda3/envs/modin/lib/python3.8/site-packages/pandas/core/groupby/groupby.py:882, in GroupBy.__init__(self, obj, keys, axis, level, grouper, exclusions, selection, as_index, sort, group_keys, squeeze, observed, mutated, dropna)
879 if grouper is None:
880 from pandas.core.groupby.grouper import get_grouper
--> 882 grouper, exclusions, obj = get_grouper(
883 obj,
884 keys,
885 axis=axis,
886 level=level,
887 sort=sort,
888 observed=observed,
889 mutated=self.mutated,
890 dropna=self.dropna,
891 )
893 self.obj = obj
894 self.axis = obj._get_axis_number(axis)
File ~/.miniconda3/envs/modin/lib/python3.8/site-packages/pandas/core/groupby/grouper.py:893, in get_grouper(obj, key, axis, level, sort, observed, mutated, validate, dropna)
888 in_axis = False
890 # create the Grouping
891 # allow us to passing the actual Grouping as the gpr
892 ping = (
--> 893 Grouping(
894 group_axis,
895 gpr,
896 obj=obj,
897 level=level,
898 sort=sort,
899 observed=observed,
900 in_axis=in_axis,
901 dropna=dropna,
902 )
903 if not isinstance(gpr, Grouping)
904 else gpr
905 )
907 groupings.append(ping)
909 if len(groupings) == 0 and len(obj):
File ~/.miniconda3/envs/modin/lib/python3.8/site-packages/pandas/core/groupby/grouper.py:481, in Grouping.__init__(self, index, grouper, obj, level, sort, observed, in_axis, dropna)
479 self.level = level
480 self._orig_grouper = grouper
--> 481 self.grouping_vector = _convert_grouper(index, grouper)
482 self._all_grouper = None
483 self._index = index
File ~/.miniconda3/envs/modin/lib/python3.8/site-packages/pandas/core/groupby/grouper.py:937, in _convert_grouper(axis, grouper)
935 elif isinstance(grouper, (list, tuple, Index, Categorical, np.ndarray)):
936 if len(grouper) != len(axis):
--> 937 raise ValueError("Grouper and axis must be same length")
939 if isinstance(grouper, (list, tuple)):
940 grouper = com.asarray_tuplesafe(grouper)
ValueError: Grouper and axis must be same length It also errors in Modin, but for a different reason. ray::_apply_list_of_funcs() (pid=56846, ip=127.0.0.1)
File "/Users/rehandurrani/Documents/modin/modin/core/execution/ray/implementations/pandas_on_ray/partitioning/partition.py", line 417, in _apply_list_of_funcs
partition = func(partition.copy(), *args, **kwargs)
File "/Users/rehandurrani/Documents/modin/modin/core/dataframe/pandas/partitioning/partition_manager.py", line 360, in map_func
return apply_func(df, **{other_name: other})
File "/Users/rehandurrani/Documents/modin/modin/core/dataframe/algebra/groupby.py", line 449, in _map
result = wrapper(df.copy(), other if other is None else other.copy())
File "/Users/rehandurrani/Documents/modin/modin/core/dataframe/algebra/groupby.py", line 432, in wrapper
return cls.map(
File "/Users/rehandurrani/Documents/modin/modin/core/dataframe/algebra/groupby.py", line 141, in map
df.groupby(by=by_part, axis=axis, **groupby_kwargs), *agg_args, **agg_kwargs
File "/Users/rehandurrani/.miniconda3/envs/modin/lib/python3.8/site-packages/pandas/core/frame.py", line 7712, in groupby
return DataFrameGroupBy(
File "/Users/rehandurrani/.miniconda3/envs/modin/lib/python3.8/site-packages/pandas/core/groupby/groupby.py", line 882, in __init__
grouper, exclusions, obj = get_grouper(
File "/Users/rehandurrani/.miniconda3/envs/modin/lib/python3.8/site-packages/pandas/core/groupby/grouper.py", line 872, in get_grouper
obj._check_label_or_level_ambiguity(gpr, axis=axis)
File "/Users/rehandurrani/.miniconda3/envs/modin/lib/python3.8/site-packages/pandas/core/generic.py", line 1794, in _check_label_or_level_ambiguity
raise ValueError(msg)
ValueError: 'so' is both an index level and a column label, which is ambiguous. If I change it to be
both succeed. |
If I try the repro script in pandas: In [23]: import pandas as pd
In [24]: df = pd.read_csv("b.csv", index_col=[0,1,2]).reset_index()
...: df.groupby(df.columns[:2]).count() # error
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Input In [24], in <cell line: 2>()
1 df = pd.read_csv("b.csv", index_col=[0,1,2]).reset_index()
----> 2 df.groupby(df.columns[:2]).count()
File ~/.miniconda3/envs/modin/lib/python3.8/site-packages/pandas/core/frame.py:7712, in DataFrame.groupby(self, by, axis, level, as_index, sort, group_keys, squeeze, observed, dropna)
7707 axis = self._get_axis_number(axis)
7709 # https://github.com/python/mypy/issues/7642
7710 # error: Argument "squeeze" to "DataFrameGroupBy" has incompatible type
7711 # "Union[bool, NoDefault]"; expected "bool"
-> 7712 return DataFrameGroupBy(
7713 obj=self,
7714 keys=by,
7715 axis=axis,
7716 level=level,
7717 as_index=as_index,
7718 sort=sort,
7719 group_keys=group_keys,
7720 squeeze=squeeze, # type: ignore[arg-type]
7721 observed=observed,
7722 dropna=dropna,
7723 )
File ~/.miniconda3/envs/modin/lib/python3.8/site-packages/pandas/core/groupby/groupby.py:882, in GroupBy.__init__(self, obj, keys, axis, level, grouper, exclusions, selection, as_index, sort, group_keys, squeeze, observed, mutated, dropna)
879 if grouper is None:
880 from pandas.core.groupby.grouper import get_grouper
--> 882 grouper, exclusions, obj = get_grouper(
883 obj,
884 keys,
885 axis=axis,
886 level=level,
887 sort=sort,
888 observed=observed,
889 mutated=self.mutated,
890 dropna=self.dropna,
891 )
893 self.obj = obj
894 self.axis = obj._get_axis_number(axis)
File ~/.miniconda3/envs/modin/lib/python3.8/site-packages/pandas/core/groupby/grouper.py:893, in get_grouper(obj, key, axis, level, sort, observed, mutated, validate, dropna)
888 in_axis = False
890 # create the Grouping
891 # allow us to passing the actual Grouping as the gpr
892 ping = (
--> 893 Grouping(
894 group_axis,
895 gpr,
896 obj=obj,
897 level=level,
898 sort=sort,
899 observed=observed,
900 in_axis=in_axis,
901 dropna=dropna,
902 )
903 if not isinstance(gpr, Grouping)
904 else gpr
905 )
907 groupings.append(ping)
909 if len(groupings) == 0 and len(obj):
File ~/.miniconda3/envs/modin/lib/python3.8/site-packages/pandas/core/groupby/grouper.py:481, in Grouping.__init__(self, index, grouper, obj, level, sort, observed, in_axis, dropna)
479 self.level = level
480 self._orig_grouper = grouper
--> 481 self.grouping_vector = _convert_grouper(index, grouper)
482 self._all_grouper = None
483 self._index = index
File ~/.miniconda3/envs/modin/lib/python3.8/site-packages/pandas/core/groupby/grouper.py:937, in _convert_grouper(axis, grouper)
935 elif isinstance(grouper, (list, tuple, Index, Categorical, np.ndarray)):
936 if len(grouper) != len(axis):
--> 937 raise ValueError("Grouper and axis must be same length")
939 if isinstance(grouper, (list, tuple)):
940 grouper = com.asarray_tuplesafe(grouper)
ValueError: Grouper and axis must be same length it fails. |
Nevermind - converting the index to a list works in pandas. |
System information
modin.__version__
): latestDescribe the problem
This only happens in a very corner case: when groupby
by
parameter contains 2 or more columns added from thereset_index
call.The text was updated successfully, but these errors were encountered: