Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

modin.Series is not a subtype of pd.Series #1459

Closed
isilopekel opened this issue May 18, 2020 · 4 comments · Fixed by #1828
Closed

modin.Series is not a subtype of pd.Series #1459

isilopekel opened this issue May 18, 2020 · 4 comments · Fixed by #1828
Assignees
Labels
bug 🦗 Something isn't working
Milestone

Comments

@isilopekel
Copy link

  • System information
  • OS Platform and Distribution: Linux Ubuntu 16.04
  • Modin version : 0.7.3
  • Python version: 3.6.7

Code we can use to reproduce:
While trying to have a workaround for issue 1457, another problem has appeared.

# Create a dummy.csv file with the following content
a,b,c
1,2,3
4,5,6


if __name__ == "__main__":
import numpy as np
    import pandas as pd

    df = pd.read_csv("dummy.csv", sep=",")
    df_count = df.groupby(list(df.columns)).size().reset_index(name="cnt")
    df_count.columns = df.columns.append(pd.Index(["cnt"]))  # not needed for pandas

    df_agg = df_count.groupby(["a", "b"]).agg(
        Max=('cnt', np.max),
        Sum=('cnt', np.sum),
        Num=("c", pd.Series.nunique)
    ).reset_index()

    print(df_agg)

    import modin.pandas as mpd
    df = mpd.read_csv("dummy.csv", sep=",")
    df_count = df.groupby(list(df.columns)).size().reset_index(name="cnt")
    df_count.columns = df.columns.append(mpd.Index(["cnt"]))  # FIX FOR MODIN (see Issue 1457)

    # returns sum and max value of Count where target column is unique
    df_agg = df_count.groupby(["a", "b"]).agg(
        Max=('cnt', np.max),
        Sum=('cnt', np.sum),
        Num=("c", mpd.Series.nunique)
    ).reset_index()
   
    print(df_agg)


Describe the problem

Modin part throws an error:

.../venv/lib/python3.6/site-packages/modin/pandas/series.py", line 1291, in nunique
return super(Series, self).nunique(dropna=dropna)
TypeError: super(type, obj): obj must be an instance or subtype of type

Error disappears when
Num=("c", mpd.Series.nunique) is replaced by
Num=("c", pd.Series.nunique)

@isilopekel isilopekel added the bug 🦗 Something isn't working label May 18, 2020
@devin-petersohn devin-petersohn added this to the 0.7.4 milestone May 18, 2020
@devin-petersohn
Copy link
Collaborator

Hi @isilopekel thanks for raising the issue!

This was an intentional design decision, as libraries that check explicitly for pandas.Series are often using non-public methods and fields (_data for example). Modin does not try to implement these non-public interfaces.

With that said, this bug is an issue with passing the pd.Series.unique object to agg, which should be supported.

@isilopekel
Copy link
Author

Hi Devin,

Thanks for the reply. I am not quite sure if I understand you correctly. I am passing modin.Series to the groupby function on a modin.DataFrame. Since it is not a good practice to mix both libraries, I assume that accepting modin.Series should be the default behavior for the aggregation function. Additionally, pandas.Series can be supported transparently by converting it internally to modin.Series or could be alternatively left out because it is a bad practice. Could you please share your thoughts as well?

Best,
Isil

@devin-petersohn
Copy link
Collaborator

The fix here will be an internal conversion to our internal representation. Internally, Modin does not have a concept of a Series, it is just a 1-column DataFrame. The problem is that we are passing the modin.pandas.Series.nunique to the internal representation, which in this case is the pandas kernels (partitions of smaller pandas objects), thus the error.

@dchigarev dchigarev self-assigned this Jul 27, 2020
dchigarev added a commit to dchigarev/modin that referenced this issue Jul 27, 2020
devin-petersohn pushed a commit that referenced this issue Jul 28, 2020
aregm pushed a commit to aregm/modin that referenced this issue Sep 16, 2020
@adriangay
Copy link

adriangay commented Sep 27, 2024

I have tried converting some Pandas code to Modin and this still seems to be an issue. I literally just installed Modin:

pip install modin
Collecting modin
  Downloading modin-0.32.0-py3-none-any.whl (1.1 MB)
     |████████████████████████████████| 1.1 MB 7.6 MB/s 
Requirement already satisfied: numpy>=1.22.4 in /opt/conda/lib/python3.10/site-packages (from modin) (1.26.4)
Requirement already satisfied: pandas<2.3,>=2.2 in /opt/conda/lib/python3.10/site-packages (from modin) (2.2.2)
Requirement already satisfied: fsspec>=2022.11.0 in /opt/conda/lib/python3.10/site-packages (from modin) (2024.6.1)
Requirement already satisfied: packaging>=21.0 in /opt/conda/lib/python3.10/site-packages (from modin) (24.1)
Requirement already satisfied: psutil>=5.8.0 in /opt/conda/lib/python3.10/site-packages (from modin) (5.9.3)
Requirement already satisfied: tzdata>=2022.7 in /opt/conda/lib/python3.10/site-packages (from pandas<2.3,>=2.2->modin) (2024.1)
Requirement already satisfied: python-dateutil>=2.8.2 in /opt/conda/lib/python3.10/site-packages (from pandas<2.3,>=2.2->modin) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in /opt/conda/lib/python3.10/site-packages (from pandas<2.3,>=2.2->modin) (2024.1)
Requirement already satisfied: six>=1.5 in /opt/conda/lib/python3.10/site-packages (from python-dateutil>=2.8.2->pandas<2.3,>=2.2->modin) (1.16.0)
Installing collected packages: modin
Successfully installed modin-0.32.0

and replaced Pandas with Modin:

# import pandas as pd
import modin.pandas as pd

and ran the code again and I get this:

Cell In[2], line 18, in CONTENT_EMBEDDINGS.__init__(self, kg, min_label_count, laplace_smooth_add, k, stretch, n_iter)
     14 self._stretch = stretch
     15 self._min_label_count = min_label_count
     17 self._word_item_tensor = self._get_word_item_tensor(
---> 18     *self.process_metadata(kg, min_label_count)
     19 )
     21 self._k = min(k, len(self._word2ind) - 1)

Cell In[2], line 196, in CONTENT_EMBEDDINGS.process_metadata(self, kg, min_label_count)
    195 v = kg[["tags"]]
--> 196 kg = kg[v.replace(v.apply(pd.Series.value_counts)).gt(min_label_count).all(1)]
    199 self._word2ind = dict(
    200     [(v, i) for i, v in enumerate(kg["tags"].unique().tolist())]
    201 )

File /opt/conda/lib/python3.10/site-packages/pandas/core/frame.py:10374, in DataFrame.apply(self, func, axis, raw, result_type, args, by_row, engine, engine_kwargs, **kwargs)
  10360 from pandas.core.apply import frame_apply
  10362 op = frame_apply(
  10363     self,
  10364     func=func,
   (...)
  10372     kwargs=kwargs,
  10373 )
> 10374 return op.apply().__finalize__(self, method="apply")

File /opt/conda/lib/python3.10/site-packages/pandas/core/apply.py:916, in FrameApply.apply(self)
    913 elif self.raw:
    914     return self.apply_raw(engine=self.engine, engine_kwargs=self.engine_kwargs)
--> 916 return self.apply_standard()

File /opt/conda/lib/python3.10/site-packages/pandas/core/apply.py:1063, in FrameApply.apply_standard(self)
   1061 def apply_standard(self):
   1062     if self.engine == "python":
-> 1063         results, res_index = self.apply_series_generator()
   1064     else:
   1065         results, res_index = self.apply_series_numba()

File /opt/conda/lib/python3.10/site-packages/pandas/core/apply.py:1081, in FrameApply.apply_series_generator(self)
   1078 with option_context("mode.chained_assignment", None):
   1079     for i, v in enumerate(series_gen):
   1080         # ignore SettingWithCopy here in case the user mutates
-> 1081         results[i] = self.func(v, *self.args, **self.kwargs)
   1082         if isinstance(results[i], ABCSeries):
   1083             # If we have a view on v, we need to make a copy because
   1084             #  series_generator will swap out the underlying data
   1085             results[i] = results[i].copy(deep=False)

File /opt/conda/lib/python3.10/site-packages/modin/logging/logger_decorator.py:144, in enable_logging.<locals>.decorator.<locals>.run_and_log(*args, **kwargs)
    129 """
    130 Compute function with logging if Modin logging is enabled.
    131 
   (...)
    141 Any
    142 """
    143 if LogMode.get() == "disable":
--> 144     return obj(*args, **kwargs)
    146 logger = get_logger()
    147 logger.log(log_level, start_line)

File /opt/conda/lib/python3.10/site-packages/modin/pandas/series.py:2177, in Series.value_counts(self, normalize, sort, ascending, bins, dropna)
   2165 if bins is not None:
   2166     # Potentially we could implement `cut` function from pandas API, which
   2167     # bins values into intervals, and then we can just count them as regular values.
   2168     # TODO #1333: new_self = self.__constructor__(pd.cut(self, bins, include_lowest=True), dtype="interval")
   2169     return self._default_to_pandas(
   2170         pandas.Series.value_counts,
   2171         normalize=normalize,
   (...)
   2175         dropna=dropna,
   2176     )
-> 2177 counted_values = super(Series, self).value_counts(
   2178     subset=self,
   2179     normalize=normalize,
   2180     sort=sort,
   2181     ascending=ascending,
   2182     dropna=dropna,
   2183 )
   2184 return counted_values

TypeError: super(type, obj): obj must be an instance or subtype of type

have I misunderstood something fundamental? Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🦗 Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants