modin.Series is not a subtype of pd.Series #1459

isilopekel · 2020-05-18T11:03:05Z

System information
OS Platform and Distribution: Linux Ubuntu 16.04
Modin version : 0.7.3
Python version: 3.6.7

Code we can use to reproduce:
While trying to have a workaround for issue 1457, another problem has appeared.

# Create a dummy.csv file with the following content
a,b,c
1,2,3
4,5,6


if __name__ == "__main__":
import numpy as np
    import pandas as pd

    df = pd.read_csv("dummy.csv", sep=",")
    df_count = df.groupby(list(df.columns)).size().reset_index(name="cnt")
    df_count.columns = df.columns.append(pd.Index(["cnt"]))  # not needed for pandas

    df_agg = df_count.groupby(["a", "b"]).agg(
        Max=('cnt', np.max),
        Sum=('cnt', np.sum),
        Num=("c", pd.Series.nunique)
    ).reset_index()

    print(df_agg)

    import modin.pandas as mpd
    df = mpd.read_csv("dummy.csv", sep=",")
    df_count = df.groupby(list(df.columns)).size().reset_index(name="cnt")
    df_count.columns = df.columns.append(mpd.Index(["cnt"]))  # FIX FOR MODIN (see Issue 1457)

    # returns sum and max value of Count where target column is unique
    df_agg = df_count.groupby(["a", "b"]).agg(
        Max=('cnt', np.max),
        Sum=('cnt', np.sum),
        Num=("c", mpd.Series.nunique)
    ).reset_index()
   
    print(df_agg)

Describe the problem

Modin part throws an error:

.../venv/lib/python3.6/site-packages/modin/pandas/series.py", line 1291, in nunique
return super(Series, self).nunique(dropna=dropna)
TypeError: super(type, obj): obj must be an instance or subtype of type

Error disappears when
Num=("c", mpd.Series.nunique) is replaced by
Num=("c", pd.Series.nunique)

The text was updated successfully, but these errors were encountered:

devin-petersohn · 2020-05-18T15:24:37Z

Hi @isilopekel thanks for raising the issue!

This was an intentional design decision, as libraries that check explicitly for pandas.Series are often using non-public methods and fields (_data for example). Modin does not try to implement these non-public interfaces.

With that said, this bug is an issue with passing the pd.Series.unique object to agg, which should be supported.

isilopekel · 2020-05-18T15:37:07Z

Hi Devin,

Thanks for the reply. I am not quite sure if I understand you correctly. I am passing modin.Series to the groupby function on a modin.DataFrame. Since it is not a good practice to mix both libraries, I assume that accepting modin.Series should be the default behavior for the aggregation function. Additionally, pandas.Series can be supported transparently by converting it internally to modin.Series or could be alternatively left out because it is a bad practice. Could you please share your thoughts as well?

Best,
Isil

devin-petersohn · 2020-05-18T15:42:03Z

The fix here will be an internal conversion to our internal representation. Internally, Modin does not have a concept of a Series, it is just a 1-column DataFrame. The problem is that we are passing the modin.pandas.Series.nunique to the internal representation, which in this case is the pandas kernels (partitions of smaller pandas objects), thus the error.

Signed-off-by: Dmitry Chigarev <[email protected]>

…oject#1828) Signed-off-by: Dmitry Chigarev <[email protected]>

adriangay · 2024-09-27T17:18:57Z

I have tried converting some Pandas code to Modin and this still seems to be an issue. I literally just installed Modin:

pip install modin
Collecting modin
  Downloading modin-0.32.0-py3-none-any.whl (1.1 MB)
     |████████████████████████████████| 1.1 MB 7.6 MB/s 
Requirement already satisfied: numpy>=1.22.4 in /opt/conda/lib/python3.10/site-packages (from modin) (1.26.4)
Requirement already satisfied: pandas<2.3,>=2.2 in /opt/conda/lib/python3.10/site-packages (from modin) (2.2.2)
Requirement already satisfied: fsspec>=2022.11.0 in /opt/conda/lib/python3.10/site-packages (from modin) (2024.6.1)
Requirement already satisfied: packaging>=21.0 in /opt/conda/lib/python3.10/site-packages (from modin) (24.1)
Requirement already satisfied: psutil>=5.8.0 in /opt/conda/lib/python3.10/site-packages (from modin) (5.9.3)
Requirement already satisfied: tzdata>=2022.7 in /opt/conda/lib/python3.10/site-packages (from pandas<2.3,>=2.2->modin) (2024.1)
Requirement already satisfied: python-dateutil>=2.8.2 in /opt/conda/lib/python3.10/site-packages (from pandas<2.3,>=2.2->modin) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in /opt/conda/lib/python3.10/site-packages (from pandas<2.3,>=2.2->modin) (2024.1)
Requirement already satisfied: six>=1.5 in /opt/conda/lib/python3.10/site-packages (from python-dateutil>=2.8.2->pandas<2.3,>=2.2->modin) (1.16.0)
Installing collected packages: modin
Successfully installed modin-0.32.0

and replaced Pandas with Modin:

# import pandas as pd
import modin.pandas as pd

and ran the code again and I get this:

Cell In[2], line 18, in CONTENT_EMBEDDINGS.__init__(self, kg, min_label_count, laplace_smooth_add, k, stretch, n_iter)
     14 self._stretch = stretch
     15 self._min_label_count = min_label_count
     17 self._word_item_tensor = self._get_word_item_tensor(
---> 18     *self.process_metadata(kg, min_label_count)
     19 )
     21 self._k = min(k, len(self._word2ind) - 1)

Cell In[2], line 196, in CONTENT_EMBEDDINGS.process_metadata(self, kg, min_label_count)
    195 v = kg[["tags"]]
--> 196 kg = kg[v.replace(v.apply(pd.Series.value_counts)).gt(min_label_count).all(1)]
    199 self._word2ind = dict(
    200     [(v, i) for i, v in enumerate(kg["tags"].unique().tolist())]
    201 )

File /opt/conda/lib/python3.10/site-packages/pandas/core/frame.py:10374, in DataFrame.apply(self, func, axis, raw, result_type, args, by_row, engine, engine_kwargs, **kwargs)
  10360 from pandas.core.apply import frame_apply
  10362 op = frame_apply(
  10363     self,
  10364     func=func,
   (...)
  10372     kwargs=kwargs,
  10373 )
> 10374 return op.apply().__finalize__(self, method="apply")

File /opt/conda/lib/python3.10/site-packages/pandas/core/apply.py:916, in FrameApply.apply(self)
    913 elif self.raw:
    914     return self.apply_raw(engine=self.engine, engine_kwargs=self.engine_kwargs)
--> 916 return self.apply_standard()

File /opt/conda/lib/python3.10/site-packages/pandas/core/apply.py:1063, in FrameApply.apply_standard(self)
   1061 def apply_standard(self):
   1062     if self.engine == "python":
-> 1063         results, res_index = self.apply_series_generator()
   1064     else:
   1065         results, res_index = self.apply_series_numba()

File /opt/conda/lib/python3.10/site-packages/pandas/core/apply.py:1081, in FrameApply.apply_series_generator(self)
   1078 with option_context("mode.chained_assignment", None):
   1079     for i, v in enumerate(series_gen):
   1080         # ignore SettingWithCopy here in case the user mutates
-> 1081         results[i] = self.func(v, *self.args, **self.kwargs)
   1082         if isinstance(results[i], ABCSeries):
   1083             # If we have a view on v, we need to make a copy because
   1084             #  series_generator will swap out the underlying data
   1085             results[i] = results[i].copy(deep=False)

File /opt/conda/lib/python3.10/site-packages/modin/logging/logger_decorator.py:144, in enable_logging.<locals>.decorator.<locals>.run_and_log(*args, **kwargs)
    129 """
    130 Compute function with logging if Modin logging is enabled.
    131 
   (...)
    141 Any
    142 """
    143 if LogMode.get() == "disable":
--> 144     return obj(*args, **kwargs)
    146 logger = get_logger()
    147 logger.log(log_level, start_line)

File /opt/conda/lib/python3.10/site-packages/modin/pandas/series.py:2177, in Series.value_counts(self, normalize, sort, ascending, bins, dropna)
   2165 if bins is not None:
   2166     # Potentially we could implement `cut` function from pandas API, which
   2167     # bins values into intervals, and then we can just count them as regular values.
   2168     # TODO #1333: new_self = self.__constructor__(pd.cut(self, bins, include_lowest=True), dtype="interval")
   2169     return self._default_to_pandas(
   2170         pandas.Series.value_counts,
   2171         normalize=normalize,
   (...)
   2175         dropna=dropna,
   2176     )
-> 2177 counted_values = super(Series, self).value_counts(
   2178     subset=self,
   2179     normalize=normalize,
   2180     sort=sort,
   2181     ascending=ascending,
   2182     dropna=dropna,
   2183 )
   2184 return counted_values

TypeError: super(type, obj): obj must be an instance or subtype of type

have I misunderstood something fundamental? Thank you.

isilopekel added the bug 🦗 Something isn't working label May 18, 2020

devin-petersohn added this to the 0.7.4 milestone May 18, 2020

dchigarev self-assigned this Jul 27, 2020

dchigarev added a commit to dchigarev/modin that referenced this issue Jul 27, 2020

FIX-modin-project#1459: 'to_pandas' of nested objects added

d8406aa

Signed-off-by: Dmitry Chigarev <[email protected]>

dchigarev mentioned this issue Jul 27, 2020

FIX-#1459: 'to_pandas' of nested objects added #1828

Merged

6 tasks

devin-petersohn closed this as completed in #1828 Jul 28, 2020

devin-petersohn pushed a commit that referenced this issue Jul 28, 2020

FIX-#1459: 'to_pandas' of nested objects added (#1828)

a95ddf9

Signed-off-by: Dmitry Chigarev <[email protected]>

aregm pushed a commit to aregm/modin that referenced this issue Sep 16, 2020

FIX-modin-project#1459: 'to_pandas' of nested objects added (modin-pr…

4418f0f

…oject#1828) Signed-off-by: Dmitry Chigarev <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

modin.Series is not a subtype of pd.Series #1459

modin.Series is not a subtype of pd.Series #1459

isilopekel commented May 18, 2020

devin-petersohn commented May 18, 2020

isilopekel commented May 18, 2020

devin-petersohn commented May 18, 2020

adriangay commented Sep 27, 2024 •

edited

Loading

modin.Series is not a subtype of pd.Series #1459

modin.Series is not a subtype of pd.Series #1459

Comments

isilopekel commented May 18, 2020

Describe the problem

devin-petersohn commented May 18, 2020

isilopekel commented May 18, 2020

devin-petersohn commented May 18, 2020

adriangay commented Sep 27, 2024 • edited Loading

adriangay commented Sep 27, 2024 •

edited

Loading