Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Row-Wise Mode Functionality (axis=1) and Improve Metadata Handling in _collection.py #1137

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
47 changes: 40 additions & 7 deletions dask_expr/_collection.py
Original file line number Diff line number Diff line change
Expand Up @@ -3650,13 +3650,46 @@ def query(self, expr, **kwargs):
return new_collection(Query(self, expr, kwargs))

@derived_from(pd.DataFrame)
def mode(self, dropna=True, split_every=False, numeric_only=False):
modes = []
for _, col in self.items():
if numeric_only and not pd.api.types.is_numeric_dtype(col.dtype):
continue
modes.append(col.mode(dropna=dropna, split_every=split_every))
return concat(modes, axis=1)
# GH#11389 - Dask issue related to adding row-wise mode functionality
# GH#1136 - Dask-Expr specific implementation for row-wise mode functionality
# Contributor: @thyripian
def mode(self, axis=0, numeric_only=False, dropna=True, split_every=False):
if axis == 0:
# Existing logic for axis=0 (column-wise mode)
modes = []
for _, col in self.items():
if numeric_only and not pd.api.types.is_numeric_dtype(col.dtype):
continue
modes.append(col.mode(dropna=dropna, split_every=split_every))
return concat(modes, axis=1)
elif axis == 1:
# Implement axis=1 (row-wise mode)
num_columns = len(self.columns) # Maximum possible number of modes per row

def row_wise_mode(df):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't have to do this, we generally expect that the columns in every partition are matching

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you clarify which part isn't necessary? Are you referring to determining the number of columns for row-wise mode?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything that is in def row_wise_mode basically

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood. I'll make the necessary revisions this afternoon and push them to this PR once tested.

result = df.mode(axis=1, numeric_only=numeric_only, dropna=dropna)
# Ensure consistent number of columns across all partitions
if result.shape[1] < num_columns:
# Pad with NaN columns
for i in range(result.shape[1], num_columns):
result[i] = np.nan
elif result.shape[1] > num_columns:
# Trim extra columns
result = result.iloc[:, :num_columns]
# Reindex columns to ensure consistent order
result = result.reindex(columns=range(num_columns))
# Set column data types to float64 to accommodate NaN values
result = result.astype("float64")
return result

# Create metadata with the correct number of columns and float64 dtype
meta = pd.DataFrame(
{i: pd.Series(dtype="float64") for i in range(num_columns)}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you forcing float in all cases?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for highlighting that forcing columns to float64 can lead to unintended type conversions. I initially used float64 to handle NaN values but agree it's better to preserve original data types. I'll update the implementation to dynamically determine data types based on the input, using pandas' nullable types like Int64 for integers and string for text data. This will handle missing values without unnecessary type changes, ensuring consistent data types across partitions while maintaining data integrity. Please let me know if you have further suggestions.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't do any explicit type casting, just use the nonempty meta of the input and call mode on it

)

return self.map_partitions(row_wise_mode, meta=meta)
else:
raise ValueError(f"No axis named {axis} for object type {type(self)}")

@derived_from(pd.DataFrame)
def add_prefix(self, prefix):
Expand Down
Loading