-
Notifications
You must be signed in to change notification settings - Fork 655
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
REFACTOR-#5424: Replace dtypes="copy" with copy_dtypes flag #5426
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the extra parameter better than what it was before?
@@ -269,7 +276,7 @@ def reduce( | |||
The axis to perform the reduce over. | |||
function : callable(row|col) -> single value | |||
The reduce function to apply to each column. | |||
dtypes : str, optional | |||
dtypes : pandas.Series, optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is scalar type not accepted here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scalar types are not accepted here. The control flow here is reduce
-> _compute_tree_reduce_metadata
-> dataframe constructor, which directly sets self.dtypes
on the new dataframe. In contrast, map
is written to duplicate scalar datatypes into a series here.
@@ -2637,10 +2659,13 @@ def broadcast_apply_full_axis( | |||
enumerate_partitions : bool, default: False | |||
Whether pass partition index into applied `func` or not. | |||
Note that `func` must be able to obtain `partition_idx` kwarg. | |||
dtypes : list-like, default: None | |||
dtypes : list-like, optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to use list-like
type everywhere or pandas.Series
one? To be more consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll change it to pandas.Series
, since there are some locations where the dtype eventually gets passed to the constructor where it has to be a series.
Thanks for reviewing @anmyachev, I’ll make the changes after coming back
from holiday in January.
Why is the extra parameter better than what it was before?
The extra parameter makes it clearer when a dtype hint is being passed, as
opposed to when it’s being copied. I’ve previously had some code changes
where I ran into errors because I neglected to handle the case where
`dtypes=“copy”`, and keeping control flow (whether to infer or copy dtype)
separate from data (the actual resulting dtype) makes more sense to me.
…On Thu, Dec 15, 2022 at 06:15 Anatoly Myachev ***@***.***> wrote:
***@***.**** commented on this pull request.
Why is the extra parameter better than what it was before?
------------------------------
In modin/core/dataframe/base/dataframe/dataframe.py
<#5426 (comment)>:
> The data types for the result. This is an optimization
because there are functions that always result in a particular data
type, and this allows us to avoid (re)computing it.
⬇️ Suggested change
- The data types for the result. This is an optimization
- because there are functions that always result in a particular data
- type, and this allows us to avoid (re)computing it.
+ The data types for the result. This is an optimization
+ because there are functions that always result in a particular data
+ type, and this allows us to avoid (re)computing it.
+ If the argument is a scalar type, then that type is assigned to each result column.
------------------------------
In modin/core/dataframe/base/dataframe/dataframe.py
<#5426 (comment)>:
> @@ -269,7 +276,7 @@ def reduce(
The axis to perform the reduce over.
function : callable(row|col) -> single value
The reduce function to apply to each column.
- dtypes : str, optional
+ dtypes : pandas.Series, optional
Is scalar type not accepted here?
------------------------------
In modin/core/dataframe/base/dataframe/dataframe.py
<#5426 (comment)>:
> @@ -308,7 +315,7 @@ def tree_reduce(
The map function to apply to each column.
reduce_func : callable(row|col) -> single value, optional
The reduce function to apply to the results of the map function.
- dtypes : str, optional
+ dtypes : pandas.Series, optional
Is scalar type not accepted here?
------------------------------
In modin/core/dataframe/pandas/dataframe/dataframe.py
<#5426 (comment)>:
> @@ -1604,6 +1604,8 @@ def _compute_tree_reduce_metadata(self, axis, new_parts):
The axis on which reduce function was applied.
new_parts : NumPy 2D array
Partitions with the result of applied function.
+ dtypes : pandas.Series, optional
Is scalar type not accepted here?
------------------------------
In modin/core/dataframe/pandas/dataframe/dataframe.py
<#5426 (comment)>:
> @@ -1643,7 +1644,7 @@ def reduce(
The axis to perform the reduce over.
function : callable(row|col) -> single value
The reduce function to apply to each column.
- dtypes : str, optional
+ dtypes : pandas.Series, optional
Is scalar type not accepted here?
------------------------------
In modin/core/dataframe/pandas/dataframe/dataframe.py
<#5426 (comment)>:
> @@ -2637,10 +2659,13 @@ def broadcast_apply_full_axis(
enumerate_partitions : bool, default: False
Whether pass partition index into applied `func` or not.
Note that `func` must be able to obtain `partition_idx` kwarg.
- dtypes : list-like, default: None
+ dtypes : list-like, optional
Is it possible to use list-like type everywhere or pandas.Series one? To
be more consistent.
------------------------------
In modin/core/dataframe/pandas/dataframe/dataframe.py
<#5426 (comment)>:
> @@ -1684,7 +1685,7 @@ def tree_reduce(
reduce_func : callable(row|col) -> single value, optional
Callable function to reduce the dataframe.
If none, then apply map_func twice.
- dtypes : str, optional
+ dtypes : pandas.Series, optional
Is scalar type not accepted here?
—
Reply to this email directly, view it on GitHub
<#5426 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AFFY4GXV63HZ252VTZXXVYTWNMRX3ANCNFSM6AAAAAAS4SRLOY>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Happy holiday! |
Signed-off-by: Jonathan Shi <[email protected]>
Signed-off-by: Jonathan Shi <[email protected]>
Co-authored-by: Anatoly Myachev <[email protected]>
Signed-off-by: Jonathan Shi <[email protected]>
What do these changes do?
This replaces the
dtypes="copy"
argument (used in some operators and internal dataframe apply/broadcast methods) with a boolean flagcopy_dtypes
.flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
git commit -s
dtypes="copy"
withcopy_dtypes
flag #5424docs/development/architecture.rst
is up-to-date