-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bump pandas from 1.5.3 to 2.0.3 #422
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Yerzhaisang Taskali <[email protected]>
Signed-off-by: Yerzhaisang Taskali <[email protected]>
CHANGELOG.md
Outdated
@@ -46,6 +46,7 @@ Inspired from [Keep a Changelog](https://keepachangelog.com/en/1.0.0/) | |||
- updating listing file with three v2 sparse model - by @dhrubo-os ([#412](https://github.com/opensearch-project/opensearch-py-ml/pull/412)) | |||
- Update model upload history - opensearch-project/opensearch-neural-sparse-encoding-doc-v2-mini (v.1.0.0)(TORCH_SCRIPT) by @dhrubo-os ([#417](https://github.com/opensearch-project/opensearch-py-ml/pull/417)) | |||
- Update model upload history - opensearch-project/opensearch-neural-sparse-encoding-v2-distill (v.1.0.0)(TORCH_SCRIPT) by @dhrubo-os ([#419](https://github.com/opensearch-project/opensearch-py-ml/pull/419)) | |||
- Bump pandas from 1.5.3 to 2.0.3 bu @yerzhaisang ([#422](https://github.com/opensearch-project/opensearch-py-ml/pull/422)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
by*
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
docs/requirements-docs.txt
Outdated
@@ -1,5 +1,5 @@ | |||
opensearch-py>=2 | |||
pandas>=1.5,<3 | |||
pandas==2.0.3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we upgrade to a more latest version? any reason specifically for 2.0.3?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bumping to a later version introduces datatype issues, including an ImportError like this:
ImportError: cannot import name 'is_datetime_or_timedelta_dtype' from 'pandas.core.dtypes.common'
Given that the issue was only to upgrade to a 2.x version, I thought 2.0.3 would be sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, let's focus on bumping to 2.0.3
for now and then we can create another issue to upgrade more if needed.
opensearch_py_ml/common.py
Outdated
) -> pd.Series: | ||
"""Builds a pd.Series while squelching the warning | ||
for unspecified dtype on empty series | ||
""" | ||
dtype = dtype or (EMPTY_SERIES_DTYPE if not data else dtype) | ||
if dtype is not None: | ||
kwargs["dtype"] = dtype | ||
if index_name: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets keep an explicit check for None - if index_name is not None:
what happens if the index is not found?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets keep an explicit check for None -
if index_name is not None:
Done
opensearch_py_ml/dataframe.py
Outdated
# axes, _ = pd.DataFrame()._construct_axes_from_arguments( | ||
# (index, columns), {} | ||
# ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can remove these comments if no longer used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
opensearch_py_ml/dataframe.py
Outdated
if index is not None: | ||
if isinstance(index, pd.Index): | ||
index = index.tolist() # Convert Index to list | ||
elif not is_list_like(index): | ||
index = [index] # Convert to list if it's not list-like already | ||
axes["index"] = index | ||
else: | ||
axes["index"] = None | ||
|
||
if columns is not None: | ||
if isinstance(columns, pd.Index): | ||
columns = columns.tolist() # Convert Index to list | ||
elif not is_list_like(columns): | ||
columns = [columns] # Convert to list if it's not list-like already | ||
axes["columns"] = columns | ||
else: | ||
axes["columns"] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this be wrapped in a method? and then we can do something like this
axes = {
"index": to_list_if_needed(index),
"columns": pd.Index(to_list_if_needed(columns)) if columns is not None else None
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
opensearch_py_ml/dataframe.py
Outdated
|
||
if columns is not None: | ||
if isinstance(columns, pd.Index): | ||
columns = columns.tolist() # Convert Index to list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
columns to list
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
opensearch_py_ml/dataframe.py
Outdated
if not is_list_like(columns): | ||
columns = [columns] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
repeated logic? this is handled in lines 443 and 444 right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
@@ -424,9 +424,36 @@ def drop( | |||
axis = pd.DataFrame._get_axis_name(axis) | |||
axes = {axis: labels} | |||
elif index is not None or columns is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kind of confused here the parent branch is checking that if one of them is not None but inside its checking again
Line 431 and 440
maybe this could simplified to what @pyek-bot stated about creating a wrapper for convertToList if needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey left some comments. I notice you constantly reimplement the mad function. I would consider this a friction point if someone new decides to use mad and has to apply the lambda formula again. this is prone to errors I suggested to extract the same formula being applied into a utility class since its being repeated a lot.
What about the built_in mad function is causing inconsistent behavior is there an example you can provide?.
Also I feel a bit concerned you are choosing to use strictly 2.0.3 which kind of forces people to code using only that library can we refactor so that if we want to bump again not a lot of code changes are needed?
@@ -475,7 +475,7 @@ def _terms_aggs( | |||
except IndexError: | |||
name = None | |||
|
|||
return build_pd_series(results, name=name) | |||
return build_pd_series(results, index_name=name, name="count") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How come its using "count" here but before it was using name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In pandas 2.0.3, a change in the value_counts method resulted in the following behavior:
The method now uses "count" as the name for the values column, while the original column name (e.g., "Carrier") is used for the index name. This differs from earlier versions, where the values column would inherit the name of the original column.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @dhrubo-os what are your thoughts? Thanks for the info @Yerzhaisang
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Yerzhaisang could you please share any documentation about the changing behavior of value_counts
method from version 1.5.3
to 2.0.3
. I think pandas versions 1.5.3 and 2.0.3, the value_counts() method has remained consistent in functionality. Please let me know if you think otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In pandas 1.5.3, the series name is used for the result series.
In pandas 2.0.3, the series name is used for the index name, while the result series name is set to this one.
@@ -106,10 +106,18 @@ def test_groupby_aggs_mad_var_std(self, pd_agg, dropna): | |||
pd_flights = self.pd_flights().filter(self.filter_data) | |||
oml_flights = self.oml_flights().filter(self.filter_data) | |||
|
|||
pd_groupby = getattr(pd_flights.groupby("Cancelled", dropna=dropna), pd_agg)() | |||
if pd_agg == "mad": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to extract these variables like "mad", "var", "std" as a constant. (can remove the friction to new comers) fro example let MEAN_ABSOLUTE_DEVATION = "mad". Not sure if that would take a lot of effort but something to think about
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
pd_groupby = getattr(pd_flights.groupby("Cancelled", dropna=dropna), pd_agg)() | ||
if pd_agg == "mad": | ||
pd_groupby = pd_flights.groupby("Cancelled", dropna=dropna).agg( | ||
lambda x: (x - x.mean()).abs().mean() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see this lambda get used in other places, could we possible have a util class that dispatches theses common function names?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
pd_metric = getattr(pd_flights, func)( | ||
**({"numeric_only": True} if func != "mad" else {}) | ||
) | ||
if func == "mad": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed you introduced branching what if some day in the future we need to reimplement another function from scratch instead of creating a new branch every time can we extract this behavior out? perhaps have a utility class with our own custom implementation like
customeFunctionMap = {"mad" : lambda x: (x - x.median()).abs().mean()}
and then dispatch it instead of branching for every new method we need to reimplent something like
if func in customFUnctionMap:
// apply the custom function
else:
// use the already given functionality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Signed-off-by: Yerzhaisang Taskali <[email protected]>
Signed-off-by: Yerzhaisang Taskali <[email protected]>
Signed-off-by: Yerzhaisang Taskali <[email protected]>
Signed-off-by: Yerzhaisang Taskali <[email protected]>
…readability and reusability Signed-off-by: Yerzhaisang Taskali <[email protected]>
…readability and reusability Signed-off-by: Yerzhaisang Taskali <[email protected]>
…readability and reusability Signed-off-by: Yerzhaisang Taskali <[email protected]>
…readability and reusability Signed-off-by: Yerzhaisang Taskali <[email protected]>
Signed-off-by: Yerzhaisang Taskali <[email protected]>
Signed-off-by: Yerzhaisang Taskali <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only had one piece of feedback you could implement. Code structure, I think its good (Not a pandas expert though)
pd.Index(to_list_if_needed(columns)) | ||
if columns is not None | ||
else None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed that in the implementation of
opensearch_py_ml.utils that when the value is None that you would return None already maybe we dont need a ternary operation here since its already doing that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey Brian,
We can’t remove the ternary operation at this point because calling pd.Index(None)
results in a TypeError
. Additionally, we need to keep this check since it’s required here.
if pd_agg in CustomFunctionDispatcher.customFunctionMap: | ||
pd_groupby = pd_flights.groupby("Cancelled", dropna=dropna).agg( | ||
lambda x: CustomFunctionDispatcher.apply_custom_function(pd_agg, x) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
@@ -475,7 +475,7 @@ def _terms_aggs( | |||
except IndexError: | |||
name = None | |||
|
|||
return build_pd_series(results, name=name) | |||
return build_pd_series(results, index_name=name, name="count") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @dhrubo-os what are your thoughts? Thanks for the info @Yerzhaisang
docs/requirements-docs.txt
Outdated
@@ -1,5 +1,5 @@ | |||
opensearch-py>=2 | |||
pandas>=1.5,<3 | |||
pandas==2.0.3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, let's focus on bumping to 2.0.3
for now and then we can create another issue to upgrade more if needed.
) -> pd.Series: | ||
"""Builds a pd.Series while squelching the warning | ||
for unspecified dtype on empty series | ||
""" | ||
dtype = dtype or (EMPTY_SERIES_DTYPE if not data else dtype) | ||
if dtype is not None: | ||
kwargs["dtype"] = dtype | ||
if index_name is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a comment why do we need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
opensearch_py_ml/constants.py
Outdated
# specific language governing permissions and limitations | ||
# under the License. | ||
|
||
MEAN_ABSOLUTE_DEVIATION = "mad" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than creating another file, can we use utils.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -1326,7 +1331,6 @@ def to_csv( | |||
compression="infer", | |||
quoting=None, | |||
quotechar='"', | |||
line_terminator=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are we removing this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I restored it as lineterminator to align with recent pandas updates, but it’s still not actively used elsewhere in the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we keep as it is: line_terminator
? not lineterminator
?
@@ -475,7 +475,7 @@ def _terms_aggs( | |||
except IndexError: | |||
name = None | |||
|
|||
return build_pd_series(results, name=name) | |||
return build_pd_series(results, index_name=name, name="count") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Yerzhaisang could you please share any documentation about the changing behavior of value_counts
method from version 1.5.3
to 2.0.3
. I think pandas versions 1.5.3 and 2.0.3, the value_counts() method has remained consistent in functionality. Please let me know if you think otherwise.
Logstash Airways 3331 | ||
JetBeats 3274 | ||
Kibana Airlines 3234 | ||
ES-Air 3220 | ||
Name: Carrier, dtype: int64 | ||
Name: count, dtype: int64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what we don't want, right? Carrier is the column name which we are changing it to count
and that's not right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @dhrubo, this is actually correct. In pandas 2.0.3, Carrier is set as the index name, and count is the column name.
@@ -34,7 +34,7 @@ def test_flights_describe(self): | |||
pd_flights = self.pd_flights() | |||
oml_flights = self.oml_flights() | |||
|
|||
pd_describe = pd_flights.describe() | |||
pd_describe = pd_flights.describe().drop(["timestamp"], axis=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are we removing this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def test_groupby_aggs_mad_var_std(self, pd_agg, dropna): | ||
# For these aggs pandas doesn't support numeric_only | ||
pd_flights = self.pd_flights().filter(self.filter_data) | ||
oml_flights = self.oml_flights().filter(self.filter_data) | ||
|
||
pd_groupby = getattr(pd_flights.groupby("Cancelled", dropna=dropna), pd_agg)() | ||
if pd_agg in CustomFunctionDispatcher.customFunctionMap: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oml_groupby = getattr(oml_flights.groupby("Cancelled", dropna=dropna), pd_agg)( | ||
numeric_only=True | ||
) | ||
pd_groupby = pd_groupby[oml_groupby.columns] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this? We shouldn't use any oml resource/info in pd as the goal is to how pd functionalities are same to oml functionality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, fixed
@@ -224,15 +240,44 @@ def test_groupby_dataframe_mad(self): | |||
pd_flights = self.pd_flights().filter(self.filter_data + ["DestCountry"]) | |||
oml_flights = self.oml_flights().filter(self.filter_data + ["DestCountry"]) | |||
|
|||
pd_mad = pd_flights.groupby("DestCountry").mad() | |||
pd_mad = pd_flights.groupby("DestCountry").apply( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we get some idea from this PR: https://github.com/elastic/eland/pull/602/files ?
Also Shouldn't we keep BWC in mind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we don’t need to worry about backward compatibility, as we haven’t modified our built-in methods. We only customized the deprecated pandas methods within the tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I was wondering currently we are doing pandas==2.0.3
, but can't we do pandas>=2.0.3
? So that we can support other versions of pandas too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it, let me do some research
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dhrubo-os thank you for this good point. Yeah, we shouldn't force customers to install only one version.
I did research and current functionality is compatible with pandas>=1.5.2,<2.1.
What do you think about that?
Signed-off-by: Yerzhaisang Taskali <[email protected]>
Signed-off-by: Yerzhaisang Taskali <[email protected]>
Signed-off-by: Yerzhaisang Taskali <[email protected]>
Signed-off-by: Yerzhaisang Taskali <[email protected]>
Signed-off-by: Yerzhaisang Taskali <[email protected]>
Signed-off-by: Yerzhaisang Taskali <[email protected]>
Signed-off-by: Yerzhaisang Taskali <[email protected]>
Signed-off-by: Yerzhaisang Taskali <[email protected]>
@@ -1,5 +1,5 @@ | |||
opensearch-py>=2 | |||
pandas>=1.5,<3 | |||
pandas>=1.5.2,<2.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you get time to test functionalities with different versions of pandas? If yes, which versions did you test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @dhrubo-os , I didn't complete yet. It's still in progress. I will let you know when it's done
Description
Issues Resolved
#270
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.