-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-47995][INFRA][PYTHON] Refresh testing image for pyarrow 17 #47965
Conversation
@@ -723,7 +723,7 @@ jobs: | |||
# See 'ipython_genutils' in SPARK-38517 | |||
# See 'docutils<0.18.0' in SPARK-39421 | |||
python3.9 -m pip install 'sphinx==4.5.0' mkdocs 'pydata_sphinx_theme>=0.13' sphinx-copybutton nbsphinx numpydoc jinja2 markupsafe 'pyzmq<24.0.0' \ | |||
ipython ipython_genutils sphinx_plotly_directive 'numpy>=1.20.0' pyarrow pandas 'plotly>=4.8' 'docutils<0.18.0' \ | |||
ipython ipython_genutils sphinx_plotly_directive 'numpy==1.26.4' pyarrow pandas 'plotly>=4.8' 'docutils<0.18.0' \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pin numpy==1.26.4
to avoid test failures
https://github.com/zhengruifeng/spark/actions/runs/10675688719/job/29589058669
need more investigation for numpy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, that sounds like a regression somewhere. We fixed it in #47083 .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, I think the initial fix was a partial fix, and we would need a similar fix for pandas API on Spark too, cc @xinrong-meng @itholic FYI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is interesting that the output type of Pandas itself also varies after numpy
upgrade:
before
In [4]: import pandas as pd
In [5]: import numpy as np
In [6]: pd.Series([None, None, 3, 4, 5], index=[100, 200, 300, 400, 500]).first_valid_index()
Out[6]: 300
In [7]: pd.__version__
Out[7]: '2.2.2'
In [8]: np.__version__
Out[8]: '1.26.4'
after
In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: pd.Series([None, None, 3, 4, 5], index=[100, 200, 300, 400, 500]).first_valid_index()
Out[3]: np.int64(300)
In [4]: pd.__version__
Out[4]: '2.2.2'
In [5]: np.__version__
Out[5]: '2.1.0'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another example:
1.26.4
In [1]: import pandas as pd
In [2]: df = pd.DataFrame([[0, 2, 3], [0, 4, 1], [10, 20, 30]], index=[4, 5, 6], columns=['A', 'B', 'C'])
In [3]: df.at[4, 'B']
Out[3]: 2
2.1.0
In [1]: import pandas as pd
In [2]: df = pd.DataFrame([[0, 2, 3], [0, 4, 1], [10, 20, 30]], index=[4, 5, 6], columns=['A', 'B', 'C'])
In [3]: df.at[4, 'B']
Out[3]: np.int64(2)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for sharing!
Also pandas only provides a minimum supported version of NumPy (here), similar to what we did, rather than a “recommended” version.
It’s surprising to see such changes in return results across supported NumPy versions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't find any existing discussion in the pandas community on this. I'm wondering if we should raise an issue there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
me too, cannot find any related documentation. Please help file a Pandas issue, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, filed pandas-dev/pandas#59838
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, is MLflow 2.16.0 ready? In the community, I've been testing this until now here.
The blocker was MLFlow until 2.15.x. If you don't mind, use SPARK-47995 instead of a new JIRA ID because it's filed before. Then, I'll close my PR.
Thank you for working on this, @zhengruifeng .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM.
I revised the PR title with SPARK-47995 and adds Closes #46232
at the PR description.
Thank you, @zhengruifeng and @HyukjinKwon .
Merged to master.
@dongjoon-hyun thanks for taking care of it. I was not aware of that ticket so file a new one :) |
No problem at all~ Thank you for doing this. I've been waiting for this so long. ;) |
### What changes were proposed in this pull request? Refresh testing image for pyarrow 17 ### Why are the changes needed? currently the cached `pyarrow==15.0.2` is used in [CI](https://github.com/apache/spark/actions/runs/10674534002/job/29585233434), we need to test Spark with latest pyarrow ### Does this PR introduce _any_ user-facing change? No, infra only ### How was this patch tested? updated ci ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#46232 Closes apache#47965 from zhengruifeng/infra_refresh_test_doc. Authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request? Refresh testing image for pyarrow 17 ### Why are the changes needed? currently the cached `pyarrow==15.0.2` is used in [CI](https://github.com/apache/spark/actions/runs/10674534002/job/29585233434), we need to test Spark with latest pyarrow ### Does this PR introduce _any_ user-facing change? No, infra only ### How was this patch tested? updated ci ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#46232 Closes apache#47965 from zhengruifeng/infra_refresh_test_doc. Authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request? Refresh testing image for pyarrow 17 ### Why are the changes needed? currently the cached `pyarrow==15.0.2` is used in [CI](https://github.com/apache/spark/actions/runs/10674534002/job/29585233434), we need to test Spark with latest pyarrow ### Does this PR introduce _any_ user-facing change? No, infra only ### How was this patch tested? updated ci ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#46232 Closes apache#47965 from zhengruifeng/infra_refresh_test_doc. Authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
What changes were proposed in this pull request?
Refresh testing image for pyarrow 17
Why are the changes needed?
currently the cached
pyarrow==15.0.2
is used in CI, we need to test Spark with latest pyarrowDoes this PR introduce any user-facing change?
No, infra only
How was this patch tested?
updated ci
Was this patch authored or co-authored using generative AI tooling?
no
Closes #46232