[SPARK-47995][INFRA][PYTHON] Refresh testing image for pyarrow 17 #47965

zhengruifeng · 2024-09-03T02:45:51Z

What changes were proposed in this pull request?

Refresh testing image for pyarrow 17

Why are the changes needed?

currently the cached pyarrow==15.0.2 is used in CI, we need to test Spark with latest pyarrow

Does this PR introduce any user-facing change?

No, infra only

How was this patch tested?

updated ci

Was this patch authored or co-authored using generative AI tooling?

no

Closes #46232

zhengruifeng · 2024-09-03T06:46:42Z

.github/workflows/build_and_test.yml

@@ -723,7 +723,7 @@ jobs:
        # See 'ipython_genutils' in SPARK-38517
        # See 'docutils<0.18.0' in SPARK-39421
        python3.9 -m pip install 'sphinx==4.5.0' mkdocs 'pydata_sphinx_theme>=0.13' sphinx-copybutton nbsphinx numpydoc jinja2 markupsafe 'pyzmq<24.0.0' \
-          ipython ipython_genutils sphinx_plotly_directive 'numpy>=1.20.0' pyarrow pandas 'plotly>=4.8' 'docutils<0.18.0' \
+          ipython ipython_genutils sphinx_plotly_directive 'numpy==1.26.4' pyarrow pandas 'plotly>=4.8' 'docutils<0.18.0' \


pin numpy==1.26.4 to avoid test failures

https://github.com/zhengruifeng/spark/actions/runs/10675688719/job/29589058669

need more investigation for numpy

Hm, that sounds like a regression somewhere. We fixed it in #47083 .

Alright, I think the initial fix was a partial fix, and we would need a similar fix for pandas API on Spark too, cc @xinrong-meng @itholic FYI.

it is interesting that the output type of Pandas itself also varies after numpy upgrade:

before

In [4]: import pandas as pd In [5]: import numpy as np In [6]: pd.Series([None, None, 3, 4, 5], index=[100, 200, 300, 400, 500]).first_valid_index() Out[6]: 300 In [7]: pd.__version__ Out[7]: '2.2.2' In [8]: np.__version__ Out[8]: '1.26.4'

after

In [1]: import pandas as pd In [2]: import numpy as np In [3]: pd.Series([None, None, 3, 4, 5], index=[100, 200, 300, 400, 500]).first_valid_index() Out[3]: np.int64(300) In [4]: pd.__version__ Out[4]: '2.2.2' In [5]: np.__version__ Out[5]: '2.1.0'

another example:

1.26.4

In [1]: import pandas as pd In [2]: df = pd.DataFrame([[0, 2, 3], [0, 4, 1], [10, 20, 30]], index=[4, 5, 6], columns=['A', 'B', 'C']) In [3]: df.at[4, 'B'] Out[3]: 2

2.1.0

In [1]: import pandas as pd In [2]: df = pd.DataFrame([[0, 2, 3], [0, 4, 1], [10, 20, 30]], index=[4, 5, 6], columns=['A', 'B', 'C']) In [3]: df.at[4, 'B'] Out[3]: np.int64(2)

Thanks for sharing!
Also pandas only provides a minimum supported version of NumPy (here), similar to what we did, rather than a “recommended” version.
It’s surprising to see such changes in return results across supported NumPy versions.

I didn't find any existing discussion in the pandas community on this. I'm wondering if we should raise an issue there.

me too, cannot find any related documentation. Please help file a Pandas issue, thanks!

Sounds good, filed pandas-dev/pandas#59838

dongjoon-hyun

Oh, is MLflow 2.16.0 ready? In the community, I've been testing this until now here.

#46232

The blocker was MLFlow until 2.15.x. If you don't mind, use SPARK-47995 instead of a new JIRA ID because it's filed before. Then, I'll close my PR.

Thank you for working on this, @zhengruifeng .

dongjoon-hyun

+1, LGTM.

I revised the PR title with SPARK-47995 and adds Closes #46232 at the PR description.

Thank you, @zhengruifeng and @HyukjinKwon .

Merged to master.

zhengruifeng · 2024-09-04T00:28:48Z

@dongjoon-hyun thanks for taking care of it. I was not aware of that ticket so file a new one :)

dongjoon-hyun · 2024-09-04T19:44:30Z

No problem at all~ Thank you for doing this. I've been waiting for this so long. ;)

### What changes were proposed in this pull request? Refresh testing image for pyarrow 17 ### Why are the changes needed? currently the cached `pyarrow==15.0.2` is used in [CI](https://github.com/apache/spark/actions/runs/10674534002/job/29585233434), we need to test Spark with latest pyarrow ### Does this PR introduce _any_ user-facing change? No, infra only ### How was this patch tested? updated ci ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#46232 Closes apache#47965 from zhengruifeng/infra_refresh_test_doc. Authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

init

10a5321

github-actions bot added the BUILD label Sep 3, 2024

HyukjinKwon approved these changes Sep 3, 2024

View reviewed changes

pin numpy

451720c

github-actions bot added the INFRA label Sep 3, 2024

zhengruifeng changed the title ~~[WIP][INFRA] Refresh testing image for pyarrow 17~~ [SPARK-49496][INFRA][PYTHON] Refresh testing image for pyarrow 17 Sep 3, 2024

zhengruifeng commented Sep 3, 2024

View reviewed changes

dongjoon-hyun reviewed Sep 3, 2024

View reviewed changes

dongjoon-hyun changed the title ~~[SPARK-49496][INFRA][PYTHON] Refresh testing image for pyarrow 17~~ [SPARK-47995][INFRA][PYTHON] Refresh testing image for pyarrow 17 Sep 3, 2024

dongjoon-hyun approved these changes Sep 3, 2024

View reviewed changes

dongjoon-hyun closed this in 7508f6d Sep 3, 2024

zhengruifeng deleted the infra_refresh_test_doc branch September 4, 2024 00:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-47995][INFRA][PYTHON] Refresh testing image for pyarrow 17 #47965

[SPARK-47995][INFRA][PYTHON] Refresh testing image for pyarrow 17 #47965

zhengruifeng commented Sep 3, 2024 •

edited by dongjoon-hyun

Loading

zhengruifeng Sep 3, 2024

HyukjinKwon Sep 3, 2024

HyukjinKwon Sep 3, 2024

zhengruifeng Sep 3, 2024

zhengruifeng Sep 9, 2024 •

edited

Loading

xinrong-meng Sep 9, 2024

xinrong-meng Sep 9, 2024

zhengruifeng Sep 9, 2024 •

edited

Loading

xinrong-meng Sep 19, 2024

dongjoon-hyun left a comment

dongjoon-hyun left a comment

zhengruifeng commented Sep 4, 2024

dongjoon-hyun commented Sep 4, 2024

[SPARK-47995][INFRA][PYTHON] Refresh testing image for pyarrow 17 #47965

[SPARK-47995][INFRA][PYTHON] Refresh testing image for pyarrow 17 #47965

Conversation

zhengruifeng commented Sep 3, 2024 • edited by dongjoon-hyun Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

zhengruifeng Sep 3, 2024

Choose a reason for hiding this comment

HyukjinKwon Sep 3, 2024

Choose a reason for hiding this comment

HyukjinKwon Sep 3, 2024

Choose a reason for hiding this comment

zhengruifeng Sep 3, 2024

Choose a reason for hiding this comment

zhengruifeng Sep 9, 2024 • edited Loading

Choose a reason for hiding this comment

xinrong-meng Sep 9, 2024

Choose a reason for hiding this comment

xinrong-meng Sep 9, 2024

Choose a reason for hiding this comment

zhengruifeng Sep 9, 2024 • edited Loading

Choose a reason for hiding this comment

xinrong-meng Sep 19, 2024

Choose a reason for hiding this comment

dongjoon-hyun left a comment

Choose a reason for hiding this comment

dongjoon-hyun left a comment

Choose a reason for hiding this comment

zhengruifeng commented Sep 4, 2024

dongjoon-hyun commented Sep 4, 2024

zhengruifeng commented Sep 3, 2024 •

edited by dongjoon-hyun

Loading

zhengruifeng Sep 9, 2024 •

edited

Loading

zhengruifeng Sep 9, 2024 •

edited

Loading