Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BEAM-13605] Add support for pandas 1.4.0 #16590

Merged
merged 12 commits into from
Feb 3, 2022
Merged

Conversation

yeandy
Copy link
Contributor

@yeandy yeandy commented Jan 21, 2022

This PR includes all changes necessary for the pandas==1.4.0 update, except the apply() operation and its dependencies drop_duplicates() and nunique().


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Choose reviewer(s) and mention them in a comment (R: @username).
  • Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

ValidatesRunner compliance status (on master branch)

Lang ULR Dataflow Flink Samza Spark Twister2
Go --- Build Status Build Status Build Status Build Status ---
Java Build Status Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Python --- Build Status
Build Status
Build Status
Build Status
Build Status
Build Status Build Status ---
XLang Build Status Build Status Build Status Build Status Build Status ---

Examples testing status on various runners

Lang ULR Dataflow Flink Samza Spark Twister2
Go --- --- --- --- --- --- ---
Java --- Build Status
Build Status
Build Status
--- --- --- --- ---
Python --- --- --- --- --- --- ---
XLang --- --- --- --- --- --- ---

Post-Commit SDK/Transform Integration Tests Status (on master branch)

Go Java Python
Build Status Build Status Build Status
Build Status
Build Status

Pre-Commit Tests Status (on master branch)

--- Java Python Go Website Whitespace Typescript
Non-portable Build Status
Build Status
Build Status
Build Status
Build Status
Build Status Build Status Build Status Build Status
Portable --- Build Status Build Status --- --- ---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests

See CI.md for more information about GitHub Actions CI.

@yeandy
Copy link
Contributor Author

yeandy commented Jan 21, 2022

@TheNeuralBit

Almost done. Will tag you for final review when finished.

I opened this PR against master b/c I was having trouble pushing to your fork/branch. As a result, the title is the same as your original PR. 😆 Will work on pushing to branch to consolidate.

@TheNeuralBit
Copy link
Member

oh, rather than trying to push to my branch, how about we just go ahead and merge #16571? If you LGTM it I can merge it.

@yeandy
Copy link
Contributor Author

yeandy commented Jan 25, 2022

R: @TheNeuralBit

@yeandy
Copy link
Contributor Author

yeandy commented Jan 25, 2022

I see a bunch of grpc errors in some of the unit tests. I think we can ignore?

I also see doctest errors for the replace method, specifically for the s.replace([1, 2], method='bfill') and s.replace('a', None) tests. However, when I run doctests locally, I don't see these errors. I believe these two tests are properly accounted for under wont_implement_ok, but maybe I overlooked something?

Edit: I think I know why. It's because we haven't updated the version of pandas to 1.4.0 in the precommit tasks / setup.py

@TheNeuralBit
Copy link
Member

TheNeuralBit commented Jan 26, 2022

I see a bunch of grpc errors in some of the unit tests. I think we can ignore?

Yeah I think these are safe to ignore. Sometimes the GHA checks flake with lots of grpc.FutureTimeoutError (BEAM-12163), in which case we can just try re-running.

I also see doctest errors for the replace method, specifically for the s.replace([1, 2], method='bfill') and s.replace('a', None) tests. However, when I run doctests locally, I don't see these errors. I believe these two tests are properly accounted for under wont_implement_ok, but maybe I overlooked something?

Edit: I think I know why. It's because we haven't updated the version of pandas to 1.4.0 in the precommit tasks / setup.py

I think your edit is correct, but that is by design. We will still verify with pandas <1.4.0 in the Python PreCommit, because we want the API to work with multiple minor versions of pandas.

Ideally we will find a way to modify the implementation that is still compatible with pandas <1.4.0

@yeandy
Copy link
Contributor Author

yeandy commented Jan 27, 2022

Fixed the backwards compatibility issue, and it now passes Python 3.6 and 3.7 now. The PR is now in a state for your review and/or other additions.

@TheNeuralBit TheNeuralBit self-requested a review January 27, 2022 22:01
Copy link
Member

@TheNeuralBit TheNeuralBit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! This looks good, I just have a few suggestions around value_counts. I'll work on the grouby.apply issue this week and we can put these together.

sdks/python/apache_beam/dataframe/frames.py Outdated Show resolved Hide resolved
sdks/python/apache_beam/dataframe/frames.py Show resolved Hide resolved
@TheNeuralBit
Copy link
Member

Alright! Now that #16706 is merged I think this is good to go! Could you rebase or merge master?

This is probably something we should mention in CHANGES.md for the 2.37.0 release too

@yeandy
Copy link
Contributor Author

yeandy commented Feb 3, 2022

Just rebased!

Also added two small commits before merge:

  1. There was a failing doctest, which I skipped, because a new pandas change now allows construction of DataFrame with a Series, which fails because it calls the len() function, which we don't allow.
  2. I also added to CHANGES.md to this PR. Beam 2.36 is still unreleased, but I don't think these changes should add to the 2.36 cut? Let me know if you think this should be in a separate PR.

@codecov
Copy link

codecov bot commented Feb 3, 2022

Codecov Report

Merging #16590 (83523c6) into master (9794fb4) will increase coverage by 9.03%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #16590      +/-   ##
==========================================
+ Coverage   74.60%   83.64%   +9.03%     
==========================================
  Files         654      452     -202     
  Lines       82029    62168   -19861     
==========================================
- Hits        61201    52001    -9200     
+ Misses      19841    10167    -9674     
+ Partials      987        0     -987     
Impacted Files Coverage Δ
sdks/python/apache_beam/dataframe/frames.py 95.04% <100.00%> (+0.06%) ⬆️
...pache_beam/dataframe/pandas_top_level_functions.py 90.52% <0.00%> (-3.16%) ⬇️
sdks/python/apache_beam/utils/interactive_utils.py 92.68% <0.00%> (-2.44%) ⬇️
sdks/python/apache_beam/internal/metrics/metric.py 90.00% <0.00%> (-1.00%) ⬇️
...ks/python/apache_beam/runners/worker/sdk_worker.py 88.90% <0.00%> (-0.16%) ⬇️
...pkg/beam/runners/dataflow/dataflowlib/translate.go
sdks/go/pkg/beam/runners/direct/buffer.go
sdks/go/pkg/beam/artifact/stage.go
sdks/go/pkg/beam/core/graph/coder/time.go
sdks/go/pkg/beam/util/starcgenx/starcgenx.go
... and 198 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9794fb4...83523c6. Read the comment docs.

@TheNeuralBit
Copy link
Member

Just rebased!

Also added two small commits before merge:

  1. There was a failing doctest, which I skipped, because a new pandas change now allows construction of DataFrame with a Series, which fails because it calls the len() function, which we don't allow.

Sounds good!

  1. I also added to CHANGES.md to this PR. Beam 2.36 is still unreleased, but I don't think these changes should add to the 2.36 cut? Let me know if you think this should be in a separate PR.

That's right, it won't be in 2.36 since the branch was already cut. I'll be cutting the 2.37 release branch next Wednesday though. It's fine to do it in the same PR.

It looks like apache_beam.dataframe.io_test.IOTest.test_read_write_parquet is failing in the py38-pyarrow-0 configuration (where we verify different versions of pyarrow), presumably because pandas 1.4 dropped support for pyarrow 0.17. Could you just skip this test when pandas >= 1.4 and pyarrow < 1.0 are installed? Similar to what we do here:

@unittest.skipIf(PD_VERSION < (1, 3), "dropna=False is new in pandas 1.3")

We could consider just dropping support for pyarrow < 1.0, but technically the non-dataframe ParquetIO will still work with it. So I think it's better to just skip this test.

@TheNeuralBit TheNeuralBit changed the title [BEAM-13605] Update pandas_doctests_test denylists in preparation for pandas 1.4.0 [BEAM-13605] Add support for pandas 1.4.0 Feb 3, 2022
@TheNeuralBit
Copy link
Member

Thanks for your help on this! It's great that we'll have this ready for 2.37.0

@TheNeuralBit TheNeuralBit merged commit 5beae2a into apache:master Feb 3, 2022
@tvalentyn
Copy link
Contributor

Could this have broken Py3.8 postcommits? See: https://ci-beam.apache.org/job/beam_PostCommit_Python38/2251 ?

@TheNeuralBit
Copy link
Member

Yes this must be the culprit for those failures. It looks a version mismatch, it should be resolved by upgrading pandas to 1.4.0 in python 3.8 containers.

@tvalentyn
Copy link
Contributor

tvalentyn commented Feb 7, 2022

ack, thanks! I think @ibzib was working on that, so remaining item would be to update names.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants