-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support arbitrary python & beam versions, stop using pangeo/forge image #90
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #90 +/- ##
=======================================
Coverage 96.01% 96.01%
=======================================
Files 14 14
Lines 452 452
=======================================
Hits 434 434
Misses 18 18
☔ View full report in Codecov by Sentry. |
Need to figure out why the dataflow tests are failing. |
This is the error from dataflow
So, why in the fuck is this even fucking trying to load geopandas of all the things, and then failing? Sigh. |
If you try doing this yourself manually, it works:
So, dataflow is doing something to the container that's fucking things up. werecomputersamistake.com |
Requires that we explicitly specify *everything* required by our pipelines in requirements.txt, but means we no longer have to maintain a complex image here. The complex image also leads to complex hard to debug problems, like #90 (comment)
Requires that we explicitly specify *everything* required by our pipelines in requirements.txt, but means we no longer have to maintain a complex image here. The complex image also leads to complex hard to debug problems, like #90 (comment)
98c6251
to
62d3908
Compare
No longer needed as of pangeo-forge/pangeo-forge-runner#90! This simplifies pangeo-forge maintenance as well, as we no longer have to match versions of python and apache_beam with whatever is going on here.
- Beam moves reasonably fast - we're at 2.49 now. This brings us a lot of improvements, and it's quite worth moving along with the versions. - When submitting to dataflow, beam actually ships a wheel of the version of beam currently used, and installs it in the container regardless of what version of beam is actually in the container. I'm not sure if this is true in Flink, but definitely true in dataflow. - Newer versions of beam support newer versions of python, so this allows us to stop being pinned to Python 3.9
This varies by apache beam version, no reason for us to test this.
This is the behavior of newer apache beam versions, let's match that. It reverts my suggestion from #82 (comment)
We need to work on better image management to support newer versions of python
This matches what is in the container image bumped up.
Requires that we explicitly specify *everything* required by our pipelines in requirements.txt, but means we no longer have to maintain a complex image here. The complex image also leads to complex hard to debug problems, like #90 (comment)
@yuvipanda for most of our current use cases, which don't require Conda-only dependencies, I think this is 💯 huge win. I think we do have to expect that some recipes may require Conda-only dependencies (for preprocessing data, e.g.). In that case, because beam doesn't yet provide builtin Conda support, what would we recommend to users following this PR? (As unmaintainable/bloated as it was, the suggestion before this PR would've been to add the required Conda dependency to the In general this PR is mindblowing-ly awesome 🤯 😃 . Just want to know what to say to the inevitable Conda question (even if unimplemented at the moment). |
@cisaacstern in that case, the end user would need to make a separate image that has the dependencies they need. We're just changing the defaults. When a custom image is made, they would have to be responsible to make sure they match versions of python, beam, etc. |
Makes sense. In case it wasn't clear from earlier comment, I've been having an impending sense of doom related to the (un)sustainability of the |
No longer needed as of pangeo-forge/pangeo-forge-runner#90! This simplifies pangeo-forge maintenance as well, as we no longer have to match versions of python and apache_beam with whatever is going on here.
No longer needed as of pangeo-forge/pangeo-forge-runner#90! This simplifies pangeo-forge maintenance as well, as we no longer have to match versions of python and apache_beam with whatever is going on here. * Remove a couple more mentions of pangeo/forge * Add note about pangeo-forge
What else is needed for this branch to be merged? I imagine now that I can get reproducible and successful Flink runs off this branch (and only this branch) as talked about on #111 that I should aim to fix the Flink integration test here? Thoughts? |
In testing matrix, replace @Injections with 0.10.3
Getting dataflow integration tests to pass is the main criteria for me, I'm looking into that now... |
Bump gcsfs version in dataflow test
Use explicit version ref for >=0.10 integration tests
Ok I got all tests except Flink to pass here... ... @ranchodeluxe should we just merge this and then you can make #114 against |
If you got things working on DataFlow then works for me 👍 |
@ranchodeluxe all set! @yuvipanda thanks for this major leap forward! |
yay, feels good to get this merged :) |
Following pangeo-forge/pangeo-forge-runner#90. Also sort dependency list alphabetically.
* updated v0.9.2 * MNT: Re-rendered with conda-build 3.27.0, conda-smithy 3.29.0, and conda-forge-pinning 2023.11.21.15.03.38 * Drop runtime dependency on pangeo-forge-recipes Xref pangeo-forge/pangeo-forge-runner#130 * Add fsspec to test.requires To fix `ModuleNotFoundError: No module named 'fsspec'` when running `pangeo-forge-runner --help`. * Remove apache-beam from runtime dependencies Following pangeo-forge/pangeo-forge-runner#90. Also sort dependency list alphabetically. * Add apache-beam to test.requires Fixes `ModuleNotFoundError: No module named 'apache_beam'` when running `pangeo-forge-runner --help`. * MNT: Re-rendered with conda-build 3.27.0, conda-smithy 3.30.4, and conda-forge-pinning 2024.01.22.14.29.27 --------- Co-authored-by: Wei Ji <[email protected]>
This PR started out as a way to unpin ourselves from ancient apache beam version
(2.42) and move to something newer. However, I eventually ran into the following error #90 (comment) that is primarily caused by the fact that we are using a heavy base image (https://github.com/pangeo-data/pangeo-docker-images/tree/master/forge) on top of which beam installs some custom packages with requirements.txt. This leads to hard to debug errors like this, and life is far too short to deal with python dependency problems.
So instead, we finally pull the plug (as we have discussed many times) on using the pangeo/forge image completely, and now start requiring that all dependencies be listed explicitly in
requirements.txt
. Beam automatically determines the image to use based on both the version of python as well as beam, picking one of the images the beam community maintains. These are also far smaller than the pangeo forge image, and everything is cleaner.The complicated weird geopandas error I was running into in the pangeo/forge image is completely gone here!
We also add tests to run on 3 versions of python (3.9, 3.10 and 3.11), and
they all pass, including on dataflow!
In addition, this undos my suggestions about setting parallelism=None in #82 (comment),
and use @cisaacstern's original idea for passing -1. That's what newer versions of
beam do anyway.
TODO: