Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support arbitrary python & beam versions, stop using pangeo/forge image #90

Merged
merged 17 commits into from
Oct 24, 2023

Conversation

yuvipanda
Copy link
Collaborator

@yuvipanda yuvipanda commented Aug 19, 2023

This PR started out as a way to unpin ourselves from ancient apache beam version
(2.42) and move to something newer. However, I eventually ran into the following error #90 (comment) that is primarily caused by the fact that we are using a heavy base image (https://github.com/pangeo-data/pangeo-docker-images/tree/master/forge) on top of which beam installs some custom packages with requirements.txt. This leads to hard to debug errors like this, and life is far too short to deal with python dependency problems.

So instead, we finally pull the plug (as we have discussed many times) on using the pangeo/forge image completely, and now start requiring that all dependencies be listed explicitly in requirements.txt. Beam automatically determines the image to use based on both the version of python as well as beam, picking one of the images the beam community maintains. These are also far smaller than the pangeo forge image, and everything is cleaner.

The complicated weird geopandas error I was running into in the pangeo/forge image is completely gone here!

We also add tests to run on 3 versions of python (3.9, 3.10 and 3.11), and
they all pass, including on dataflow!

In addition, this undos my suggestions about setting parallelism=None in #82 (comment),
and use @cisaacstern's original idea for passing -1. That's what newer versions of
beam do anyway.

TODO:

  • What to do for recipes versions < 0.10?

@yuvipanda yuvipanda added the test-dataflow Add this label to PRs to trigger Dataflow integration test. label Aug 19, 2023
@codecov
Copy link

codecov bot commented Aug 19, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (241c167) 96.01% compared to head (dc3b25a) 96.01%.
Report is 2 commits behind head on main.

❗ Current head dc3b25a differs from pull request most recent head 37d2710. Consider uploading reports for the commit 37d2710 to get more accurate results

Additional details and impacted files
@@           Coverage Diff           @@
##             main      #90   +/-   ##
=======================================
  Coverage   96.01%   96.01%           
=======================================
  Files          14       14           
  Lines         452      452           
=======================================
  Hits          434      434           
  Misses         18       18           
Files Coverage Δ
pangeo_forge_runner/bakery/flink.py 93.75% <ø> (-0.37%) ⬇️
pangeo_forge_runner/commands/bake.py 90.29% <100.00%> (+0.29%) ⬆️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@yuvipanda
Copy link
Collaborator Author

Need to figure out why the dataflow tests are failing.

@yuvipanda
Copy link
Collaborator Author

This is the error from dataflow

Traceback (most recent call last):
  File "/srv/conda/envs/notebook/lib/python3.10/runpy.py", line 187, in _run_module_as_main
    mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
  File "/srv/conda/envs/notebook/lib/python3.10/runpy.py", line 110, in _get_module_details
    __import__(pkg_name)
  File "/srv/conda/envs/notebook/lib/python3.10/site-packages/apache_beam/__init__.py", line 88, in <module>
    from apache_beam import io
  File "/srv/conda/envs/notebook/lib/python3.10/site-packages/apache_beam/io/__init__.py", line 36, in <module>
    from apache_beam.io.gcp.bigquery import *
  File "/srv/conda/envs/notebook/lib/python3.10/site-packages/apache_beam/io/gcp/bigquery.py", line 384, in <module>
    from apache_beam.io.gcp import bigquery_schema_tools
  File "/srv/conda/envs/notebook/lib/python3.10/site-packages/apache_beam/io/gcp/bigquery_schema_tools.py", line 31, in <module>
    import apache_beam.io.gcp.bigquery_tools
  File "/srv/conda/envs/notebook/lib/python3.10/site-packages/apache_beam/io/gcp/bigquery_tools.py", line 76, in <module>
    from google.cloud import bigquery as gcp_bigquery
  File "/srv/conda/envs/notebook/lib/python3.10/site-packages/google/cloud/bigquery/__init__.py", line 35, in <module>
    from google.cloud.bigquery.client import Client
  File "/srv/conda/envs/notebook/lib/python3.10/site-packages/google/cloud/bigquery/client.py", line 76, in <module>
    from google.cloud.bigquery import _job_helpers
  File "/srv/conda/envs/notebook/lib/python3.10/site-packages/google/cloud/bigquery/_job_helpers.py", line 24, in <module>
    from google.cloud.bigquery import job
  File "/srv/conda/envs/notebook/lib/python3.10/site-packages/google/cloud/bigquery/job/__init__.py", line 27, in <module>
    from google.cloud.bigquery.job.copy_ import CopyJob
  File "/srv/conda/envs/notebook/lib/python3.10/site-packages/google/cloud/bigquery/job/copy_.py", line 22, in <module>
    from google.cloud.bigquery.table import TableReference
  File "/srv/conda/envs/notebook/lib/python3.10/site-packages/google/cloud/bigquery/table.py", line 43, in <module>
    import geopandas  # type: ignore
  File "/srv/conda/envs/notebook/lib/python3.10/site-packages/geopandas/__init__.py", line 1, in <module>
    from geopandas._config import options  # noqa
  File "/srv/conda/envs/notebook/lib/python3.10/site-packages/geopandas/_config.py", line 109, in <module>
    default_value=_default_use_pygeos(),
  File "/srv/conda/envs/notebook/lib/python3.10/site-packages/geopandas/_config.py", line 95, in _default_use_pygeos
    import geopandas._compat as compat
  File "/srv/conda/envs/notebook/lib/python3.10/site-packages/geopandas/_compat.py", line 251, in <module>
    import rtree  # noqa
  File "/srv/conda/envs/notebook/lib/python3.10/site-packages/rtree/__init__.py", line 9, in <module>
    from .index import Index, Rtree  # noqa
  File "/srv/conda/envs/notebook/lib/python3.10/site-packages/rtree/index.py", line 17, in <module>
    from . import core
  File "/srv/conda/envs/notebook/lib/python3.10/site-packages/rtree/core.py", line 74, in <module>
    rt = finder.load()
  File "/srv/conda/envs/notebook/lib/python3.10/site-packages/rtree/finder.py", line 118, in load
    raise OSError("Could not load libspatialindex_c library")
OSError: Could not load libspatialindex_c library

So, why in the fuck is this even fucking trying to load geopandas of all the things, and then failing? Sigh.

@yuvipanda
Copy link
Collaborator Author

If you try doing this yourself manually, it works:

$ docker run -it quay.io/pangeo/forge:554675c  /bin/bash
$ python
>>> from apache_beam import io
>>> 

So, dataflow is doing something to the container that's fucking things up.

werecomputersamistake.com

yuvipanda added a commit that referenced this pull request Aug 19, 2023
Requires that we explicitly specify *everything*
required by our pipelines in requirements.txt, but means
we no longer have to maintain a complex image here. The
complex image also leads to complex hard to debug problems,
like #90 (comment)
yuvipanda added a commit that referenced this pull request Aug 19, 2023
Requires that we explicitly specify *everything*
required by our pipelines in requirements.txt, but means
we no longer have to maintain a complex image here. The
complex image also leads to complex hard to debug problems,
like #90 (comment)
@yuvipanda yuvipanda changed the title Unpin version of apache_beam Stop using pangeo/forge image Aug 20, 2023
@yuvipanda yuvipanda changed the title Stop using pangeo/forge image Support arbitrary python & beam versions, stop using pangeo/forge image Aug 20, 2023
@yuvipanda yuvipanda force-pushed the unpin-beam branch 2 times, most recently from 98c6251 to 62d3908 Compare August 20, 2023 01:57
yuvipanda added a commit to yuvipanda/pangeo-docker-images that referenced this pull request Aug 20, 2023
No longer needed as of
pangeo-forge/pangeo-forge-runner#90!

This simplifies pangeo-forge maintenance as well, as we no
longer have to match versions of python and apache_beam with
whatever is going on here.
- Beam moves reasonably fast - we're at 2.49 now. This brings
  us a lot of improvements, and it's quite worth moving along
  with the versions.
- When submitting to dataflow, beam actually ships a wheel of the
  version of beam currently used, and installs it in the container
  regardless of what version of beam is actually in the container.
  I'm not sure if this is true in Flink, but definitely true in
  dataflow.
- Newer versions of beam support newer versions of python, so this
  allows us to stop being pinned to Python 3.9
This varies by apache beam version, no reason for us to
test this.
This is the behavior of newer apache beam versions,
let's match that. It reverts my suggestion
from #82 (comment)
We need to work on better image management to support newer
versions of python
This matches what is in the container image bumped up.
Requires that we explicitly specify *everything*
required by our pipelines in requirements.txt, but means
we no longer have to maintain a complex image here. The
complex image also leads to complex hard to debug problems,
like #90 (comment)
@cisaacstern
Copy link
Member

cisaacstern commented Aug 21, 2023

@yuvipanda for most of our current use cases, which don't require Conda-only dependencies, I think this is 💯 huge win.

I think we do have to expect that some recipes may require Conda-only dependencies (for preprocessing data, e.g.). In that case, because beam doesn't yet provide builtin Conda support, what would we recommend to users following this PR? (As unmaintainable/bloated as it was, the suggestion before this PR would've been to add the required Conda dependency to the pangeo/forge image.)

In general this PR is mindblowing-ly awesome 🤯 😃 . Just want to know what to say to the inevitable Conda question (even if unimplemented at the moment).

@yuvipanda
Copy link
Collaborator Author

@cisaacstern in that case, the end user would need to make a separate image that has the dependencies they need. We're just changing the defaults. When a custom image is made, they would have to be responsible to make sure they match versions of python, beam, etc.

@cisaacstern
Copy link
Member

Makes sense. In case it wasn't clear from earlier comment, I've been having an impending sense of doom related to the (un)sustainability of the pangeo/forge default image, so super grateful for this PR and the work that you put into it, I think it puts us in a much better position.

yuvipanda added a commit to yuvipanda/pangeo-docker-images that referenced this pull request Sep 13, 2023
No longer needed as of
pangeo-forge/pangeo-forge-runner#90!

This simplifies pangeo-forge maintenance as well, as we no
longer have to match versions of python and apache_beam with
whatever is going on here.
weiji14 pushed a commit to pangeo-data/pangeo-docker-images that referenced this pull request Sep 14, 2023
No longer needed as of
pangeo-forge/pangeo-forge-runner#90!

This simplifies pangeo-forge maintenance as well, as we no
longer have to match versions of python and apache_beam with
whatever is going on here.

* Remove a couple more mentions of pangeo/forge
* Add note about pangeo-forge
@ranchodeluxe
Copy link
Collaborator

What else is needed for this branch to be merged? I imagine now that I can get reproducible and successful Flink runs off this branch (and only this branch) as talked about on #111 that I should aim to fix the Flink integration test here? Thoughts?

@cisaacstern cisaacstern added the test-flink Add this label to PRs to trigger Dataflow integration test. label Oct 23, 2023
@cisaacstern
Copy link
Member

What else is needed for this branch to be merged?

Getting dataflow integration tests to pass is the main criteria for me, I'm looking into that now...

@cisaacstern
Copy link
Member

Ok I got all tests except Flink to pass here...

... @ranchodeluxe should we just merge this and then you can make #114 against main?

@ranchodeluxe
Copy link
Collaborator

ranchodeluxe commented Oct 24, 2023

Ok I got all tests except Flink to pass here...

... @ranchodeluxe should we just merge this and then you can make #114 against main?

If you got things working on DataFlow then works for me 👍

@cisaacstern cisaacstern merged commit 23ec9a0 into main Oct 24, 2023
@cisaacstern
Copy link
Member

@ranchodeluxe all set!

@yuvipanda thanks for this major leap forward!

@yuvipanda
Copy link
Collaborator Author

yay, feels good to get this merged :)

@yuvipanda yuvipanda deleted the unpin-beam branch November 3, 2023 12:26
weiji14 added a commit to regro-cf-autotick-bot/pangeo-forge-runner-feedstock that referenced this pull request Nov 21, 2023
Following pangeo-forge/pangeo-forge-runner#90. Also sort dependency list alphabetically.
weiji14 added a commit to conda-forge/pangeo-forge-runner-feedstock that referenced this pull request Jan 23, 2024
* updated v0.9.2

* MNT: Re-rendered with conda-build 3.27.0, conda-smithy 3.29.0, and conda-forge-pinning 2023.11.21.15.03.38

* Drop runtime dependency on pangeo-forge-recipes

Xref pangeo-forge/pangeo-forge-runner#130

* Add fsspec to test.requires

To fix `ModuleNotFoundError: No module named 'fsspec'` when running `pangeo-forge-runner --help`.

* Remove apache-beam from runtime dependencies

Following pangeo-forge/pangeo-forge-runner#90. Also sort dependency list alphabetically.

* Add apache-beam to test.requires

Fixes `ModuleNotFoundError: No module named 'apache_beam'` when running `pangeo-forge-runner --help`.

* MNT: Re-rendered with conda-build 3.27.0, conda-smithy 3.30.4, and conda-forge-pinning 2024.01.22.14.29.27

---------

Co-authored-by: Wei Ji <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
test-dataflow Add this label to PRs to trigger Dataflow integration test. test-flink Add this label to PRs to trigger Dataflow integration test.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants