-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Dependencies from private repositories unable to be seen #25085
Comments
CC: @robertwb Sounds like a regression. Is there a workaround to mitigate this? |
ack, thanks, I'll try to get some eyes here. |
I looked at the mentioned culprit PR and I don't think its quite the culprit because it is not discarding anything that used to work earlier. I'll take a closer look at the bug for other possibilities. |
are your private dependencies listed in |
FWIW, if there is a regression between versions, it should be possible to bisect the regression to an exact commit. |
Re: 'from versions: none' - just to double check, when you changed versions of Beam, did you by chance also change the version of Python interpreter in addition to Beam version? Could you double check that it didn't change? |
Sorry I've been away for the past week.
The other part to this is that I've created a base docker container for my workers on gcp to use. The private docker image referred to in my Dockerfile @tvalentyn no, I'm afraid my python version has remained constant. |
I see. It looks like you may be copying site-packages directory from a different virtual environment. There was a change recently that creates one virtual environment per each SDK process: #16658 It could be that you were impacted by this change, if you have been using a non-global site-packages directory to store your packages after Note that dependencies installed in the global python environment should still be accessible in individual python environments, which are created after #16658. |
I think 2.44.0 is the first release that include #16658 , which matches the timing you describe. |
Ah ok, yep that sounds like it could be the culprit then. I've noticed that the dataflow docs use The envvar |
The global environment will have packages installed in ./usr/local/lib/python3.8/site-packages. These packages will be available to other environments. If you activate a custom venv, I think it will be ignored now that the codepath has changed in #16658, and a python process creates an individual environment. I suppose you could try to manipulate the PYTHONPATH variable to include your environment, but that may be brittle if you have package mismatches. |
I'm back working on this now. I tried altering my I'm now experimenting using |
sg, thanks. |
Yep, that worked! My new Dockerfile, in case it helps anyone:
tl;dr for anyone skipping to the end: Make sure your python packages are installed in Cheers for your help everyone! Should I close, or would you like it kept open? I guess at a minimum this should be documented somewhere. |
You could modify CHANGES.md to further document suggestions/instructions pertaining to change in behavior in 2.44.0 if you'd like and link this issue. |
Glad to hear you resolved the issue. |
I just tried raising a PR, but it appears that I don't have the needed permissions to push to this repo. This is the diff of my PR: diff --git a/CHANGES.md b/CHANGES.md
index 871f24bf9d..c7578a8a61 100644
--- a/CHANGES.md
+++ b/CHANGES.md
@@ -254,6 +254,8 @@
runner (such as Dataflow Runner v2) will need to provide this package and its dependencies.
* Slices now use the Beam Iterable Coder. This enables cross language use, but breaks pipeline updates
if a Slice type is used as a PCollection element or State API element. (Go)[#24339](https://github.com/apache/beam/issues/24339)
+* Custom worker Dockerfiles must now install their dependencies in the global python environment. For example, when using poetry
+ you must use `poetry config virtualenvs.create false` before installing deps [#25085](https://github.com/apache/beam/issues/25085)
## Deprecations
diff --git a/website/www/site/content/en/documentation/runtime/environments.md b/website/www/site/content/en/documentation/runtime/environments.md
index 17ee452a57..46a7f69209 100644
--- a/website/www/site/content/en/documentation/runtime/environments.md
+++ b/website/www/site/content/en/documentation/runtime/environments.md
@@ -198,6 +198,7 @@ Beam offers a way to provide your own custom container image. The easiest way to
>The version specified in the `RUN` instruction must match the version used to launch the pipeline.<br>
>**Make sure that the Python or Java runtime version specified in the base image is the same as the version used to run the pipeline.**
+>**NOTE**: When using version >=2.44.0 you must ensure dependencies are installed in the global python environment in the resulting image
2. [Build](https://docs.docker.com/engine/reference/commandline/build/) and [push](https://docs.docker.com/engine/reference/commandline/push/) the image using Docker.
|
np. you might have to fork a repo first to create PRs. Sent you #26471 |
Thanks a lot! |
What happened?
Running a gcp dataflow, using the python sdk 2.44.0 I can no longer access my private repositories. It works on 2.43.0
My set up is as follows:
This is my run command:
Checking my dataflow worker logs it fails to see my private repos:
I think this is the culprit PR: https://github.com/apache/beam/pull/23684/files#diff-cc1f3d7f808c692a6102847bec78809f2e4350c5ee34278100ce0f55d8c23d68R234
Issue Priority
Priority: 2 (default / most bugs should be filed as P2)
Issue Components
The text was updated successfully, but these errors were encountered: