Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move to src/ layout? #813

Closed
MattWellie opened this issue Jul 4, 2024 · 1 comment
Closed

Move to src/ layout? #813

MattWellie opened this issue Jul 4, 2024 · 1 comment
Labels
core Changes to the cpg-workflows api enhancement New feature or request shared The change impacts both the LC and RD pipeline.

Comments

@MattWellie
Copy link
Contributor

Inspired by the python best practices @dancoates mentioned earlier in the week

https://www.pyopensci.org/python-package-guide/package-structure-code/python-package-structure.html

We have a fundamental issue with our usage of cpg_workflows in Cloud infrastructure:

  • we use a cpg_workflows docker image, which installs a version of the production-pipelines codebase
  • we're using a flat layout (cpg_workflows is at the root of the project folder)
  • the first operation inside the driver image is to run a git clone of production-pipelines && cd into that folder
  • this places the local folder cpg_workflows at the front of the Python PATH (i.e. when main.py imports stages from cpg_workflows.X it uses the git cloned version of the code, not the installed version. The packages and versions available in the environment at runtime are sourced from the installation in the image). This is an inherent conflict
  • when we create a batch using main.py, we're in a position where cpg_utils, or seqr-loading-pipelines are sourced from the installation inside the image, while cpg_workflows is sourced from the cloned folder.
  • when we use query_command we pickle a bit of source code, pass it into a new VM, and run it there. When this code is run, we import the packages inside that image. i.e. for every one of our query_modules scripts which is run using query_command, we potentially import different versions of the cpg_workflows library (here) compared with when main.py was run to create the batch jobs.
  • Yes, this issue is just Remove/modify Query_command? #647 in disguise, but seeing as Extract 'Alignment' and 'Genotype' to be separate independent pipelines #783 has been opened to split our pipelines into separate smaller pipelines for versioning reasons, it seems to me like we should address the more fundamental issue of 'which code are we even importing' before we start splitting up our codebase for more advanced versioning purposes.

Something as basic as moving to a src layout could be a partial solution to this. We would never be in a position to accidentally import from cpg_workflows import X and pick up the cloned code instead of the installed code. That would create stability between the code running when creating a batch, and the code running inside a batch.

The issue with the flat layout we have is that you can accidentally run code you don't intend to. Normally running in a container with pre-installed versions would overcome this, but analysis-runner clones the codebase into the running container before doing anything else, so we're right back at this vulnerability.

This change would make the pipeline more cumbersome to run while undergoing development - to test new changes, you need to push a new image and use it as a driver image - but it would add zero burden to code in main, and stability there should be the most important factor when we are trying to nail down exact versions of our tools and code.

@MattWellie MattWellie added enhancement New feature or request core Changes to the cpg-workflows api shared The change impacts both the LC and RD pipeline. labels Jul 10, 2024
@MattWellie
Copy link
Contributor Author

Now ain't the time

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Changes to the cpg-workflows api enhancement New feature or request shared The change impacts both the LC and RD pipeline.
Projects
None yet
Development

No branches or pull requests

1 participant