Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Containerized, Database-Free Job Creation #10873

Open
Tracked by #12399
jmchilton opened this issue Dec 8, 2020 · 0 comments
Open
Tracked by #12399

Containerized, Database-Free Job Creation #10873

jmchilton opened this issue Dec 8, 2020 · 0 comments

Comments

@jmchilton
Copy link
Member

jmchilton commented Dec 8, 2020

Overview

Continuation of job cleanup work in #7050.

Over the course of a year we were able to de-couple the cleanup and metadata collection from the backend so that jobs could largely be completed remotely.

In order to work with truly remote data that is never initially processed by Galaxy with vanilla Galaxy tools - we need to be able to push metadata generation for inputs off into jobs so that can be executed before command-line generation (and environment generation, script generation, dependency resolution, etc...) which also needs to be delayed and evaluated remotely.

We can broadly divide this work into two big tasks - dealing with remote job setup and dealing with remote, un-initialized data in APIs and in the database.

Remote Job Setup

  • Command-line generation.
  • Environment construction.
  • Embed metadata generation in job setup (when needed).
  • Simple datasets.
    • Composite datatypes.
    • Collections.
    • Dependency resolution.

Remote Data

  • Support this in the object store and data model that supports HDA-like entities with lazy, un-initialized metadata.
  • Support some new form of upload for these.
  • Augment tool framework to allow inputs to be declared as files instead of datasets for when we don't need metadata at all.

We need to establish establish internal APIs consuming and producing JSON job descriptions that are detached from the database.

Open questions:

  • Is there are way to prevent duplicated generation of metadata for inputs.
  • Is there a way to mix and match inputs that have metadata and those that don't.

While a lot of the work from making job completion work remotely will be useful (model stores, package decomposition of Galaxy, source URI on datasets, serializable objectstores, runtime integration with Pulsar, structured objectstore access through Pulsar, etc..) - I (@jmchilton) think this is a very large (multi-quarter project). So another open question is can something useful and smaller be completed within one quarter?

2021 Q1 Prototype?

Ignore metadata generation, remote data, environment variables, collections, etc... and simply extend job-execution package and two-container pod to allow command-line generation from the remote container for the simplest of jobs with metadata already available?

My reservation here is that for job completion we had this big list of super awesome, fairly universal architectural advantages we got from decoupling that stuff from Galaxy (check out issue for details). I don't really see those upsides here - this will either all work and be useful in some specific situations... or it won't.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant