Containerized, Database-Free Job Creation #10873

jmchilton · 2020-12-08T19:08:39Z

Overview

Continuation of job cleanup work in #7050.

Over the course of a year we were able to de-couple the cleanup and metadata collection from the backend so that jobs could largely be completed remotely.

In order to work with truly remote data that is never initially processed by Galaxy with vanilla Galaxy tools - we need to be able to push metadata generation for inputs off into jobs so that can be executed before command-line generation (and environment generation, script generation, dependency resolution, etc...) which also needs to be delayed and evaluated remotely.

We can broadly divide this work into two big tasks - dealing with remote job setup and dealing with remote, un-initialized data in APIs and in the database.

Remote Job Setup

Command-line generation.
Environment construction.
Embed metadata generation in job setup (when needed).
Simple datasets.
- Composite datatypes.
- Collections.
- Dependency resolution.

Remote Data

Support this in the object store and data model that supports HDA-like entities with lazy, un-initialized metadata.
Support some new form of upload for these.
Augment tool framework to allow inputs to be declared as files instead of datasets for when we don't need metadata at all.

We need to establish establish internal APIs consuming and producing JSON job descriptions that are detached from the database.

Open questions:

Is there are way to prevent duplicated generation of metadata for inputs.
Is there a way to mix and match inputs that have metadata and those that don't.

While a lot of the work from making job completion work remotely will be useful (model stores, package decomposition of Galaxy, source URI on datasets, serializable objectstores, runtime integration with Pulsar, structured objectstore access through Pulsar, etc..) - I (@jmchilton) think this is a very large (multi-quarter project). So another open question is can something useful and smaller be completed within one quarter?

2021 Q1 Prototype?

Ignore metadata generation, remote data, environment variables, collections, etc... and simply extend job-execution package and two-container pod to allow command-line generation from the remote container for the simplest of jobs with metadata already available?

My reservation here is that for job completion we had this big list of super awesome, fairly universal architectural advantages we got from decoupling that stuff from Galaxy (check out issue for details). I don't really see those upsides here - this will either all work and be useful in some specific situations... or it won't.

jmchilton added kind/feature area/jobs area/objectstore labels Dec 8, 2020

jmchilton mentioned this issue Apr 5, 2021

Directions to Improve Distributed Data Handling in Galaxy #11787

Open

This was referenced Apr 12, 2021

Initial Support for Consuming GA4GH DRS URIs #11819

Open

Prototype New Tool Submission/Job Creation Endpoint Based on Celery #11820

Open

Executive summary of 2021 Q2 Backend Goals #11824

Closed

jmchilton mentioned this issue Sep 1, 2021

Executive summary of 22.01 Release Cycle Backend Working Group Goals #12399

Closed

24 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Containerized, Database-Free Job Creation #10873

Containerized, Database-Free Job Creation #10873

jmchilton commented Dec 8, 2020 •

edited

Loading

Containerized, Database-Free Job Creation #10873

Containerized, Database-Free Job Creation #10873

Comments

jmchilton commented Dec 8, 2020 • edited Loading

Overview

Remote Job Setup

Remote Data

Open questions:

2021 Q1 Prototype?

jmchilton commented Dec 8, 2020 •

edited

Loading