You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Over the course of a year we were able to de-couple the cleanup and metadata collection from the backend so that jobs could largely be completed remotely.
In order to work with truly remote data that is never initially processed by Galaxy with vanilla Galaxy tools - we need to be able to push metadata generation for inputs off into jobs so that can be executed before command-line generation (and environment generation, script generation, dependency resolution, etc...) which also needs to be delayed and evaluated remotely.
We can broadly divide this work into two big tasks - dealing with remote job setup and dealing with remote, un-initialized data in APIs and in the database.
Remote Job Setup
Command-line generation.
Environment construction.
Embed metadata generation in job setup (when needed).
Simple datasets.
Composite datatypes.
Collections.
Dependency resolution.
Remote Data
Support this in the object store and data model that supports HDA-like entities with lazy, un-initialized metadata.
Support some new form of upload for these.
Augment tool framework to allow inputs to be declared as files instead of datasets for when we don't need metadata at all.
We need to establish establish internal APIs consuming and producing JSON job descriptions that are detached from the database.
Open questions:
Is there are way to prevent duplicated generation of metadata for inputs.
Is there a way to mix and match inputs that have metadata and those that don't.
While a lot of the work from making job completion work remotely will be useful (model stores, package decomposition of Galaxy, source URI on datasets, serializable objectstores, runtime integration with Pulsar, structured objectstore access through Pulsar, etc..) - I (@jmchilton) think this is a very large (multi-quarter project). So another open question is can something useful and smaller be completed within one quarter?
2021 Q1 Prototype?
Ignore metadata generation, remote data, environment variables, collections, etc... and simply extend job-execution package and two-container pod to allow command-line generation from the remote container for the simplest of jobs with metadata already available?
My reservation here is that for job completion we had this big list of super awesome, fairly universal architectural advantages we got from decoupling that stuff from Galaxy (check out issue for details). I don't really see those upsides here - this will either all work and be useful in some specific situations... or it won't.
The text was updated successfully, but these errors were encountered:
Overview
Continuation of job cleanup work in #7050.
Over the course of a year we were able to de-couple the cleanup and metadata collection from the backend so that jobs could largely be completed remotely.
In order to work with truly remote data that is never initially processed by Galaxy with vanilla Galaxy tools - we need to be able to push metadata generation for inputs off into jobs so that can be executed before command-line generation (and environment generation, script generation, dependency resolution, etc...) which also needs to be delayed and evaluated remotely.
We can broadly divide this work into two big tasks - dealing with remote job setup and dealing with remote, un-initialized data in APIs and in the database.
Remote Job Setup
Remote Data
We need to establish establish internal APIs consuming and producing JSON job descriptions that are detached from the database.
Open questions:
While a lot of the work from making job completion work remotely will be useful (model stores, package decomposition of Galaxy, source URI on datasets, serializable objectstores, runtime integration with Pulsar, structured objectstore access through Pulsar, etc..) - I (@jmchilton) think this is a very large (multi-quarter project). So another open question is can something useful and smaller be completed within one quarter?
2021 Q1 Prototype?
Ignore metadata generation, remote data, environment variables, collections, etc... and simply extend job-execution package and two-container pod to allow command-line generation from the remote container for the simplest of jobs with metadata already available?
My reservation here is that for job completion we had this big list of super awesome, fairly universal architectural advantages we got from decoupling that stuff from Galaxy (check out issue for details). I don't really see those upsides here - this will either all work and be useful in some specific situations... or it won't.
The text was updated successfully, but these errors were encountered: