You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We are doing something similar in cloud-ilastik, in the sense that we are specifying jobs, running them, and collecting their outputs. I would say it's healthy to split the concepts of a job specification (lets call it JobSpec here) from the concept of a job runner.
Right now, those two ideas are merged into the class BatchJob, which both specifies what is to be run (some application like ilastik), as well as how/where it is to be run (local, slurm). There are a few negative implications to this design:
For every new backend, every child of BatchJob has to be updated to implement the matching backend in its self.runners dict;
It is not immediately clear to users of the jobs if they are able to run in some specific job backend e.g. IlastikPredictions(target='unicore') looks valid, but isn't;
it's not immediately clear what the supported runner backends are;
job runner backend logic isn't abstracted away from the jobs, requiring all of them to re-implement the runnig logic for each backend they plan to support;
It's not immediatley wrong to subclass BatchJob with an empty self.runners, or a self.runners with wrong keys or with implementations for runners that don't exist in the rest of the system and therefore won't integrate with other jobs;
I would suggest you consider implementing something akin to a JobSpec base class, which can then be subclassed by all your different jobs, and have those child classes be completely specified in their constructors, so no other part of the system is required to know anything about them other than they can be __call__ed with a list of file inputs or whatever is common to all jobs:
and if you eventually feel that there are some particularities that change in the Job specification depending on where it is supposed to be run, you could further derive your job specs to account for that:
Thanks for looking into this. I agree with the issues in the current design.
I think what you propose makes sense, but haven't fully grasped it yet.
For our current use cases, the design issues are not a big problem, because I have added almost all Jobs we will need already and the number of job classes is so small that I can just manually add runners for a new target.
If we ever should decide to make this "long-term" maintainable this is a different question.
(But for this, I would much rather have a clean parent implementation of the JobSpec, JobRunner concept and share it with cloud_ilastik and others.)
@Tomaz-Vieira
I have realized that we don't need the fancyness to run on different targets here; can just use the usual sbatch ... mechanism to run a full experiment via slurm. Experiments take <~ 1hr so there is no need to parallelize on an even more fine-grained level on slurm.
Anyway, I am still very interested in exploring how to use / merge CloudIlastik and cluster_tools.
Will make some issues to discuss this soon and leave this open until reference then.
We are doing something similar in cloud-ilastik, in the sense that we are specifying jobs, running them, and collecting their outputs. I would say it's healthy to split the concepts of a job specification (lets call it
JobSpec
here) from the concept of a job runner.Right now, those two ideas are merged into the class
BatchJob
, which both specifies what is to be run (some application like ilastik), as well as how/where it is to be run (local, slurm). There are a few negative implications to this design:BatchJob
has to be updated to implement the matching backend in itsself.runners
dict;IlastikPredictions(target='unicore')
looks valid, but isn't;BatchJob
with an emptyself.runners
, or aself.runners
with wrong keys or with implementations for runners that don't exist in the rest of the system and therefore won't integrate with other jobs;I would suggest you consider implementing something akin to a
JobSpec
base class, which can then be subclassed by all your different jobs, and have those child classes be completely specified in their constructors, so no other part of the system is required to know anything about them other than they can be__call__
ed with a list of file inputs or whatever is common to all jobs:Then different runner implementations could be used to fire those jobs:
and if you eventually feel that there are some particularities that change in the Job specification depending on where it is supposed to be run, you could further derive your job specs to account for that:
Then If you really want to go crazy on the types, you could even make it so that your hypothetical
SlurmRunner
only acceptsSlurJobSpec
jobs:I think this might help the architecture scale more easily into other jobs and other runners
The text was updated successfully, but these errors were encountered: