Architecture suggestion - explicit constructors, separate the notion of runner from job specification #5

Tomaz-Vieira · 2020-04-08T13:03:28Z

We are doing something similar in cloud-ilastik, in the sense that we are specifying jobs, running them, and collecting their outputs. I would say it's healthy to split the concepts of a job specification (lets call it JobSpec here) from the concept of a job runner.

Right now, those two ideas are merged into the class BatchJob, which both specifies what is to be run (some application like ilastik), as well as how/where it is to be run (local, slurm). There are a few negative implications to this design:

For every new backend, every child of BatchJob has to be updated to implement the matching backend in its self.runners dict;
It is not immediately clear to users of the jobs if they are able to run in some specific job backend e.g. IlastikPredictions(target='unicore') looks valid, but isn't;
it's not immediately clear what the supported runner backends are;
job runner backend logic isn't abstracted away from the jobs, requiring all of them to re-implement the runnig logic for each backend they plan to support;
It's not immediatley wrong to subclass BatchJob with an empty self.runners, or a self.runners with wrong keys or with implementations for runners that don't exist in the rest of the system and therefore won't integrate with other jobs;

I would suggest you consider implementing something akin to a JobSpec base class, which can then be subclassed by all your different jobs, and have those child classes be completely specified in their constructors, so no other part of the system is required to know anything about them other than they can be __call__ed with a list of file inputs or whatever is common to all jobs:

class JobSpec:
    @abstractmethod
     def __call__(self, inputs) -> JobResult:
        pass

class IlastikPrediction(JobSpec):
    def __init__(self, ilastik_thing1: str, ilastik_thing2: float, ...):
        pass
    
    def __call__(self, inputs) -> JobResult:
        ...

Then different runner implementations could be used to fire those jobs:

slurm_runner = SlurmRunner(...)
local_runner = LocalRunner(...)

job_specs: List[JobSpec] = [IlastikPrediction(...), Preprocess(...), IlastikPrediction(...)]
slurm_runner.run_jobs(job_specs)
local_runner.run_jobs(job_specs)

and if you eventually feel that there are some particularities that change in the Job specification depending on where it is supposed to be run, you could further derive your job specs to account for that:

class ISlurmJobSpec:
    @abstractmethod
    @property
    def some_slurm_param(self) -> int:
        pass

class IlastikSlurmPredictions(IlastikPredictions, ISlurmJobSpec):
    def __init__(self, ilastik_Stuff, slurm_stuff: int):
        ...

    @property
    def some_slurm_param(self) -> int:
        return self.slurm_stuff

Then If you really want to go crazy on the types, you could even make it so that your hypothetical SlurmRunner only accepts SlurJobSpec jobs:

class SlurmJobRunner(JobRunner):
    def run_jobs(self, Sequence[ISlurmJobSpec]):
        ...

I think this might help the architecture scale more easily into other jobs and other runners

The text was updated successfully, but these errors were encountered:

constantinpape · 2020-04-09T05:55:05Z

Thanks for looking into this. I agree with the issues in the current design.
I think what you propose makes sense, but haven't fully grasped it yet.

For our current use cases, the design issues are not a big problem, because I have added almost all Jobs we will need already and the number of job classes is so small that I can just manually add runners for a new target.

If we ever should decide to make this "long-term" maintainable this is a different question.
(But for this, I would much rather have a clean parent implementation of the JobSpec, JobRunner concept and share it with cloud_ilastik and others.)

constantinpape · 2020-04-14T21:32:37Z

@Tomaz-Vieira
I have realized that we don't need the fancyness to run on different targets here; can just use the usual
sbatch ... mechanism to run a full experiment via slurm. Experiments take <~ 1hr so there is no need to parallelize on an even more fine-grained level on slurm.

Anyway, I am still very interested in exploring how to use / merge CloudIlastik and cluster_tools.
Will make some issues to discuss this soon and leave this open until reference then.

constantinpape mentioned this issue Apr 13, 2020

Simplify design of the base class #17

Merged

constantinpape mentioned this issue Apr 27, 2020

Use (or generalize) cloud_ilastik architecture for my implementations of scalable algorithms ilastik/cloud_ilastik#70

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Architecture suggestion - explicit constructors, separate the notion of runner from job specification #5

Architecture suggestion - explicit constructors, separate the notion of runner from job specification #5

Tomaz-Vieira commented Apr 8, 2020

constantinpape commented Apr 9, 2020

constantinpape commented Apr 14, 2020

Architecture suggestion - explicit constructors, separate the notion of runner from job specification #5

Architecture suggestion - explicit constructors, separate the notion of runner from job specification #5

Comments

Tomaz-Vieira commented Apr 8, 2020

constantinpape commented Apr 9, 2020

constantinpape commented Apr 14, 2020