Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Architecture suggestion - explicit constructors, separate the notion of runner from job specification #5

Open
Tomaz-Vieira opened this issue Apr 8, 2020 · 2 comments

Comments

@Tomaz-Vieira
Copy link

We are doing something similar in cloud-ilastik, in the sense that we are specifying jobs, running them, and collecting their outputs. I would say it's healthy to split the concepts of a job specification (lets call it JobSpec here) from the concept of a job runner.

Right now, those two ideas are merged into the class BatchJob, which both specifies what is to be run (some application like ilastik), as well as how/where it is to be run (local, slurm). There are a few negative implications to this design:

  • For every new backend, every child of BatchJob has to be updated to implement the matching backend in its self.runners dict;
  • It is not immediately clear to users of the jobs if they are able to run in some specific job backend e.g. IlastikPredictions(target='unicore') looks valid, but isn't;
  • it's not immediately clear what the supported runner backends are;
  • job runner backend logic isn't abstracted away from the jobs, requiring all of them to re-implement the runnig logic for each backend they plan to support;
  • It's not immediatley wrong to subclass BatchJob with an empty self.runners, or a self.runners with wrong keys or with implementations for runners that don't exist in the rest of the system and therefore won't integrate with other jobs;

I would suggest you consider implementing something akin to a JobSpec base class, which can then be subclassed by all your different jobs, and have those child classes be completely specified in their constructors, so no other part of the system is required to know anything about them other than they can be __call__ed with a list of file inputs or whatever is common to all jobs:

class JobSpec:
    @abstractmethod
     def __call__(self, inputs) -> JobResult:
        pass

class IlastikPrediction(JobSpec):
    def __init__(self, ilastik_thing1: str, ilastik_thing2: float, ...):
        pass
    
    def __call__(self, inputs) -> JobResult:
        ...

Then different runner implementations could be used to fire those jobs:

slurm_runner = SlurmRunner(...)
local_runner = LocalRunner(...)

job_specs: List[JobSpec] = [IlastikPrediction(...), Preprocess(...), IlastikPrediction(...)]
slurm_runner.run_jobs(job_specs)
local_runner.run_jobs(job_specs)

and if you eventually feel that there are some particularities that change in the Job specification depending on where it is supposed to be run, you could further derive your job specs to account for that:

class ISlurmJobSpec:
    @abstractmethod
    @property
    def some_slurm_param(self) -> int:
        pass

class IlastikSlurmPredictions(IlastikPredictions, ISlurmJobSpec):
    def __init__(self, ilastik_Stuff, slurm_stuff: int):
        ...

    @property
    def some_slurm_param(self) -> int:
        return self.slurm_stuff

Then If you really want to go crazy on the types, you could even make it so that your hypothetical SlurmRunner only accepts SlurJobSpec jobs:

class SlurmJobRunner(JobRunner):
    def run_jobs(self, Sequence[ISlurmJobSpec]):
        ...

I think this might help the architecture scale more easily into other jobs and other runners

@constantinpape
Copy link
Contributor

Thanks for looking into this. I agree with the issues in the current design.
I think what you propose makes sense, but haven't fully grasped it yet.

For our current use cases, the design issues are not a big problem, because I have added almost all Jobs we will need already and the number of job classes is so small that I can just manually add runners for a new target.

If we ever should decide to make this "long-term" maintainable this is a different question.
(But for this, I would much rather have a clean parent implementation of the JobSpec, JobRunner concept and share it with cloud_ilastik and others.)

@constantinpape
Copy link
Contributor

@Tomaz-Vieira
I have realized that we don't need the fancyness to run on different targets here; can just use the usual
sbatch ... mechanism to run a full experiment via slurm. Experiments take <~ 1hr so there is no need to parallelize on an even more fine-grained level on slurm.

Anyway, I am still very interested in exploring how to use / merge CloudIlastik and cluster_tools.
Will make some issues to discuss this soon and leave this open until reference then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants