Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using Slurm in addition to (or as a replacement of) HTCondor #40

Open
amizeranschi opened this issue Nov 25, 2022 · 3 comments
Open

Using Slurm in addition to (or as a replacement of) HTCondor #40

amizeranschi opened this issue Nov 25, 2022 · 3 comments

Comments

@amizeranschi
Copy link
Contributor

Hi!

Is there any easy way to add a Slurm role to VGCN, if possible alongside HTCondor (or as a replacement of it, if the two can't coexist)?

My colleague Octavian (@OtimusOne) tried replacing the existing Ansible role for HTCondor with the one below, but he ended up having trouble compiling the resulting image.
https://github.com/galaxyproject/ansible-slurm

@amizeranschi
Copy link
Contributor Author

Hi @bgruening and @mira-miracoli

Thanks again for updating the VGCN image. In an attempt of reving this older thread, I wanted to ask for some advice about how Slurm could best be integrated into VGCN. My colleague was looking some time ago at the galaxyproject.slurm Ansible role, but didn't manage to figure out how to get it to work.

Our goal is to create heterogeneous clusters based on both HTCondor and Slurm via Terraform/Cloud-init, on OpenStack. We've been successfully following the Galaxy tutorial Deploying a compute cluster in OpenStack via Terraform in order to deploy HTCondor from the VGCN images, but some of the tools that we work with do not support HTCondor, although they support Slurm.

@natefoo and @hexylena any advice you might also have about how to approach this issue would be much appreciated.

@hexylena
Copy link
Member

hexylena commented Feb 15, 2023

(as background)
My understanding with slurm is it needs a lot more coordination for appearing and disappearing nodes like one expects in a cloud environment.

that and my previous experience with htcondor where we planned around transient nodes, motivated the focus solely on htcondor

I suspect integration of slurm is possible, you can easily install it, but running the images and having a central coordinator keep an up to date list is maybe a point of complexity that you'll need to address with additional development and coordination with openstack, good luck :/

@amizeranschi
Copy link
Contributor Author

Thanks @hexylena for the reply. We haven't considered transient nodes so far, as we're only getting started with this stuff. We are planning to also use the new hybrid cluster for use-cases outside of Galaxy, at first, although we're hoping to eventually move everything into Galaxy, as we gain experience with all of this.

What approach would you recommend for installing Slurm and getting it to run, when starting from the latest VGCN image? Could this be done entirely through Terraform+Cloud-init? Or should Slurm be integrated into the VGCN image itself, in a similar way to how NFS, Pulsar, Singularity etc. have been added?

For the second option, besides adding the galaxyproject.slurm Ansible role into requirements.yml, we wouldn't even know where to start, right now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants