-
Notifications
You must be signed in to change notification settings - Fork 357
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: deprecating container_runtime config, agentrm supporting singularity,podman, and apptainer #9516
Conversation
✅ Deploy Preview for determined-ui canceled.
|
agent/internal/agent.go
Outdated
}() | ||
docker := docker.NewClient(dcl) | ||
|
||
// a.log.Tracef("setting up %s runtime", a.opts.ContainerRuntime) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commented a few parts of code instead of completely removing it as we are unclear if we want to deprecate whole of container_runtime (configuration option/variable) or just overloaded use of "container runtime" (container "providers", e.g. docker, podman, apptainer).
I will make corrections as needed in future commits
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the only valid option for container_runtime should be docker
so if I had a config file that specified container_runtime: docker
I would expect that to not error after this change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
every other container runtime I would expect the agent to return an error message saying it can't run
agent/cmd/determined-agent/init.go
Outdated
@@ -152,6 +152,6 @@ func registerAgentConfig() { | |||
registerInt(flags, name("agent-reconnect-backoff"), defaults.AgentReconnectBackoff, | |||
"Time between agent reconnect attempts") | |||
|
|||
registerString(flags, name("container-runtime"), defaults.ContainerRuntime, | |||
"The container runtime to use") | |||
registerString(flags, name("docker-container-runtime"), defaults.DockerContainerRuntime, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change is to make config definition for /etc/determined/agent.yaml file.
My changes will be undone here after your suggestions below.
agent/internal/agent.go
Outdated
}() | ||
docker := docker.NewClient(dcl) | ||
|
||
// a.log.Tracef("setting up %s runtime", a.opts.ContainerRuntime) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the only valid option for container_runtime should be docker
so if I had a config file that specified container_runtime: docker
I would expect that to not error after this change
agent/internal/agent.go
Outdated
}() | ||
docker := docker.NewClient(dcl) | ||
|
||
// a.log.Tracef("setting up %s runtime", a.opts.ContainerRuntime) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
every other container runtime I would expect the agent to return an error message saying it can't run
) | ||
|
||
// ContainerRuntime is our interface for interacting with runtimes like Docker or Singularity. | ||
type ContainerRuntime interface { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kinda like the interface. I think its a lot harder to go the other way (that is put code behind a clean interface when it isn't) than it is to go from a clean interface to not having one.
So I think my preference would be to just leave this in here for now.
agent/cmd/determined-agent/init.go
Outdated
@@ -152,6 +152,6 @@ func registerAgentConfig() { | |||
registerInt(flags, name("agent-reconnect-backoff"), defaults.AgentReconnectBackoff, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we are going to need a lot of changes to the circleci file to no longer run agent tests
dd9734f
to
cd0e626
Compare
agent/internal/options/options.go
Outdated
// requires root or a suid installation with /etc/subuid --fakeroot. | ||
AllowNetworkCreation bool `json:"allow_network_creation"` | ||
} | ||
|
||
// PodmanOptions configures how we interact with podman. | ||
type PodmanOptions struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am currently leaving this out. Would remove it if not needed in the next subtask
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #9516 +/- ##
==========================================
+ Coverage 52.91% 53.07% +0.16%
==========================================
Files 1255 1253 -2
Lines 153086 152507 -579
Branches 3230 3229 -1
==========================================
- Hits 81004 80950 -54
+ Misses 71931 71406 -525
Partials 151 151
Flags with carried forward coverage won't be shown. Click here to find out more.
|
agent/internal/agent.go
Outdated
@@ -124,29 +122,14 @@ func (a *Agent) run(ctx context.Context) error { | |||
var cruntime container.ContainerRuntime | |||
switch a.opts.ContainerRuntime { | |||
case options.PodmanContainerRuntime: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we also want to remove this switch statement; no one should even be using the configuration option.
I think we should simply set cruntime
to options.DockerContainerRuntime
(or whatever makes sense in the code). We can check if a.opts.ContainerRuntime
has a value and is not docker if we want to print a generic warning ... but I think we can simplify this section of the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, earlier i removed the switch statement all together and had only Docker configs, but @NicholasBlaskey suggested we throw error in here
I agree to both cases, maybe now we can return an error, and later just have a case default
which handles with a generic error message
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This configuration option shouldn't be known to users; we do not want them using it even if they have discovered it.
It looks like docker is the default for container runtime configuration. I think this because there's no default
clause in the original switch statement, but you should confirm this is true.
I think we can have a much simpler check. In pseudo-code, something like:
if a.opts.ContainerRuntime != "" and a.opts.ContainerRuntime != "Docker" {
// print warning
// error and exit
}
// continue to set up ContainerRuntime interface using Docker
Let me know if this is still confusing and we can go over the code together! (I might be able to explain it better in a call :)
agent/internal/options/options.go
Outdated
SingularityOptions SingularityOptions `json:"singularity_options"` | ||
PodmanOptions PodmanOptions `json:"podman_options"` | ||
ContainerRuntime string `json:"container_runtime"` | ||
PodmanOptions PodmanOptions `json:"podman_options"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't want users to be using these configuration options. I'm ok with removing them in the next set of work though if that's what you prefer (I saw one of your comments).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking of having contianer_runtime
configuration as is, but will try PodmanOption
in later iterations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see Nick likes the ContainerRuntime
interface and wants to keep it. I think that makes a lot of sense.
I'm saying, we don't need to keep the user-facing container_runtime
configuration option. Either way, yes we should definitely remove PodmanOptions
in the next PR if it's not done in this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see what you mean here. That's possible and makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there may be some confusion since there's multiple ContainerRuntime
s in the code.
looking at this snippet ....
var cruntime container.ContainerRuntime // ContainerRuntime is an interface, defined in agent/internal/container_runtime.go
switch a.opts.ContainerRuntime { // ContainerRuntime is a string, populated by user configuration
The interface is good to keep. The configuration option one is up for debate.
.circleci/real_config.yml
Outdated
@@ -2840,7 +2840,7 @@ jobs: | |||
PROJECT=$(terraform -chdir=tools/slurm/terraform output --raw project) | |||
gcloud compute scp agent/build/determined-agent "$INSTANCE_NAME":~ --zone $ZONE | |||
gcloud compute ssh --zone "$ZONE" "$INSTANCE_NAME" --project "$PROJECT" -- \ | |||
srun determined-agent --master-host=<<parameters.master-host>> --master-port=<<parameters.master-port>> --resource-pool=default --container-runtime=<<parameters.container-run-type>> | |||
srun determined-agent --master-host=<<parameters.master-host>> --master-port=<<parameters.master-port>> --resource-pool=default |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can delete the agent-use
parameter and all steps that use it since it should be unused
So I would completely delete this when
block and the when
block below this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I understand your comment here and will update it.
We would need to remove the test-e2e-slurm
case as well, right?
jobs:
test-e2e-slurm:
parameters:
- when:
condition:
equal: ["-A", <<parameters.agent-use>>]
agent/cmd/determined-agent/init.go
Outdated
@@ -151,7 +151,4 @@ func registerAgentConfig() { | |||
"Max attempts agent has to reconnect") | |||
registerInt(flags, name("agent-reconnect-backoff"), defaults.AgentReconnectBackoff, | |||
"Time between agent reconnect attempts") | |||
|
|||
registerString(flags, name("container-runtime"), defaults.ContainerRuntime, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should leave this flag in but only allow docker as the container-runtime
If for some reason someone has determined-agent --container-runtime=docker
I don't think we should break that use case
@@ -81,10 +80,6 @@ type Options struct { | |||
// master config. | |||
AgentReconnectBackoff int `json:"agent_reconnect_backoff"` | |||
|
|||
ContainerRuntime string `json:"container_runtime"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment about leaving this option in but checking that is it always docker
So if someone has container_runtime: docker
in their agent config I think we shouldn't break them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to our previous iterations in the PR, I thought we agreed on having to remove ContainerRuntime
config all-together. I see what you are saying here and agree, whereas earlier I assumed because this config was never documented, it was not actually used by any customers.
@kkunapuli - Do you agree to the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ShreyaLnuHpe Yes, that was my understanding from talking with Bradley (that container_runtime
configuration option was never documented).
As you've dug into the task more thoroughly, it seems you've uncovered an additional use case that was documented: using agents with slurm. I'm ok with leaving the configuration option in, with non-Docker values returning an error. I though that extra config options would be ignored, not result in errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@NicholasBlaskey @kkunapuli
Shall we now have a documentation of container_runtime
config, and mentioning that it only intakes docker
value in case of agentRM?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I think don't document it. We won't block someone from having container_runtime: docker
but we don't want anyone adding it either.
tools/slurm/scripts/slurmcluster.sh
Outdated
@@ -76,7 +76,7 @@ while [[ $# -gt 0 ]]; do | |||
echo ' -A ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the -A
flag is the one we want to remove
Do we need any code changes to the scripts here to no longer support -A
beyond changing the usage description?
tools/slurm/scripts/slurmcluster.sh
Outdated
@@ -76,7 +76,7 @@ while [[ $# -gt 0 ]]; do | |||
echo ' -A ' | |||
echo " Description: Invokes a slurmcluster that uses agents instead of the launcher." | |||
echo " Example: $0 -A" | |||
echo ' -c {enroot|podman|singularity}' | |||
echo ' -c {docker}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think enroot|podman|singularity
are the correct options, since we don't support docker with launcher and we still will support the other runtimes for launcher
@@ -120,47 +118,18 @@ func (a *Agent) run(ctx context.Context) error { | |||
return fmt.Errorf("failed to detect devices: %v", devices) | |||
} | |||
|
|||
a.log.Tracef("setting up %s runtime", a.opts.ContainerRuntime) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we have a release note?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add a release note, thanks
file. Below is a minimal example using a resource pool named for the user (``$USER``) and | ||
``singularity`` as the container runtime platform. If configured using variables such as ``$HOME``, | ||
a single ``agent.yaml`` could be shared by all users. | ||
file. Below is a minimal example using a resource pool named for the user (``$USER``). ``docker`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this will work
I think in slurm environments it is very likely that the launched jobs won't have access to docker so the agent method won't be supported
I think this whole hpc-with-agent
workload is no longer going to be supported, so I think we can remove all these docs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would this entail removing the page?: https://docs.determined.ai/latest/setup-cluster/slurm/hpc-with-agent.html
if so, please create a redirect. there is a guide here on using the docs redirect tool https://hpe-aiatscale.atlassian.net/wiki/spaces/DOC/pages/1338310660/How+to+Use+the+Docs+Redirect+Tool
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let me know how i can help
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tara-det-ai - While using redirect to remove a .rst file, what do we update after the colon ':' in this file?
"setup-cluster/slurm/hpc-with-agent.rst" : "_index.html",
[Edited] The above code gives the below error-
broken redirect detected: Link(src='setup-cluster/deploy-cluster/slurm/hpc-with-agent', dst='setup-cluster/slurm/hpc-with-agent')
check failed; the following previously-published urls seem to have been dropped and should be assigned redirects:
setup-cluster/slurm/hpc-with-agent
Redirect removing file doc
It mentions about if we have moving a file from one directory location to other, but doesn't mention anything about when we remove the file completely?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove the page with git rm
- run
git rm setup-cluster/slurm/hpc-with-agent.rst
Remove the toctree element since the page is gone
- Edit the
docs/setup-cluster/slurm/_index.rst
page to remove the toctree elementhpc-with-agent
Create a new redirect by editing an existing redirect
- modify the
docs/.redirects/redirects.json
file, replacing line 16
"setup-cluster/deploy-cluster/slurm/hpc-with-agent": "../../slurm/hpc-with-agent.html",
with
"setup-cluster/deploy-cluster/slurm/hpc-with-agent": "../../slurm/_index.html",
Should build with no errors after that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Error occurred while building after making the above suggested change-
% make -C docs check
git ls-files -z '*.rst' | xargs -0 rstfmt -w 100 --check
python3 redirects.py check
check failed; the following previously-published urls seem to have been dropped and should be assigned redirects:
setup-cluster/slurm/hpc-with-agent
make: *** [check] Error 1
The above was solved when I added another line (reference to setup-cluster/slurm/hpc-with-agent
in docs/.redirects/redirects.json
file.
"setup-cluster/slurm/hpc-with-agent": "../slurm/_index.html",
Let me know if this looks right.
tools/slurm/README.md
Outdated
@@ -5,7 +5,7 @@ | |||
1. Install Terraform following [these instructions](https://developer.hashicorp.com/terraform/downloads). | |||
2. Download the [GCP CLI](https://cloud.google.com/sdk/docs/install-sdk) and run `gcloud auth application-default login` to get credentials. | |||
3. Run `make slurmcluster` from the root of the repo and wait (up to 10 minutes) for it to start. | |||
- To specify which container runtime environment to use, pass in `FLAGS="-c {container_run_type}"` to `make slurmcluster`. Choose from either `singularity` (default), `podman`, or `enroot`. | |||
- To specify which container runtime environment to use, pass in `FLAGS="-c {container_run_type}"` to `make slurmcluster`. Choose from either `singularity` (default), `podman`, or `enroot`. If utilizing Slurmcluster with Determined Agents, `docker` container runtime environment is the sole available option. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we support Slurmcluster with docker
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For clarity, the intent is to no longer give customers the option to use slurm with determined agents.
At least, that's what I understood from slack thread: https://hpe-aiatscale.slack.com/archives/C06GMG83ZE0/p1718224380888699
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Corrected it. Let me know mentioning it straight out makes sense or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we want to remove the entire last sentence, If utilizing Slurmcluster with Determined Agents,
docker container runtime environment is the sole available option.
We don't want customers to use Slurmcluster with Determiend Agents. At least, that's my understanding.
@@ -81,10 +80,6 @@ type Options struct { | |||
// master config. | |||
AgentReconnectBackoff int `json:"agent_reconnect_backoff"` | |||
|
|||
ContainerRuntime string `json:"container_runtime"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about image_root
?
https://docs.determined.ai/latest/reference/deploy/agent-config-reference.html#image-root
docs this have any affect on Docker?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not very clear of your question here.
If you mean that removing Singularity/Podman/Apptainer as container_runtime for AgentRM might affect with setting those image cache in image_root parameter - Looks unlikely as we know we have these options as is for when deploying Slurm though Launcher or any non-agent way shown here
Or if you mean that as AgentRM won't be supporting Singularity/Podman/Apptainer, hence it would have to make corrections in here. I do not think docker will be affected even in this case.
Kindly correct if my understanding is not right
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is two image_root
config options.
One lives in master config with resource_manager.agent_root
while the other is in agent config as image_root
For the agent config option, I think that Docker does not use the image_root
option given it was added for other runtimes. So I think now that we have removed the other runtimes I think we can also remove image_root
if Docker does not use it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
viewed hpc-with-agent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm OK with the infra-relevant portion of this PR, pending the stuff Nick suggested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks pretty good
Does the slurm CI still pass on this branch?
agent/internal/agent.go
Outdated
@@ -121,46 +119,24 @@ func (a *Agent) run(ctx context.Context) error { | |||
} | |||
|
|||
a.log.Tracef("setting up %s runtime", a.opts.ContainerRuntime) | |||
if a.opts.ContainerRuntime != options.DockerContainerRuntime { | |||
a.log.Error("%w creation is not supported, please update agent container runtime config to use Docker instead.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we mean container runtime instead of creation
?
agent/internal/agent.go
Outdated
a.opts.ContainerRuntime) | ||
return fmt.Errorf("%s creation not available", a.opts.ContainerRuntime) | ||
} | ||
|
||
var cruntime container.ContainerRuntime |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we don't need to declare cruntime ahead of time
var cruntime container.ContainerRuntime
and can just do this on line 139
cruntime := docker.NewClient(dcl)
|
||
- Singularity, Podman, and Apptainer Container runtimes for AgentRM: Launching a Singluarity/Podman/Apptainer container runtimes for Agent is no longer supported. Docker is the only option that is supported. | ||
|
||
- Determined Agent on Slurm/PBS: Slurmcluster with Determined Agents is not supported any more. For detailed instructions on existing ways to deploy, visit :ref:deploy-on-slurm-pbs. This change was announced in version 0.33.0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can remove the second note since slurmcluster
is only used internally
|
||
**Deprecations** | ||
|
||
- Singularity, Podman, and Apptainer Container runtimes for AgentRM: Launching a Singluarity/Podman/Apptainer container runtimes for Agent is no longer supported. Docker is the only option that is supported. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid repeating here I think I would start the note with
AgentRM: ...
I would add a note that reminds users
"If you want to use singularity, podman, or apptainer the Determined master enterprise edition still supports this "
I think I would also clarify that
"This change only affects when container_runtime
is set to podman
, using a podman emulation layer is unchanged"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for suggesting! Its far better approach!
@@ -214,16 +206,6 @@ if [[ $OPT_CONTAINER_RUN_TYPE == "enroot" ]]; then | |||
fi | |||
fi | |||
|
|||
TEMPYAML=$TEMPDIR/slurmcluster.yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think on line 143 there is another reference to $DETERMINED_AGENT
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't remove the line in 143 because the description of those lines meant grabbing token for launcher service, which I think is still valid.
Though when we do not set DETERMINED_AGENT=1
anymore, hence those set of lines will never be executed.
I am inclined towards removing the line 143 if [[ -z $DETERMINED_AGENT ]]; then
if condition and bring the lines within line 144 - 150 outside the if parameter.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think since we don't set DETERMINED_AGENT
the if [[ -z $DETERMINED_AGENT ]]; then
checks that DETERMINED_AGENT
isn't set so the if will always be true. So I think your suggestion of removing the if statement and keeping the rest of the code is right
@@ -214,16 +206,6 @@ if [[ $OPT_CONTAINER_RUN_TYPE == "enroot" ]]; then | |||
fi | |||
fi | |||
|
|||
TEMPYAML=$TEMPDIR/slurmcluster.yaml | |||
envsubst <$PARENT_PATH/slurmcluster.yaml >$TEMPYAML | |||
if [[ -n $DETERMINED_AGENT ]]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious here what deploys the agents? As in is there another step which launches the agents? I'm not super familiar with this code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am myself not very familiar with this shell script, but according to these lines I understand that - If $DETERMINED_AGENT is set, this command deletes lines between resource_manager and resource_manager_end in $TEMPYAML.
But we do not need to generate a devcluster file deployed with agent anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah ok, looks like agents are deployed manually so we don't need to do anything else
@@ -81,10 +80,6 @@ type Options struct { | |||
// master config. | |||
AgentReconnectBackoff int `json:"agent_reconnect_backoff"` | |||
|
|||
ContainerRuntime string `json:"container_runtime"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is two image_root
config options.
One lives in master config with resource_manager.agent_root
while the other is in agent config as image_root
For the agent config option, I think that Docker does not use the image_root
option given it was added for other runtimes. So I think now that we have removed the other runtimes I think we can also remove image_root
if Docker does not use it.
tools/slurm/README.md
Outdated
@@ -5,7 +5,7 @@ | |||
1. Install Terraform following [these instructions](https://developer.hashicorp.com/terraform/downloads). | |||
2. Download the [GCP CLI](https://cloud.google.com/sdk/docs/install-sdk) and run `gcloud auth application-default login` to get credentials. | |||
3. Run `make slurmcluster` from the root of the repo and wait (up to 10 minutes) for it to start. | |||
- To specify which container runtime environment to use, pass in `FLAGS="-c {container_run_type}"` to `make slurmcluster`. Choose from either `singularity` (default), `podman`, or `enroot`. | |||
- To specify which container runtime environment to use, pass in `FLAGS="-c {container_run_type}"` to `make slurmcluster`. Choose from either `singularity` (default), `podman`, or `enroot`. If utilizing Slurmcluster with Determined Agents, `docker` container runtime environment is the sole available option. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- To specify which container runtime environment to use, pass in `FLAGS="-c {container_run_type}"` to `make slurmcluster`. Choose from either `singularity` (default), `podman`, or `enroot`. If utilizing Slurmcluster with Determined Agents, `docker` container runtime environment is the sole available option. | |
- To specify which container runtime environment to use, pass ``FLAGS="-c {container_run_type}"`` to make slurmcluster. You can choose from ``singularity`` (default), ``podman``, or ``enroot``. If you are using Slurmcluster with Determined Agents, docker is the only available container runtime environment. |
8234d62
to
13516a6
Compare
- AgentRM: Launching a Singluarity/Podman/Apptainer container runtimes for Agent is no longer | ||
supported. Docker is the only option that is supported. This change only affects when | ||
container_runtime is set to podman, using a podman emulation layer is unchanged. If you want to | ||
use singularity, podman, or apptainer the Determined master enterprise edition still supports it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docker is the only option that is supported. This change only affects when
container_runtime is set to podman, using a podman emulation layer is unchanged.
This is confusing to me. Maybe something like this would be clearer: "Docker and Podman using the emulation layer are still supported."
If you want to
use singularity, podman, or apptainer the Determined master enterprise edition still supports it.
Is this statement accurate? Are you able to run AgentRM with singularity, podman, or apptainer in EE? If so, then wouldn't it be the same amount of work to support all containers in both EE and OSS? Why bother deprecating it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to
use singularity, podman, or apptainer the Determined master enterprise edition still supports it.
This came from a comment from Nick: #9516 (comment)
I'm not exactly sure what he means. I think he was referring to how the other container runtimes (podman/singularity/apptainer) still work with slurm (and slurm is ee only). If I recall correctly, podman also works with kubernetesrm though ...
If it's accurate, I think I'd rather say something like "Kubernetes and Slurm resource managers still support singularity, podman, or apptainer use."
.circleci/real_config.yml
Outdated
@@ -5614,39 +5450,6 @@ workflows: | |||
ld_library_path: | |||
security: | |||
initial_user_password: ${INITIAL_USER_PASSWORD} | |||
- test-e2e-slurm: | |||
name: test-e2e-slurm-agent-singularity-znode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@NicholasBlaskey / @kkunapuli - Please confirm if we no longer be needing this test as well.
My understanding is, that we should not test any agent related singularity znode changes, but I am not sure, so need an opinion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ongoing discussion on the failed state of the test- https://hpe-aiatscale.slack.com/archives/C04C9JXB1C2/p1720448458815989
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this testing EE + Slurm + Singularity which remains a supported usecase if I'm reading the PR description correctly?
update this is testing: https://docs.determined.ai/latest/setup-cluster/slurm/hpc-with-agent.html
make sense to treat these the same so remove?
4524: name: [test-e2e-slurm-agent-podman-gcp]
5427: name: [test-e2e-slurm-agent-podman-gcp]
5620: name: test-e2e-slurm-agent-singularity-znode
All the znodes tests should start getting scheduled and running after the backlog is cleared let's check on that in the channel
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that EE + Slurm + Singularity is a supported use case, but the name of the test test-e2e-slurm-agent-singularity-znode
confuses me a little as we are deprecating all AGENT related singularity/podman/apptainer use cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, for this PR, we have removed Agent on Slurm/PBS document and removed the below test runs (you can see above in this PR) as well:
4524: name: [test-e2e-slurm-agent-podman-gcp]
5427: name: [test-e2e-slurm-agent-podman-gcp]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work!
I left a couple minor suggestions. As a reminder, please don't merge until after user acceptance testing.
agent/internal/agent.go
Outdated
if a.opts.ContainerRuntime != options.DockerContainerRuntime { | ||
a.log.Error(a.opts.ContainerRuntime, | ||
" Container Runtime is not supported, please update runtime config to use Docker instead.") | ||
return fmt.Errorf("%s Container Runtime not available", a.opts.ContainerRuntime) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think it's more idiomatic to format the error with the %s
at the end, so container runtime not available: %s
I think Go requires errors to be all lowercase?
- AgentRM: Support for Singluarity, Podman, and Apptainer has been deprecated in 0.33.0 and is now | ||
removed. Docker is the only container runtime supported by Agent resource manager (AgentRM). It | ||
is still possible to use podman with AgentRM by using the podman emulation layer. For detailed | ||
instructions, follow steps in the link: `Emulating Docker CLI with Podman |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think to be more consistent with docs style, we should use "visit" instead of "follow steps in the link".
e.g.,
For detailed instructions, visit `Emulating Docker CLI with Podman <https://podman-desktop.io/docs/migrating-from-docker/emulating-docker-cli-with-podman>`
tools/slurm/README.md
Outdated
Now, you can launch jobs like normal using the Determined CLI. You can check the status of the allocated resources using `det slot list`. | ||
|
||
If you encounter an issue with jobs failing due to `ModuleNotFoundError: No module named 'determined'` run `make clean all` to rebuild determined. | ||
### Note: We no longer support Slurmcluster with Determined Agents |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think it's more consistent with docs style to say:
Slurmcluster with Determined Agents is no longer supported.
Config Reference | ||
https://docs.determined.ai/latest/reference/deploy/master-config-reference.html#checkpoint-storage` | ||
|
||
In enterprise edition, Slurm resource manager still supports singularity, podman, or apptainer use. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update capitalization for Singularity, Podman, and Apptainer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not confident, but I think we want either "In the enterprise edition, ..." or "In Enterprise Edition, ..." Maybe more of a question for Tara.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's discuss this during our user acceptance call with Caetano.
@@ -13,7 +13,8 @@ | |||
"architecture/system-architecture": "../get-started/architecture/system-architecture.html", | |||
"architecture/introduction": "../get-started/architecture/introduction.html", | |||
"setup-cluster/deploy-cluster/slurm/install-on-slurm": "../../slurm/install-on-slurm.html", | |||
"setup-cluster/deploy-cluster/slurm/hpc-with-agent": "../../slurm/hpc-with-agent.html", | |||
"setup-cluster/deploy-cluster/slurm/hpc-with-agent": "../../slurm/_index.html", | |||
"setup-cluster/slurm/hpc-with-agent": "../slurm/_index.html", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
didn't know we have this, nice.
# the master waits for agents to connect and provide resources. | ||
sed -i -e '/resource_manager/,/resource_manager_end/d' $TEMPYAML | ||
fi | ||
echo "Generated devcluster file: $TEMPYAML" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this still getting generated? the failure might be relate to this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A lot of people have reviewed this since I last looked, so I don't feel the need to review it. Let me know if you would like me to look.
https://docs.determined.ai/latest/reference/deploy/master-config-reference.html#checkpoint-storage` | ||
|
||
In enterprise edition, Slurm resource manager still supports Singularity, Podman, or Apptainer use. | ||
For detailed instructions, visit :ref:deploy-on-slurm-pbs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:orphan:
Deprecations
-
AgentRM: As of version 0.33.0, support for Singularity, Podman, and Apptainer has been deprecated and is now officially removed. Docker is the only container runtime supported by Agent resource manager (AgentRM). However, you can still use Podman with AgentRM by utilizing the Podman emulation layer. For instructions, visit the Podman Desktop documentation and search for "Emulating Docker CLI with Podman". Additionally, you may need to configure
checkpoint_storage
in your experiment configuration or :ref:master configuration <master-config-reference>
. -
In the enterprise edition, the Slurm Resource Manager continues to support Singularity, Podman, and Apptainer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the only restructured text anchor that i could find for slurm/pbs is this one
.. _install-on-slurm:
so maybe you meant to use the following:
:ref:install-on-slurm
in any case, this one does not exist and it would need proper formatting if it did exist: :ref:deploy-on-slurm-pbs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ShreyaLnuHpe PR 9662 should fix these
Ticket
RM-311
Description
Deprecate all container runtimes config except for Docker for AgentRM
Potential Impact: Can only run AgentRM with Docker container runtime. Can no longer run with Apptainer/Singularity or podman.
AgentRM: Support for Singluarity, Podman, and Apptainer has been deprecated in 0.33.0 and is now
removed. Docker is the only container runtime supported by Agent resource manager (AgentRM). It
is still possible to use podman with AgentRM by using the podman emulation layer. For detailed
instructions, follow steps in the link: Emulating Docker CLI with Podman. You
might need to also configure checkpoint_storage in experiment or master configurations: Master
Config Reference
In enterprise edition, Slurm resource manager still supports singularity, podman, or apptainer use.
For detailed instructions, visit :ref:deploy-on-slurm-pbs.
Test Plan
Pass CircleCI + Manual testing by setting the below code in tools/devcluster.yaml file
happy case:
container_runtime: docker
unhappy case:
container_runtime: podman
example:
command to run-
% devcluster -c tools/devcluster.yaml
experiment command-
% cd examples/tutorials/mnist_pytorch
% det experiment create const.yaml .
Checklist
docs/release-notes/
.See Release Note for details.