Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: deprecating container_runtime config, agentrm supporting singularity,podman, and apptainer #9516

Merged
merged 19 commits into from
Jul 15, 2024

Conversation

ShreyaLnuHpe
Copy link
Contributor

@ShreyaLnuHpe ShreyaLnuHpe commented Jun 13, 2024

Ticket

RM-311

Description

Deprecate all container runtimes config except for Docker for AgentRM

Potential Impact: Can only run AgentRM with Docker container runtime. Can no longer run with Apptainer/Singularity or podman.

AgentRM: Support for Singluarity, Podman, and Apptainer has been deprecated in 0.33.0 and is now
removed. Docker is the only container runtime supported by Agent resource manager (AgentRM). It
is still possible to use podman with AgentRM by using the podman emulation layer. For detailed
instructions, follow steps in the link: Emulating Docker CLI with Podman. You
might need to also configure checkpoint_storage in experiment or master configurations: Master
Config Reference

In enterprise edition, Slurm resource manager still supports singularity, podman, or apptainer use.
For detailed instructions, visit :ref:deploy-on-slurm-pbs.

Test Plan

Pass CircleCI + Manual testing by setting the below code in tools/devcluster.yaml file
happy case: container_runtime: docker
unhappy case: container_runtime: podman

example:

- agent:
      pre:
        - sh: make -C agent build
      cmdline:
        - agent/build/determined-agent
        - run
        - --config-file
        - :config

      # config_file is just an agent.yaml
      config_file:
        master_host: 127.0.0.1
        master_port: 8080
        container_master_host: $DOCKER_LOCALHOST
        log:
          level: trace
        container_runtime: docker

command to run- % devcluster -c tools/devcluster.yaml
experiment command-
% cd examples/tutorials/mnist_pytorch
% det experiment create const.yaml .

Checklist

  • Changes have been manually QA'd
  • User-facing API changes need the "User-facing API Change" label.
  • Release notes should be added as a separate file under docs/release-notes/.
    See Release Note for details.
  • Licenses should be included for new code which was copied and/or modified from any external code.

@ShreyaLnuHpe ShreyaLnuHpe requested a review from a team as a code owner June 13, 2024 18:03
@cla-bot cla-bot bot added the cla-signed label Jun 13, 2024
Copy link

netlify bot commented Jun 13, 2024

Deploy Preview for determined-ui canceled.

Name Link
🔨 Latest commit b6c4e44
🔍 Latest deploy log https://app.netlify.com/sites/determined-ui/deploys/669174b782a450000804251b

}()
docker := docker.NewClient(dcl)

// a.log.Tracef("setting up %s runtime", a.opts.ContainerRuntime)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented a few parts of code instead of completely removing it as we are unclear if we want to deprecate whole of container_runtime (configuration option/variable) or just overloaded use of "container runtime" (container "providers", e.g. docker, podman, apptainer).

I will make corrections as needed in future commits

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the only valid option for container_runtime should be docker

so if I had a config file that specified container_runtime: docker I would expect that to not error after this change

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

every other container runtime I would expect the agent to return an error message saying it can't run

@ShreyaLnuHpe ShreyaLnuHpe changed the title chore: deprecating container_runtime chore: draft deprecating container_runtime Jun 13, 2024
@@ -152,6 +152,6 @@ func registerAgentConfig() {
registerInt(flags, name("agent-reconnect-backoff"), defaults.AgentReconnectBackoff,
"Time between agent reconnect attempts")

registerString(flags, name("container-runtime"), defaults.ContainerRuntime,
"The container runtime to use")
registerString(flags, name("docker-container-runtime"), defaults.DockerContainerRuntime,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change is to make config definition for /etc/determined/agent.yaml file.
My changes will be undone here after your suggestions below.

}()
docker := docker.NewClient(dcl)

// a.log.Tracef("setting up %s runtime", a.opts.ContainerRuntime)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the only valid option for container_runtime should be docker

so if I had a config file that specified container_runtime: docker I would expect that to not error after this change

}()
docker := docker.NewClient(dcl)

// a.log.Tracef("setting up %s runtime", a.opts.ContainerRuntime)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

every other container runtime I would expect the agent to return an error message saying it can't run

)

// ContainerRuntime is our interface for interacting with runtimes like Docker or Singularity.
type ContainerRuntime interface {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kinda like the interface. I think its a lot harder to go the other way (that is put code behind a clean interface when it isn't) than it is to go from a clean interface to not having one.

So I think my preference would be to just leave this in here for now.

@@ -152,6 +152,6 @@ func registerAgentConfig() {
registerInt(flags, name("agent-reconnect-backoff"), defaults.AgentReconnectBackoff,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are going to need a lot of changes to the circleci file to no longer run agent tests

@ShreyaLnuHpe ShreyaLnuHpe force-pushed the shreya/deprecateContainerRuntime branch from dd9734f to cd0e626 Compare June 14, 2024 19:26
// requires root or a suid installation with /etc/subuid --fakeroot.
AllowNetworkCreation bool `json:"allow_network_creation"`
}

// PodmanOptions configures how we interact with podman.
type PodmanOptions struct {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am currently leaving this out. Would remove it if not needed in the next subtask

Copy link

codecov bot commented Jun 14, 2024

Codecov Report

Attention: Patch coverage is 0% with 12 lines in your changes missing coverage. Please review.

Project coverage is 53.07%. Comparing base (0a57cde) to head (b6c4e44).
Report is 23 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #9516      +/-   ##
==========================================
+ Coverage   52.91%   53.07%   +0.16%     
==========================================
  Files        1255     1253       -2     
  Lines      153086   152507     -579     
  Branches     3230     3229       -1     
==========================================
- Hits        81004    80950      -54     
+ Misses      71931    71406     -525     
  Partials      151      151              
Flag Coverage Δ
backend 44.47% <0.00%> (+0.38%) ⬆️
harness 72.76% <ø> (ø)
web 51.30% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
agent/cmd/determined-agent/init.go 100.00% <ø> (ø)
agent/internal/options/options.go 29.09% <ø> (ø)
agent/internal/agent.go 0.00% <0.00%> (ø)

... and 6 files with indirect coverage changes

@ShreyaLnuHpe ShreyaLnuHpe changed the title chore: draft deprecating container_runtime chore: deprecating container_runtime Jun 14, 2024
@@ -124,29 +122,14 @@ func (a *Agent) run(ctx context.Context) error {
var cruntime container.ContainerRuntime
switch a.opts.ContainerRuntime {
case options.PodmanContainerRuntime:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we also want to remove this switch statement; no one should even be using the configuration option.

I think we should simply set cruntime to options.DockerContainerRuntime (or whatever makes sense in the code). We can check if a.opts.ContainerRuntime has a value and is not docker if we want to print a generic warning ... but I think we can simplify this section of the code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, earlier i removed the switch statement all together and had only Docker configs, but @NicholasBlaskey suggested we throw error in here

I agree to both cases, maybe now we can return an error, and later just have a case default which handles with a generic error message

Copy link
Contributor

@kkunapuli kkunapuli Jun 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This configuration option shouldn't be known to users; we do not want them using it even if they have discovered it.

It looks like docker is the default for container runtime configuration. I think this because there's no default clause in the original switch statement, but you should confirm this is true.

I think we can have a much simpler check. In pseudo-code, something like:

if a.opts.ContainerRuntime != "" and a.opts.ContainerRuntime != "Docker" {
  // print warning
  // error and exit
}

// continue to set up ContainerRuntime interface using Docker

Let me know if this is still confusing and we can go over the code together! (I might be able to explain it better in a call :)

SingularityOptions SingularityOptions `json:"singularity_options"`
PodmanOptions PodmanOptions `json:"podman_options"`
ContainerRuntime string `json:"container_runtime"`
PodmanOptions PodmanOptions `json:"podman_options"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want users to be using these configuration options. I'm ok with removing them in the next set of work though if that's what you prefer (I saw one of your comments).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of having contianer_runtime configuration as is, but will try PodmanOption in later iterations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see Nick likes the ContainerRuntime interface and wants to keep it. I think that makes a lot of sense.

I'm saying, we don't need to keep the user-facing container_runtime configuration option. Either way, yes we should definitely remove PodmanOptions in the next PR if it's not done in this one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you mean here. That's possible and makes sense.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there may be some confusion since there's multiple ContainerRuntimes in the code.

looking at this snippet ....

	var cruntime container.ContainerRuntime // ContainerRuntime is an interface, defined in agent/internal/container_runtime.go
	switch a.opts.ContainerRuntime { // ContainerRuntime is a string, populated by user configuration

The interface is good to keep. The configuration option one is up for debate.

@ShreyaLnuHpe ShreyaLnuHpe requested a review from a team as a code owner June 18, 2024 19:13
@determined-ci determined-ci added the documentation Improvements or additions to documentation label Jun 18, 2024
@determined-ci determined-ci requested a review from a team June 18, 2024 19:13
@ShreyaLnuHpe ShreyaLnuHpe changed the title chore: deprecating container_runtime chore: deprecating container_runtime config, AgentRM supporting singularity,podman, and apptainer Jun 18, 2024
@@ -2840,7 +2840,7 @@ jobs:
PROJECT=$(terraform -chdir=tools/slurm/terraform output --raw project)
gcloud compute scp agent/build/determined-agent "$INSTANCE_NAME":~ --zone $ZONE
gcloud compute ssh --zone "$ZONE" "$INSTANCE_NAME" --project "$PROJECT" -- \
srun determined-agent --master-host=<<parameters.master-host>> --master-port=<<parameters.master-port>> --resource-pool=default --container-runtime=<<parameters.container-run-type>>
srun determined-agent --master-host=<<parameters.master-host>> --master-port=<<parameters.master-port>> --resource-pool=default
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can delete the agent-use parameter and all steps that use it since it should be unused

So I would completely delete this when block and the when block below this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I understand your comment here and will update it.
We would need to remove the test-e2e-slurm case as well, right?

jobs:
  test-e2e-slurm:
      parameters:
        - when:
                  condition:
                    equal: ["-A", <<parameters.agent-use>>]

@@ -151,7 +151,4 @@ func registerAgentConfig() {
"Max attempts agent has to reconnect")
registerInt(flags, name("agent-reconnect-backoff"), defaults.AgentReconnectBackoff,
"Time between agent reconnect attempts")

registerString(flags, name("container-runtime"), defaults.ContainerRuntime,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should leave this flag in but only allow docker as the container-runtime

If for some reason someone has determined-agent --container-runtime=docker I don't think we should break that use case

@@ -81,10 +80,6 @@ type Options struct {
// master config.
AgentReconnectBackoff int `json:"agent_reconnect_backoff"`

ContainerRuntime string `json:"container_runtime"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment about leaving this option in but checking that is it always docker

So if someone has container_runtime: docker in their agent config I think we shouldn't break them

Copy link
Contributor Author

@ShreyaLnuHpe ShreyaLnuHpe Jun 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to our previous iterations in the PR, I thought we agreed on having to remove ContainerRuntime config all-together. I see what you are saying here and agree, whereas earlier I assumed because this config was never documented, it was not actually used by any customers.

@kkunapuli - Do you agree to the same?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ShreyaLnuHpe Yes, that was my understanding from talking with Bradley (that container_runtime configuration option was never documented).

As you've dug into the task more thoroughly, it seems you've uncovered an additional use case that was documented: using agents with slurm. I'm ok with leaving the configuration option in, with non-Docker values returning an error. I though that extra config options would be ignored, not result in errors.

Copy link
Contributor Author

@ShreyaLnuHpe ShreyaLnuHpe Jun 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NicholasBlaskey @kkunapuli
Shall we now have a documentation of container_runtime config, and mentioning that it only intakes docker value in case of agentRM?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I think don't document it. We won't block someone from having container_runtime: docker but we don't want anyone adding it either.

@@ -76,7 +76,7 @@ while [[ $# -gt 0 ]]; do
echo ' -A '
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the -A flag is the one we want to remove

Do we need any code changes to the scripts here to no longer support -A beyond changing the usage description?

@@ -76,7 +76,7 @@ while [[ $# -gt 0 ]]; do
echo ' -A '
echo " Description: Invokes a slurmcluster that uses agents instead of the launcher."
echo " Example: $0 -A"
echo ' -c {enroot|podman|singularity}'
echo ' -c {docker}'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think enroot|podman|singularity are the correct options, since we don't support docker with launcher and we still will support the other runtimes for launcher

@@ -120,47 +118,18 @@ func (a *Agent) run(ctx context.Context) error {
return fmt.Errorf("failed to detect devices: %v", devices)
}

a.log.Tracef("setting up %s runtime", a.opts.ContainerRuntime)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have a release note?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add a release note, thanks

file. Below is a minimal example using a resource pool named for the user (``$USER``) and
``singularity`` as the container runtime platform. If configured using variables such as ``$HOME``,
a single ``agent.yaml`` could be shared by all users.
file. Below is a minimal example using a resource pool named for the user (``$USER``). ``docker``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this will work

I think in slurm environments it is very likely that the launched jobs won't have access to docker so the agent method won't be supported

I think this whole hpc-with-agent workload is no longer going to be supported, so I think we can remove all these docs

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would this entail removing the page?: https://docs.determined.ai/latest/setup-cluster/slurm/hpc-with-agent.html

if so, please create a redirect. there is a guide here on using the docs redirect tool https://hpe-aiatscale.atlassian.net/wiki/spaces/DOC/pages/1338310660/How+to+Use+the+Docs+Redirect+Tool

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me know how i can help

Copy link
Contributor Author

@ShreyaLnuHpe ShreyaLnuHpe Jul 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tara-det-ai - While using redirect to remove a .rst file, what do we update after the colon ':' in this file?
"setup-cluster/slurm/hpc-with-agent.rst" : "_index.html",

[Edited] The above code gives the below error-

broken redirect detected: Link(src='setup-cluster/deploy-cluster/slurm/hpc-with-agent', dst='setup-cluster/slurm/hpc-with-agent')
check failed; the following previously-published urls seem to have been dropped and should be assigned redirects:
setup-cluster/slurm/hpc-with-agent

Redirect removing file doc
It mentions about if we have moving a file from one directory location to other, but doesn't mention anything about when we remove the file completely?

Copy link
Contributor

@tara-det-ai tara-det-ai Jul 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the page with git rm

  1. run git rm setup-cluster/slurm/hpc-with-agent.rst

Remove the toctree element since the page is gone

  1. Edit the docs/setup-cluster/slurm/_index.rst page to remove the toctree element hpc-with-agent

Create a new redirect by editing an existing redirect

  1. modify the docs/.redirects/redirects.json file, replacing line 16
    "setup-cluster/deploy-cluster/slurm/hpc-with-agent": "../../slurm/hpc-with-agent.html",
    with
    "setup-cluster/deploy-cluster/slurm/hpc-with-agent": "../../slurm/_index.html",

Should build with no errors after that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tara-det-ai -

Error occurred while building after making the above suggested change-

% make -C docs check
git ls-files -z '*.rst' | xargs -0 rstfmt -w 100 --check
python3 redirects.py check
check failed; the following previously-published urls seem to have been dropped and should be assigned redirects:
setup-cluster/slurm/hpc-with-agent
make: *** [check] Error 1

The above was solved when I added another line (reference to setup-cluster/slurm/hpc-with-agent in docs/.redirects/redirects.json file.
"setup-cluster/slurm/hpc-with-agent": "../slurm/_index.html",

Let me know if this looks right.

@@ -5,7 +5,7 @@
1. Install Terraform following [these instructions](https://developer.hashicorp.com/terraform/downloads).
2. Download the [GCP CLI](https://cloud.google.com/sdk/docs/install-sdk) and run `gcloud auth application-default login` to get credentials.
3. Run `make slurmcluster` from the root of the repo and wait (up to 10 minutes) for it to start.
- To specify which container runtime environment to use, pass in `FLAGS="-c {container_run_type}"` to `make slurmcluster`. Choose from either `singularity` (default), `podman`, or `enroot`.
- To specify which container runtime environment to use, pass in `FLAGS="-c {container_run_type}"` to `make slurmcluster`. Choose from either `singularity` (default), `podman`, or `enroot`. If utilizing Slurmcluster with Determined Agents, `docker` container runtime environment is the sole available option.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we support Slurmcluster with docker

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For clarity, the intent is to no longer give customers the option to use slurm with determined agents.

At least, that's what I understood from slack thread: https://hpe-aiatscale.slack.com/archives/C06GMG83ZE0/p1718224380888699

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected it. Let me know mentioning it straight out makes sense or not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want to remove the entire last sentence, If utilizing Slurmcluster with Determined Agents, docker container runtime environment is the sole available option.

We don't want customers to use Slurmcluster with Determiend Agents. At least, that's my understanding.

@@ -81,10 +80,6 @@ type Options struct {
// master config.
AgentReconnectBackoff int `json:"agent_reconnect_backoff"`

ContainerRuntime string `json:"container_runtime"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about image_root?

https://docs.determined.ai/latest/reference/deploy/agent-config-reference.html#image-root

docs this have any affect on Docker?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not very clear of your question here.

If you mean that removing Singularity/Podman/Apptainer as container_runtime for AgentRM might affect with setting those image cache in image_root parameter - Looks unlikely as we know we have these options as is for when deploying Slurm though Launcher or any non-agent way shown here

Or if you mean that as AgentRM won't be supporting Singularity/Podman/Apptainer, hence it would have to make corrections in here. I do not think docker will be affected even in this case.

Kindly correct if my understanding is not right

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is two image_root config options.

One lives in master config with resource_manager.agent_root while the other is in agent config as image_root

For the agent config option, I think that Docker does not use the image_root option given it was added for other runtimes. So I think now that we have removed the other runtimes I think we can also remove image_root if Docker does not use it.

Copy link
Contributor

@tara-det-ai tara-det-ai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

viewed hpc-with-agent

Copy link
Contributor

@dannysauer dannysauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm OK with the infra-relevant portion of this PR, pending the stuff Nick suggested.

@determined-ci determined-ci requested a review from a team July 1, 2024 18:05
Copy link
Contributor

@NicholasBlaskey NicholasBlaskey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks pretty good

Does the slurm CI still pass on this branch?

@@ -121,46 +119,24 @@ func (a *Agent) run(ctx context.Context) error {
}

a.log.Tracef("setting up %s runtime", a.opts.ContainerRuntime)
if a.opts.ContainerRuntime != options.DockerContainerRuntime {
a.log.Error("%w creation is not supported, please update agent container runtime config to use Docker instead.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we mean container runtime instead of creation?

a.opts.ContainerRuntime)
return fmt.Errorf("%s creation not available", a.opts.ContainerRuntime)
}

var cruntime container.ContainerRuntime
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need to declare cruntime ahead of time

	var cruntime container.ContainerRuntime

and can just do this on line 139

	cruntime := docker.NewClient(dcl)


- Singularity, Podman, and Apptainer Container runtimes for AgentRM: Launching a Singluarity/Podman/Apptainer container runtimes for Agent is no longer supported. Docker is the only option that is supported.

- Determined Agent on Slurm/PBS: Slurmcluster with Determined Agents is not supported any more. For detailed instructions on existing ways to deploy, visit :ref:deploy-on-slurm-pbs. This change was announced in version 0.33.0.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can remove the second note since slurmcluster is only used internally


**Deprecations**

- Singularity, Podman, and Apptainer Container runtimes for AgentRM: Launching a Singluarity/Podman/Apptainer container runtimes for Agent is no longer supported. Docker is the only option that is supported.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid repeating here I think I would start the note with

AgentRM: ...

I would add a note that reminds users
"If you want to use singularity, podman, or apptainer the Determined master enterprise edition still supports this "

I think I would also clarify that
"This change only affects when container_runtime is set to podman, using a podman emulation layer is unchanged"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for suggesting! Its far better approach!

@@ -214,16 +206,6 @@ if [[ $OPT_CONTAINER_RUN_TYPE == "enroot" ]]; then
fi
fi

TEMPYAML=$TEMPDIR/slurmcluster.yaml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think on line 143 there is another reference to $DETERMINED_AGENT

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't remove the line in 143 because the description of those lines meant grabbing token for launcher service, which I think is still valid.
Though when we do not set DETERMINED_AGENT=1 anymore, hence those set of lines will never be executed.

I am inclined towards removing the line 143 if [[ -z $DETERMINED_AGENT ]]; then if condition and bring the lines within line 144 - 150 outside the if parameter.

What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think since we don't set DETERMINED_AGENT

the if [[ -z $DETERMINED_AGENT ]]; then checks that DETERMINED_AGENT isn't set so the if will always be true. So I think your suggestion of removing the if statement and keeping the rest of the code is right

@@ -214,16 +206,6 @@ if [[ $OPT_CONTAINER_RUN_TYPE == "enroot" ]]; then
fi
fi

TEMPYAML=$TEMPDIR/slurmcluster.yaml
envsubst <$PARENT_PATH/slurmcluster.yaml >$TEMPYAML
if [[ -n $DETERMINED_AGENT ]]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious here what deploys the agents? As in is there another step which launches the agents? I'm not super familiar with this code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am myself not very familiar with this shell script, but according to these lines I understand that - If $DETERMINED_AGENT is set, this command deletes lines between resource_manager and resource_manager_end in $TEMPYAML.
But we do not need to generate a devcluster file deployed with agent anymore.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah ok, looks like agents are deployed manually so we don't need to do anything else

@@ -81,10 +80,6 @@ type Options struct {
// master config.
AgentReconnectBackoff int `json:"agent_reconnect_backoff"`

ContainerRuntime string `json:"container_runtime"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is two image_root config options.

One lives in master config with resource_manager.agent_root while the other is in agent config as image_root

For the agent config option, I think that Docker does not use the image_root option given it was added for other runtimes. So I think now that we have removed the other runtimes I think we can also remove image_root if Docker does not use it.

@@ -5,7 +5,7 @@
1. Install Terraform following [these instructions](https://developer.hashicorp.com/terraform/downloads).
2. Download the [GCP CLI](https://cloud.google.com/sdk/docs/install-sdk) and run `gcloud auth application-default login` to get credentials.
3. Run `make slurmcluster` from the root of the repo and wait (up to 10 minutes) for it to start.
- To specify which container runtime environment to use, pass in `FLAGS="-c {container_run_type}"` to `make slurmcluster`. Choose from either `singularity` (default), `podman`, or `enroot`.
- To specify which container runtime environment to use, pass in `FLAGS="-c {container_run_type}"` to `make slurmcluster`. Choose from either `singularity` (default), `podman`, or `enroot`. If utilizing Slurmcluster with Determined Agents, `docker` container runtime environment is the sole available option.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- To specify which container runtime environment to use, pass in `FLAGS="-c {container_run_type}"` to `make slurmcluster`. Choose from either `singularity` (default), `podman`, or `enroot`. If utilizing Slurmcluster with Determined Agents, `docker` container runtime environment is the sole available option.
- To specify which container runtime environment to use, pass ``FLAGS="-c {container_run_type}"`` to make slurmcluster. You can choose from ``singularity`` (default), ``podman``, or ``enroot``. If you are using Slurmcluster with Determined Agents, docker is the only available container runtime environment.

@ShreyaLnuHpe ShreyaLnuHpe force-pushed the shreya/deprecateContainerRuntime branch from 8234d62 to 13516a6 Compare July 5, 2024 20:28
- AgentRM: Launching a Singluarity/Podman/Apptainer container runtimes for Agent is no longer
supported. Docker is the only option that is supported. This change only affects when
container_runtime is set to podman, using a podman emulation layer is unchanged. If you want to
use singularity, podman, or apptainer the Determined master enterprise edition still supports it.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docker is the only option that is supported. This change only affects when
container_runtime is set to podman, using a podman emulation layer is unchanged.

This is confusing to me. Maybe something like this would be clearer: "Docker and Podman using the emulation layer are still supported."

If you want to
use singularity, podman, or apptainer the Determined master enterprise edition still supports it.

Is this statement accurate? Are you able to run AgentRM with singularity, podman, or apptainer in EE? If so, then wouldn't it be the same amount of work to support all containers in both EE and OSS? Why bother deprecating it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to
use singularity, podman, or apptainer the Determined master enterprise edition still supports it.

This came from a comment from Nick: #9516 (comment)

I'm not exactly sure what he means. I think he was referring to how the other container runtimes (podman/singularity/apptainer) still work with slurm (and slurm is ee only). If I recall correctly, podman also works with kubernetesrm though ...

If it's accurate, I think I'd rather say something like "Kubernetes and Slurm resource managers still support singularity, podman, or apptainer use."

@@ -5614,39 +5450,6 @@ workflows:
ld_library_path:
security:
initial_user_password: ${INITIAL_USER_PASSWORD}
- test-e2e-slurm:
name: test-e2e-slurm-agent-singularity-znode
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NicholasBlaskey / @kkunapuli - Please confirm if we no longer be needing this test as well.

My understanding is, that we should not test any agent related singularity znode changes, but I am not sure, so need an opinion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ongoing discussion on the failed state of the test- https://hpe-aiatscale.slack.com/archives/C04C9JXB1C2/p1720448458815989

Copy link
Contributor

@hamidzr hamidzr Jul 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this testing EE + Slurm + Singularity which remains a supported usecase if I'm reading the PR description correctly?
update this is testing: https://docs.determined.ai/latest/setup-cluster/slurm/hpc-with-agent.html
make sense to treat these the same so remove?

4524:              name: [test-e2e-slurm-agent-podman-gcp]
5427:              name: [test-e2e-slurm-agent-podman-gcp]
5620:            name: test-e2e-slurm-agent-singularity-znode

All the znodes tests should start getting scheduled and running after the backlog is cleared let's check on that in the channel

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that EE + Slurm + Singularity is a supported use case, but the name of the test test-e2e-slurm-agent-singularity-znode confuses me a little as we are deprecating all AGENT related singularity/podman/apptainer use cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, for this PR, we have removed Agent on Slurm/PBS document and removed the below test runs (you can see above in this PR) as well:

4524:              name: [test-e2e-slurm-agent-podman-gcp]
5427:              name: [test-e2e-slurm-agent-podman-gcp]

Copy link
Contributor

@kkunapuli kkunapuli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work!

I left a couple minor suggestions. As a reminder, please don't merge until after user acceptance testing.

if a.opts.ContainerRuntime != options.DockerContainerRuntime {
a.log.Error(a.opts.ContainerRuntime,
" Container Runtime is not supported, please update runtime config to use Docker instead.")
return fmt.Errorf("%s Container Runtime not available", a.opts.ContainerRuntime)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think it's more idiomatic to format the error with the %s at the end, so container runtime not available: %s

I think Go requires errors to be all lowercase?

- AgentRM: Support for Singluarity, Podman, and Apptainer has been deprecated in 0.33.0 and is now
removed. Docker is the only container runtime supported by Agent resource manager (AgentRM). It
is still possible to use podman with AgentRM by using the podman emulation layer. For detailed
instructions, follow steps in the link: `Emulating Docker CLI with Podman
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think to be more consistent with docs style, we should use "visit" instead of "follow steps in the link".

e.g.,

For detailed instructions, visit `Emulating Docker CLI with Podman <https://podman-desktop.io/docs/migrating-from-docker/emulating-docker-cli-with-podman>`

Now, you can launch jobs like normal using the Determined CLI. You can check the status of the allocated resources using `det slot list`.

If you encounter an issue with jobs failing due to `ModuleNotFoundError: No module named 'determined'` run `make clean all` to rebuild determined.
### Note: We no longer support Slurmcluster with Determined Agents
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think it's more consistent with docs style to say:

Slurmcluster with Determined Agents is no longer supported.

Config Reference
https://docs.determined.ai/latest/reference/deploy/master-config-reference.html#checkpoint-storage`

In enterprise edition, Slurm resource manager still supports singularity, podman, or apptainer use.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update capitalization for Singularity, Podman, and Apptainer

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not confident, but I think we want either "In the enterprise edition, ..." or "In Enterprise Edition, ..." Maybe more of a question for Tara.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's discuss this during our user acceptance call with Caetano.

@ShreyaLnuHpe ShreyaLnuHpe changed the title chore: deprecating container_runtime config, AgentRM supporting singularity,podman, and apptainer chore: deprecating container_runtime config, agentrm supporting singularity,podman, and apptainer Jul 11, 2024
@@ -13,7 +13,8 @@
"architecture/system-architecture": "../get-started/architecture/system-architecture.html",
"architecture/introduction": "../get-started/architecture/introduction.html",
"setup-cluster/deploy-cluster/slurm/install-on-slurm": "../../slurm/install-on-slurm.html",
"setup-cluster/deploy-cluster/slurm/hpc-with-agent": "../../slurm/hpc-with-agent.html",
"setup-cluster/deploy-cluster/slurm/hpc-with-agent": "../../slurm/_index.html",
"setup-cluster/slurm/hpc-with-agent": "../slurm/_index.html",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didn't know we have this, nice.

# the master waits for agents to connect and provide resources.
sed -i -e '/resource_manager/,/resource_manager_end/d' $TEMPYAML
fi
echo "Generated devcluster file: $TEMPYAML"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@NicholasBlaskey NicholasBlaskey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of people have reviewed this since I last looked, so I don't feel the need to review it. Let me know if you would like me to look.

@ShreyaLnuHpe ShreyaLnuHpe merged commit c3e0a41 into main Jul 15, 2024
118 of 121 checks passed
@ShreyaLnuHpe ShreyaLnuHpe deleted the shreya/deprecateContainerRuntime branch July 15, 2024 16:15
https://docs.determined.ai/latest/reference/deploy/master-config-reference.html#checkpoint-storage`

In enterprise edition, Slurm resource manager still supports Singularity, Podman, or Apptainer use.
For detailed instructions, visit :ref:deploy-on-slurm-pbs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:orphan:

Deprecations

  • AgentRM: As of version 0.33.0, support for Singularity, Podman, and Apptainer has been deprecated and is now officially removed. Docker is the only container runtime supported by Agent resource manager (AgentRM). However, you can still use Podman with AgentRM by utilizing the Podman emulation layer. For instructions, visit the Podman Desktop documentation and search for "Emulating Docker CLI with Podman". Additionally, you may need to configure checkpoint_storage in your experiment configuration or :ref:master configuration <master-config-reference>.

  • In the enterprise edition, the Slurm Resource Manager continues to support Singularity, Podman, and Apptainer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the only restructured text anchor that i could find for slurm/pbs is this one

.. _install-on-slurm:

so maybe you meant to use the following:

:ref:install-on-slurm

in any case, this one does not exist and it would need proper formatting if it did exist: :ref:deploy-on-slurm-pbs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ShreyaLnuHpe PR 9662 should fix these

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants