[question] How to force CSI Node and Plugin jobs start before any other workload #8994

carlosrbcunha · 2020-09-30T16:38:15Z

How to force CSI Node and Plugin jobs start before any other workload ?

After an infrastructure restart, workloads are being scheduled into nodes that have not yet the CSI Node task running as it takes some time to download and start.

Is there any way to force the other jobs wait for CSI nodes and controller ?
Can we force the plugin container image to get cached and not downloaded every time?
Can create a task on server level to start node and controller on the OS an then register the CSI plugin on to the sock address ?
Can we use the docker engine to do this ? and not rely on nomad to start this "prerequisites"

We get a lot of errors related to exhausted write claims also . If node and controller plugins are running ok before all tasks, all is good but if It fails or takes too long to start, everything start to get pretty unstable.

We are using azuredisk CSI Driver , mounting azure managed disks to nomad clients.

tgross · 2020-09-30T17:05:51Z

After an infrastructure restart, workloads are being scheduled into nodes that have not yet the CSI Node task running as it takes some time to download and start.

If the CSI Node task isn't running, then it won't have fingerprinted with the servers and should not be showing up as a valid scheduling target. Are the plugins being shutdown cleanly so that they deregister themselves with the servers?

Can we force the plugin container image to get cached and not downloaded every time?

In your plugin configuration on the clients you can either:

Set docker-cleanup-image to false. This would require that you have some other process running on the host to clean up workload images.
Set docker-cleanup-image-delay. This would allow you to restart infrastructure without

Can create a task on server level to start node and controller on the OS an then register the CSI plugin on to the sock address ?

See my first answer above. The CSI plugin doesn't get registered until after the plugin container is running.

Can we use the docker engine to do this ? and not rely on nomad to start this "prerequisites"

Unfortunately that's not going to work. The directory with the CSI control socket needs to be within the Nomad allocation directory that gets created by Nomad just before the container gets built.

carlosrbcunha · 2020-09-30T17:37:10Z

Nice insight , thanks.
Just some related questions.
Here are our jobs, I put them together with some input of gitter conversations and web content. Please review to see if its all ok.

Controller job

job "plugin-azure-disk-controller" {
  datacenters = ["dc1"]
  type = "service"

  vault {
    policies = ["nomad-jobs"]
  }

  group "controller" {
    count = 1

    # disable deployments
    update {
      max_parallel = 0
    }
    task "controller" {
      driver = "docker"

      template {
        change_mode = "noop"
        destination = "local/azure.json"
        data = <<EOH
{
"cloud":"AzurePublicCloud",
"tenantId": "{{with secret "kv/azure/credentials"}}{{.Data.ARM_TENANT_ID}}{{end}}",
"subscriptionId": "{{with secret "kv/azure/credentials"}}{{.Data.ARM_SUBSCRIPTION_ID}}{{end}}",
"aadClientId": "{{with secret "kv/azure/credentials"}}{{.Data.ARM_CLIENT_ID}}{{end}}",
"aadClientSecret": "{{with secret "kv/azure/credentials"}}{{.Data.ARM_CLIENT_SECRET}}{{end}}",
"resourceGroup": "resource-group-name",
"location": "westeurope"
}
EOH
      }

      env {
        AZURE_CREDENTIAL_FILE = "/etc/kubernetes/azure.json"
      }

      config {
        image = "mcr.microsoft.com/k8s/csi/azuredisk-csi:v0.9.0"

        volumes = [
          "local/azure.json:/etc/kubernetes/azure.json"
        ]

        args = [
          "--nodeid=${attr.unique.hostname}-vm",
          "--endpoint=unix://csi/csi.sock",
          "--logtostderr",
          "--v=5",
        ]
      }

      csi_plugin {
        id        = "az-disk0"
        type      = "controller"
        mount_dir = "/csi"
      }

      resources {
        memory = 256
      }

      # ensuring the plugin has time to shut down gracefully
      kill_timeout = "2m"
    }
  }
}

Node job

  datacenters = ["dc1"]

  vault {
    policies = ["nomad-jobs"]
  }

  # you can run node plugins as service jobs as well, but this ensures
  # that all nodes in the DC have a copy.
  type = "system"

  group "nodes" {
    task "node" {
      driver = "docker"

      template {
        change_mode = "noop"
        destination = "local/azure.json"
        data = <<EOH
{
"cloud":"AzurePublicCloud",
"tenantId": "{{with secret "kv/azure/credentials"}}{{.Data.ARM_TENANT_ID}}{{end}}",
"subscriptionId": "{{with secret "kv/azure/credentials"}}{{.Data.ARM_SUBSCRIPTION_ID}}{{end}}",
"aadClientId": "{{with secret "kv/azure/credentials"}}{{.Data.ARM_CLIENT_ID}}{{end}}",
"aadClientSecret": "{{with secret "kv/azure/credentials"}}{{.Data.ARM_CLIENT_SECRET}}{{end}}",
"resourceGroup": "resource-group-name",
"location": "westeurope"
}
EOH
      }

      env {
        AZURE_CREDENTIAL_FILE = "/etc/kubernetes/azure.json"
      }

      config {
        image = "mcr.microsoft.com/k8s/csi/azuredisk-csi:v0.9.0"

        volumes = [
          "local/azure.json:/etc/kubernetes/azure.json"
        ]

        args = [
          "--nodeid=${attr.unique.hostname}-vm",
          "--endpoint=unix://csi/csi.sock",
          "--logtostderr",
          "--v=5",
        ]

        # node plugins must run as privileged jobs because they
        # mount disks to the host
        privileged = true
      }

      csi_plugin {
        id        = "az-disk0"
        type      = "node"
        mount_dir = "/csi"
      }

      resources {
        memory = 256
      }

      # ensuring the plugin has time to shut down gracefully
      kill_timeout = "2m"
    }
  }
}

We will try to implement a controlled shutdown with a drain without forcing system job out and after that stop these jobs, will that, in your opinion solve the plugin de-registration issue ?
We are using nomad for development clusters with dev workloads that are shutdown at the end of the day. On the next morning, with CSI, all is down.. We are trying to get this stable. Any thoughts?

Sometimes the plugin stats get all strange with odd numbers like having more nodes expected than nodes present.

Regarding Azuredisk , you still do not have any documentation. It is very frustrating only seeing documentation for k8s and trying to port everything to nomad. Please add more examples..

tgross · 2020-09-30T20:15:42Z

Please review to see if its all ok.

I don't have an Azure environment handy to verify them, but those jobs look reasonable. Depending on the size of your deployment, I'd consider running multiple controllers spread across hosts so that you're less likely to have one of them on a node that you're in the middle of draining.

We will try to implement a controlled shutdown with a drain without forcing system job out and after that stop these jobs, will that, in your opinion solve the plugin de-registration issue ?

It should. See this note in the csi_plugin docs:

Note: During node drains, jobs that claim volumes must be moved before the node or monolith plugin for those volumes. You should run node or monolith plugins as system jobs and use the -ignore-system flag on nomad node drain to ensure that the plugins are running while the node is being drained.

For your dev environment:

We are using nomad for development clusters with dev workloads that are shutdown at the end of the day. On the next morning, with CSI, all is down.. We are trying to get this stable. Any thoughts?

When the servers come up (or if they're left running), they won't have a heartbeat from the client agents in however many hours and will consider them lost. And that'll result in them trying to reschedule all the workloads, including the CSI plugins. If you want to shutdown the whole cluster you'll definitely need to shutdown all the volume-consuming workloads so that volume claims get released. You should avoid #8609 because you're shutting down rather than draining off to another client.

That being said, taking off my Nomad developer hat for a moment, if you're running on cloud infrastructure anyways it might be worth your while to automate recreating the developer environment from scratch every day. I've done that sort of thing to prevent developers from leaving all kinds of undocumented things laying around development environments that later bite you when you get to production.

Sometimes the plugin stats get all strange with odd numbers like having more nodes expected than nodes present.

We have a couple of open bugs on that (#8948, #8628, #8034). There's also some eventual consistency because it takes a while for plugins to come live: #7296. I've got some time coming up in the next couple weeks to work on some of those items.

Regarding Azuredisk , you still do not have any documentation. It is very frustrating only seeing documentation for k8s and trying to port everything to nomad. Please add more examples..

Please keep in mind that Nomad's CSI integration is still marked as beta, so you'll need to do a bit of lift here. There are ~100 or different CSI drivers, so we won't be able to provide documentation for them all. There's been a few other folks who've given it a go, if you search through issues for storage and Azure

carlosrbcunha · 2020-10-01T17:54:11Z

As per our research we implemented some guardrails on our pipeline to stop the jobs and check if the disks are detached normally. Having a controlled shutdown and startup did the trick.

My colleagues and I will continue to ask stuff and do code or documentation contributions so that it gets easier for the next guy.
Funny thing.. the requests you pointed out (storage and azure) are mine ;)

Thanks for all your help, you are doing a great job.

tgross · 2020-10-01T17:57:10Z

Funny thing.. the requests you pointed out (storage and azure) are mine ;)

Ha ha! I totally missed that 😆

github-actions · 2022-11-01T02:44:51Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

tgross added type/question theme/storage labels Sep 30, 2020

carlosrbcunha closed this as completed Oct 1, 2020

github-actions bot locked as resolved and limited conversation to collaborators Nov 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[question] How to force CSI Node and Plugin jobs start before any other workload #8994

[question] How to force CSI Node and Plugin jobs start before any other workload #8994

carlosrbcunha commented Sep 30, 2020 •

edited

Loading

tgross commented Sep 30, 2020

carlosrbcunha commented Sep 30, 2020 •

edited

Loading

tgross commented Sep 30, 2020 •

edited

Loading

carlosrbcunha commented Oct 1, 2020

tgross commented Oct 1, 2020

github-actions bot commented Nov 1, 2022

[question] How to force CSI Node and Plugin jobs start before any other workload #8994

[question] How to force CSI Node and Plugin jobs start before any other workload #8994

Comments

carlosrbcunha commented Sep 30, 2020 • edited Loading

tgross commented Sep 30, 2020

carlosrbcunha commented Sep 30, 2020 • edited Loading

tgross commented Sep 30, 2020 • edited Loading

carlosrbcunha commented Oct 1, 2020

tgross commented Oct 1, 2020

github-actions bot commented Nov 1, 2022

carlosrbcunha commented Sep 30, 2020 •

edited

Loading

carlosrbcunha commented Sep 30, 2020 •

edited

Loading

tgross commented Sep 30, 2020 •

edited

Loading