Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[question] How to force CSI Node and Plugin jobs start before any other workload #8994

Closed
carlosrbcunha opened this issue Sep 30, 2020 · 6 comments

Comments

@carlosrbcunha
Copy link

carlosrbcunha commented Sep 30, 2020

How to force CSI Node and Plugin jobs start before any other workload ?

After an infrastructure restart, workloads are being scheduled into nodes that have not yet the CSI Node task running as it takes some time to download and start.

  • Is there any way to force the other jobs wait for CSI nodes and controller ?
  • Can we force the plugin container image to get cached and not downloaded every time?
  • Can create a task on server level to start node and controller on the OS an then register the CSI plugin on to the sock address ?
  • Can we use the docker engine to do this ? and not rely on nomad to start this "prerequisites"

We get a lot of errors related to exhausted write claims also . If node and controller plugins are running ok before all tasks, all is good but if It fails or takes too long to start, everything start to get pretty unstable.

We are using azuredisk CSI Driver , mounting azure managed disks to nomad clients.

@tgross
Copy link
Member

tgross commented Sep 30, 2020

After an infrastructure restart, workloads are being scheduled into nodes that have not yet the CSI Node task running as it takes some time to download and start.

If the CSI Node task isn't running, then it won't have fingerprinted with the servers and should not be showing up as a valid scheduling target. Are the plugins being shutdown cleanly so that they deregister themselves with the servers?

Can we force the plugin container image to get cached and not downloaded every time?

In your plugin configuration on the clients you can either:

Can create a task on server level to start node and controller on the OS an then register the CSI plugin on to the sock address ?

See my first answer above. The CSI plugin doesn't get registered until after the plugin container is running.

Can we use the docker engine to do this ? and not rely on nomad to start this "prerequisites"

Unfortunately that's not going to work. The directory with the CSI control socket needs to be within the Nomad allocation directory that gets created by Nomad just before the container gets built.

@carlosrbcunha
Copy link
Author

carlosrbcunha commented Sep 30, 2020

Nice insight , thanks.
Just some related questions.
Here are our jobs, I put them together with some input of gitter conversations and web content. Please review to see if its all ok.

Controller job

job "plugin-azure-disk-controller" {
  datacenters = ["dc1"]
  type = "service"

  vault {
    policies = ["nomad-jobs"]
  }

  group "controller" {
    count = 1

    # disable deployments
    update {
      max_parallel = 0
    }
    task "controller" {
      driver = "docker"

      template {
        change_mode = "noop"
        destination = "local/azure.json"
        data = <<EOH
{
"cloud":"AzurePublicCloud",
"tenantId": "{{with secret "kv/azure/credentials"}}{{.Data.ARM_TENANT_ID}}{{end}}",
"subscriptionId": "{{with secret "kv/azure/credentials"}}{{.Data.ARM_SUBSCRIPTION_ID}}{{end}}",
"aadClientId": "{{with secret "kv/azure/credentials"}}{{.Data.ARM_CLIENT_ID}}{{end}}",
"aadClientSecret": "{{with secret "kv/azure/credentials"}}{{.Data.ARM_CLIENT_SECRET}}{{end}}",
"resourceGroup": "resource-group-name",
"location": "westeurope"
}
EOH
      }

      env {
        AZURE_CREDENTIAL_FILE = "/etc/kubernetes/azure.json"
      }

      config {
        image = "mcr.microsoft.com/k8s/csi/azuredisk-csi:v0.9.0"

        volumes = [
          "local/azure.json:/etc/kubernetes/azure.json"
        ]

        args = [
          "--nodeid=${attr.unique.hostname}-vm",
          "--endpoint=unix://csi/csi.sock",
          "--logtostderr",
          "--v=5",
        ]
      }

      csi_plugin {
        id        = "az-disk0"
        type      = "controller"
        mount_dir = "/csi"
      }

      resources {
        memory = 256
      }

      # ensuring the plugin has time to shut down gracefully
      kill_timeout = "2m"
    }
  }
}

Node job

  datacenters = ["dc1"]

  vault {
    policies = ["nomad-jobs"]
  }

  # you can run node plugins as service jobs as well, but this ensures
  # that all nodes in the DC have a copy.
  type = "system"

  group "nodes" {
    task "node" {
      driver = "docker"

      template {
        change_mode = "noop"
        destination = "local/azure.json"
        data = <<EOH
{
"cloud":"AzurePublicCloud",
"tenantId": "{{with secret "kv/azure/credentials"}}{{.Data.ARM_TENANT_ID}}{{end}}",
"subscriptionId": "{{with secret "kv/azure/credentials"}}{{.Data.ARM_SUBSCRIPTION_ID}}{{end}}",
"aadClientId": "{{with secret "kv/azure/credentials"}}{{.Data.ARM_CLIENT_ID}}{{end}}",
"aadClientSecret": "{{with secret "kv/azure/credentials"}}{{.Data.ARM_CLIENT_SECRET}}{{end}}",
"resourceGroup": "resource-group-name",
"location": "westeurope"
}
EOH
      }

      env {
        AZURE_CREDENTIAL_FILE = "/etc/kubernetes/azure.json"
      }

      config {
        image = "mcr.microsoft.com/k8s/csi/azuredisk-csi:v0.9.0"

        volumes = [
          "local/azure.json:/etc/kubernetes/azure.json"
        ]

        args = [
          "--nodeid=${attr.unique.hostname}-vm",
          "--endpoint=unix://csi/csi.sock",
          "--logtostderr",
          "--v=5",
        ]

        # node plugins must run as privileged jobs because they
        # mount disks to the host
        privileged = true
      }

      csi_plugin {
        id        = "az-disk0"
        type      = "node"
        mount_dir = "/csi"
      }

      resources {
        memory = 256
      }

      # ensuring the plugin has time to shut down gracefully
      kill_timeout = "2m"
    }
  }
}

We will try to implement a controlled shutdown with a drain without forcing system job out and after that stop these jobs, will that, in your opinion solve the plugin de-registration issue ?
We are using nomad for development clusters with dev workloads that are shutdown at the end of the day. On the next morning, with CSI, all is down.. We are trying to get this stable. Any thoughts?

Sometimes the plugin stats get all strange with odd numbers like having more nodes expected than nodes present.

Regarding Azuredisk , you still do not have any documentation. It is very frustrating only seeing documentation for k8s and trying to port everything to nomad. Please add more examples..

@tgross
Copy link
Member

tgross commented Sep 30, 2020

Please review to see if its all ok.

I don't have an Azure environment handy to verify them, but those jobs look reasonable. Depending on the size of your deployment, I'd consider running multiple controllers spread across hosts so that you're less likely to have one of them on a node that you're in the middle of draining.

We will try to implement a controlled shutdown with a drain without forcing system job out and after that stop these jobs, will that, in your opinion solve the plugin de-registration issue ?

It should. See this note in the csi_plugin docs:

Note: During node drains, jobs that claim volumes must be moved before the node or monolith plugin for those volumes. You should run node or monolith plugins as system jobs and use the -ignore-system flag on nomad node drain to ensure that the plugins are running while the node is being drained.

For your dev environment:

We are using nomad for development clusters with dev workloads that are shutdown at the end of the day. On the next morning, with CSI, all is down.. We are trying to get this stable. Any thoughts?

When the servers come up (or if they're left running), they won't have a heartbeat from the client agents in however many hours and will consider them lost. And that'll result in them trying to reschedule all the workloads, including the CSI plugins. If you want to shutdown the whole cluster you'll definitely need to shutdown all the volume-consuming workloads so that volume claims get released. You should avoid #8609 because you're shutting down rather than draining off to another client.

That being said, taking off my Nomad developer hat for a moment, if you're running on cloud infrastructure anyways it might be worth your while to automate recreating the developer environment from scratch every day. I've done that sort of thing to prevent developers from leaving all kinds of undocumented things laying around development environments that later bite you when you get to production.

Sometimes the plugin stats get all strange with odd numbers like having more nodes expected than nodes present.

We have a couple of open bugs on that (#8948, #8628, #8034). There's also some eventual consistency because it takes a while for plugins to come live: #7296. I've got some time coming up in the next couple weeks to work on some of those items.

Regarding Azuredisk , you still do not have any documentation. It is very frustrating only seeing documentation for k8s and trying to port everything to nomad. Please add more examples..

Please keep in mind that Nomad's CSI integration is still marked as beta, so you'll need to do a bit of lift here. There are ~100 or different CSI drivers, so we won't be able to provide documentation for them all. There's been a few other folks who've given it a go, if you search through issues for storage and Azure

@carlosrbcunha
Copy link
Author

As per our research we implemented some guardrails on our pipeline to stop the jobs and check if the disks are detached normally. Having a controlled shutdown and startup did the trick.

My colleagues and I will continue to ask stuff and do code or documentation contributions so that it gets easier for the next guy.
Funny thing.. the requests you pointed out (storage and azure) are mine ;)

Thanks for all your help, you are doing a great job.

@tgross
Copy link
Member

tgross commented Oct 1, 2020

Funny thing.. the requests you pointed out (storage and azure) are mine ;)

Ha ha! I totally missed that 😆

@github-actions
Copy link

github-actions bot commented Nov 1, 2022

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 1, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants