CSI volume per_alloc availability zone placement #11778

ygersie · 2022-01-05T13:21:33Z

Nomad version

v1.2.3

Issue

According to the recommendation here #10793 (comment) and in the docs the proper way to deal with scheduling volumes that are required to be mounted in the same AZ is to run a plugin controller per AZ. When using the per_alloc volume option, this doesn't work as expected. I assume this has to do with the fact that an alloc id isn't known at time of scheduling, so Nomad tries to randomly assign an alloc to a node that can't satisfy the volume requirement, this however conflicts with the purpose of the per_alloc directive. It also makes me wonder if the alloc index is even supposed to be a runtime variable as it is, as far as I can tell, determined during scheduling time?

With the following volumes and job:

$ nomad volume status myvolume[0]
ID                   = myvolume[0]
Name                 = myvolume
External ID          = vol-xxxxxxxxxxxxxx
Plugin ID            = aws-ebs-us-west-2a
Provider             = ebs.csi.aws.com
Version              = v1.4.0
Schedulable          = true
Controllers Healthy  = 1
Controllers Expected = 1
Nodes Healthy        = 1
Nodes Expected       = 1
Access Mode          = <none>
Attachment Mode      = <none>
Mount Options        = <none>
Namespace            = default

Allocations
No allocations placed

$ nomad volume status myvolume[1]
ID                   = myvolume[1]
Name                 = myvolume
External ID          = vol-xxxxxxxxxxxxxx
Plugin ID            = aws-ebs-us-west-2b
Provider             = ebs.csi.aws.com
Version              = v1.4.0
Schedulable          = true
Controllers Healthy  = 1
Controllers Expected = 1
Nodes Healthy        = 1
Nodes Expected       = 1
Access Mode          = <none>
Attachment Mode      = <none>
Mount Options        = <none>
Namespace            = default

Allocations
No allocations placed

job "example" {
  type = "service"

  region      = "us-west-2"
  datacenters = ["us-west-2a", "us-west-2b", "us-west-2c"]

  group "example" {
    count = 2

    volume "ebs" {
      type      = "csi"
      source    = "myvolume"
      read_only = false
      per_alloc = true

      attachment_mode = "file-system"
      access_mode     = "single-node-writer"

      mount_options {
        fs_type = "ext4"
      }
    }

    task "example" {
      driver = "docker"
      config {
        image = "alpine"
        args  = ["tail", "-f", "/dev/null"]
      }

      volume_mount {
        volume      = "ebs"
        destination = "/data"
        read_only   = false
      }

      resources {
        cpu    = 100
        memory = 64
      }
    }
  }
}

The result is that placement intermittently fails:

$ nomad plan example.hcl
+ Job: "example"
+ Task Group: "example" (2 create)
  + Task: "example" (forces create)

Scheduler dry-run:
- WARNING: Failed to place all allocations.
  Task Group "example" (failed to place 1 allocation):
    * Class "default": 1 nodes excluded by filter
    * Constraint "CSI plugin aws-ebs-us-west-2b is missing from client <clientNodeID>": 1 nodes excluded by filter

Job Modify Index: 0
To submit the job with version verification run:

nomad job run -check-index 0 example.hcl

When running the job with the check-index flag, the job will only be run if the
job modify index given matches the server-side version. If the index has
changed, another user has modified the job and the plan's results are
potentially invalid.

The text was updated successfully, but these errors were encountered:

tgross · 2022-01-05T13:46:51Z

Hi @ygersie! Yeah unfortunately you'll probably want to namespace the volume names by AZ as well, if you're using the per_alloc field. Note this should all be fixed more nicely once we've implemented topology (coming soon!)

ygersie · 2022-01-05T13:54:05Z

Hi @tgross thanks for the quick response! Forgive my ignorance, could you elaborate a bit on what you mean with namespacing the volumes per AZ?

tgross · 2022-01-05T19:47:18Z

The jobspec you have crosses AZs, so the allocations are being placed based on binpacking across those AZs. So if you use per_alloc = true then the volume names created during scheduling will point to volumes that might not exist in those AZs, as you've seen.

So as a very hacky workaround, you can name the volumes myvolume-2a[0], myvolume-2a[1], etc. and then target jobs to each AZ independently. I recognize this isn't ideal, which is why we want topology so that we don't have this requirement anymore.

ygersie · 2022-01-06T09:07:58Z

@tgross thanks, yeah it's not pretty. In case someone runs into the same issue I came up with the following for now:

variable "datacenters" {
  type    = list(string)
  default = ["us-west-2a", "us-west-2b", "us-west-2c"]
}

variable "instance_count" {
  type    = number
  default = 3
}

locals {
  nr_of_dcs   = length(var.datacenters)
  alloc_to_az = { for i, dc in sort(var.datacenters) : i => dc }
}

job "example" {
  type = "service"

  region      = "us-west-2"
  datacenters = var.datacenters

  dynamic "group" {
    for_each = range(var.instance_count)
    iterator = alloc
    labels   = ["example-${alloc.key}"]

    content {
      volume "ebs" {
        type = "csi"
        # creates volume source ids like:
        # us-west-2a-myvolume[0]
        # us-west-2b-myvolume[1]
        source    = "${local.alloc_to_az[alloc.key % local.nr_of_dcs]}-myvolume[${alloc.key}]"
        read_only = false

        attachment_mode = "file-system"
        access_mode     = "single-node-writer"

        mount_options {
          fs_type = "ext4"
        }
      }

      task "example" {
        driver = "docker"
        config {
          image = "alpine"
          args  = ["tail", "-f", "/dev/null"]
        }

        volume_mount {
          volume      = "ebs"
          destination = "/data"
          read_only   = false
        }

        resources {
          cpu    = 100
          memory = 64
        }
      }
    }
  }
}

At least this makes it easier to schedule a different number of instances in a single job.

tgross · 2022-02-24T19:14:59Z

Just a heads up that I'm actively working on #7669 which will resolve this issue.

tgross · 2022-03-01T15:23:39Z

Resolved by #12129, which will ship in Nomad 1.3.0

ygersie · 2022-03-01T16:33:17Z

Awesome stuff @tgross !

ygersie · 2022-05-19T08:02:10Z

Hey @tgross I just got time to test out the per_alloc feature with Nomad 1.3.0 but I'm still running into this issue. My test case:

Volume specs:

id        = "ygersie[0]"
name      = "ygersie[0]"
namespace = "mynamespace"

type        = "csi"
plugin_id   = "aws-ebs-us-west-2a"
external_id = "vol-asdfasdfasdfasdfasdf"

capability {
  access_mode     = "single-node-writer"
  attachment_mode = "file-system"
}

mount_options {
  fs_type = "ext4"
}

topology_request {
  required {
    topology {
      segments {
        "topology.ebs.csi.aws.com/zone" = "us-west-2a"
      }
    }
  }
}

and the second:

id        = "ygersie[1]"
name      = "ygersie[1]"
namespace = "mynamespace"

type        = "csi"
plugin_id   = "aws-ebs-us-west-2b"
external_id = "vol-asdfasdfasdfasdfasdf"

capability {
  access_mode     = "single-node-writer"
  attachment_mode = "file-system"
}

mount_options {
  fs_type = "ext4"
}

topology_request {
  required {
    topology {
      segments {
        "topology.ebs.csi.aws.com/zone" = "us-west-2b"
      }
    }
  }
}

and the job:

job "ygersie" {
  region      = "us-west-2"
  datacenters = ["us-west-2a", "us-west-2b", "us-west-2c", "us-west-2d"]
  namespace   = "mynamespace"

  group "example" {
    count = 2

    volume "ebs" {
      type      = "csi"
      source    = "ygersie"
      read_only = false
      per_alloc = true

      attachment_mode = "file-system"
      access_mode     = "single-node-writer"

      mount_options {
        fs_type = "ext4"
      }
    }

    task "example" {
      driver = "docker"
      config {
        image = "alpine"
        args  = ["tail", "-f", "/dev/null"]
      }

      volume_mount {
        volume      = "ebs"
        destination = "/data"
      }

      resources {
        cpu    = 100
        memory = 64
      }
    }
  }
}

The plan sometimes is completely fine and sometimes still shows:

$ nomad job plan example-job-volume.hcl
+ Job: "ygersie"
+ Task Group: "example" (2 create)
  + Task: "example" (forces create)

Scheduler dry-run:
- WARNING: Failed to place all allocations.
  Task Group "example" (failed to place 1 allocation):
    * Class "default": 1 nodes excluded by filter
    * Constraint "CSI plugin aws-ebs-us-west-2b is missing from client 17224e99-393e-16d2-aa0a-329ff47ca63b": 1 nodes excluded by filter

Job Modify Index: 0
To submit the job with version verification run:

nomad job run -check-index 0 example-job-volume.hcl

When running the job with the check-index flag, the job will only be run if the
job modify index given matches the server-side version. If the index has
changed, another user has modified the job and the plan's results are
potentially invalid.

If I do a job run it'll fail initially and on the second attempt it seems to still schedule it correctly. However, I wonder if it is just luck that the job schedules on the second attempt due to the low amount of (test) nodes I have or if there's a difference between the plan and the evaluation generated on job creation. As long as Nomad guarantees it will be placed once the job is created I can get away with marking this as a "cosmetic" bug.

I'm still using the plugin_id with the AZ in there as this is how we currently have our plugins deployed. I wouldn't expect this to be related to the issue I'm seeing here.

tgross · 2022-05-19T14:27:25Z

Hi @ygersie! That message bubbles up from the per-node feasibility check (ref feasible.go#L305-L309), which looks like this:

plugin, ok := n.CSINodePlugins[vol.PluginID]
if !ok {
	return false, fmt.Sprintf(FilterConstraintCSIPluginTemplate, vol.PluginID, n.ID)
}

The reason you're seeing different behaviors on different runs is most likely because the scheduler shuffles the list of nodes. For each allocation the list is feasibility checked (filtered) and scored until we either have at least 1 (and a maximum of 2) viable node scores or we run out of nodes.

So suppose we have (nodeA, nodeB, nodeC, nodeD) one for each DC. Our example[0] has a volume[0] in zone A, and example[1] has a volume[1] in zone B. One possible iteration through the scheduler might look like this:

example[0]:
- nodeD is feasible? no
- nodeB is feasible? no
- nodeA is feasible? yes, score it.
- nodeC is feasible? no
- no more nodes
- take node with max score (out of first 2 scored, but we scored only 1)
- planned for nodeA
example[1]:
- nodeD is feasible? no
- allocB: nodeB is feasible? yes, score it.
- allocB: nodeA is feasible? no
- allocB: nodeC is feasible? no
- no more nodes
- take node with max score (out of first 2 scored, but we scored only 1)
- planned for nodeB

The failure you're getting suggests that one of the zones doesn't have a plugin (much less a healthy one). So when you say:

I'm still using the plugin_id with the AZ in there as this is how we currently have our plugins deployed. I wouldn't expect this to be related to the issue I'm seeing here.

I think this may be unexpectedly be a factor still. If it were topology-related I'd expect to see "did not meet topology requirement" (ref feasible.go#L317-L323).

I'll re-open this issue while we debug this. Some questions on your test environment that might help narrow down the behavior:

Can you verify that the volumes were created in the correct AWS AZ as their topology says?
Do you have more than one node in each DC with the running plugins?
Do any nodes in the DC not have a plugin?
Which DC is 17224e99-393e-16d2-aa0a-329ff47ca63b in? It would be interesting to see whether that was a DC with a different plugin or a DC without a plugin at all.

ygersie · 2022-05-19T16:21:21Z

Hey @tgross

Can you verify that the volumes were created in the correct AWS AZ as their topology says?

They are, otherwise it would never work as planned, but here is the confirmation:

$ aws ec2 describe-volumes --volume-ids vol-0f53db9b7f68bca2c vol-07aae447a2bec7826 --query 'Volumes[].[VolumeId,AvailabilityZone]'
[
    [
        "vol-07aae447a2bec7826",
        "us-west-2b"
    ],
    [
        "vol-0f53db9b7f68bca2c",
        "us-west-2a"
    ]
]

and the volumes in nomad:

$ nomad volume status -json ygersie[0] | jq -r '"\(.ExternalID)\n\(.Topologies)"'
vol-0f53db9b7f68bca2c
[{"Segments":{"topology.ebs.csi.aws.com/zone":"us-west-2a"}}]

$ nomad volume status -json ygersie[1] | jq -r '"\(.ExternalID)\n\(.Topologies)"'
vol-07aae447a2bec7826
[{"Segments":{"topology.ebs.csi.aws.com/zone":"us-west-2b"}}]

Do you have more than one node in each DC with the running plugins

I do have more than one node running with the plugins, here's the nomad plugin output:

$ nomad plugin status -verbose
Container Storage Interface
ID                  Provider         Controllers Healthy/Expected  Nodes Healthy/Expected
aws-ebs-us-west-2a  ebs.csi.aws.com  2/2                           3/3
aws-ebs-us-west-2b  ebs.csi.aws.com  2/2                           3/3
aws-ebs-us-west-2c  ebs.csi.aws.com  2/2                           3/3
aws-ebs-us-west-2d  ebs.csi.aws.com  2/2                           3/3
aws-efs             efs.csi.aws.com  2/2                           12/12

I have 2 controllers running per AZ (== nomad datacenter) and each node then runs the node plugin as a system job. All plugins are reported healthy and there is sufficient capacity for placement.

Do any nodes in the DC not have a plugin?

No, each node has a EBS node plugin running.

Which DC is 17224e99-393e-16d2-aa0a-329ff47ca63b in? It would be interesting to see whether that was a DC with a different plugin or a DC without a plugin at all.

That node is running in us-west-2d which doesn't have a volume at all. I only have 2 volumes created, one in us-west-2a and one in us-west-2b. They both have the required topology configured as shown in my previous comment. Also, the plugins correctly report their accessible topologies:

$ nomad plugin status -verbose aws-ebs-us-west-2a
ID                   = aws-ebs-us-west-2a
Provider             = ebs.csi.aws.com
Version              = v1.4.0
Controllers Healthy  = 2
Controllers Expected = 2
Nodes Healthy        = 3
Nodes Expected       = 3

Controller Capabilities
  ATTACH_READONLY
  CLONE_VOLUME
  CONTROLLER_ATTACH_DETACH
  CREATE_DELETE_SNAPSHOT
  CREATE_DELETE_VOLUME
  EXPAND_VOLUME
  GET_CAPACITY
  GET_VOLUME
  LIST_SNAPSHOTS
  LIST_VOLUMES
  LIST_VOLUMES_PUBLISHED_NODES
  VOLUME_CONDITION

Node Capabilities
  EXPAND_VOLUME
  GET_VOLUME_STATS
  STAGE_UNSTAGE_VOLUME
  VOLUME_ACCESSIBILITY_CONSTRAINTS
  VOLUME_CONDITION

Accessible Topologies
Node ID   Accessible Topology
6095ae77  topology.ebs.csi.aws.com/zone=us-west-2a
a2fca83c  topology.ebs.csi.aws.com/zone=us-west-2a
4a3e8c7c  topology.ebs.csi.aws.com/zone=us-west-2a

Allocations
No allocations placed

and

$ nomad plugin status -verbose aws-ebs-us-west-2b
ID                   = aws-ebs-us-west-2b
Provider             = ebs.csi.aws.com
Version              = v1.4.0
Controllers Healthy  = 2
Controllers Expected = 2
Nodes Healthy        = 3
Nodes Expected       = 3

Controller Capabilities
  ATTACH_READONLY
  CLONE_VOLUME
  CONTROLLER_ATTACH_DETACH
  CREATE_DELETE_SNAPSHOT
  CREATE_DELETE_VOLUME
  EXPAND_VOLUME
  GET_CAPACITY
  GET_VOLUME
  LIST_SNAPSHOTS
  LIST_VOLUMES
  LIST_VOLUMES_PUBLISHED_NODES
  VOLUME_CONDITION

Node Capabilities
  EXPAND_VOLUME
  GET_VOLUME_STATS
  STAGE_UNSTAGE_VOLUME
  VOLUME_ACCESSIBILITY_CONSTRAINTS
  VOLUME_CONDITION

Accessible Topologies
Node ID   Accessible Topology
e2044d7c  topology.ebs.csi.aws.com/zone=us-west-2b
d45499f7  topology.ebs.csi.aws.com/zone=us-west-2b
2b5f4317  topology.ebs.csi.aws.com/zone=us-west-2b

Allocations
No allocations placed

tgross · 2022-05-19T16:26:31Z

@ygersie that's all super helpful.

That node is running in us-west-2d which doesn't have a volume at all.

Very interesting! Ok you provided a ton of info here where I can probably write a standalone test that lets me exercise the whole scheduler with roughly the same state. Let me have a go at that and hopefully it'll come up with a reproduction and clues as to what's going wrong. Thanks again!

ygersie · 2022-05-19T16:29:06Z

@tgross you're welcome, thanks for taking a quick look, the sooner this gets resolved the better :)
I'm not sure why the plugins report "No allocations placed", the job is running and both volumes are now attached:

$ nomad volume status ygersie[0]
ID                   = ygersie[0]
Name                 = ygersie[0]
External ID          = vol-0f53db9b7f68bca2c
Plugin ID            = aws-ebs-us-west-2a
Provider             = ebs.csi.aws.com
Version              = v1.4.0
Schedulable          = true
Controllers Healthy  = 2
Controllers Expected = 2
Nodes Healthy        = 3
Nodes Expected       = 3
Access Mode          = single-node-writer
Attachment Mode      = file-system
Mount Options        = fs_type: ext4
Namespace            = mynamespace

Topologies
Topology  Segments
00        topology.ebs.csi.aws.com/zone=us-west-2a

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created    Modified
bdb1e655  4a3e8c7c  example     0        run      running  8h32m ago  8h32m ago

and

$ nomad volume status ygersie[1]
ID                   = ygersie[1]
Name                 = ygersie[1]
External ID          = vol-07aae447a2bec7826
Plugin ID            = aws-ebs-us-west-2b
Provider             = ebs.csi.aws.com
Version              = v1.4.0
Schedulable          = true
Controllers Healthy  = 2
Controllers Expected = 2
Nodes Healthy        = 3
Nodes Expected       = 3
Access Mode          = single-node-writer
Attachment Mode      = file-system
Mount Options        = fs_type: ext4
Namespace            = mynamespace

Topologies
Topology  Segments
00        topology.ebs.csi.aws.com/zone=us-west-2b

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created    Modified
eb6d7301  2b5f4317  example     0        run      running  8h34m ago  8h34m ago

ygersie · 2022-05-24T08:42:56Z

It seems this issue even occurs with the manual AZ mapping of source volumes, this worked at least before the 1.3.0 upgrade. I have a job that is failing to be scheduled with the same constraint issue. The job has 3 taskgroups each with a volume claim looking like:

                "Volumes": {
                    "data": {
                        "AccessMode": "single-node-writer",
                        "AttachmentMode": "file-system",
                        "MountOptions": {
                            "FSType": "ext4",
                            "MountFlags": null
                        },
                        "Name": "data",
                        "PerAlloc": false,
                        "ReadOnly": false,
                        "Source": "us-west-2c-redis[2]",
                        "Type": "csi"
                    }
                }

the volume status:

$ nomad volume status us-west-2c-redis[2]
ID                   = us-west-2c-redis[2]
Name                 = redis[2]
External ID          = vol-asdfasdfasdfasdf
Plugin ID            = aws-ebs-us-west-2c
Provider             = ebs.csi.aws.com
Version              = v1.4.0
Schedulable          = true
Controllers Healthy  = 2
Controllers Expected = 2
Nodes Healthy        = 3
Nodes Expected       = 3
Access Mode          = single-node-writer
Attachment Mode      = file-system
Mount Options        = fs_type: ext4
Namespace            = mynamespace

Allocations
No allocations placed

And the job status:

Placement Failure
Task Group "redis-0":
  * Class "default": 1 nodes excluded by filter
  * Constraint "CSI plugin aws-ebs-us-west-2a is missing from client f13406f2-1cb1-008b-ebf5-74c27f418a46": 1 nodes excluded by filter

Task Group "redis-1":
  * Class "default": 1 nodes excluded by filter
  * Constraint "CSI plugin aws-ebs-us-west-2b is missing from client f562d820-a589-6c02-807a-d653ebfb7b14": 1 nodes excluded by filter

Task Group "redis-2":
  * Class "default": 1 nodes excluded by filter
  * Constraint "CSI plugin aws-ebs-us-west-2c is missing from client 17224e99-393e-16d2-aa0a-329ff47ca63b": 1 nodes excluded by filter

The node ids are indeed in completely different datacenters, so they are rightfully missing the plugin.

tgross · 2022-05-24T13:23:24Z

Ok, I haven't had a chance to come back to write that test. But:

I'm not sure why the plugins report "No allocations placed",

Just FYI on this the nomad plugin status command has to hit the Allocations API in order to get the list of allocations. But because plugins don't have a namespace but the plugin jobs do, the Allocations API will respond with an empty list if the plugins aren't in the default namespace if you don't provide a namespace to nomad plugin status.

ygersie · 2022-05-26T15:35:30Z

I'm now also seeing this every now and then:

Placement Failure
Task Group "example":
  * Class "default": 1 nodes excluded by filter
  * Constraint "did not meet topology requirement": 1 nodes excluded by filter

There really seems to be a problem with selecting feasible nodes.

tgross · 2022-05-31T14:32:41Z

Ok, I've got a failing test that demonstrates the issue:

$ NOMAD_TEST_LOG_LEVEL=debug go test -v ./scheduler -run TestServiceSched_CSITopology -count=1
=== RUN   TestServiceSched_CSITopology
=== PAUSE TestServiceSched_CSITopology
=== CONT  TestServiceSched_CSITopology
2022-05-31T10:26:56.851-0400 [DEBUG] scheduler/generic_sched.go:384: service_sched: reconciled current state with desired state: eval_id=dcbcac79-43e6-780a-6500-accff939ed44 job_id=mock-service-70f9fa3a-4a5d-f7ff-8568-d593e72cbd87 namespace=default
  results=
  | Total changes: (place 2) (destructive 0) (inplace 0) (stop 0) (disconnect 0) (reconnect 0)
  | Desired Changes for "web": (place 2) (inplace 0) (destructive 0) (stop 0) (migrate 0) (ignore 0) (canary 0)

2022-05-31T10:26:56.851-0400 [DEBUG] scheduler/generic_sched.go:301: service_sched: failed to place all allocations, blocked eval created: eval_id=dcbcac79-43e6-780a-6500-accff939ed44 job_id=mock-service-70f9fa3a-4a5d-f7ff-8568-d593e72cbd87 namespace=default blocked_eval_id=94ca12a5-b216-677a-bc29-bd31340c4ec8
2022-05-31T10:26:56.851-0400 [DEBUG] scheduler/util.go:796: service_sched: setting eval status: eval_id=dcbcac79-43e6-780a-6500-accff939ed44 job_id=mock-service-70f9fa3a-4a5d-f7ff-8568-d593e72cbd87 namespace=default status=complete
FILTERED: CSI plugin test-plugin-zone0 is missing from client 1dc7bc5b-930e-417f-3cf8-3222e4d98413    generic_sched_test.go:6486:
                Error Trace:    generic_sched_test.go:6486
                Error:          "[]" should have 1 item(s), but has 0
                Test:           TestServiceSched_CSITopology
                Messages:       expected one plan
--- FAIL: TestServiceSched_CSITopology (0.00s)
FAIL
FAIL    github.com/hashicorp/nomad/scheduler    0.706s
FAIL

That test can be found in the b-csi-feasibility-check branch. Interestingly, if I change this test so that all the plugins have the same ID, everything works as expected. This gives us two things:

The test narrows down the range of problems we need to figure out; the topology feasibility check itself seems to be correct on an individual node basis. We only can demonstrate the behavior at the whole-scheduler level.
Using the same plugin ID and relying solely on topology may be a useful temporary workaround while we get this figured out.

Edit: it occurred to me that this plugin ID feasibility check happens before topology entirely. So I removed topology from the plugins and volume requests and it still fails the same way. That probably eliminates topology as the source of the issue.

Edit 2: I've run out my timebox for this today but in some detailed exercising of this test (with a lot of printf debugging), I'm finding that the CSI feasibility checker simply isn't getting the full set of nodes to process. Which suggests there's some deeper bug lurking in the scheduler that we've been missing for a while. I saw your comment on #12748 @ygersie, so I'll pick this up again a bit later this week with that in mind.

tgross · 2022-06-07T15:18:32Z

I've just opened a PR in #13274 which should ship in Nomad 1.3.2 with backports to Nomad 1.2.x and Nomad 1.1.x.

Something interesting we discovered here is that this bug has existed since we first implemented CSI, but topology constraints make the feasibility check "sparser" and therefore more likely to hit this bug. But anyone running CSI on a cluster with a lot of heterogeneity could have easily hit it as well.

github-actions · 2022-10-07T02:41:55Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

ygersie added the type/bug label Jan 5, 2022

tgross added the theme/storage label Jan 5, 2022

tgross added the stage/accepted Confirmed, and intend to work on. No timeline committment though. label Feb 3, 2022

tgross mentioned this issue Feb 3, 2022

CSI implement topologies support #7669

Closed

tgross added this to the 1.3.0 milestone Feb 24, 2022

tgross self-assigned this Feb 24, 2022

tgross mentioned this issue Feb 24, 2022

CSI: implement support for topology #12129

Merged

tgross closed this as completed in #12129 Mar 1, 2022

tgross reopened this May 19, 2022

tgross removed this from the 1.3.0 milestone May 19, 2022

tgross added the stage/waiting-reply label May 19, 2022

tgross removed the stage/waiting-reply label May 19, 2022

ygersie mentioned this issue May 31, 2022

System job with constrains fails to plan #12748

Open

dhung-hashicorp added the hcc/cst Admin - internal label Jun 6, 2022

tgross mentioned this issue Jun 7, 2022

CSI: no early return when feasibility check fails on eligible nodes #13274

Merged

tgross added this to the 1.3.2 milestone Jun 7, 2022

tgross closed this as completed in #13274 Jun 7, 2022

github-actions bot locked as resolved and limited conversation to collaborators Oct 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CSI volume per_alloc availability zone placement #11778

CSI volume per_alloc availability zone placement #11778

ygersie commented Jan 5, 2022 •

edited

Loading

tgross commented Jan 5, 2022

ygersie commented Jan 5, 2022

tgross commented Jan 5, 2022

ygersie commented Jan 6, 2022 •

edited

Loading

tgross commented Feb 24, 2022

tgross commented Mar 1, 2022

ygersie commented Mar 1, 2022

ygersie commented May 19, 2022

tgross commented May 19, 2022

ygersie commented May 19, 2022 •

edited

Loading

tgross commented May 19, 2022

ygersie commented May 19, 2022

ygersie commented May 24, 2022

tgross commented May 24, 2022

ygersie commented May 26, 2022

tgross commented May 31, 2022 •

edited

Loading

tgross commented Jun 7, 2022 •

edited

Loading

github-actions bot commented Oct 7, 2022

CSI volume per_alloc availability zone placement #11778

CSI volume per_alloc availability zone placement #11778

Comments

ygersie commented Jan 5, 2022 • edited Loading

Nomad version

Issue

tgross commented Jan 5, 2022

ygersie commented Jan 5, 2022

tgross commented Jan 5, 2022

ygersie commented Jan 6, 2022 • edited Loading

tgross commented Feb 24, 2022

tgross commented Mar 1, 2022

ygersie commented Mar 1, 2022

ygersie commented May 19, 2022

tgross commented May 19, 2022

ygersie commented May 19, 2022 • edited Loading

Can you verify that the volumes were created in the correct AWS AZ as their topology says?

Do you have more than one node in each DC with the running plugins

Do any nodes in the DC not have a plugin?

Which DC is 17224e99-393e-16d2-aa0a-329ff47ca63b in? It would be interesting to see whether that was a DC with a different plugin or a DC without a plugin at all.

tgross commented May 19, 2022

ygersie commented May 19, 2022

ygersie commented May 24, 2022

tgross commented May 24, 2022

ygersie commented May 26, 2022

tgross commented May 31, 2022 • edited Loading

tgross commented Jun 7, 2022 • edited Loading

github-actions bot commented Oct 7, 2022

ygersie commented Jan 5, 2022 •

edited

Loading

ygersie commented Jan 6, 2022 •

edited

Loading

ygersie commented May 19, 2022 •

edited

Loading

tgross commented May 31, 2022 •

edited

Loading

tgross commented Jun 7, 2022 •

edited

Loading