Plan returns an erroneous in-place update in diff when task group has a constraint #10836

mjm · 2021-07-01T05:18:12Z

Nomad version

Nomad v1.1.2 (60638a086ef9630e2a9ba1e237e8426192a44244)

Operating system and Environment details

Ubuntu 20.04.2 LTS

Issue

I have some deploy automation for my Nomad cluster that for each job first runs a plan to see if the job has any changes that need to be applied. I've noticed that for some of my jobs, the plan always has type "Edited" even when there are no changes. If I look in the "Versions" tab for the job in the UI, it lists that version as having "0 changes".

Here's an example of the response from the plan endpoint:

{
   "Annotations" : {
      "DesiredTGUpdates" : {
         "tripplite-exporter" : {
            "Canary" : 0,
            "DestructiveUpdate" : 0,
            "Ignore" : 0,
            "InPlaceUpdate" : 1,
            "Migrate" : 0,
            "Place" : 0,
            "Preemptions" : 0,
            "Stop" : 0
         }
      },
      "PreemptedAllocs" : null
   },
   "CreatedEvals" : null,
   "Diff" : {
      "Fields" : null,
      "ID" : "tripplite-exporter",
      "Objects" : null,
      "TaskGroups" : [
         {
            "Fields" : null,
            "Name" : "tripplite-exporter",
            "Objects" : null,
            "Tasks" : [
               {
                  "Annotations" : null,
                  "Fields" : null,
                  "Name" : "tripplite-exporter",
                  "Objects" : null,
                  "Type" : "None"
               }
            ],
            "Type" : "Edited",
            "Updates" : {
               "in-place update" : 1
            }
         }
      ],
      "Type" : "Edited"
   },
   "FailedTGAllocs" : null,
   "JobModifyIndex" : 409749,
   "NextPeriodicLaunch" : "0001-01-01T00:00:00Z",
   "Warnings" : ""
}

This is happening with 3 of my jobs, and the one thing I've noticed they all have in common is that they all have a constraint to only place them on a particular node. My other jobs are not exhibiting this behavior and don't have this constraint on a task group.

        {
          "LTarget": "${node.unique.name}",
          "RTarget": "raspberrypi",
          "Operand": "="
        }

I tried digging through the code for planning jobs but I got a bit lost trying to figure out where this kind of decision was made.

Reproduction steps

Register a job with a single task group with a constraint like the one shown above.
Request a plan for the job without any changes.

Expected Result

The plan has diff type "None" because nothing has changed.

Actual Result

The plan has diff type "Edited" due to an in-place update.

Job file (if appropriate)

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

The text was updated successfully, but these errors were encountered:

mjm · 2021-07-01T05:22:55Z

I just tried moving constraint up to the job instead of the task group and it fixes this behavior. For me, that's a decent workaround since these jobs only have a single task group anyway.

Summary: Having the constraints on the task group causes Nomad to always think there's a diff on these jobs, even when nothing has changed. I filed hashicorp/nomad#10836 about that, but in the meantime, this works around the issue. Test Plan: I installed these apps from Nomadic locally, then installed them again and saw that it did not try to submit the jobs again. Reviewers: matt Reviewed By: matt Differential Revision: https://code.home.mattmoriarity.com/D20

shoenig · 2021-07-01T15:34:09Z

Hi @mjm, thanks for reporting!

So far I haven't been able to reproduce what you're seeing - but it might just not be the group constraint that's actually the problem? It might be helpful if you could post the output from the CLI when running nomad job plan -diff -verbose <job>, or just one of the symptomatic job files.

Here's the job I'm submitting:

job "example" {
  datacenters = ["dc1"]

  group "sleep" {

    constraint {
      operator  = "="
      attribute = "${node.unique.name}"
      value     = "laptop"
    }

    task "sleep" {
      driver = "exec"

      config {
	command = "/bin/sleep"
	args = ["100"]
      }
    }
  }
}

mjm · 2021-07-01T16:37:10Z

Thanks for looking at this! Here's the JSON version of one of these jobs. I have some code that does some transformations on the original HCL, so this JSON version is what actually gets applied. I have to believe that the constraint is somehow relevant (maybe it's not the whole story) because moving it up to the job does fix the problem.

{
  "Region": null,
  "Namespace": null,
  "ID": "tripplite-exporter",
  "Name": "tripplite-exporter",
  "Type": "system",
  "Priority": 70,
  "AllAtOnce": null,
  "Datacenters": [
    "dc1"
  ],
  "Constraints": null,
  "Affinities": null,
  "TaskGroups": [
    {
      "Name": "tripplite-exporter",
      "Count": 1,
      "Constraints": [
        {
          "LTarget": "${node.unique.name}",
          "RTarget": "raspberrypi",
          "Operand": "="
        }
      ],
      "Affinities": null,
      "Tasks": [
        {
          "Name": "tripplite-exporter",
          "Driver": "docker",
          "User": "",
          "Lifecycle": null,
          "Config": {
            "command": "/tripplite_exporter",
            "image": "index.docker.io/mmoriarity/tripplite-exporter@sha256:c955272aa83f9eccfe461a8b96ef8f299e13b3cb71a7a7bcad5db6376d27ace6",
            "logging": {
              "config": [
                {
                  "tag": "tripplite-exporter"
                }
              ],
              "type": "journald"
            },
            "mount": [
              {
                "source": "/dev/bus/usb",
                "target": "/dev/bus/usb",
                "type": "bind"
              }
            ],
            "ports": [
              "http"
            ],
            "privileged": true
          },
          "Constraints": null,
          "Affinities": null,
          "Env": {
            "HOSTNAME": "${attr.unique.hostname}",
            "HOST_IP": "${attr.unique.network.ip-address}",
            "NOMAD_CLIENT_ID": "${node.unique.id}"
          },
          "Services": null,
          "Resources": {
            "CPU": 30,
            "MemoryMB": 30,
            "DiskMB": null,
            "Networks": null,
            "Devices": null,
            "IOPS": null
          },
          "RestartPolicy": null,
          "Meta": null,
          "KillTimeout": null,
          "LogConfig": null,
          "Artifacts": null,
          "Vault": null,
          "Templates": null,
          "DispatchPayload": null,
          "VolumeMounts": null,
          "Leader": false,
          "ShutdownDelay": 0,
          "KillSignal": "",
          "Kind": "",
          "ScalingPolicies": null
        }
      ],
      "Spreads": null,
      "Volumes": null,
      "RestartPolicy": null,
      "ReschedulePolicy": null,
      "EphemeralDisk": null,
      "Update": null,
      "Migrate": null,
      "Networks": [
        {
          "Mode": "",
          "Device": "",
          "CIDR": "",
          "IP": "",
          "DNS": {
            "Servers": [
              "10.0.2.101"
            ],
            "Searches": null,
            "Options": null
          },
          "ReservedPorts": null,
          "DynamicPorts": [
            {
              "Label": "http",
              "Value": 0,
              "To": 8080,
              "HostNetwork": ""
            }
          ],
          "MBits": null
        }
      ],
      "Meta": null,
      "Services": [
        {
          "Id": "",
          "Name": "tripplite-exporter",
          "Tags": null,
          "CanaryTags": null,
          "EnableTagOverride": false,
          "PortLabel": "http",
          "AddressMode": "",
          "Checks": [
            {
              "Id": "",
              "Name": "",
              "Type": "http",
              "Command": "",
              "Args": null,
              "Path": "/healthz",
              "Protocol": "",
              "PortLabel": "",
              "Expose": false,
              "AddressMode": "",
              "Interval": 30000000000,
              "Timeout": 5000000000,
              "InitialStatus": "",
              "TLSSkipVerify": false,
              "Header": null,
              "Method": "",
              "CheckRestart": null,
              "GRPCService": "",
              "GRPCUseTLS": false,
              "TaskName": "",
              "SuccessBeforePassing": 3,
              "FailuresBeforeCritical": 0
            }
          ],
          "CheckRestart": null,
          "Connect": null,
          "Meta": {
            "metrics_path": "/metrics"
          },
          "CanaryMeta": null,
          "TaskName": ""
        }
      ],
      "ShutdownDelay": null,
      "StopAfterClientDisconnect": null,
      "Scaling": null
    }
  ],
  "Update": null,
  "Multiregion": null,
  "Spreads": null,
  "Periodic": null,
  "ParameterizedJob": null,
  "Reschedule": null,
  "Migrate": null,
  "Meta": null,
  "ConsulToken": null,
  "VaultToken": null,
  "Stop": null,
  "ParentID": null,
  "Dispatched": false,
  "Payload": null,
  "VaultNamespace": null,
  "NomadTokenID": null,
  "Status": null,
  "StatusDescription": null,
  "Stable": null,
  "Version": null,
  "SubmitTime": null,
  "CreateIndex": null,
  "ModifyIndex": null,
  "JobModifyIndex": null
}

shoenig · 2021-07-06T13:50:51Z

Hi @mjm, so far I still haven't reproduced what you're seeing, however I did notice one interesting thing: in your plan output we see

"NextPeriodicLaunch" : "0001-01-01T00:00:00Z",

but when I submit a similar job, get the JSON from inspect, and submit it for planning, I always get

"NextPeriodicLaunch":null,

I don't know if that's actually related, but it seems suspicious. Did a job of this name once exist as a periodic job?

mjm · 2021-07-07T04:18:27Z

I had 3 different jobs affected by this. One is a periodic batch job, one is a batch job I trigger manually with a dispatch payload when necessary, and the other is a system job (that's the one I included here). They've all been those types of jobs from the beginning as far as I remember.

The JSON I got there is coming from some Go code that interacts with the Nomad API to plan and submit jobs, rather than the nomad CLI tool. So that JSON is produced by json.Marshaling the JobPlanResponse type. The NextPeriodicLaunch field is a time.Time, not a pointer, so I think that value I have is just Go's zero value for that type.

luckymike · 2021-07-10T19:43:20Z

@shoenig this appears to be similar to the issue reported in #9804

shoenig · 2021-07-12T15:33:39Z

Thanks for pointing that out @luckymike, indeed there does seem to be a problem mixing system jobs with constraints. I'm finally able to reproduce the symptom here, in fact all I needed to do was run my same sample job above but on a cluster with more than one client 😬

This PR causes Nomad to no longer memoize the String value of a Constraint. The private memoized variable may or may not be initialized at any given time, which means a reflect.DeepEqual comparison between two jobs (e.g. during Plan) may return incorrect results. Fixes #10836

github-actions · 2022-10-17T02:45:48Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

mjm added the type/bug label Jul 1, 2021

shoenig self-assigned this Jul 1, 2021

shoenig added stage/needs-investigation theme/plan labels Jul 1, 2021

shoenig added stage/waiting-reply and removed stage/needs-investigation labels Jul 1, 2021

shoenig added stage/accepted Confirmed, and intend to work on. No timeline committment though. and removed stage/waiting-reply labels Jul 13, 2021

shoenig mentioned this issue Jul 14, 2021

core: do not internalize constraint strings #10896

Merged

shoenig closed this as completed in #10896 Jul 14, 2021

github-actions bot locked as resolved and limited conversation to collaborators Oct 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Plan returns an erroneous in-place update in diff when task group has a constraint #10836

Plan returns an erroneous in-place update in diff when task group has a constraint #10836

mjm commented Jul 1, 2021

mjm commented Jul 1, 2021

shoenig commented Jul 1, 2021

mjm commented Jul 1, 2021

shoenig commented Jul 6, 2021

mjm commented Jul 7, 2021

luckymike commented Jul 10, 2021

shoenig commented Jul 12, 2021

github-actions bot commented Oct 17, 2022

Plan returns an erroneous in-place update in diff when task group has a constraint #10836

Plan returns an erroneous in-place update in diff when task group has a constraint #10836

Comments

mjm commented Jul 1, 2021

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Expected Result

Actual Result

Job file (if appropriate)

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

mjm commented Jul 1, 2021

shoenig commented Jul 1, 2021

mjm commented Jul 1, 2021

shoenig commented Jul 6, 2021

mjm commented Jul 7, 2021

luckymike commented Jul 10, 2021

shoenig commented Jul 12, 2021

github-actions bot commented Oct 17, 2022