Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plan returns an erroneous in-place update in diff when task group has a constraint #10836

Closed
mjm opened this issue Jul 1, 2021 · 8 comments · Fixed by #10896
Closed

Plan returns an erroneous in-place update in diff when task group has a constraint #10836

mjm opened this issue Jul 1, 2021 · 8 comments · Fixed by #10896
Assignees
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/plan type/bug

Comments

@mjm
Copy link

mjm commented Jul 1, 2021

Nomad version

Nomad v1.1.2 (60638a086ef9630e2a9ba1e237e8426192a44244)

Operating system and Environment details

Ubuntu 20.04.2 LTS

Issue

I have some deploy automation for my Nomad cluster that for each job first runs a plan to see if the job has any changes that need to be applied. I've noticed that for some of my jobs, the plan always has type "Edited" even when there are no changes. If I look in the "Versions" tab for the job in the UI, it lists that version as having "0 changes".

Here's an example of the response from the plan endpoint:

{
   "Annotations" : {
      "DesiredTGUpdates" : {
         "tripplite-exporter" : {
            "Canary" : 0,
            "DestructiveUpdate" : 0,
            "Ignore" : 0,
            "InPlaceUpdate" : 1,
            "Migrate" : 0,
            "Place" : 0,
            "Preemptions" : 0,
            "Stop" : 0
         }
      },
      "PreemptedAllocs" : null
   },
   "CreatedEvals" : null,
   "Diff" : {
      "Fields" : null,
      "ID" : "tripplite-exporter",
      "Objects" : null,
      "TaskGroups" : [
         {
            "Fields" : null,
            "Name" : "tripplite-exporter",
            "Objects" : null,
            "Tasks" : [
               {
                  "Annotations" : null,
                  "Fields" : null,
                  "Name" : "tripplite-exporter",
                  "Objects" : null,
                  "Type" : "None"
               }
            ],
            "Type" : "Edited",
            "Updates" : {
               "in-place update" : 1
            }
         }
      ],
      "Type" : "Edited"
   },
   "FailedTGAllocs" : null,
   "JobModifyIndex" : 409749,
   "NextPeriodicLaunch" : "0001-01-01T00:00:00Z",
   "Warnings" : ""
}

This is happening with 3 of my jobs, and the one thing I've noticed they all have in common is that they all have a constraint to only place them on a particular node. My other jobs are not exhibiting this behavior and don't have this constraint on a task group.

        {
          "LTarget": "${node.unique.name}",
          "RTarget": "raspberrypi",
          "Operand": "="
        }

I tried digging through the code for planning jobs but I got a bit lost trying to figure out where this kind of decision was made.

Reproduction steps

Register a job with a single task group with a constraint like the one shown above.
Request a plan for the job without any changes.

Expected Result

The plan has diff type "None" because nothing has changed.

Actual Result

The plan has diff type "Edited" due to an in-place update.

Job file (if appropriate)

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

@mjm mjm added the type/bug label Jul 1, 2021
@mjm
Copy link
Author

mjm commented Jul 1, 2021

I just tried moving constraint up to the job instead of the task group and it fixes this behavior. For me, that's a decent workaround since these jobs only have a single task group anyway.

mjm added a commit to mjm/pi-tools that referenced this issue Jul 1, 2021
Summary: Having the constraints on the task group causes Nomad to always think there's a diff on these jobs, even when nothing has changed. I filed hashicorp/nomad#10836 about that, but in the meantime, this works around the issue.

Test Plan: I installed these apps from Nomadic locally, then installed them again and saw that it did not try to submit the jobs again.

Reviewers: matt

Reviewed By: matt

Differential Revision: https://code.home.mattmoriarity.com/D20
@shoenig shoenig self-assigned this Jul 1, 2021
@shoenig
Copy link
Contributor

shoenig commented Jul 1, 2021

Hi @mjm, thanks for reporting!

So far I haven't been able to reproduce what you're seeing - but it might just not be the group constraint that's actually the problem? It might be helpful if you could post the output from the CLI when running nomad job plan -diff -verbose <job>, or just one of the symptomatic job files.

Here's the job I'm submitting:

job "example" {
  datacenters = ["dc1"]

  group "sleep" {

    constraint {
      operator  = "="
      attribute = "${node.unique.name}"
      value     = "laptop"
    }

    task "sleep" {
      driver = "exec"

      config {
	command = "/bin/sleep"
	args = ["100"]
      }
    }
  }
}

@mjm
Copy link
Author

mjm commented Jul 1, 2021

Thanks for looking at this! Here's the JSON version of one of these jobs. I have some code that does some transformations on the original HCL, so this JSON version is what actually gets applied. I have to believe that the constraint is somehow relevant (maybe it's not the whole story) because moving it up to the job does fix the problem.

{
  "Region": null,
  "Namespace": null,
  "ID": "tripplite-exporter",
  "Name": "tripplite-exporter",
  "Type": "system",
  "Priority": 70,
  "AllAtOnce": null,
  "Datacenters": [
    "dc1"
  ],
  "Constraints": null,
  "Affinities": null,
  "TaskGroups": [
    {
      "Name": "tripplite-exporter",
      "Count": 1,
      "Constraints": [
        {
          "LTarget": "${node.unique.name}",
          "RTarget": "raspberrypi",
          "Operand": "="
        }
      ],
      "Affinities": null,
      "Tasks": [
        {
          "Name": "tripplite-exporter",
          "Driver": "docker",
          "User": "",
          "Lifecycle": null,
          "Config": {
            "command": "/tripplite_exporter",
            "image": "index.docker.io/mmoriarity/tripplite-exporter@sha256:c955272aa83f9eccfe461a8b96ef8f299e13b3cb71a7a7bcad5db6376d27ace6",
            "logging": {
              "config": [
                {
                  "tag": "tripplite-exporter"
                }
              ],
              "type": "journald"
            },
            "mount": [
              {
                "source": "/dev/bus/usb",
                "target": "/dev/bus/usb",
                "type": "bind"
              }
            ],
            "ports": [
              "http"
            ],
            "privileged": true
          },
          "Constraints": null,
          "Affinities": null,
          "Env": {
            "HOSTNAME": "${attr.unique.hostname}",
            "HOST_IP": "${attr.unique.network.ip-address}",
            "NOMAD_CLIENT_ID": "${node.unique.id}"
          },
          "Services": null,
          "Resources": {
            "CPU": 30,
            "MemoryMB": 30,
            "DiskMB": null,
            "Networks": null,
            "Devices": null,
            "IOPS": null
          },
          "RestartPolicy": null,
          "Meta": null,
          "KillTimeout": null,
          "LogConfig": null,
          "Artifacts": null,
          "Vault": null,
          "Templates": null,
          "DispatchPayload": null,
          "VolumeMounts": null,
          "Leader": false,
          "ShutdownDelay": 0,
          "KillSignal": "",
          "Kind": "",
          "ScalingPolicies": null
        }
      ],
      "Spreads": null,
      "Volumes": null,
      "RestartPolicy": null,
      "ReschedulePolicy": null,
      "EphemeralDisk": null,
      "Update": null,
      "Migrate": null,
      "Networks": [
        {
          "Mode": "",
          "Device": "",
          "CIDR": "",
          "IP": "",
          "DNS": {
            "Servers": [
              "10.0.2.101"
            ],
            "Searches": null,
            "Options": null
          },
          "ReservedPorts": null,
          "DynamicPorts": [
            {
              "Label": "http",
              "Value": 0,
              "To": 8080,
              "HostNetwork": ""
            }
          ],
          "MBits": null
        }
      ],
      "Meta": null,
      "Services": [
        {
          "Id": "",
          "Name": "tripplite-exporter",
          "Tags": null,
          "CanaryTags": null,
          "EnableTagOverride": false,
          "PortLabel": "http",
          "AddressMode": "",
          "Checks": [
            {
              "Id": "",
              "Name": "",
              "Type": "http",
              "Command": "",
              "Args": null,
              "Path": "/healthz",
              "Protocol": "",
              "PortLabel": "",
              "Expose": false,
              "AddressMode": "",
              "Interval": 30000000000,
              "Timeout": 5000000000,
              "InitialStatus": "",
              "TLSSkipVerify": false,
              "Header": null,
              "Method": "",
              "CheckRestart": null,
              "GRPCService": "",
              "GRPCUseTLS": false,
              "TaskName": "",
              "SuccessBeforePassing": 3,
              "FailuresBeforeCritical": 0
            }
          ],
          "CheckRestart": null,
          "Connect": null,
          "Meta": {
            "metrics_path": "/metrics"
          },
          "CanaryMeta": null,
          "TaskName": ""
        }
      ],
      "ShutdownDelay": null,
      "StopAfterClientDisconnect": null,
      "Scaling": null
    }
  ],
  "Update": null,
  "Multiregion": null,
  "Spreads": null,
  "Periodic": null,
  "ParameterizedJob": null,
  "Reschedule": null,
  "Migrate": null,
  "Meta": null,
  "ConsulToken": null,
  "VaultToken": null,
  "Stop": null,
  "ParentID": null,
  "Dispatched": false,
  "Payload": null,
  "VaultNamespace": null,
  "NomadTokenID": null,
  "Status": null,
  "StatusDescription": null,
  "Stable": null,
  "Version": null,
  "SubmitTime": null,
  "CreateIndex": null,
  "ModifyIndex": null,
  "JobModifyIndex": null
}

@shoenig
Copy link
Contributor

shoenig commented Jul 6, 2021

Hi @mjm, so far I still haven't reproduced what you're seeing, however I did notice one interesting thing: in your plan output we see

"NextPeriodicLaunch" : "0001-01-01T00:00:00Z",

but when I submit a similar job, get the JSON from inspect, and submit it for planning, I always get

"NextPeriodicLaunch":null,

I don't know if that's actually related, but it seems suspicious. Did a job of this name once exist as a periodic job?

@mjm
Copy link
Author

mjm commented Jul 7, 2021

I had 3 different jobs affected by this. One is a periodic batch job, one is a batch job I trigger manually with a dispatch payload when necessary, and the other is a system job (that's the one I included here). They've all been those types of jobs from the beginning as far as I remember.

The JSON I got there is coming from some Go code that interacts with the Nomad API to plan and submit jobs, rather than the nomad CLI tool. So that JSON is produced by json.Marshaling the JobPlanResponse type. The NextPeriodicLaunch field is a time.Time, not a pointer, so I think that value I have is just Go's zero value for that type.

@luckymike
Copy link

@shoenig this appears to be similar to the issue reported in #9804

@shoenig
Copy link
Contributor

shoenig commented Jul 12, 2021

Thanks for pointing that out @luckymike, indeed there does seem to be a problem mixing system jobs with constraints. I'm finally able to reproduce the symptom here, in fact all I needed to do was run my same sample job above but on a cluster with more than one client 😬

@shoenig shoenig added stage/accepted Confirmed, and intend to work on. No timeline committment though. and removed stage/waiting-reply labels Jul 13, 2021
shoenig added a commit that referenced this issue Jul 14, 2021
This PR causes Nomad to no longer memoize the String value of
a Constraint. The private memoized variable may or may not be
initialized at any given time, which means a reflect.DeepEqual
comparison between two jobs (e.g. during Plan) may return incorrect
results.

Fixes #10836
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 17, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/plan type/bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants