Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot get job status for jobs that are the prefix of another job #10625

Closed
dansteen opened this issue May 19, 2021 · 8 comments · Fixed by #10648
Closed

Cannot get job status for jobs that are the prefix of another job #10625

dansteen opened this issue May 19, 2021 · 8 comments · Fixed by #10648

Comments

@dansteen
Copy link

dansteen commented May 19, 2021

Nomad version

client:

Nomad v1.1.0 (2678c3604bc9530014208bc167415e167fd440fc)

(but is also a problem going as far back as 0.12.7 in my testing just now - although it was not a problem previously (see below))

Server:

Nomad v1.0.4 (9294f35f9aa8dbb4acb6e85fa88e3e2534a3e41a)

Operating system and Environment details

Client - Arch Linux
Server - Debian Linux (10.9)

Issue

Since we added in namespaces that match our job names, we can no longer use the CLI to get status info or inspect jobs for jobs who's names are the prefix of other jobs.

To clarify:

Given a job name: rec-service-stag
and a job name: rec-service-stag-test

since we have added a namespace called rec-service-stag, the output of:

nomad job status rec-service-stag

is the list of jobs that match that prefix:

rec-service-stag                      default    service              50        running  2021-05-19T15:18:09-04:00
rec-service-stag-test                 default    batch/parameterized  50        running  2021-01-19T17:01:17-05:00

Rather than the information about the rec-service-stag job.

The output of nomad status rec-service-stag (without the word job in the command line) is information about the namespace rather then the job. If we remove the namespace, we get an error (rather than information about the job):

$ nomad status recommendation-service-stag
Unable to resolve ID: "recommendation-service-stag"```

Expected Result

We used to get the status of the job that was explicitly named instead of it treating it like a prefix.

Actual Result

It treats the name as a prefix and gives us a list of jobs

Thanks!

@dansteen
Copy link
Author

Ok. So I narrowed this down a bit and figured it out a bit more. If you set:

NOMAD_NAMESPACE='*'

Then, when you attempt to do a nomad job status rec-service-stag it will assume that that is a prefix, and return the list of matching jobs for that prefix. If, however, you unset NOMAD_NAMESPACE or you set NOMAD_NAMESPACE="default" (the jobs are in the default namespace in my case) then it will treat the same query as a job name and return the job status.

@tgross
Copy link
Member

tgross commented May 21, 2021

Hi @dansteen! It seems like there are two behaviors here, so I want to try to discuss them separately. Note that nomad status and nomad job status are not the same command, it just happens to look like it in a lot of cases because nomad status is a bit "do what I mean":

The status command accepts any Nomad identifier or identifier prefix as its sole argument. The command detects the type of the identifier and routes to the appropriate status command to display more detailed output.

If the ID is omitted, the command lists out all of the existing jobs. This is for backwards compatibility and should not be relied on.

So that nomad status command hits a general prefix search API, and that's why you're getting the namespace first there. If you're trying to search for a job that's in a namespace and that namespace name is a prefix of the job, it's going to use the exact match first.

Now onto the nomad job status behavior. Like a lot of Nomad commands, it makes a few API calls. The client is configured with the namespace from NOMAD_NAMESPACE. When you pass a job ID, the initial API call is a prefix search job_status.go#L151, and that prefix search will be over the IDs in that namespace. In 0.12.0 85db718 we added the ability to use a wildcard *, which gives us another path by which we need to ask you as the user which job you meant (ref job_status.go#L160)

So I think what's not clear to me from what you're describing is what namespace the two jobs are in? Is the rec-service-stag job in a rec-service-stag namespace and the rec-service-stag-test job is in a rec-service-stag-test namespace? Or is one of the jobs in the default namespace?

@dansteen
Copy link
Author

hi @tgross thanks for the response!

We are actually testing out namespaces at the moment. So while there is a namespace named rec-service-stag both the rec-service-stag job and the rec-service-stag-test job are in the default namespace.

@tgross
Copy link
Member

tgross commented May 21, 2021

Thanks @dansteen, that's helpful. I was able to reproduce the behavior, but it turns out that having a namespace named the same as one of the jobs (or a prefix of it) was a red herring. The difference in behavior is only about the NOMAD_NAMESPACE='*' wildcard. My reproduction is below.

The cause is this line in job_status.go#L160 is taking a bit of a shortcut and assuming that if we get more than one job that they're in different namespaces, or at least that it's ambiguous.

At first I was looking at this and thinking I could argue we should use the exact if all the jobs that came back are in the same namespace. But when @notnoop implemented this, I think he was being careful to consider some nasty edge cases.

Imagine the following scenario: you have a job prod in a namespace foo, and a job prod-staging in the default namespace. If you do nomad job status prod you'd get the prod-staging job, but if you did NOMAD_NAMESPACE='*' nomad job status prod you'd now get the prod job without a prompt. (And this same logic is reused for destructive behaviors like nomad job stop!) So I'm not totally sure we'd want to change this behavior, even though it's a little bit of a surprise.


To reproduce: set up a namespace and two jobs, with prefix-sharing names.

$ nomad namespace apply staging
Successfully applied namespace "staging"!

$ nomad job run ./staging.nomad
==> Monitoring evaluation "76367077"
    Evaluation triggered by job "staging"
    Allocation "3f62bb8f" created: node "edf6f6c1", group "web"
==> Monitoring evaluation "76367077"
    Evaluation within deployment: "dbaf1612"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "76367077" finished with status "complete"

$ nomad job run ./staging-test.nomad
==> Monitoring evaluation "8a3927f9"
    Evaluation triggered by job "staging-test"
    Allocation "db932199" created: node "edf6f6c1", group "web"
==> Monitoring evaluation "8a3927f9"
    Evaluation within deployment: "6024f1fa"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "8a3927f9" finished with status "complete"

The nomad status job works as expected, regardless of the NOMAD_NAMESPACE variable:

$ nomad status
ID            Type     Priority  Status   Submit Date
staging       service  50        running  2021-05-21T14:23:00-04:00
staging-test  service  50        running  2021-05-21T14:23:05-04:00

$ NOMAD_NAMESPACE='*' nomad status
ID            Namespace  Type     Priority  Status   Submit Date
staging       default    service  50        running  2021-05-21T14:23:00-04:00
staging-test  default    service  50        running  2021-05-21T14:23:05-04:00

$ NOMAD_NAMESPACE=default nomad status
ID            Type     Priority  Status   Submit Date
staging       service  50        running  2021-05-21T14:23:00-04:00
staging-test  service  50        running  2021-05-21T14:23:05-04:00

The nomad job status command in list mode works as expected, regardless of the NOMAD_NAMESPACE:

$ nomad job status
ID            Type     Priority  Status   Submit Date
staging       service  50        running  2021-05-21T14:23:00-04:00
staging-test  service  50        running  2021-05-21T14:23:05-04:00

$ NOMAD_NAMESPACE='*' nomad job status
ID            Type     Priority  Status   Submit Date
staging       service  50        running  2021-05-21T14:23:00-04:00
staging-test  service  50        running  2021-05-21T14:23:05-04:00

It's where we get into the specific nomad job status :id that things go slightly awry. Without the NOMAD_NAMESPACE set (or with it set to "default"):

$ nomad job status staging
ID            = staging
...
Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created  Modified
3f62bb8f  edf6f6c1  web         0        run      running  45s ago  33s ago

$ nomad job status staging-test
ID            = staging-test
Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created  Modified
db932199  edf6f6c1  web         0        run      running  41s ago  30s ago

But with the wildcard set, we get the "Prefix matched multiple jobs" when the job is a prefix:

$ NOMAD_NAMESPACE='*' nomad job status staging
Prefix matched multiple jobs

ID            Namespace  Type     Priority  Status   Submit Date
staging       default    service  50        running  2021-05-21T14:23:00-04:00
staging-test  default    service  50        running  2021-05-21T14:23:05-04:00

$ NOMAD_NAMESPACE='*' nomad job status staging-test
ID            = staging-test
Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created  Modified
db932199  edf6f6c1  web         0        run      running  53s ago  42s ago

Arguably the "Prefix matched multiple jobs" should be the correct behavior here, and that it's wrong when the NOMAD_NAMESPACE isn't set to "*". The staging name is a prefix of the staging-test name! But that's not how Nomad has historically done it.

But as it turns out, that the namespace you've got overlaps the job ID turns out to be a red herring. If we delete the namespace, we see the exact same behavior:

$ nomad namespace delete staging
Successfully deleted namespace "staging"!

$ nomad job status staging
...
Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created  Modified
ae8c682b  dec429cc  web         0        run      running  23s ago  11s ago

$ NOMAD_NAMESPACE='*' nomad job status staging
Prefix matched multiple jobs

ID            Namespace  Type     Priority  Status   Submit Date
staging       default    service  50        running  2021-05-21T14:31:45-04:00
staging-test  default    service  50        running  2021-05-21T14:31:41-04:00

@dansteen
Copy link
Author

dansteen commented May 21, 2021

hi @tgross thanks for the super detailed response! Some thoughts:

I agree that the namespace being a prefix was a red herring. thanks for refining that down.

Arguably the "Prefix matched multiple jobs" should be the correct behavior here, and that it's wrong when the NOMAD_NAMESPACE isn't set to "*". The staging name is a prefix of the staging-test name! But that's not how Nomad has historically done it.

I'm not sure I'm following here. I would argue the exact opposite. "Prefix matched multiple jobs" should only show when the string does not also match a specific job. That's the way nomad has always done it, and that's the behavior that I would argue should happen here as well - as it's extremely confusing if commands change their semantics based on the content of ENV variables.

Assuming two jobs staging and staging-test - both of which are in the default namespace - the current behavior will display different results for the following two commands:

NOMAD_NAMESPACE='default' nomad job status staging

NOMAD_NAMESPACE='*' nomad job status staging

The first will display the job while the second will display the "Prefix matched multiple jobs". But the intent and semantics of what I have requested are exactly the same. I'm increasing my search area to multiple namespaces, but I still want that specific job if it exists or a list of prefixes if it does not.

Going backwards a bit in your post:

Imagine the following scenario: you have a job prod in a namespace foo, and a job prod-staging in the default namespace. If you do nomad job status prod you'd get the prod-staging job, but if you did NOMAD_NAMESPACE='*' nomad job status prod you'd now get the prod job without a prompt. (And this same logic is reused for destructive behaviors like nomad job stop!) So I'm not totally sure we'd want to change this behavior, even though it's a little bit of a surprise.

Please please please don't fall into the trap of trying to protect me from myself. If I run a command it's better to assume that I meant what I typed then changing the expected output in an effort to prevent me from shooting myself in the foot. If you really feel that you must be protective, add in confirmation warnings (with a command line -y flag style override!) - but don't do the unexpected in an effort to protect me - that's just going to cause more trouble! Doing something unexpected is always bad.

/endrant/ :-)

Thanks!

@tgross
Copy link
Member

tgross commented May 24, 2021

I'm not sure I'm following here. I would argue the exact opposite. "Prefix matched multiple jobs" should only show when the string does not also match a specific job. That's the way nomad has always done it, and that's the behavior that I would argue should happen here as well

Right, sorry... what I was trying to say there that supposing we were writing the feature for the first time, I'd probably recommend we do what the NOMAD_NAMESPACE='*' behavior is doing now, rather than what Nomad has done historically. But given what Nomad has always done, I agree we should be fixing the new behavior so it matches the existing one.

Please please please don't fall into the trap of trying to protect me from myself. If I run a command it's better to assume that I meant what I typed then changing the expected output in an effort to prevent me from shooting myself in the foot.

I totally respect that viewpoint! But if we were to take it to it's logical conclusion we wouldn't have prefix matching at all and just return an error if the job ID didn't match exactly. Prefix matching is already a "do what I mean" behavior 😀

Fix is in PR #10648

@tgross
Copy link
Member

tgross commented May 24, 2021

I've merged #10648 and that will ship in the next patch release of Nomad. Thanks again for opening this @dansteen!

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 19, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants