Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow wildcard datacenters to be specified in job file #11170

Merged
merged 18 commits into from
Feb 2, 2023

Conversation

jmwilkinson
Copy link
Contributor

@jmwilkinson jmwilkinson commented Sep 10, 2021

Overview

This PR addresses the issue #9024

It uses the filepath.Match golang std lib to accomplish the matching, which is somewhat simpler and more intuitive than using a regex. Because of that simplicity, it should be minimally impactful.

It is not completely backwards compatible, as datacenters with "*", "]", "?" in their name may be impacted. I do not believe the requested feature could be implemented without a theoretical compatibility break, or an entirely new property (which feels worse).

Notes

The docs do not appear to be part of the project, at least not that I could see, so they would need to be updated as well.

The returned list of datacenters will no longer have keys with a node count of 0. This is because we cannot a priori determine the number of datacenters based on wildcard dc specs. So we have to allocate a map, then add and increment each dc as we match it.

I do not know what, if any, impact this will have.

Example

job "test" {
    name = "test"

    datacenters = ["dc1", "site.*", "*.site", "dc?"]

    type = "system"

    group "test" {
        task "test" {
            driver = "docker"

            config {
                image = "redis"
            }
        }
    }
}

Interactions with other features

Spread

There was a comment in the feature request thread about the interaction with the spread stanza. As far as I can tell, the nodes are set on the stack based on the results of the readyNodesInDCs function, and the stack then uses those nodes when computing selections, both for spread and for bin packing. So it seems like it should just work, and my limited testing with spread has done what was expected, but I could well be missing something.

multiregion.region[*].datacenters

There was another comment about parity with multiregion.region[*].datacenters for interpolation reasons. While I'd image the job spec that is received by the scheduler, which has the Datacenters property used by readyNodesInDCs, has that property set to the applicable region value, or the default job value, by the time it reaches the function, I have been unable to confirm that.

As multiregion requires Nomad Enterprise, I have also been unable to test it.

@hashicorp-cla
Copy link

hashicorp-cla commented Sep 10, 2021

CLA assistant check
All committers have signed the CLA.

@vercel vercel bot temporarily deployed to Preview – nomad September 10, 2021 18:46 Inactive
@jmwilkinson jmwilkinson changed the title Jwilkinson wildcard dc Allow wildcard datacenters to be specified in job file Sep 10, 2021
@vercel vercel bot temporarily deployed to Preview – nomad September 10, 2021 19:30 Inactive
@jmwilkinson jmwilkinson changed the title Allow wildcard datacenters to be specified in job file Draft: Allow wildcard datacenters to be specified in job file Sep 10, 2021
@jmwilkinson jmwilkinson changed the title Draft: Allow wildcard datacenters to be specified in job file Allow wildcard datacenters to be specified in job file Sep 10, 2021
@jrasell
Copy link
Member

jrasell commented Sep 13, 2021

Hi @jmwilkinson and thanks so much for this. I have marked this for our next major release meaning we won't review this immediately, but will perform this once that development cycle is open to merging into main.

@jrasell jrasell added this to the 1.2.0 milestone Sep 13, 2021
@elqueffo
Copy link

We would love to have this ASAP; is there a date set for 1.2 release?

@schmichael
Copy link
Member

schmichael commented Oct 13, 2021

The Nomad team loves this feature but have some concerns that need to be addressed before we merge it:

filepath.Match

We should not use filepath.Match for this feature because it is designed for use in paths and stops at separators. Since Windows and Unix have different path separators this could cause different interpretations depending on the Nomad server's OS! Even ignoring that the behavior might be surprising for people using / or \ in their datacenters.

Another factor is that we will likely need to implement this same logic in the UI so we can render the valid datacenters for a job even if it is using globbing. filepath.Match has more features than #9024 requires, so it would be extra burden in the UI for little benefit.

What if we only allowed * globbing and no character escaping? That greatly simplifies the logic to reimplement and hopefully makes the querying work as people expect.

Client Validation

I think we should restrict the use of special characters in datacenter names. We could allow escaping them to disambiguate between foo* (matches foo, foobar, and foo*) and foo\\* (matches only foo*). I think restricting is the better choice as it keeps the query syntax simple and obvious while not meaningfully restricting how people can use datacenter.

Sadly we currently only restrict datacenter from using null bytes (\000), so users are welcome to use * in datacenter names. We should restrict the characters used in datacenter names like we do for null bytes: https://github.com/hashicorp/nomad/blob/v1.1.6/command/agent/command.go#L317-L321

Upgrading

We need to inform users of the datacenter restrictions in our upgrade guide documentation: https://www.nomadproject.io/docs/upgrade/upgrade-specific

While we normally try to avoid breaking backward compatibility in this way, I think it's very unlikely people use * in datacenter names. Therefore I think this should be a small disruption.

multiregion.region[*].datacenters

Since this is an Enterprise Only feature the Nomad team would have to implement it. This isn't a problem and shouldn't prevent the merging of this improvement.

Testing

In particular we should test that a system job scheduled against datacenter = "foo*" gets scheduled against nodes added later with matching datacenters (eg datacenter = "foobar").

@jmwilkinson
Copy link
Contributor Author

That's a good point about filepath.Match being designed for paths, totally valid.

As an alternative, what about this linear time glob function which Russ Cox based the filepath.Match function on? https://research.swtch.com/glob

I've dropped it in an it works well. It's also easy enough to remove the single character matching block but I think that is useful, especially as it keeps it in line with the original glob behavior which devs are most familiar with.

I added the validation as well. I'm trying to change as little as possible, but with respect to the validation, I think it would be valuable to report on precisely which characters are invalid. Of course, that requires more changes and more work.

I've also tried to add some documentation, but I'm quite sure how it should look or what the process is for that, so I'll remove it or update it as necessary.

@lgfa29 lgfa29 removed this from the 1.2.0 milestone Nov 1, 2021
@dpogorzelski
Copy link

This change is quite useful, could a specific target milestone be set to know when to expect it? :)

@mikenomitch
Copy link
Contributor

Hey @dpogorzelski, sorry about the delay on this. Internally we agree its a good feature, but there are some tweaks we have to make to get this merged - the UI and client updates that @schmichael mentioned.

Since this would include breaking changes, we'll need to include this in a major release, and unfortunately I don't think we'll be able to sneak it into 1.3. Once we ship 1.3, I'll revisit this to see if we can queue it up early in the 1.4 cycle. I'll update publicly if we're committing to it for that release.

Copy link
Member

@schmichael schmichael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the slow replies. We still want to get this in, but probably only in a 1.X.0 release like 1.3.0 or 1.4.0. We need to make sure we have time to add UI support.

scheduler/datacenters.go Outdated Show resolved Hide resolved
scheduler/util.go Outdated Show resolved Hide resolved
website/content/docs/job-specification/job.mdx Outdated Show resolved Hide resolved
@kinnalru
Copy link

Is there any updates?

@mikenomitch
Copy link
Contributor

Hey @kinnalru, this update is still the latest - #11170 (comment)

Unfortunately, this won't make 1.3 (beta coming in a few weeks!), but we'll try to queue it up early in the 1.4 cycle so it can definitely make that major release.

@dpogorzelski
Copy link

It seems like this one is still not part of the 1.4.0 milestone. Is that intentional? :)

@mikenomitch mikenomitch added this to the 1.5.0 milestone Dec 22, 2022
@tgross tgross self-assigned this Feb 1, 2023
@tgross
Copy link
Member

tgross commented Feb 1, 2023

Hi folks! 👋 We've had a few requests to try to ship this PR. Because it's got some related backwards compatibility warnings, we're going to ship this in the upcoming Nomad 1.5.0. I've rebased this PR on main and today I'll be addressing the remaining comments to land the great work that @jmwilkinson has already contributed.

@tgross
Copy link
Member

tgross commented Feb 1, 2023

I've updated the PR to pick up all the review comments, added documentation, fixed up some tests, and make sure that we can detect node updates correctly. I've had a look at the multiregion bits and it doesn't look like there's anything to actually do there, but I'll verify that once we've merged this PR and had the OSS->ENT sync run.

There's one open discussion we're having internally about whether or not we should have a default datacenters = ["*"] value. If we decide we don't want that (it can be messy for some users we know of who are using datacenters in unusual ways), I'll back that part out.

Copy link
Contributor

@lgfa29 lgfa29 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor note on the upgrade path, but I'm not sure we need that much detail.

website/content/docs/upgrade/upgrade-specific.mdx Outdated Show resolved Hide resolved
website/content/docs/job-specification/job.mdx Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging this pull request may close these issues.

10 participants