-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
windows: fix inefficient gathering of task processes #20619
Conversation
c3df21a
to
a0390be
Compare
a0390be
to
3a40893
Compare
3a40893
to
8005339
Compare
8005339
to
106d2e9
Compare
106d2e9
to
5f89ecb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Shows what a leetcode problem this is that I immediately started thinking about memoizing by checking if the candidate.PPid()
was already in family
... but for that to be at all meaningful the individual tasks would have to have so many processes.
|
||
all, err := processes() | ||
if err != nil { | ||
return set.New[ProcessID](0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we at least return the executor PID here? Or is the thinking that if this call fails we can't trust any of the results?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh good idea, let me add that
@shoenig you don't need a 1.8.x backport label on this, as we'll cut the RC release from main |
Yeah we could definitely add base cases for:
But at that point you're optimizing microseconds when the original problem is this horrible process table snapshot syscall that takes 10's of milliseconds. I'd rather the code just be readable. |
In #20619 we overhauled how we were gathering stats for Windows processes. Unlike in Linux where we can ask for processes in a cgroup, on Windows we have to make a single expensive syscall to get all the processes and then build the tree ourselves. Our algorithm to do so is recursive and quadratic in both steps and space with the number of processes on the host. For busy hosts this hit the stack limit and panic the Nomad client. We already build a map of parent PID to PID, so modify this to be a map of parent PID to slice of children and then traverse that tree only from the root we care about (the executor PID). This moves the allocations to the heap but makes the stats gathering linear in steps and space required. Fixes: #23984
…ar space (#24182) In #20619 we overhauled how we were gathering stats for Windows processes. Unlike in Linux where we can ask for processes in a cgroup, on Windows we have to make a single expensive syscall to get all the processes and then build the tree ourselves. Our algorithm to do so is recursive and quadratic in both steps and space with the number of processes on the host. For busy hosts this hits the stack limit and panics the Nomad client. We already build a map of parent PID to PID, so modify this to be a map of parent PID to slice of children and then traverse that tree only from the root we care about (the executor PID). This moves the allocations to the heap but makes the stats gathering linear in steps and space required. This changeset also moves as much of this code as possible into an area not conditionally-compiled by OS, as the tagged test file was not being run in CI. Fixes: #23984
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
This PR overhauls the way processes are scanned on Windows for the exec driver(s).
A regression was introduced in Nomad 1.7 where we would invoke an expensive syscall for every descendant of a process to build a tree of the processes associated with a task. Instead, we restore the behavior of scanning the process table once and rebuild the process tree manually.
Fixes #20042