-
-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set Limit To Maximum Concurrent Survey Jobs #838
Conversation
…g, R is garbage, who thought using this in a production system was a good idea
I think bumping |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this pattern should be used for all job types rather than only for Surveyor jobs. That would give us full control of how many jobs make it into the queue rather than hoping that the average surveyor job doesn't queue 100 jobs. Additionally I think the fact that is is only applied to Surveyor jobs means it's likely that when we turn things back on we'll just fall right back over again because the Foreman will requeue just as many processor jobs and downloader jobs as knocked it over this weekend.
Also, since we're now not initially queueing the jobs at all, our num_retries count is being incremented before the job has been tried for the first time. You could start them out at -1, increase the max, or just add an additional field to the jobs like has_been_queued
to denote that a job has or hasn't ever even been tried to be queued.
MAX_TOTAL_JOBS = int(get_env_variable_gracefully("MAX_TOTAL_JOBS", 1000)) | ||
|
||
lost_jobs = [] | ||
all_jobs = nomad_client.jobs.get_jobs() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be called in handle_survey_jobs
and then the length of it could be used to determine how many times to call this function, rather than calling for each individual job.
Like that? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it also make sense to stop surveyor jobs from queuing the downloader jobs they create so the foreman has more control over the job queue? Processor jobs could go either way because it might be good to get them queued up ASAP to reduce the time data sits on disk, but at the same time a single downloader job can spawn multiple processor jobs.
I also think 1000 is too low of job cap. I've seen prod go up as high as 850 jobs naturally, and I think having a decent queue depth will help prevent us from ever waiting on work to be queued. I'm guessing you choose 1000 in case they were all surveyor jobs, since we wouldn't want those to spawn too many jobs. However that is a very variable and imprecise way to try to limit the actual queue depth. I think making the Foreman in charge of queuing at least downloader jobs and maybe also processor jobs would be more reliable. I've seen Nomad handle upwards of 40k jobs in prod without a problem so maybe a 10k queue depth would be both safe and deep enough to never starve any workers.
Finally I think there's a problem related to the foreman threads. Each function that monitors jobs is wrapped in a do_forever
block and then managed in its own thread. This means that if each thread checks how many jobs there are and sees that it can queue 10k jobs, then in total 60k jobs could be queued (3 different monitoring functions per job type). I think we should just have a function go through all processor-job functions, then all the downloader-job functions, then all the surveyor-job functions and wrap THAT function in a do_forever
loop so there's just one thread so we only make it as far through that function as we have room in the queue for.
logger.info("Not requeuing job until we're running fewer jobs.") | ||
return False | ||
|
||
jobs_dispatched = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should start at len(all_jobs)
not 0.
logger.info("Not requeuing job until we're running fewer jobs.") | ||
return False | ||
|
||
jobs_dispatched = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same.
logger.info("Not requeuing job until we're running fewer jobs.") | ||
return False | ||
|
||
jobs_dispatched = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same.
# Maximum number of total jobs running at a time. | ||
# We do this now rather than import time for testing purposes. | ||
MAX_TOTAL_JOBS = int(get_env_variable_gracefully("MAX_TOTAL_JOBS", 1000)) | ||
all_jobs = nomad_client.jobs.get_jobs() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just realized that this will include dead jobs, we only want to consider pending and running jobs.
The scalability problem isn't from the number of jobs the system can place, it's from how many Nomad can keep track of. I don't see any benefit to having it be higher 10000 (20TB/2GB), and everything more than that is just going to eat performance and stability. Even 10K seems high to me. Similarly, the status of jobs doesn't matter since they're still going to duplicated across the cluster (which is where the memory problems come from), so we'll want to wait for the dead ones to get GC'd anyway. The threading issue was the original reason I had it check the length on a per-job basis, which we could still do (and is ultimately really the only way to be accurate, since processor jobs can dispatch jobs directly as well). If you want to put the whole foreman back into a single thread.. why did we bother with a thread pool at all? Do we still want to do that? Serious question. |
I think it was a case of overzealous premature optimization on my part. Thinking about it now there really isn't reason for the Foreman to be multithreaded and I think now there's a good reason to revert that: by only having one place the Foreman queues jobs, we can give it better control over the queue and potentially later add logic over time to prioritize jobs. I wasn't suggesting we go higher than 10k and it does make sense that we need to be concerned about dead jobs as well. I don't know what Nomad's GC cylce is like I just hope it won't let thousands of jobs sit in the dead state. As long as we always have pending jobs we should be good though. You're right about querying Nomad before queuing each job being the only way to be perfectly accurate about the job queue depth, however Nomad's API always seemed somewhat slow to me and I was worried about the impact hammering it with HTTP requests as fast as possible would have on performance and stability as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Just two more things:
- you already set Surveyor jobs to start at -1.
- Doesn't the Foreman use send_job to queue it's jobs as well? Seems like they might never get queued now.
Set Limit To Maximum Concurrent Survey Jobs
Issue Number
hashicorp/nomad#4864
Purpose/Implementation Notes
Nomad is garbage, part two.
We keep all of the job needs in the database, so we don't need to have all of the stuff dispatched to a queue anyway. This sets an arbitrary limit on the number of dispatched survey jobs at a time. It looks like our problems mostly started at >100,000 jobs, so the new limit of 1000 should be safer without limiting our throughput.
Types of changes
Functional tests
New test added. Currently running, might fail due to some unrelated R garbage.