Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set Limit To Maximum Concurrent Survey Jobs #838

Merged
merged 17 commits into from
Nov 15, 2018
Merged

Set Limit To Maximum Concurrent Survey Jobs #838

merged 17 commits into from
Nov 15, 2018

Conversation

Miserlou
Copy link
Contributor

Issue Number

hashicorp/nomad#4864

Purpose/Implementation Notes

Nomad is garbage, part two.

We keep all of the job needs in the database, so we don't need to have all of the stuff dispatched to a queue anyway. This sets an arbitrary limit on the number of dispatched survey jobs at a time. It looks like our problems mostly started at >100,000 jobs, so the new limit of 1000 should be safer without limiting our throughput.

Types of changes

  • New feature (non-breaking change which adds functionality)

Functional tests

New test added. Currently running, might fail due to some unrelated R garbage.

@Miserlou
Copy link
Contributor Author

I think bumping rlang fixed that R problem.

Copy link
Contributor

@kurtwheeler kurtwheeler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this pattern should be used for all job types rather than only for Surveyor jobs. That would give us full control of how many jobs make it into the queue rather than hoping that the average surveyor job doesn't queue 100 jobs. Additionally I think the fact that is is only applied to Surveyor jobs means it's likely that when we turn things back on we'll just fall right back over again because the Foreman will requeue just as many processor jobs and downloader jobs as knocked it over this weekend.

Also, since we're now not initially queueing the jobs at all, our num_retries count is being incremented before the job has been tried for the first time. You could start them out at -1, increase the max, or just add an additional field to the jobs like has_been_queued to denote that a job has or hasn't ever even been tried to be queued.

MAX_TOTAL_JOBS = int(get_env_variable_gracefully("MAX_TOTAL_JOBS", 1000))

lost_jobs = []
all_jobs = nomad_client.jobs.get_jobs()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be called in handle_survey_jobs and then the length of it could be used to determine how many times to call this function, rather than calling for each individual job.

@Miserlou
Copy link
Contributor Author

Like that?

Copy link
Contributor

@kurtwheeler kurtwheeler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it also make sense to stop surveyor jobs from queuing the downloader jobs they create so the foreman has more control over the job queue? Processor jobs could go either way because it might be good to get them queued up ASAP to reduce the time data sits on disk, but at the same time a single downloader job can spawn multiple processor jobs.

I also think 1000 is too low of job cap. I've seen prod go up as high as 850 jobs naturally, and I think having a decent queue depth will help prevent us from ever waiting on work to be queued. I'm guessing you choose 1000 in case they were all surveyor jobs, since we wouldn't want those to spawn too many jobs. However that is a very variable and imprecise way to try to limit the actual queue depth. I think making the Foreman in charge of queuing at least downloader jobs and maybe also processor jobs would be more reliable. I've seen Nomad handle upwards of 40k jobs in prod without a problem so maybe a 10k queue depth would be both safe and deep enough to never starve any workers.

Finally I think there's a problem related to the foreman threads. Each function that monitors jobs is wrapped in a do_forever block and then managed in its own thread. This means that if each thread checks how many jobs there are and sees that it can queue 10k jobs, then in total 60k jobs could be queued (3 different monitoring functions per job type). I think we should just have a function go through all processor-job functions, then all the downloader-job functions, then all the surveyor-job functions and wrap THAT function in a do_forever loop so there's just one thread so we only make it as far through that function as we have room in the queue for.

logger.info("Not requeuing job until we're running fewer jobs.")
return False

jobs_dispatched = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should start at len(all_jobs) not 0.

logger.info("Not requeuing job until we're running fewer jobs.")
return False

jobs_dispatched = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same.

logger.info("Not requeuing job until we're running fewer jobs.")
return False

jobs_dispatched = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same.

# Maximum number of total jobs running at a time.
# We do this now rather than import time for testing purposes.
MAX_TOTAL_JOBS = int(get_env_variable_gracefully("MAX_TOTAL_JOBS", 1000))
all_jobs = nomad_client.jobs.get_jobs()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized that this will include dead jobs, we only want to consider pending and running jobs.

@Miserlou
Copy link
Contributor Author

The scalability problem isn't from the number of jobs the system can place, it's from how many Nomad can keep track of. I don't see any benefit to having it be higher 10000 (20TB/2GB), and everything more than that is just going to eat performance and stability. Even 10K seems high to me. Similarly, the status of jobs doesn't matter since they're still going to duplicated across the cluster (which is where the memory problems come from), so we'll want to wait for the dead ones to get GC'd anyway.

The threading issue was the original reason I had it check the length on a per-job basis, which we could still do (and is ultimately really the only way to be accurate, since processor jobs can dispatch jobs directly as well). If you want to put the whole foreman back into a single thread.. why did we bother with a thread pool at all? Do we still want to do that? Serious question.

@kurtwheeler
Copy link
Contributor

kurtwheeler commented Nov 14, 2018

If you want to put the whole foreman back into a single thread.. why did we bother with a thread pool at all? Do we still want to do that? Serious question.

I think it was a case of overzealous premature optimization on my part. Thinking about it now there really isn't reason for the Foreman to be multithreaded and I think now there's a good reason to revert that: by only having one place the Foreman queues jobs, we can give it better control over the queue and potentially later add logic over time to prioritize jobs.

I wasn't suggesting we go higher than 10k and it does make sense that we need to be concerned about dead jobs as well. I don't know what Nomad's GC cylce is like I just hope it won't let thousands of jobs sit in the dead state. As long as we always have pending jobs we should be good though.

You're right about querying Nomad before queuing each job being the only way to be perfectly accurate about the job queue depth, however Nomad's API always seemed somewhat slow to me and I was worried about the impact hammering it with HTTP requests as fast as possible would have on performance and stability as well.

Copy link
Contributor

@kurtwheeler kurtwheeler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Just two more things:

  • you already set Surveyor jobs to start at -1.
  • Doesn't the Foreman use send_job to queue it's jobs as well? Seems like they might never get queued now.

@Miserlou Miserlou merged commit 10cc8c3 into dev Nov 15, 2018
@Miserlou Miserlou deleted the mis/maxjobs branch November 15, 2018 20:04
kurtwheeler pushed a commit that referenced this pull request Jan 10, 2019
Set Limit To Maximum Concurrent Survey Jobs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants