Set Limit To Maximum Concurrent Survey Jobs #838

Miserlou · 2018-11-14T16:17:42Z

Issue Number

Purpose/Implementation Notes

Nomad is garbage, part two.

We keep all of the job needs in the database, so we don't need to have all of the stuff dispatched to a queue anyway. This sets an arbitrary limit on the number of dispatched survey jobs at a time. It looks like our problems mostly started at >100,000 jobs, so the new limit of 1000 should be safer without limiting our throughput.

Types of changes

New feature (non-breaking change which adds functionality)

Functional tests

New test added. Currently running, might fail due to some unrelated R garbage.

…g, R is garbage, who thought using this in a production system was a good idea

Miserlou · 2018-11-14T16:40:01Z

I think bumping rlang fixed that R problem.

kurtwheeler

I think this pattern should be used for all job types rather than only for Surveyor jobs. That would give us full control of how many jobs make it into the queue rather than hoping that the average surveyor job doesn't queue 100 jobs. Additionally I think the fact that is is only applied to Surveyor jobs means it's likely that when we turn things back on we'll just fall right back over again because the Foreman will requeue just as many processor jobs and downloader jobs as knocked it over this weekend.

Also, since we're now not initially queueing the jobs at all, our num_retries count is being incremented before the job has been tried for the first time. You could start them out at -1, increase the max, or just add an additional field to the jobs like has_been_queued to denote that a job has or hasn't ever even been tried to be queued.

kurtwheeler · 2018-11-14T17:11:45Z

foreman/data_refinery_foreman/foreman/main.py

+    MAX_TOTAL_JOBS = int(get_env_variable_gracefully("MAX_TOTAL_JOBS", 1000))
+
+    lost_jobs = []
+    all_jobs = nomad_client.jobs.get_jobs()


This could be called in handle_survey_jobs and then the length of it could be used to determine how many times to call this function, rather than calling for each individual job.

Miserlou · 2018-11-14T19:12:47Z

Like that?

kurtwheeler

Wouldn't it also make sense to stop surveyor jobs from queuing the downloader jobs they create so the foreman has more control over the job queue? Processor jobs could go either way because it might be good to get them queued up ASAP to reduce the time data sits on disk, but at the same time a single downloader job can spawn multiple processor jobs.

I also think 1000 is too low of job cap. I've seen prod go up as high as 850 jobs naturally, and I think having a decent queue depth will help prevent us from ever waiting on work to be queued. I'm guessing you choose 1000 in case they were all surveyor jobs, since we wouldn't want those to spawn too many jobs. However that is a very variable and imprecise way to try to limit the actual queue depth. I think making the Foreman in charge of queuing at least downloader jobs and maybe also processor jobs would be more reliable. I've seen Nomad handle upwards of 40k jobs in prod without a problem so maybe a 10k queue depth would be both safe and deep enough to never starve any workers.

Finally I think there's a problem related to the foreman threads. Each function that monitors jobs is wrapped in a do_forever block and then managed in its own thread. This means that if each thread checks how many jobs there are and sees that it can queue 10k jobs, then in total 60k jobs could be queued (3 different monitoring functions per job type). I think we should just have a function go through all processor-job functions, then all the downloader-job functions, then all the surveyor-job functions and wrap THAT function in a do_forever loop so there's just one thread so we only make it as far through that function as we have room in the queue for.

kurtwheeler · 2018-11-14T19:14:59Z

foreman/data_refinery_foreman/foreman/main.py

+        logger.info("Not requeuing job until we're running fewer jobs.")
+        return False
+
+    jobs_dispatched = 0


This should start at len(all_jobs) not 0.

kurtwheeler · 2018-11-14T19:15:08Z

foreman/data_refinery_foreman/foreman/main.py

+        logger.info("Not requeuing job until we're running fewer jobs.")
+        return False
+
+    jobs_dispatched = 0


kurtwheeler · 2018-11-14T19:15:13Z

foreman/data_refinery_foreman/foreman/main.py

+        logger.info("Not requeuing job until we're running fewer jobs.")
+        return False
+
+    jobs_dispatched = 0


kurtwheeler · 2018-11-14T19:22:16Z

foreman/data_refinery_foreman/foreman/main.py

+    # Maximum number of total jobs running at a time.
+    # We do this now rather than import time for testing purposes.
+    MAX_TOTAL_JOBS = int(get_env_variable_gracefully("MAX_TOTAL_JOBS", 1000))
+    all_jobs = nomad_client.jobs.get_jobs()


I just realized that this will include dead jobs, we only want to consider pending and running jobs.

Miserlou · 2018-11-14T19:59:42Z

The scalability problem isn't from the number of jobs the system can place, it's from how many Nomad can keep track of. I don't see any benefit to having it be higher 10000 (20TB/2GB), and everything more than that is just going to eat performance and stability. Even 10K seems high to me. Similarly, the status of jobs doesn't matter since they're still going to duplicated across the cluster (which is where the memory problems come from), so we'll want to wait for the dead ones to get GC'd anyway.

The threading issue was the original reason I had it check the length on a per-job basis, which we could still do (and is ultimately really the only way to be accurate, since processor jobs can dispatch jobs directly as well). If you want to put the whole foreman back into a single thread.. why did we bother with a thread pool at all? Do we still want to do that? Serious question.

kurtwheeler · 2018-11-14T20:15:58Z

If you want to put the whole foreman back into a single thread.. why did we bother with a thread pool at all? Do we still want to do that? Serious question.

I think it was a case of overzealous premature optimization on my part. Thinking about it now there really isn't reason for the Foreman to be multithreaded and I think now there's a good reason to revert that: by only having one place the Foreman queues jobs, we can give it better control over the queue and potentially later add logic over time to prioritize jobs.

I wasn't suggesting we go higher than 10k and it does make sense that we need to be concerned about dead jobs as well. I don't know what Nomad's GC cylce is like I just hope it won't let thousands of jobs sit in the dead state. As long as we always have pending jobs we should be good though.

You're right about querying Nomad before queuing each job being the only way to be perfectly accurate about the job queue depth, however Nomad's API always seemed somewhat slow to me and I was worried about the impact hammering it with HTTP requests as fast as possible would have on performance and stability as well.

kurtwheeler

Looks good! Just two more things:

you already set Surveyor jobs to start at -1.
Doesn't the Foreman use send_job to queue it's jobs as well? Seems like they might never get queued now.

Set Limit To Maximum Concurrent Survey Jobs

Miserlou added 6 commits November 13, 2018 16:54

no dispatch directly from surveyor

8dc155a

add max jobs limit in foreman

5f69a60

test max jobs in true and false scenarios

e84a963

maybe bump rlang will work but this is not a good solution to anythin…

1c51701

…g, R is garbage, who thought using this in a production system was a good idea

bump timeouts

82dcf0f

explicit order by

ea45d48

ghost assigned Miserlou Nov 14, 2018

ghost added in progress review labels Nov 14, 2018

Miserlou requested a review from kurtwheeler November 14, 2018 16:18

kurtwheeler suggested changes Nov 14, 2018

View reviewed changes

kurtwheeler mentioned this pull request Nov 14, 2018

Apparently bumping R version fixes image building, needed for several… #839

Merged

4 tasks

Miserlou added 3 commits November 14, 2018 13:52

duplicate max_total_jobs for all processor types and update tests

8ef5bf4

start with neg one

450b2fb

rm dead test

155c50b

kurtwheeler suggested changes Nov 14, 2018

View reviewed changes

forgot to commit

2475531

Miserlou added 3 commits November 14, 2018 15:35

update count every 100 jobs

82bc0b9

only dispatch processor jobs directly

aab7918

comment to explain new behavior

33ed1ca

kurtwheeler approved these changes Nov 14, 2018

View reviewed changes

Miserlou added 4 commits November 14, 2018 16:48

centralize logic to send_job with is_dispatch argument

60f15e1

type fix

c33da40

maybe fix tests, since foreman isnt available for local e2e test driver

4e43f41

mayyyyybe this will work

9538fb8

Miserlou merged commit 10cc8c3 into dev Nov 15, 2018

ghost removed in progress review labels Nov 15, 2018

Miserlou deleted the mis/maxjobs branch November 15, 2018 20:04

kurtwheeler pushed a commit that referenced this pull request Jan 10, 2019

Merge pull request #838 from AlexsLemonade/mis/maxjobs

48784a6

Set Limit To Maximum Concurrent Survey Jobs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set Limit To Maximum Concurrent Survey Jobs #838

Set Limit To Maximum Concurrent Survey Jobs #838

Miserlou commented Nov 14, 2018

Miserlou commented Nov 14, 2018

kurtwheeler left a comment •

edited

Loading

kurtwheeler Nov 14, 2018

Miserlou commented Nov 14, 2018

kurtwheeler left a comment

kurtwheeler Nov 14, 2018

kurtwheeler Nov 14, 2018

kurtwheeler Nov 14, 2018

kurtwheeler Nov 14, 2018

Miserlou commented Nov 14, 2018

kurtwheeler commented Nov 14, 2018 •

edited

Loading

kurtwheeler left a comment

Set Limit To Maximum Concurrent Survey Jobs #838

Set Limit To Maximum Concurrent Survey Jobs #838

Conversation

Miserlou commented Nov 14, 2018

Issue Number

Purpose/Implementation Notes

Types of changes

Functional tests

Miserlou commented Nov 14, 2018

kurtwheeler left a comment • edited Loading

Choose a reason for hiding this comment

kurtwheeler Nov 14, 2018

Choose a reason for hiding this comment

Miserlou commented Nov 14, 2018

kurtwheeler left a comment

Choose a reason for hiding this comment

kurtwheeler Nov 14, 2018

Choose a reason for hiding this comment

kurtwheeler Nov 14, 2018

Choose a reason for hiding this comment

kurtwheeler Nov 14, 2018

Choose a reason for hiding this comment

kurtwheeler Nov 14, 2018

Choose a reason for hiding this comment

Miserlou commented Nov 14, 2018

kurtwheeler commented Nov 14, 2018 • edited Loading

kurtwheeler left a comment

Choose a reason for hiding this comment

kurtwheeler left a comment •

edited

Loading

kurtwheeler commented Nov 14, 2018 •

edited

Loading