-
-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set Limit To Maximum Concurrent Survey Jobs #838
Merged
Merged
Changes from 13 commits
Commits
Show all changes
17 commits
Select commit
Hold shift + click to select a range
8dc155a
no dispatch directly from surveyor
Miserlou 5f69a60
add max jobs limit in foreman
Miserlou e84a963
test max jobs in true and false scenarios
Miserlou 1c51701
maybe bump rlang will work but this is not a good solution to anythin…
Miserlou 82dcf0f
bump timeouts
Miserlou ea45d48
explicit order by
Miserlou 8ef5bf4
duplicate max_total_jobs for all processor types and update tests
Miserlou 450b2fb
start with neg one
Miserlou 155c50b
rm dead test
Miserlou 2475531
forgot to commit
Miserlou 82bc0b9
update count every 100 jobs
Miserlou aab7918
only dispatch processor jobs directly
Miserlou 33ed1ca
comment to explain new behavior
Miserlou 60f15e1
centralize logic to send_job with is_dispatch argument
Miserlou c33da40
type fix
Miserlou 4e43f41
maybe fix tests, since foreman isnt available for local e2e test driver
Miserlou 9538fb8
mayyyyybe this will work
Miserlou File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -32,6 +32,9 @@ | |
# greater than this because of the first attempt | ||
MAX_NUM_RETRIES = 2 | ||
|
||
# This can be overritten by the env var "MAX_TOTAL_JOBS" | ||
DEFAULT_MAX_JOBS = 10000 | ||
|
||
# The fastest each thread will repeat its checks. | ||
# Could be slower if the thread takes longer than this to check its jobs. | ||
MIN_LOOP_TIME = timedelta(minutes=2) | ||
|
@@ -139,12 +142,34 @@ def requeue_downloader_job(last_job: DownloaderJob) -> None: | |
|
||
def handle_downloader_jobs(jobs: List[DownloaderJob]) -> None: | ||
"""For each job in jobs, either retry it or log it.""" | ||
for job in jobs: | ||
|
||
nomad_host = get_env_variable("NOMAD_HOST") | ||
nomad_port = get_env_variable("NOMAD_PORT", "4646") | ||
nomad_client = Nomad(nomad_host, port=int(nomad_port), timeout=30) | ||
# Maximum number of total jobs running at a time. | ||
# We do this now rather than import time for testing purposes. | ||
MAX_TOTAL_JOBS = int(get_env_variable_gracefully("MAX_TOTAL_JOBS", DEFAULT_MAX_JOBS)) | ||
len_all_jobs = len(nomad_client.jobs.get_jobs()) | ||
if len_all_jobs >= MAX_TOTAL_JOBS: | ||
logger.info("Not requeuing job until we're running fewer jobs.") | ||
return False | ||
|
||
jobs_dispatched = 0 | ||
for count, job in enumerate(jobs): | ||
if job.num_retries < MAX_NUM_RETRIES: | ||
requeue_downloader_job(job) | ||
jobs_dispatched = jobs_dispatched + 1 | ||
else: | ||
handle_repeated_failure(job) | ||
|
||
if (count % 100) == 0: | ||
len_all_jobs = len(nomad_client.jobs.get_jobs()) | ||
|
||
if (jobs_dispatched + len_all_jobs) >= MAX_TOTAL_JOBS: | ||
logger.info("We hit the maximum total jobs ceiling, so we're not handling any more downloader jobs now.") | ||
return False | ||
|
||
return True | ||
|
||
@do_forever(MIN_LOOP_TIME) | ||
def retry_failed_downloader_jobs() -> None: | ||
|
@@ -166,7 +191,7 @@ def retry_hung_downloader_jobs() -> None: | |
|
||
nomad_host = get_env_variable("NOMAD_HOST") | ||
nomad_port = get_env_variable("NOMAD_PORT", "4646") | ||
nomad_client = Nomad(nomad_host, port=int(nomad_port), timeout=5) | ||
nomad_client = Nomad(nomad_host, port=int(nomad_port), timeout=30) | ||
hung_jobs = [] | ||
for job in potentially_hung_jobs: | ||
try: | ||
|
@@ -205,7 +230,7 @@ def retry_lost_downloader_jobs() -> None: | |
|
||
nomad_host = get_env_variable("NOMAD_HOST") | ||
nomad_port = get_env_variable("NOMAD_PORT", "4646") | ||
nomad_client = Nomad(nomad_host, port=int(nomad_port), timeout=5) | ||
nomad_client = Nomad(nomad_host, port=int(nomad_port), timeout=30) | ||
lost_jobs = [] | ||
for job in potentially_lost_jobs: | ||
try: | ||
|
@@ -306,12 +331,34 @@ def requeue_processor_job(last_job: ProcessorJob) -> None: | |
|
||
def handle_processor_jobs(jobs: List[ProcessorJob]) -> None: | ||
"""For each job in jobs, either retry it or log it.""" | ||
for job in jobs: | ||
|
||
nomad_host = get_env_variable("NOMAD_HOST") | ||
nomad_port = get_env_variable("NOMAD_PORT", "4646") | ||
nomad_client = Nomad(nomad_host, port=int(nomad_port), timeout=30) | ||
# Maximum number of total jobs running at a time. | ||
# We do this now rather than import time for testing purposes. | ||
MAX_TOTAL_JOBS = int(get_env_variable_gracefully("MAX_TOTAL_JOBS", DEFAULT_MAX_JOBS)) | ||
len_all_jobs = len(nomad_client.jobs.get_jobs()) | ||
if len_all_jobs >= MAX_TOTAL_JOBS: | ||
logger.info("Not requeuing job until we're running fewer jobs.") | ||
return False | ||
|
||
jobs_dispatched = 0 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same. |
||
for count, job in enumerate(jobs): | ||
if job.num_retries < MAX_NUM_RETRIES: | ||
requeue_processor_job(job) | ||
jobs_dispatched = jobs_dispatched + 1 | ||
else: | ||
handle_repeated_failure(job) | ||
|
||
if (count % 100) == 0: | ||
len_all_jobs = len(nomad_client.jobs.get_jobs()) | ||
|
||
if (jobs_dispatched + len_all_jobs) >= MAX_TOTAL_JOBS: | ||
logger.info("We hit the maximum total jobs ceiling, so we're not handling any more processor jobs now.") | ||
return False | ||
|
||
return True | ||
|
||
@do_forever(MIN_LOOP_TIME) | ||
def retry_failed_processor_jobs() -> None: | ||
|
@@ -338,7 +385,7 @@ def retry_hung_processor_jobs() -> None: | |
|
||
nomad_host = get_env_variable("NOMAD_HOST") | ||
nomad_port = get_env_variable("NOMAD_PORT", "4646") | ||
nomad_client = Nomad(nomad_host, port=int(nomad_port), timeout=5) | ||
nomad_client = Nomad(nomad_host, port=int(nomad_port), timeout=30) | ||
hung_jobs = [] | ||
for job in potentially_hung_jobs: | ||
try: | ||
|
@@ -425,6 +472,8 @@ def requeue_survey_job(last_job: SurveyJob) -> None: | |
The new survey job will have num_retries one greater than | ||
last_job.num_retries. | ||
""" | ||
|
||
lost_jobs = [] | ||
num_retries = last_job.num_retries + 1 | ||
|
||
new_job = SurveyJob(num_retries=num_retries, | ||
|
@@ -468,20 +517,44 @@ def requeue_survey_job(last_job: SurveyJob) -> None: | |
# Can't communicate with nomad just now, leave the job for a later loop. | ||
new_job.delete() | ||
|
||
return True | ||
|
||
|
||
def handle_survey_jobs(jobs: List[SurveyJob]) -> None: | ||
"""For each job in jobs, either retry it or log it.""" | ||
for job in jobs: | ||
|
||
nomad_host = get_env_variable("NOMAD_HOST") | ||
nomad_port = get_env_variable("NOMAD_PORT", "4646") | ||
nomad_client = Nomad(nomad_host, port=int(nomad_port), timeout=30) | ||
# Maximum number of total jobs running at a time. | ||
# We do this now rather than import time for testing purposes. | ||
MAX_TOTAL_JOBS = int(get_env_variable_gracefully("MAX_TOTAL_JOBS", DEFAULT_MAX_JOBS)) | ||
len_all_jobs = len(nomad_client.jobs.get_jobs()) | ||
if len_all_jobs >= MAX_TOTAL_JOBS: | ||
logger.info("Not requeuing job until we're running fewer jobs.") | ||
return False | ||
|
||
jobs_dispatched = 0 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same. |
||
for count, job in enumerate(jobs): | ||
if job.num_retries < MAX_NUM_RETRIES: | ||
requeue_survey_job(job) | ||
jobs_dispatched = jobs_dispatched + 1 | ||
else: | ||
handle_repeated_failure(job) | ||
|
||
if (count % 100) == 0: | ||
len_all_jobs = len(nomad_client.jobs.get_jobs()) | ||
|
||
if (jobs_dispatched + len_all_jobs) >= MAX_TOTAL_JOBS: | ||
logger.info("We hit the maximum total jobs ceiling, so we're not handling any more survey jobs now.") | ||
return False | ||
|
||
return True | ||
|
||
@do_forever(MIN_LOOP_TIME) | ||
def retry_failed_survey_jobs() -> None: | ||
"""Handle survey jobs that were marked as a failure.""" | ||
failed_jobs = SurveyJob.objects.filter(success=False, retried=False) | ||
failed_jobs = SurveyJob.objects.filter(success=False, retried=False).order_by('pk') | ||
if failed_jobs: | ||
logger.info( | ||
"Handling failed (explicitly-marked-as-failure) jobs!", | ||
|
@@ -499,11 +572,11 @@ def retry_hung_survey_jobs() -> None: | |
end_time=None, | ||
start_time__isnull=False, | ||
no_retry=False | ||
) | ||
).order_by('pk') | ||
|
||
nomad_host = get_env_variable("NOMAD_HOST") | ||
nomad_port = get_env_variable("NOMAD_PORT", "4646") | ||
nomad_client = Nomad(nomad_host, port=int(nomad_port), timeout=5) | ||
nomad_client = Nomad(nomad_host, port=int(nomad_port), timeout=30) | ||
hung_jobs = [] | ||
for job in potentially_hung_jobs: | ||
try: | ||
|
@@ -541,12 +614,13 @@ def retry_lost_survey_jobs() -> None: | |
start_time=None, | ||
end_time=None, | ||
no_retry=False | ||
) | ||
).order_by('pk') | ||
|
||
nomad_host = get_env_variable("NOMAD_HOST") | ||
nomad_port = get_env_variable("NOMAD_PORT", "4646") | ||
nomad_client = Nomad(nomad_host, port=int(nomad_port), timeout=5) | ||
nomad_client = Nomad(nomad_host, port=int(nomad_port), timeout=30) | ||
lost_jobs = [] | ||
|
||
for job in potentially_lost_jobs: | ||
try: | ||
# Surveyor jobs didn't always have nomad_job_ids. If they | ||
|
@@ -583,7 +657,7 @@ def retry_lost_survey_jobs() -> None: | |
handle_survey_jobs(lost_jobs) | ||
|
||
## | ||
# Main loop | ||
# Janitor | ||
## | ||
|
||
@do_forever(JANITOR_DISPATCH_TIME) | ||
|
@@ -610,6 +684,9 @@ def send_janitor_jobs(): | |
# If we can't dispatch this job, something else has gone wrong. | ||
continue | ||
|
||
## | ||
# Main loop | ||
## | ||
|
||
def monitor_jobs(): | ||
"""Runs a thread for each job monitoring loop.""" | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should start at
len(all_jobs)
not 0.