-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve performance of cleanup_jobs #6166
Improve performance of cleanup_jobs #6166
Conversation
Build failed.
|
Build failed.
|
Build failed.
|
num_deleted = 0 | ||
self.logger.info("deleting total of %d jobs", num_to_delete) | ||
while num_deleted < num_to_delete: | ||
pk_list = Job.objects.filter(created__lt=self.cutoff)[0:batch_size].values_list('pk') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may want to put this set of queries in a with transaction.atomic()
for safety.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep the def handle() in cleanup_jobs.py is wrapped in @transaction.atomic decorator
qs_batch = Job.objects.filter(pk__in=pk_list) | ||
num_deleted += qs_batch.count() | ||
if not self.dry_run: | ||
qs_batch.delete() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that .delete()
returns the number of rows deleted: https://docs.djangoproject.com/en/2.2/ref/models/querysets/#delete
It might be worthwhile to make use of it in the not dry_run
case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
☝️
}) | ||
|
||
|
||
class Collector: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you intend to make this inherit from the base collector implementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(or are there no other methods to override)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I think we will want to inherit from base Collector for this.
done
Build failed.
|
}) | ||
|
||
|
||
class AWXCollector(Collector): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, this is much more manageable since we're not pulling everything in, but instead just subclassing.
batch_size = 1000000 | ||
num_deleted = 0 | ||
self.logger.info("deleting total of %d jobs", num_to_delete) | ||
while num_deleted < num_to_delete: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related to what @jbradberry mentions about .delete()
returning a count, another way to do this loop would be instead of doing math on each iteration, just break out of the loop once there's nothing else to delete:
num_deleted = collector.delete()
self.logger.info('deleted %d jobs', num_deleted)
if num_deleted == 0:
break
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I like this.
if self.logger: | ||
self.logger.info("Collecting objects for %s", objs.model._meta.label) | ||
|
||
if not getattr(objs, 'polymorphic_disabled', None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems to be a bug with some of our polymorphic classes in old collector
c = Collector('default')
c.collect(UnifiedJobTemplate.objects.all())
ValueError: Cannot query "Demo Project-10": Must be "JobTemplate" instance.
disabling polymorphic fixes this
Build failed.
|
collector = AWXCollector('default', self.logger) | ||
pk_list = Job.objects.filter(created__lt=self.cutoff)[0:batch_size].values_list('pk') | ||
qs_batch = Job.objects.filter(pk__in=pk_list) | ||
num_deleted += qs_batch.count() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably can update this as well to remove the .count()
(instead, just accumulate the just_deleted
on each loop, like the function above).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, fixed!
cc @matburt because I'd like him to be aware of this change and weigh in on it. So I've looked this over, and I feel pretty good about the custom collector given that we're only using it explicitly here in this cleanup command, and not in other places. The changes @fosterseth made to the upstream Django implementation make sense to me. I think it would be even better if we tried to clean up the Collector changes and open an upstream PR with similar changes and attempt to get these optimizations into upstream Django (though that's not going to happen in the very near term, obviously): django/django@2.2.4...fosterseth:fix-deletion-2.2.4 If such a PR had an issue had an associated issue filed in the Django issue racker that illustrated the problem, passed existing Django tests, and included new tests that illustrated what the optimization accomplished, I think it would be more likely to be well-received by the upstream community. But let's do that work after we're done with this, because it has a longer tail, and will be an uphill battle. It might also be an opportunity for the Django community to point out any issues with our implementation or approach here. Given a choice between option 1 and 2 (
We do have tests in our codebase for verifying the result of https://github.com/ansible/awx/tree/devel/awx/main/tests/functional/commands So in my opinion, we should:
|
|
||
If 'keep_parents' is True, data of parent model's will be not deleted. | ||
""" | ||
if self.logger: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should consider removing this self.logger
usage; it's unlikely to be accepted upstream.
Build failed.
|
fd9f9b0
to
970f747
Compare
Build failed.
|
Build failed.
|
|
||
|
||
class AWXCollector(Collector): | ||
def __init__(self, using): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can probably just remove this __init__
since it's not doing anything special.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, fixed
Build failed.
|
8b9fca0
to
4e1c5d6
Compare
Build succeeded.
|
1c39cd5
to
8232a5b
Compare
Build succeeded.
|
fdce179
to
6891981
Compare
Build succeeded.
|
170cdf8
to
6dcb0b2
Compare
Build succeeded.
|
@fosterseth @ryanpetrello reading this last comment, it is unclear to me what option you went with. of "option 1" and "option 2"
Can we sync up about this and clarify more about these test cases, because I'm not 100% clear
|
@kdelee We implemented option 2 --- the faster version :) the way I envision testing is to create a handful of Jobs with creation date (the then you can run the awx-manage command We can assert the older jobs are removed, and make sure the newer jobs are still in the database. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this, but it warrants it's own mention in docs/ architecture docs.
OK talked to @fosterseth and we have come up with the following plan: What does do?:
What are consequences? Deleting JOBS needs to ALSO deletes and/or get nulled out:
if wrong thing happens probably see postgres error in logs and thing would not get deleted (violates ForeignKey restraint) Do we have any special concerns for upgrades? No, no migrations or model changes have occured. TODO:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, we can talk at merge meeting about testing performed
6dcb0b2
to
1db9b48
Compare
Build succeeded.
|
batch_size = 1000000 | ||
|
||
while True: | ||
pk_list = Job.objects.filter(created__lt=self.cutoff).exclude(status__in=['pending', 'waiting', 'running'])[0:batch_size].values_list('pk') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think given this that you'll have an infinite loop if you do a dry-run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep you're right, let me think about a cleaner way to write this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
Build succeeded.
|
Build succeeded.
|
a71bb84
to
ad64b6f
Compare
Build failed.
|
The commit is intended to speed up the cleanup_jobs command in awx. Old methods takes 7+ hours to delete 1 million old jobs. New method takes around 6 minutes. Leverages a sub-classed Collector, called AWXCollector, that does not load in objects before deleting them. Instead querysets, which are lazily evaluated, are used in places where Collector normally keeps a list of objects. Finally, a couple of tests to ensure parity between old Collector and AWXCollector. That is, any object that is updated/removed from the database using Collector should be have identical operations using AWXCollector. tower issue 1103
ad64b6f
to
88fb30e
Compare
Build succeeded.
|
Build succeeded (gate pipeline).
|
[4.2] Backport Validate same start/end day different time schedule
SUMMARY
related https://github.com/ansible/tower/issues/1103
The old implementation iterates through Jobs older than
--days
, and calls.delete()
on one job at a time. This is slow.I have developed two solutions for this problem.
benchmark on 1 Million jobs (ec2 m5.large, 37500 iops)
method 1 : 1.5 hoursNote: JobEvent objects can be fast deleted, so the overall deletion time is more dependent on number of Jobs in the database. Also, method 2 is fast for everything, not just jobs.
method 1 is implemented indef cleanup_jobs (self)
method 2 is implemented in
def cleanup_jobs(self)
method 2 can be called viaawx-manage cleanup_jobs --days 90 --jobs_fast
Method 1 deletes jobs in batches of 10,000Method 2 is complicated because it involves an override of Django's built-in
Collector
class.Previously this class pulls in all objects into memory before deleting them. I re-wrote it to use querysets (lazily evaluated) instead of objects. This is where the performance gain comes from.
Note, only the cleanup_jobs tool would use this rewritten class -- the rest of the application will continue to use the old Collector class.
You can view the changes I made to
Collector
(deletion.py) heredjango/django@2.2.4...fosterseth:fix-deletion-2.2.4
(you can see an example of obj to queryset on line 217)
How
Collector
workson_delete CASCADE or NULL
). For example, you must deleteJobEvent
entry before deleting the correspondingJob
.Collector.collect()
will gather all objectsCollector.sort()
resolves the dependencies -- it will determine which order it can safely delete objects before others.Collector.delete()
will do field updates (e.g. change foreign key to NULL) if there are any, then do fast_deletes (objects without signals or foreign keys constraints), then do the rest of the deletes.Outstanding issues:
I should probably rename Collector toAwxCollector
so people know it's a mod.ISSUE TYPE
COMPONENT NAME
AWX VERSION
ADDITIONAL INFORMATION
awx/main/tests/functional/commands/test_cleanup_jobs.py
Here are the related objects to class
Job
that are affected if we delete a job object.The test is designed to find the object for each of the above relationships, and then checks that the object is deleted (if
on_delete == CASCADE
) or set to None (ifon_delete == SET_NULL
)Additionally, I have a test in place to check parity between django's
Collector.collect()
results, and theAWXCollector.collect()
results. These methods popular a few dictionaries of objects that should be deleted or updated. This includes related objects to related objects, etc. Going forward, this test should ensureAWXCollector
will be properly affecting the exact same objects thatCollector
does.