Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of cleanup_jobs #6166

Conversation

fosterseth
Copy link
Member

@fosterseth fosterseth commented Mar 4, 2020

SUMMARY

related https://github.com/ansible/tower/issues/1103

The old implementation iterates through Jobs older than --days, and calls .delete() on one job at a time. This is slow.

I have developed two solutions for this problem.
benchmark on 1 Million jobs (ec2 m5.large, 37500 iops)

  • old way: ~ 7 hours (estimate from my 50k benchmark)
  • method 1 : 1.5 hours
  • method 2 : 6 minutes

Note: JobEvent objects can be fast deleted, so the overall deletion time is more dependent on number of Jobs in the database. Also, method 2 is fast for everything, not just jobs.

method 1 is implemented in def cleanup_jobs (self)
method 2 is implemented in def cleanup_jobs(self)

method 2 can be called via awx-manage cleanup_jobs --days 90 --jobs_fast

Method 1 deletes jobs in batches of 10,000

Method 2 is complicated because it involves an override of Django's built-in Collector class.
Previously this class pulls in all objects into memory before deleting them. I re-wrote it to use querysets (lazily evaluated) instead of objects. This is where the performance gain comes from.

Note, only the cleanup_jobs tool would use this rewritten class -- the rest of the application will continue to use the old Collector class.

You can view the changes I made to Collector (deletion.py) here

django/django@2.2.4...fosterseth:fix-deletion-2.2.4
(you can see an example of obj to queryset on line 217)

How Collector works

  • Collector must gather objects, their parents, and related objects.
  • It is recursive; for each parent and related object, it must find their parents and related objects, and so forth
  • When deleting an object, it must handle foreign key constraints (on_delete CASCADE or NULL). For example, you must delete JobEvent entry before deleting the corresponding Job.
  • Collector.collect() will gather all objects
  • Collector.sort() resolves the dependencies -- it will determine which order it can safely delete objects before others.
  • Collector.delete() will do field updates (e.g. change foreign key to NULL) if there are any, then do fast_deletes (objects without signals or foreign keys constraints), then do the rest of the deletes.

Outstanding issues:

  • Method 2 has yet to fail me but we probably need more testing. What I've been doing is running integration tests with teardown off to populate the database with dependent data, then running this command and checking that things are deleting properly.
  • Lugging around a modified Collector class is not pretty.
  • Might look into adding pre and post delete signals into the modified Collector
  • I should probably rename Collector to AwxCollector so people know it's a mod.
ISSUE TYPE
  • Feature Pull Request
COMPONENT NAME
  • API
AWX VERSION
awx: 9.2.0
ADDITIONAL INFORMATION

awx/main/tests/functional/commands/test_cleanup_jobs.py
Here are the related objects to class Job that are affected if we delete a job object.

[<ManyToOneRel: main.unifiedjobtemplate>,
 <ManyToOneRel: main.unifiedjobtemplate>,
 <ManyToOneRel: main.unifiedjob_dependent_jobs>,
 <ManyToOneRel: main.unifiedjob_dependent_jobs>,
 <ManyToOneRel: main.unifiedjob_notifications>,
 <ManyToOneRel: main.unifiedjob_labels>,
 <ManyToOneRel: main.unifiedjob_credentials>,
 <OneToOneRel: main.joblaunchconfig>,
 <ManyToOneRel: main.activitystream_unified_job>,
 <OneToOneRel: main.workflowjobnode>,
 <ManyToOneRel: main.jobevent>,
 <ManyToOneRel: main.jobhostsummary>,
 <ManyToOneRel: main.host>,
 <ManyToOneRel: main.activitystream_job>]

The test is designed to find the object for each of the above relationships, and then checks that the object is deleted (if on_delete == CASCADE) or set to None (if on_delete == SET_NULL)

Additionally, I have a test in place to check parity between django's Collector.collect() results, and the AWXCollector.collect() results. These methods popular a few dictionaries of objects that should be deleted or updated. This includes related objects to related objects, etc. Going forward, this test should ensure AWXCollector will be properly affecting the exact same objects that Collector does.

@softwarefactory-project-zuul
Copy link
Contributor

Build failed.

@softwarefactory-project-zuul
Copy link
Contributor

Build failed.

@softwarefactory-project-zuul
Copy link
Contributor

Build failed.

num_deleted = 0
self.logger.info("deleting total of %d jobs", num_to_delete)
while num_deleted < num_to_delete:
pk_list = Job.objects.filter(created__lt=self.cutoff)[0:batch_size].values_list('pk')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to put this set of queries in a with transaction.atomic() for safety.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep the def handle() in cleanup_jobs.py is wrapped in @transaction.atomic decorator

qs_batch = Job.objects.filter(pk__in=pk_list)
num_deleted += qs_batch.count()
if not self.dry_run:
qs_batch.delete()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that .delete() returns the number of rows deleted: https://docs.djangoproject.com/en/2.2/ref/models/querysets/#delete

It might be worthwhile to make use of it in the not dry_run case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

☝️

})


class Collector:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you intend to make this inherit from the base collector implementation?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(or are there no other methods to override)

Copy link
Member Author

@fosterseth fosterseth Mar 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I think we will want to inherit from base Collector for this.

done

@softwarefactory-project-zuul
Copy link
Contributor

Build failed.

})


class AWXCollector(Collector):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, this is much more manageable since we're not pulling everything in, but instead just subclassing.

batch_size = 1000000
num_deleted = 0
self.logger.info("deleting total of %d jobs", num_to_delete)
while num_deleted < num_to_delete:
Copy link
Contributor

@ryanpetrello ryanpetrello Mar 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to what @jbradberry mentions about .delete() returning a count, another way to do this loop would be instead of doing math on each iteration, just break out of the loop once there's nothing else to delete:

num_deleted = collector.delete()
self.logger.info('deleted %d jobs', num_deleted)
if num_deleted == 0:
    break

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I like this.

if self.logger:
self.logger.info("Collecting objects for %s", objs.model._meta.label)

if not getattr(objs, 'polymorphic_disabled', None):
Copy link
Member Author

@fosterseth fosterseth Mar 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems to be a bug with some of our polymorphic classes in old collector

c = Collector('default')
c.collect(UnifiedJobTemplate.objects.all())

ValueError: Cannot query "Demo Project-10": Must be "JobTemplate" instance.

disabling polymorphic fixes this

@softwarefactory-project-zuul
Copy link
Contributor

Build failed.

collector = AWXCollector('default', self.logger)
pk_list = Job.objects.filter(created__lt=self.cutoff)[0:batch_size].values_list('pk')
qs_batch = Job.objects.filter(pk__in=pk_list)
num_deleted += qs_batch.count()
Copy link
Contributor

@ryanpetrello ryanpetrello Mar 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably can update this as well to remove the .count() (instead, just accumulate the just_deleted on each loop, like the function above).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, fixed!

@ryanpetrello
Copy link
Contributor

ryanpetrello commented Mar 6, 2020

cc @matburt because I'd like him to be aware of this change and weigh in on it.

So I've looked this over, and I feel pretty good about the custom collector given that we're only using it explicitly here in this cleanup command, and not in other places. The changes @fosterseth made to the upstream Django implementation make sense to me.

I think it would be even better if we tried to clean up the Collector changes and open an upstream PR with similar changes and attempt to get these optimizations into upstream Django (though that's not going to happen in the very near term, obviously):

django/django@2.2.4...fosterseth:fix-deletion-2.2.4

If such a PR had an issue had an associated issue filed in the Django issue racker that illustrated the problem, passed existing Django tests, and included new tests that illustrated what the optimization accomplished, I think it would be more likely to be well-received by the upstream community. But let's do that work after we're done with this, because it has a longer tail, and will be an uphill battle. It might also be an opportunity for the Django community to point out any issues with our implementation or approach here.

Given a choice between option 1 and 2 (cleanup_jobs, and cleanup_jobs_fast), I think I'm comfortable with just going w/ option 2 iff we can have somebody available to write some integration tests that verify this for correctness in-depth. Specifically, I'd like to see tests that:

  1. Set up a number of jobs that are "expired", and a number that are not expired, and establishes links to their various cascading dependencies:
In [10]: [f for f in Job._meta.get_fields(include_hidden=True) if f.auto_created and not f.concrete and (f.one_to_one or f.one_to_many)]
Out[10]:
[<ManyToOneRel: main.unifiedjobtemplate>,
 <ManyToOneRel: main.unifiedjobtemplate>,
 <ManyToOneRel: main.unifiedjob_dependent_jobs>,
 <ManyToOneRel: main.unifiedjob_dependent_jobs>,
 <ManyToOneRel: main.unifiedjob_notifications>,
 <ManyToOneRel: main.unifiedjob_labels>,
 <ManyToOneRel: main.unifiedjob_credentials>,
 <OneToOneRel: main.joblaunchconfig>,
 <ManyToOneRel: main.activitystream_unified_job>,
 <OneToOneRel: main.workflowjobnode>,
 <ManyToOneRel: main.jobevent>,
 <ManyToOneRel: main.jobhostsummary>,
 <ManyToOneRel: main.host>,
 <ManyToOneRel: main.activitystream_job>]
  1. Deletes older jobs, and has assertions to verify that the relations referenced above are properly updated to reflect the deletion.

We do have tests in our codebase for verifying the result of manage.py commands:

https://github.com/ansible/awx/tree/devel/awx/main/tests/functional/commands

So in my opinion, we should:

  • Remove the distinction between "fast" and "not fast" for jobs (just go with fast)
  • Add some rigorous testing for this command in the way described above - specifically for cleanup_job, and then I'll be more comfortable approving this PR.


If 'keep_parents' is True, data of parent model's will be not deleted.
"""
if self.logger:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should consider removing this self.logger usage; it's unlikely to be accepted upstream.

@softwarefactory-project-zuul
Copy link
Contributor

Build failed.

@fosterseth fosterseth force-pushed the feature-cleanup_jobs-perf branch from fd9f9b0 to 970f747 Compare March 6, 2020 21:08
@softwarefactory-project-zuul
Copy link
Contributor

Build failed.

@softwarefactory-project-zuul
Copy link
Contributor

Build failed.



class AWXCollector(Collector):
def __init__(self, using):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can probably just remove this __init__ since it's not doing anything special.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, fixed

@softwarefactory-project-zuul
Copy link
Contributor

Build failed.

@fosterseth fosterseth force-pushed the feature-cleanup_jobs-perf branch from 8b9fca0 to 4e1c5d6 Compare March 10, 2020 19:55
@fosterseth fosterseth changed the title [WIP] Improve performance of cleanup_jobs Improve performance of cleanup_jobs Mar 10, 2020
@fosterseth fosterseth closed this Mar 10, 2020
@fosterseth fosterseth reopened this Mar 10, 2020
@softwarefactory-project-zuul
Copy link
Contributor

Build succeeded.

@fosterseth fosterseth force-pushed the feature-cleanup_jobs-perf branch from 1c39cd5 to 8232a5b Compare March 11, 2020 15:08
@softwarefactory-project-zuul
Copy link
Contributor

Build succeeded.

@fosterseth fosterseth force-pushed the feature-cleanup_jobs-perf branch from fdce179 to 6891981 Compare March 11, 2020 19:57
@softwarefactory-project-zuul
Copy link
Contributor

Build succeeded.

@fosterseth fosterseth force-pushed the feature-cleanup_jobs-perf branch from 170cdf8 to 6dcb0b2 Compare March 12, 2020 15:28
@softwarefactory-project-zuul
Copy link
Contributor

Build succeeded.

@ryanpetrello ryanpetrello requested a review from matburt March 12, 2020 19:21
@kdelee
Copy link
Member

kdelee commented Mar 16, 2020

@fosterseth @ryanpetrello reading this last comment, it is unclear to me what option you went with. of "option 1" and "option 2"

Given a choice between option 1 and 2 (cleanup_jobs, and cleanup_jobs_fast), I think I'm comfortable with just going w/ option 2 iff we can have somebody available to write some integration tests that verify this for correctness in-depth. Specifically, I'd like to see tests that:

Can we sync up about this and clarify more about these test cases, because I'm not 100% clear

Set up a number of jobs that are "expired", and a number that are not expired, and establishes links to their various cascading dependencies:

Deletes older jobs, and has assertions to verify that the relations referenced above are properly updated to reflect the deletion.

@fosterseth
Copy link
Member Author

@kdelee We implemented option 2 --- the faster version :)

the way I envision testing is to create a handful of Jobs with creation date (the created field) of today's date. Then create a handful of jobs with a created date of 10 days ago.

then you can run the awx-manage command
awx-manage cleanup_jobs --days 10

We can assert the older jobs are removed, and make sure the newer jobs are still in the database.

Copy link
Member

@matburt matburt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this, but it warrants it's own mention in docs/ architecture docs.

@kdelee
Copy link
Member

kdelee commented Mar 16, 2020

OK talked to @fosterseth and we have come up with the following plan:

What does do?:

  • This cleans up adhoc jobs, inventory updates, jobs, project updates, workflow jobs....think that is it?

What are consequences?
Jobs older than X days (prompt-able field on launching job) get deleted.

Deleting JOBS needs to ALSO deletes and/or get nulled out:

  • job events
  • job host summary
  • hosts (brought in by inventory update)
  • activity stream items related to the job
  • workflow job nodes
  • job launch configs

if wrong thing happens probably see postgres error in logs and thing would not get deleted (violates ForeignKey restraint)

Do we have any special concerns for upgrades?

No, no migrations or model changes have occured.

TODO:

  • define which of above get links nulled out and which get deleted @ryanpetrello can you help clarify expecations on what gets nulled out and what gets deleted?

  • deploy instance + load with variety of job types w/ real events, activity stream things, etc

  • wait one day or send tower into the future

  • load with NEW set of job types w/ real events, activity stream things, etc

  • check expectations about deleted vs. nulled out links

  • document appropriate info for developers in architecture docs

  • update appropriate info in docs + release notes

Copy link
Member

@kdelee kdelee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, we can talk at merge meeting about testing performed

@fosterseth fosterseth force-pushed the feature-cleanup_jobs-perf branch from 6dcb0b2 to 1db9b48 Compare March 17, 2020 19:25
@softwarefactory-project-zuul
Copy link
Contributor

Build succeeded.

batch_size = 1000000

while True:
pk_list = Job.objects.filter(created__lt=self.cutoff).exclude(status__in=['pending', 'waiting', 'running'])[0:batch_size].values_list('pk')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think given this that you'll have an infinite loop if you do a dry-run.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep you're right, let me think about a cleaner way to write this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@softwarefactory-project-zuul
Copy link
Contributor

Build succeeded.

@softwarefactory-project-zuul
Copy link
Contributor

Build succeeded.

@fosterseth fosterseth force-pushed the feature-cleanup_jobs-perf branch from a71bb84 to ad64b6f Compare March 19, 2020 17:03
@softwarefactory-project-zuul
Copy link
Contributor

Build failed.

The commit is intended to speed up the cleanup_jobs command in awx. Old
methods takes 7+ hours to delete 1 million old jobs. New method takes
around 6 minutes.

Leverages a sub-classed Collector, called AWXCollector, that does not
load in objects before deleting them. Instead querysets, which are
lazily evaluated, are used in places where Collector normally keeps a
list of objects.

Finally, a couple of tests to ensure parity between old Collector and
AWXCollector. That is, any object that is updated/removed from the
database using Collector should be have identical operations using
AWXCollector.

tower issue 1103
@fosterseth fosterseth force-pushed the feature-cleanup_jobs-perf branch from ad64b6f to 88fb30e Compare March 19, 2020 18:14
@softwarefactory-project-zuul
Copy link
Contributor

Build succeeded.

@softwarefactory-project-zuul
Copy link
Contributor

Build succeeded (gate pipeline).

@softwarefactory-project-zuul softwarefactory-project-zuul bot merged commit 0a5acb6 into ansible:devel Mar 19, 2020
AlanCoding pushed a commit to AlanCoding/awx that referenced this pull request Jan 4, 2023
[4.2] Backport Validate same start/end day different time schedule
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants