Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use exists subquery to find deleted model instance pks #791

Merged
merged 2 commits into from
Jun 3, 2019

Conversation

jeremy-engel
Copy link
Contributor

This is an idea that may improve the performance of get_deleted for some of the users in #748.

The changes shift the subquery on the model on get_deleted to use the built-in Exists (only django versions 1.11 and up are listed as supported as of now, and Exists was added in this version). The subquery should work across supported backends, the test suite passes, and I've used locally in MySQL and Postgres successfully.

As far as performance goes, things get a bit tricky, and I've been able to look into query times only in Postgres at the moment. But, in Postgres, I've experienced performance changes similar similar to the following when running get_deleted on the entire Version queryset:

  • For smaller models (in my testing, models with 10,000 or fewer instances), the original query takes 20% of the time of the new query (though on reasonable hardware for models of this size we are talking only a few milliseconds)
  • For larger models (in my testing, models with 100,000 or more instances), I've found that the new query takes 20% of the time of the original query (from roughly 10 seconds to 2 seconds in a relatively large model)

But there is another use case in which I've found more significant performance improvements (see jedie/django-reversion-compare#95). In the linked issue, the performance issues are from the following line (https://github.com/jedie/django-reversion-compare/blob/0bfe214e40933e38a5e5a94f3c6d0f56de9051be/reversion_compare/compare.py#L206):

    deleted = list(Version.objects.filter(revision=old_revision).get_deleted(related_model))

Here, the original query still must iterate over the pks the related model, but using Exists results in a significant performance improvement (in a model with a few million instances and many deletions, the query went from about 50 seconds to 1.5 milliseconds).

That being said, there are a couple of issues:

  1. The performance improvement for larger models and filtered querysets may not be worth the cost for other cases
  2. Should work with django's natively supported database backends, but performance may vary

In the general case, performance should be similar to the previous
subquery. But when the queryset is filtered, a potentially costly scan
of the model pks is avoided, greatly improving performance in models
with many instances and revisions. Such a situation occurs in the
`django-reversion-compare` package in
https://github.com/jedie/django-reversion-compare/blob/0bfe214e40933e38a5e5a94f3c6d0f56de9051be/reversion_compare/compare.py#L206:
@etianen
Copy link
Owner

etianen commented May 30, 2019

I like losing the custom SQL, and I like the idea of some performance improvements.

Can you make some benchmarks for the affected queries and views please? I only care about PostgreSQL, MySQL and SQLite. I'm happy to accept a small performance degradation for small datasets if the improvement for large datasets is as significant as you say.

@jeremy-engel
Copy link
Contributor Author

To get some cross-database benchmarks, I used TestModel from test_app. To populate, the objects were created in a revision, saved in another revision, and then every other object was deleted using the following:

from test_app.models import TestModel
import reversion
from reversion.models import Version


def create_test_objects(num_objects):
    for obj_num in range(num_objects):
        with reversion.create_revision():
            obj = TestModel.objects.create()
        with reversion.create_revision():
            obj.save()
        # Delete half of created objects
        if obj_num % 2 == 0:
            obj.delete() 

I ran the original query vs. the version in this pull request for postgres, sqlite, and mysql with 100, 1,000, and 10,000 objects (and 100,000 for postgres) for Version.objects.get_deleted(TestModel):

Postgres 11.3

Number of objects Original query New query
100 1.5 ms 1.7 ms
1000 4.8 ms 5.8 ms
10000 92.3 ms 35.0 ms
100000 933.7 ms 257.7 ms

MySQL 8.0

Number of objects Original query New query
100 2.1 ms 7.7 ms
1000 12.8 ms 237.7 ms
10000 85.8 ms 22313.4 ms

SQLite 3.24

Number of objects Original query New query
100 167 ms 166 ms
1000 1422 ms 1398 ms
10000 25430 ms 22557 ms

So MySQL should get the old fallback subquery again, but the postgres results are improved.

@etianen etianen merged commit 5a5b653 into etianen:master Jun 3, 2019
@etianen
Copy link
Owner

etianen commented Jun 3, 2019

Thanks for a great PR, and a solid set of benchmarks! Merged.

@jeremy-engel jeremy-engel deleted the get-deleted-speed branch June 5, 2019 03:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants