Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add final SQL check when looking up involved tables #199

Merged
merged 28 commits into from
Dec 27, 2021

Conversation

dbartenstein
Copy link
Contributor

@dbartenstein dbartenstein commented Aug 21, 2021

Description

  • Add a final SQL check to include potentially overlooked tables when looking up involved tables.
  • Add unit tests showing queries which do "order by" using a field of a referenced table. These tests would fail without the final SQL check.

Rationale

Changing the referenced object should also invalidate the query as calling the query again might lead to another result.

"Order by" allows expressions such as Coalesce as well: https://docs.djangoproject.com/en/3.2/ref/models/querysets/#order-by

Discussion

Initially I thought of adding the final SQL check as configuration option. After having looked at all the queries, I believe that it should be the default behavior. Thus I did not make it an option for now.

@dbartenstein dbartenstein changed the title Test with "order by" using field of another table. "order by" using field of another table. Aug 21, 2021
Proof of concept for conservative mode: final SQL query check.
@dbartenstein
Copy link
Contributor Author

dbartenstein commented Aug 21, 2021

@Andrew-Chen-Wang

  • Two additional order_by tests which would fail without the additional SQL query check.
  • I have written a Proof of Concept for the "redundancy mode" as I believe it’s easier to discuss when there is a code proposal on the table. This redundancy mode makes both order_by tests succeed and would have caught the previously unhandled Case example as well.

What do you think about the proposal of providing the option of enabling an additional SQL check to be on the safe side?

Dominik Bartenstein added 2 commits August 22, 2021 13:05
Print differences between regular checks and SQL check.
Adapt unit tests.
@dbartenstein dbartenstein changed the title "order by" using field of another table. Add final SQL check when looking up involved tables Aug 22, 2021
@@ -328,13 +338,13 @@ def test_subquery(self):
def test_custom_subquery(self):
tests = Test.objects.filter(permission=OuterRef('pk')).values('name')
qs = Permission.objects.annotate(first_permission=Subquery(tests[:1]))
self.assert_tables(qs, Permission, Test)
self.assert_tables(qs, Permission, Test, ContentType)
Copy link
Contributor Author

@dbartenstein dbartenstein Aug 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The final SQL includes an ORDER BY "django_content_type"."app_label" ASC
https://github.com/django/django/blob/ca9872905559026af82000e46cde6f7dedc897b6/django/contrib/auth/models.py#L72

Thus it’s correct behavior to add ContentType as involved model.

cachalot/tests/read.py Outdated Show resolved Hide resolved
@dbartenstein
Copy link
Contributor Author

dbartenstein commented Aug 23, 2021

@Andrew-Chen-Wang: before diving too deep - what do you think about this "Proof of concept" in general? I.e. the idea of doing a final SQL query check to catch unconsidered tables?
Hint: there still is some way to go - especially with Django 3.1 which seems to have some issues with returning the SQL query.

Copy link
Collaborator

@Andrew-Chen-Wang Andrew-Chen-Wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's alr and makes sense. I'm just worried about false positives; although it's better to invalidate than pass, if there are still too many queries that are invalidated, then that's a prob.

The original behavior for pre-Django 1.11 in cachalot is exactly what you've implemented here. Going through the commit history, you can see that the IsRawSQL exception was created for tracking subqueries. The question is why not just check the generated SQL query; why bother go through all these Pythonic subqueries in the first place especially with the addition quote_name?

I guess that's a test we can try: replace everything in _get_tables with a single call to _get_tables_from_sql with the generated SQL and quote_name.

Note: I've published a patch release; will be gone for the next two weeks cuz school is starting.

# Additional check of the final SQL.
# Potentially overlooked tables are added here. Tables may be overlooked by the regular checks
# as not all expressions are handled yet. This final check acts as safety net.
final_check_tables = _get_tables_from_sql(connections[db_alias], str(query), enable_quote=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

difference between this and L224? I say just put in try clause.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

difference between this and L224? I say just put in try clause.

Put it into an try - except - else.

"""Returns names of involved tables after analyzing the final SQL query."""
return {table for table in connection.introspection.django_table_names()
+ cachalot_settings.CACHALOT_ADDITIONAL_TABLES
if _quote_table_name(table, connection, enable_quote) in lowercased_sql}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice addition 👍

I also wanna double check that quote_name is not making a call to the database since that sometimes happens for some obscure thing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice addition +1

I also wanna double check that quote_name is not making a call to the database since that sometimes happens for some obscure thing.

At first glance quote_name seems to be safe, but it’s better to double-check of course!
https://github.com/django/django/blob/e703b152c6148ddda1b072a4353e9a41dca87f90/django/db/backends/mysql/operations.py#L177

@dbartenstein
Copy link
Contributor Author

I think it's alr and makes sense. I'm just worried about false positives; although it's better to invalidate than pass, if there are still too many queries that are invalidated, then that's a prob.

My take: Better false positives than false negatives. But of course we should avoid them both. One thing to note is that with the final SQL check the parent models’ tables have to included in CACHALOT_ONLY_CACHABLE_TABLES.

The original behavior for pre-Django 1.11 in cachalot is exactly what you've implemented here. Going through the commit history, you can see that the IsRawSQL exception was created for tracking subqueries. The question is why not just check the generated SQL query; why bother go through all these Pythonic subqueries in the first place especially with the addition quote_name?

Yes, that will be something to investigate.

I guess that's a test we can try: replace everything in _get_tables with a single call to _get_tables_from_sql with the generated SQL and quote_name.

Note: I've published a patch release; will be gone for the next two weeks cuz school is starting.

👍 And happy start of 🏫!

Copy link
Collaborator

@Andrew-Chen-Wang Andrew-Chen-Wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Just a couple of typings and a add-ons, but looks great.

Comment on lines 104 to 105
def _get_tables_from_sql(connection, lowercased_sql,
enable_quote=False):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def _get_tables_from_sql(connection, lowercased_sql,
enable_quote=False):
def _get_tables_from_sql(connection, lowercased_sql, enable_quote=False):

Copy link
Contributor Author

@dbartenstein dbartenstein Aug 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Andrew-Chen-Wang: what’s the line length limit used in the cachalot project? 120?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's black's default 88 I believe. Unfortunately, I've got a small computer, so that's why I can't match line length with Django's standards.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok - but black is not used for cachalot, is it? Would that be an option?

cachalot/utils.py Show resolved Hide resolved
cachalot/utils.py Outdated Show resolved Hide resolved
cachalot/utils.py Outdated Show resolved Hide resolved
cachalot/tests/read.py Show resolved Hide resolved
cachalot/tests/read.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@Andrew-Chen-Wang Andrew-Chen-Wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks a lot for this!

@@ -917,7 +954,7 @@ def test_now_annotate(self):
"""Check that queries with a Now() annotation are not cached #193"""
qs = Test.objects.annotate(now=Now())
self.assert_query_cached(qs, after=1)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

weird space? I can use pre-commit later to remove this space though, so dw about it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

weird space? I can use pre-commit later to remove this space though, so dw about it

Would it make sense to add the pro-commit configuration to the project itself?

cachalot/tests/transaction.py Show resolved Hide resolved
@Andrew-Chen-Wang
Copy link
Collaborator

@dbartenstein you'll need to convert this draft PR to a finalized PR

@dbartenstein
Copy link
Contributor Author

dbartenstein commented Aug 25, 2021

@dbartenstein you'll need to convert this draft PR to a finalized PR

@Andrew-Chen-Wang: I will be on vacation for the next 10 days. So I wonder if it was better to postpone merging? Or would you like to do a release ASAP? It’s up to you.

@dbartenstein dbartenstein marked this pull request as ready for review August 25, 2021 07:10
@Andrew-Chen-Wang
Copy link
Collaborator

Postpone since I like to give a couple days for others to view and for me to think. Plus school. It's not unusual for releases to be a monthly thing.

@dbartenstein
Copy link
Contributor Author

Postpone since I like to give a couple days for others to view and for me to think. Plus school. It's not unusual for releases to be a monthly thing.

@Andrew-Chen-Wang: just wanted to inform you that I am back from vacation 🌴.

@coveralls
Copy link

coveralls commented Nov 25, 2021

Pull Request Test Coverage Report for Build 1626626393

  • 16 of 16 (100.0%) changed or added relevant lines in 3 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.2%) to 97.189%

Totals Coverage Status
Change from base Build 1551740406: 0.2%
Covered Lines: 657
Relevant Lines: 676

💛 - Coveralls

@dbartenstein
Copy link
Contributor Author

Thanks for running the benchmark for us @Fasther! And yes, I hope to get a quick docker-compose file up, so sorry about the inconvenience.

Pertaining to the results, I don't think many organizations (or personally I) would mind the decrease in the "faster" column, though the initial slow down performance drop might be troublesome as that can add up quickly. It may be better to roll this out as a separate, experimental feature @dbartenstein that can be enabled by a bool. That way, we can slowly improve this feature over time while not breaking any benchmarks for current users of cachalot.

@Andrew-Chen-Wang: I just did another optimization to prevent that as_sql() for the whole query is called twice: in get_query_cache_key and in _get_tables. This will very likely improve performance. I didn’t find a better way than adding an attribute to the compiler object for storing the generated SQL query. In general I would favor a more object-oriented approach.

@Fasther: can you please share another benchmark with us?

I am also fine with introducing a bool: CACHALOT_FINAL_SQL_CHECK which is False by default.

@Andrew-Chen-Wang
Copy link
Collaborator

@dbartenstein yes please introduce a setting for this feature regardless of the results. I think it would be a breaking change regardless due to possible addition of tables unexpectedly.

@PavelPancocha
Copy link

Optimized solution:

mysql      is 1.5× slower then 9.0× faster
postgresql is 1.3× slower then 10.5× faster
sqlite     is 1.4× slower then 2.6× faster
filebased  is 1.4× slower then 9.1× faster
locmem     is 1.3× slower then 9.9× faster
pylibmc    is 1.4× slower then 7.5× faster
pymemcache is 1.4× slower then 6.5× faster
redis      is 1.5× slower then 6.2× faster

@dbartenstein
Copy link
Contributor Author

dbartenstein commented Nov 27, 2021

@Andrew-Chen-Wang: @Fasther did great work on the PR and introduced the CACHALOT_FINAL_SQL_CHECK setting 👏

  • From your point of view - is the PR ready to be merged?
  • Are you going to update the documentation to include the setting CACHALOT_FINAL_SQL_CHECK?

Copy link
Collaborator

@Andrew-Chen-Wang Andrew-Chen-Wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding the setting to the docs would be great! Additionally, a CHANGELOG entry for this feature for version bump to 2.5.0 would be great.

Just in the docs, explaining why this setting was added would be great. Adding on the performance metrics done by Fasther would be helpful for organizations to decide whether to enable the feature.

Excited to get this out!

cachalot/tests/postgres.py Show resolved Hide resolved
cachalot/tests/postgres.py Show resolved Hide resolved
@dbartenstein
Copy link
Contributor Author

Adding the setting to the docs would be great! Additionally, a CHANGELOG entry for this feature for version bump to 2.5.0 would be great.

Just in the docs, explaining why this setting was added would be great. Adding on the performance metrics done by Fasther would be helpful for organizations to decide whether to enable the feature.

Excited to get this out!

@Andrew-Chen-Wang from our point of view (Thanks to @Fasther) the PR is ready to be merged. It includes documentation as well. What do you think?

@Andrew-Chen-Wang
Copy link
Collaborator

Thank you. I will review and push a minor version release today.

Copy link
Collaborator

@Andrew-Chen-Wang Andrew-Chen-Wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much @dbartenstein @Fasther this is a great addition!

@Andrew-Chen-Wang Andrew-Chen-Wang merged commit 434a575 into noripyt:master Dec 27, 2021
@dbartenstein
Copy link
Contributor Author

Thanks so much @dbartenstein @Fasther this is a great addition!

@Andrew-Chen-Wang you’re welcome! 🙇

@dbartenstein
Copy link
Contributor Author

@Andrew-Chen-Wang when do you plan to make the next release containing this PR?

@Andrew-Chen-Wang
Copy link
Collaborator

Done 👍 thanks for this PR again everyone!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants