Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Fix unintended cache misses with async queries #14291

Merged
merged 4 commits into from
Apr 28, 2021

Conversation

benjreinhart
Copy link
Contributor

@benjreinhart benjreinhart commented Apr 22, 2021

Some charts are rendering errors when GLOBAL_ASYNC_QUERIES is turned on (#14289).

Some of the fixes:

  1. The background worker stores the form_data to be used for validation on subsequent requests when clients request the query results. However, form_data is mutated by the code that runs the query and the background worker is storing the mutated object. Since this object is used to derive a unique cache key, any mutations to it will lead to a different key and, subsequently, a cache miss.
  2. Even if 1 is solved, in some case, the code will modify form_data in a non-deterministic way before the cache key is derived from the object (e.g., adding UUIDs to it). This will always lead to a unique cache key.
  3. There were two cases where deduplicating lists seemed to introduce the potential for lists to get out of sync in terms of ordering, causing a different cache key when I think ordering doesn't matter here. Sorting them will ensure the same results. Either way, the ordering should be the same given the same input, and that wasn't always the case.

Note: even though this is broken in the async queries experience, it will benefit everyone since these fixes should prevent unnecessary cache misses in the regular flow.

@benjreinhart benjreinhart changed the title bug: Fix unintended cache misses with async queries fix: Fix unintended cache misses with async queries Apr 22, 2021
@@ -1065,6 +1065,9 @@ def to_adhoc(
elif expression_type == "SQL":
result.update({"sqlExpression": filt.get(clause)})

deterministic_name = md5_sha_from_dict(result)
result["filterOptionName"] = deterministic_name
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unclear what the impact of changing the filterOptionName value is here. @rusackas @villebro

Copy link
Contributor Author

@benjreinhart benjreinhart Apr 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree we should get some 👀 on this from folks more familiar. AFAICT, this preserves the existing behavior of generating a unique key (which I assume is needed on clients) but also keeps it deterministic so that hashing form_data will work properly (both with async/non-async caching).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With a deterministic filterOptionName we of course now have the risk of having duplicate filterOptionNames which probably is precisely what the uuid4 key is aiming to avoid (could cause trouble in React components if these get returned to the frontend). So maybe we should deduplicate filters, too, which in edge cases would avoid unnecessary cache misses (=if the exact same filter has been defined twice vs another query where the same filter is only defined once)?

Copy link
Contributor Author

@benjreinhart benjreinhart Apr 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... can you point to a specific example of where that might be an issue? Wouldn't each filter have different attributes? If not, is it a bug on the client that the same filters for the same chart would be added more than once? The idea is this should be deterministic for the same filter dict, but the set of filter dicts should have different properties, causing a different sha.

# objects modify the form_data object. If the modified version were
# to be cached here, it will lead to a cache miss when clients
# attempt to retrieve the value of the completed async query.
original_form_data = copy.deepcopy(form_data)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

@benjreinhart benjreinhart marked this pull request as ready for review April 22, 2021 21:07
@codecov
Copy link

codecov bot commented Apr 23, 2021

Codecov Report

Merging #14291 (789e109) into master (86d2a61) will decrease coverage by 0.08%.
The diff coverage is 87.50%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #14291      +/-   ##
==========================================
- Coverage   76.79%   76.71%   -0.09%     
==========================================
  Files         955      955              
  Lines       48251    48255       +4     
  Branches     6030     6030              
==========================================
- Hits        37055    37018      -37     
- Misses      11001    11042      +41     
  Partials      195      195              
Flag Coverage Δ
hive 80.77% <87.50%> (+<0.01%) ⬆️
mysql 81.03% <87.50%> (+<0.01%) ⬆️
postgres 81.07% <87.50%> (+<0.01%) ⬆️
presto ?
python 81.44% <87.50%> (-0.16%) ⬇️
sqlite 80.67% <87.50%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
superset/viz.py 55.49% <50.00%> (ø)
superset/tasks/async_queries.py 92.30% <100.00%> (+0.24%) ⬆️
superset/utils/core.py 88.77% <100.00%> (+0.02%) ⬆️
superset/db_engine_specs/presto.py 84.42% <0.00%> (-5.90%) ⬇️
superset/connectors/sqla/models.py 88.61% <0.00%> (-1.46%) ⬇️
superset/models/core.py 88.85% <0.00%> (-0.28%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 86d2a61...789e109. Read the comment docs.

Copy link
Member

@villebro villebro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor comment

@@ -1065,6 +1065,9 @@ def to_adhoc(
elif expression_type == "SQL":
result.update({"sqlExpression": filt.get(clause)})

deterministic_name = md5_sha_from_dict(result)
result["filterOptionName"] = deterministic_name
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With a deterministic filterOptionName we of course now have the risk of having duplicate filterOptionNames which probably is precisely what the uuid4 key is aiming to avoid (could cause trouble in React components if these get returned to the frontend). So maybe we should deduplicate filters, too, which in edge cases would avoid unnecessary cache misses (=if the exact same filter has been defined twice vs another query where the same filter is only defined once)?

@benjreinhart benjreinhart force-pushed the benjreinhart/query-caching branch from 860f7fc to 789e109 Compare April 27, 2021 23:18
@pull-request-size pull-request-size bot added size/L and removed size/M labels Apr 27, 2021
@benjreinhart
Copy link
Contributor Author

Following up on this:

I just spent some time playing with the UI and poking through the code. I am pretty confident the filterOptionName change will not cause issues (at least with how things are written today). If you filter out tests and superset/examples, there's only 14 references across the front and back end.

As mentioned above, the only way I see a conflict is if two adhoc filters contained the exact same key/value pairs, which I don't think is an expected / desired use case to support. I think we're good on dashboards too since the filter handling should be scoped to a single chart.

FWIW, I tested adding a duplicate filter pair, running the chart query, saving the chart, reloading it, etc., and it worked as expected.

cc @robdiciuccio @villebro

@robdiciuccio robdiciuccio merged commit e7f5100 into apache:master Apr 28, 2021
@robdiciuccio robdiciuccio deleted the benjreinhart/query-caching branch April 28, 2021 19:14
QAlexBall pushed a commit to QAlexBall/superset that referenced this pull request Dec 29, 2021
* bug: Fix unintended cache misses with async queries

* Ensure sort order

* Ensure columns are sorted

* Update failing tests
@mistercrunch mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 1.2.0 labels Mar 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels size/L 🚢 1.2.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants