-
Notifications
You must be signed in to change notification settings - Fork 14.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Load async sql lab results early for Presto #4834
Conversation
ee27e77
to
dcbeeb2
Compare
@timifasubaa would you mind including screenshots and/or an animated gif (you can use LICEcap to record a video) to provide more context with regards to how the UX has changed. |
@timifasubaa if you update your description the following keywords can be use to automatically close issues. |
11386cd
to
e629a3e
Compare
@john-bodley Updated both |
628ea56
to
6970d1c
Compare
2606e22
to
de04dea
Compare
Codecov Report
@@ Coverage Diff @@
## master #4834 +/- ##
=========================================
- Coverage 77.51% 77.22% -0.3%
=========================================
Files 44 44
Lines 8735 8780 +45
=========================================
+ Hits 6771 6780 +9
- Misses 1964 2000 +36
Continue to review full report at Codecov.
|
de04dea
to
0793b2f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few general questions:
- Currently this is defined for Presto but I was wondering whether it should be enabled for other databases as well. I presume any engine which supports
fetchmany
could in theory work. - What happens when the result set is less than 1,000 rows?
- Can you also add some unit tests?
superset/db_engine_specs.py
Outdated
payload = dict(query_id=query.id) | ||
payload.update({ | ||
'status': utils.QueryStatus.PREFETCHED, | ||
'data': cdf.data if cdf.data else [], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you need the if/else as cdf.data
returns an empty array if the data frame is empty per this. Also x or []
is preferred over x if x else []
.
}) | ||
|
||
json_payload = json.dumps(payload, default=utils.json_iso_dttm_ser) | ||
key = '{}'.format(uuid.uuid4()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to format this as a string as it's already a string.
superset/db_engine_specs.py
Outdated
|
||
query = session.query(type(query)).filter_by(id=query.id).one() | ||
if query.status == QueryStatus.STOPPED: | ||
cursor.cancel() | ||
break | ||
|
||
if ( | ||
config.get('PREFETCH_PRESTO') and | ||
processed_rows > 1000 and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this a hard coded value? Shouldn't this be tied to the limit
defined in the prefetch_results
method?
'columns': cdf.columns if cdf.columns else [], | ||
'query': query.to_dict(), | ||
}) | ||
if store_results: | ||
key = '{}'.format(uuid.uuid4()) | ||
if not query.results_key: | ||
query.results_key = '{}'.format(uuid.uuid4()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See note above.
superset/sql_lab.py
Outdated
@@ -158,6 +136,9 @@ def handle_error(msg): | |||
|
|||
if store_results and not results_backend: | |||
return handle_error("Results backend isn't configured.") | |||
cache_timeout = database.cache_timeout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will never execute and is duplicate logic from line 256.
superset/utils.py
Outdated
|
||
|
||
def get_original_key(prefetch_key): | ||
return prefetch_key[:-9] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems a tad brittle.
superset/views/core.py
Outdated
blob = results_backend.get(key) | ||
if not blob: | ||
return json_error_response( | ||
'Data could not be retrieved. ' | ||
'You may want to re-run the query.', | ||
status=410, | ||
) | ||
if utils.is_prefetch_key(key): # hack to not break when requesting prefetch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having the term "hack" in a comment is asking for trouble. Maybe we could rethink this.
0793b2f
to
d93f393
Compare
6f72db0
to
5391cd2
Compare
The current approach incorporates the various feedback I have received on this PR.
|
5391cd2
to
f6b9069
Compare
403058c
to
c7a3534
Compare
c7a3534
to
8d54892
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this is a sweet feature, I think it adds a somewhat complex layer on top of something that's already fairly complex and hard to reason about.
I'm not saying we shouldn't do this, just saying that we should make it as straightforward/manageable as possible.
<div> | ||
{progressBar} | ||
</div> | ||
<VisualizeModal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This whole section is copy pasted from earlier in this very method. This should be refactored into its own component, or at least a renderDataSection
method or something like that, that would receive different props/params as needed.
@@ -726,7 +728,29 @@ def extra_table_metadata(cls, database, table_name, schema_name): | |||
} | |||
|
|||
@classmethod | |||
def handle_cursor(cls, cursor, query, session): | |||
def prefetch_results(cls, cursor, query, cache_timeout, session, limit): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It appears this method is not covered by tests
data = cursor.fetchmany(limit) | ||
column_names = cls.get_normalized_column_names(cursor.description) | ||
cdf = utils.convert_results_to_df(column_names, data) | ||
payload = dict(query_id=query.id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Much of the logic here is not specific to Presto and should probably live in the base class or outside this module. Maybe something like cache_prefetched_data(data)
.
@@ -294,6 +294,12 @@ class CeleryConfig(object): | |||
# Timeout duration for SQL Lab synchronous queries | |||
SQLLAB_TIMEOUT = 30 | |||
|
|||
# When set to true, results from asynchronous sql lab are prefetched | |||
PREFETCH_ASYNC = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be a db-level param?
(not query.has_loaded_early) | ||
): | ||
query.has_loaded_early = True | ||
limit = config.get('PREFETCH_ROWS') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prefetch_count
would be a better name than limit
as limit has different meaning around limiting the query itself.
@@ -72,8 +74,8 @@ class BaseEngineSpec(object): | |||
inner_joins = True | |||
|
|||
@classmethod | |||
def fetch_data(cls, cursor, limit): | |||
if cls.limit_method == LimitMethod.FETCH_MANY: | |||
def fetch_data(cls, cursor, limit, prefetch=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does fetch_data need a new arg or does it just need to be called with a limit?
): | ||
query.has_loaded_early = True | ||
limit = config.get('PREFETCH_ROWS') | ||
PrestoEngineSpec.prefetch_results( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this possible that since fetch_many
is synchronous, it will prevent publishing any query % progress until some rows are fetched? I think we may be loosing or obfuscating progress here. Say a query with a large groupby returning a small result set that takes 5 minutes to scan will show 0% until it can return the small result set. Now I also wonder how the UI does if/when the prefetch and final result occur right at around the same moment (large scan query with small result set), does it flicker, does it look ok?
This PR enables sqllab to load queries early after the first 100 rows are available. Since this uses fetchmany, it should be possible for all our dbs but this PR implements it for just Presto.
This has the benefit of makingusers wait less before seeing questy results and also enabling them iterate on their query quicker and also visualize much earlier on in the process.
Closes #4588
At a high level, the celery worker loads the first 100 results into an s3 bicket and loads it. When the rest of the data is ready, it combines the old and the new and puts it in the same s3 bucket.
@graceguo-supercat @john-bodley @mistercrunch @williaster