-
Notifications
You must be signed in to change notification settings - Fork 14k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: welcome presto to the suite of tested databases #10498
Conversation
9172b7e
to
e0841a1
Compare
1c17b53
to
e0841a1
Compare
requirements-dev.txt
Outdated
pyhive==0.6.2 | ||
# Enable in-place multirow inserts | ||
# TODO(bkyryliuk): release new version of pyhive | ||
git+https://github.com/dropbox/PyHive@master |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will be a blocker - the community has not been accepting of installing dependencies from Github.
superset/examples/birth_names.py
Outdated
@@ -54,19 +54,26 @@ def gen_filter( | |||
|
|||
def load_data(tbl_name: str, database: Database, sample: bool = False) -> None: | |||
pdf = pd.read_json(get_example_data("birth_names.json.gz")) | |||
pdf.ds = pd.to_datetime(pdf.ds, unit="ms") | |||
if database.backend != "presto": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: switching the order of the conditions might make this more readable.
@@ -1022,6 +1022,13 @@ def get_example_database() -> "Database": | |||
return get_or_create_db("examples", db_uri) | |||
|
|||
|
|||
def get_main_database() -> "Database": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We usually refer to this as the "metadata" database - should that be in the function name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is called in superset as a main database on the dev installation, I am fine with renaming it - however we should probably do it across the board in a separate PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, forgive the confusion on my part. This is fine as-is then.
Sample test data Datetime conversion Sample test data Fix tests
695c296
to
e90deaa
Compare
cc2b2e0
to
d45cbf9
Compare
d45cbf9
to
8fee355
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should aim to put as much db-specific logic in db_engine_specs
, especially in the examples loading logic. We need to add similar logic for mutating column names, e.g. BigQuery doesn't support columns starting with a number (there's other similar problems for other dbs).
WRT the test assertions, I think those are fine left as-is. Also curious about that extra cache key test failing, that seems like a potential security problem, so I'm happy to help track down what's causing that.
if database.backend == "presto": | ||
pdf.ds = pd.to_datetime(pdf.ds, unit="ms") | ||
pdf.ds = pdf.ds.dt.strftime("%Y-%m-%d %H:%M%:%S") | ||
else: | ||
pdf.ds = pd.to_datetime(pdf.ds, unit="ms") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic should ideally be moved out to db_engine_specs/presto.py
to keep this clean of db-specific logic. Something along the lines of BaseEngineSpec.convert_examples_datetime()
or similar (there are other similar methods there).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my plan here is a bit different, we have talked about getting rid of load_examples for the test, I plan to move initialization code into the pytest fixture and will do that cleanup & restructure code to be more database generic.
The scope of the changes is test only, I prefer not to modify production code.
# TODO(bkyryliuk): use TIMESTAMP type for presto | ||
"ds": DateTime if database.backend != "presto" else String(255), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, e.g. get_examples_datetime_type()
. There are a few places below to which I feel the same applies.
tests/celery_tests.py
Outdated
expected_result = [] | ||
if backend == "presto": | ||
expected_result = ( | ||
[{"rows": 1}] if ctas_method == CtasMethod.TABLE else [{"result": True}] | ||
) | ||
self.assertEqual(expected_result, result["data"]) | ||
expected_columns = [] | ||
if backend == "presto": | ||
expected_columns = [ | ||
{ | ||
"name": "rows" if ctas_method == CtasMethod.TABLE else "result", | ||
"type": "BIGINT" if ctas_method == CtasMethod.TABLE else "BOOLEAN", | ||
"is_date": False, | ||
} | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also runs the risk of becoming convoluted if we introduce db-specific logic for all supported dbs. As this logic is repeated below, I think there is merit to centralizing this somewhere, too, but db_engine_specs
probably isn't the right place for test assertion related stuff. I'm open to suggestions, but feel we can probably leave this as-is until we come up with a more scalable solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree, will tackle it in the upcoming PRs, do not have clear solution yet.
most likely it would be possible to introduce TestDBSpec and hide that complexity there.
@@ -222,7 +223,7 @@ def test_get_select_star_not_found_table(self): | |||
return | |||
uri = f"api/v1/database/{example_db.id}/select_star/table_does_not_exist/" | |||
rv = self.client.get(uri) | |||
self.assertEqual(rv.status_code, 404) | |||
self.assertEqual(rv.status_code, 404 if example_db.backend != "presto" else 500) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be interested in understanding why this is returning a 500. Perhaps add a TODO here so we can follow up on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pyhive raises the exception, probably it is not caught, left todo
tests/sqla_models_tests.py
Outdated
# TODO: make it work with presto | ||
if get_example_database().backend == "presto": | ||
assert extra_cache_keys == [] | ||
else: | ||
assert extra_cache_keys == ["abc"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, I wonder why this is failing for presto. This really shouldn't have anything to do with the underlying analytical db type..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree, did not investigate - worth looking into
tests/sqla_models_tests.py
Outdated
# TODO: make it work with presto | ||
if get_example_database().backend == "presto": | ||
assert extra_cache_keys == [] | ||
else: | ||
assert extra_cache_keys == ["abc"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here..
Agree with all the suggestions, my only comment here is this PR is fairly large already - I am happy to tackle the suggestions in the followup PRs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good and I agree it makes sense to break it up into more small PRs than aim for a solve-it-all mega PR. LGTM.
* Add presto to the CI Sample test data Datetime conversion Sample test data Fix tests * TODO to switch to timestamps * Address feedback * Update requirements * Add TODOs Co-authored-by: bogdan kyryliuk <[email protected]>
* Add presto to the CI Sample test data Datetime conversion Sample test data Fix tests * TODO to switch to timestamps * Address feedback * Update requirements * Add TODOs Co-authored-by: bogdan kyryliuk <[email protected]>
Goal: adds presto to the CI.
Non goal: clean up inconsistencies in handling different databases
Follow up: there will be more PRs cleaning up inconsistent handling of the presto db, moving load_examples into the pytest fixture etc
SUMMARY
Based on: #10487
Test only change.
TEST PLAN