Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: welcome presto to the suite of tested databases #10498

Merged
merged 5 commits into from
Aug 6, 2020

Conversation

bkyryliuk
Copy link
Member

@bkyryliuk bkyryliuk commented Aug 1, 2020

Goal: adds presto to the CI.
Non goal: clean up inconsistencies in handling different databases

Follow up: there will be more PRs cleaning up inconsistent handling of the presto db, moving load_examples into the pytest fixture etc

SUMMARY

  • introduces test suite with presto & memory connector
  • splits main from examples db in the tests

Based on: #10487
Test only change.

TEST PLAN

  • CI

@bkyryliuk bkyryliuk marked this pull request as draft August 1, 2020 19:46
@bkyryliuk bkyryliuk force-pushed the bogdan/add_presto_to_tests branch 6 times, most recently from 9172b7e to e0841a1 Compare August 3, 2020 18:22
@bkyryliuk bkyryliuk marked this pull request as ready for review August 3, 2020 18:44
@bkyryliuk bkyryliuk force-pushed the bogdan/add_presto_to_tests branch 2 times, most recently from 1c17b53 to e0841a1 Compare August 4, 2020 16:59
pyhive==0.6.2
# Enable in-place multirow inserts
# TODO(bkyryliuk): release new version of pyhive
git+https://github.com/dropbox/PyHive@master
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will be a blocker - the community has not been accepting of installing dependencies from Github.

@@ -54,19 +54,26 @@ def gen_filter(

def load_data(tbl_name: str, database: Database, sample: bool = False) -> None:
pdf = pd.read_json(get_example_data("birth_names.json.gz"))
pdf.ds = pd.to_datetime(pdf.ds, unit="ms")
if database.backend != "presto":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: switching the order of the conditions might make this more readable.

@@ -1022,6 +1022,13 @@ def get_example_database() -> "Database":
return get_or_create_db("examples", db_uri)


def get_main_database() -> "Database":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We usually refer to this as the "metadata" database - should that be in the function name?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is called in superset as a main database on the dev installation, I am fine with renaming it - however we should probably do it across the board in a separate PR

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, forgive the confusion on my part. This is fine as-is then.

Sample test data

Datetime conversion

Sample test data

Fix tests
@bkyryliuk bkyryliuk force-pushed the bogdan/add_presto_to_tests branch 2 times, most recently from 695c296 to e90deaa Compare August 5, 2020 18:15
@bkyryliuk bkyryliuk force-pushed the bogdan/add_presto_to_tests branch 2 times, most recently from cc2b2e0 to d45cbf9 Compare August 5, 2020 21:58
Copy link
Member

@villebro villebro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should aim to put as much db-specific logic in db_engine_specs, especially in the examples loading logic. We need to add similar logic for mutating column names, e.g. BigQuery doesn't support columns starting with a number (there's other similar problems for other dbs).

WRT the test assertions, I think those are fine left as-is. Also curious about that extra cache key test failing, that seems like a potential security problem, so I'm happy to help track down what's causing that.

Comment on lines +57 to +61
if database.backend == "presto":
pdf.ds = pd.to_datetime(pdf.ds, unit="ms")
pdf.ds = pdf.ds.dt.strftime("%Y-%m-%d %H:%M%:%S")
else:
pdf.ds = pd.to_datetime(pdf.ds, unit="ms")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic should ideally be moved out to db_engine_specs/presto.py to keep this clean of db-specific logic. Something along the lines of BaseEngineSpec.convert_examples_datetime() or similar (there are other similar methods there).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my plan here is a bit different, we have talked about getting rid of load_examples for the test, I plan to move initialization code into the pytest fixture and will do that cleanup & restructure code to be more database generic.

The scope of the changes is test only, I prefer not to modify production code.

Comment on lines +70 to +71
# TODO(bkyryliuk): use TIMESTAMP type for presto
"ds": DateTime if database.backend != "presto" else String(255),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, e.g. get_examples_datetime_type(). There are a few places below to which I feel the same applies.

Comment on lines 181 to 195
expected_result = []
if backend == "presto":
expected_result = (
[{"rows": 1}] if ctas_method == CtasMethod.TABLE else [{"result": True}]
)
self.assertEqual(expected_result, result["data"])
expected_columns = []
if backend == "presto":
expected_columns = [
{
"name": "rows" if ctas_method == CtasMethod.TABLE else "result",
"type": "BIGINT" if ctas_method == CtasMethod.TABLE else "BOOLEAN",
"is_date": False,
}
]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also runs the risk of becoming convoluted if we introduce db-specific logic for all supported dbs. As this logic is repeated below, I think there is merit to centralizing this somewhere, too, but db_engine_specs probably isn't the right place for test assertion related stuff. I'm open to suggestions, but feel we can probably leave this as-is until we come up with a more scalable solution.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree, will tackle it in the upcoming PRs, do not have clear solution yet.
most likely it would be possible to introduce TestDBSpec and hide that complexity there.

@@ -222,7 +223,7 @@ def test_get_select_star_not_found_table(self):
return
uri = f"api/v1/database/{example_db.id}/select_star/table_does_not_exist/"
rv = self.client.get(uri)
self.assertEqual(rv.status_code, 404)
self.assertEqual(rv.status_code, 404 if example_db.backend != "presto" else 500)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be interested in understanding why this is returning a 500. Perhaps add a TODO here so we can follow up on it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pyhive raises the exception, probably it is not caught, left todo

Comment on lines 96 to 100
# TODO: make it work with presto
if get_example_database().backend == "presto":
assert extra_cache_keys == []
else:
assert extra_cache_keys == ["abc"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, I wonder why this is failing for presto. This really shouldn't have anything to do with the underlying analytical db type..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree, did not investigate - worth looking into

Comment on lines 132 to 136
# TODO: make it work with presto
if get_example_database().backend == "presto":
assert extra_cache_keys == []
else:
assert extra_cache_keys == ["abc"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here..

@bkyryliuk
Copy link
Member Author

I think we should aim to put as much db-specific logic in db_engine_specs, especially in the examples loading logic. We need to add similar logic for mutating column names, e.g. BigQuery doesn't support columns starting with a number (there's other similar problems for other dbs).

WRT the test assertions, I think those are fine left as-is. Also curious about that extra cache key test failing, that seems like a potential security problem, so I'm happy to help track down what's causing that.

Agree with all the suggestions, my only comment here is this PR is fairly large already - I am happy to tackle the suggestions in the followup PRs.

Copy link
Member

@villebro villebro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good and I agree it makes sense to break it up into more small PRs than aim for a solve-it-all mega PR. LGTM.

@bkyryliuk bkyryliuk merged commit 62b873e into apache:master Aug 6, 2020
Ofeknielsen pushed a commit to ofekisr/incubator-superset that referenced this pull request Oct 5, 2020
* Add presto to the CI

Sample test data

Datetime conversion

Sample test data

Fix tests

* TODO to switch to timestamps

* Address feedback

* Update requirements

* Add TODOs

Co-authored-by: bogdan kyryliuk <[email protected]>
auxten pushed a commit to auxten/incubator-superset that referenced this pull request Nov 20, 2020
* Add presto to the CI

Sample test data

Datetime conversion

Sample test data

Fix tests

* TODO to switch to timestamps

* Address feedback

* Update requirements

* Add TODOs

Co-authored-by: bogdan kyryliuk <[email protected]>
@mistercrunch mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 0.38.0 labels Mar 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels size/XL 🚢 0.38.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants