Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[druid] Updating refresh logic #4655

Merged
merged 1 commit into from
Mar 27, 2018

Conversation

john-bodley
Copy link
Member

@john-bodley john-bodley commented Mar 21, 2018

Though the term "refresh" is somewhat vague from a Druid metadata perspective I sense this translates to create or update. Previously we were creating or updates Druid columns but only creating Druid metrics when the Druid metadata was synced/refreshed.

This PR ensures that refreshing is consistent for both Druid columns and metrics and specifically addresses the following:

  1. Removes redundancy by only controlling metric specifications within the DruidMetric class. Previously there was somewhat duplicate logic for both the DruidColumn and DruidMetric class.
  2. Renames generate_metrics with refresh_metrics to imply that we're both creating and updating.
  3. Updates the SQL filters to use IN rather than a series of ORs.
  4. Added metric creation for float types.
  5. Adds the missing migration for creating Druid uniqueness constraints to the columns and metrics tables which were added in Adding YAML Import-Export for Datasources to CLI #3978. Note to avoid this MySQL issue the metric_name column was reduced from 512 to 255 characters.
  6. Corrects the --merge options for the refresh_druid command, which should be a flag (true/false) rather than an option.
  7. Note I didn't want to change the structure of get_metrics in terms of checking whether said metric already exists to ensure consistency with SQLA, hence why the merging logic is handled in refresh_metrics.

@fabianmenges I only added the missing Druid migrations however I believe there are additional migrations from your PR (#3978) which are missing for the following tables:

  • table_columns
  • tables

to: @mistercrunch @Mogball

@john-bodley john-bodley force-pushed the john-bodley-refresh-druid branch 5 times, most recently from 7c69592 to 36bccde Compare March 21, 2018 06:44
@codecov-io
Copy link

codecov-io commented Mar 21, 2018

Codecov Report

Merging #4655 into master will increase coverage by 0.12%.
The diff coverage is 90%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4655      +/-   ##
==========================================
+ Coverage   71.37%   71.49%   +0.12%     
==========================================
  Files         190      190              
  Lines       14918    14911       -7     
  Branches     1102     1102              
==========================================
+ Hits        10648    10661      +13     
+ Misses       4267     4247      -20     
  Partials        3        3
Impacted Files Coverage Δ
superset/cli.py 48.38% <ø> (ø) ⬆️
superset/connectors/druid/views.py 68.02% <0%> (ø) ⬆️
superset/connectors/druid/models.py 78.85% <100%> (+2.5%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ec06967...a334220. Read the comment docs.

Copy link
Contributor

@fabianmenges fabianmenges left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. I'll check how the constraints for the table you mentioned are.

'metric1': {
'type': 'FLOAT', 'hasMultipleValues': False,
'size': 100000, 'cardinality': None, 'errorMessage': None},
},
'aggregators': {
Copy link
Contributor

@fabianmenges fabianmenges Mar 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we not testing aggregators anymore?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fabianmenges I'm not the original author of this code so I don't have complete context, but from what I observed whilst debugging was these aggregators were not being tested. I've re-added this in case I was wrong.

"""Generate metrics based on the column metadata"""
metrics = self.get_metrics()
dbmetrics = (
db.session.query(DruidMetric)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running a loop to issue a query for each metric/column means that many queries have to be made, as opposed to just one to grab them all at once. If you've got tons of metrics and such, this adds up and can be reasonably slow

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Mogball that was one of my concerns and I agree with your comment. I've refactored the logic to do one query per DruidColumn.

@john-bodley john-bodley force-pushed the john-bodley-refresh-druid branch 2 times, most recently from 51564ee to 75e886e Compare March 21, 2018 20:36
@@ -220,13 +220,13 @@ def refresh(self, datasource_names, merge_flag, refreshAll):
if datatype == 'hyperUnique' or datatype == 'thetaSketch':
col_obj.count_distinct = True
# Allow sum/min/max for long or double
if datatype == 'LONG' or datatype == 'DOUBLE':
if datatype == 'LONG' or datatype in ('FLOAT', 'DOUBLE'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could go col_obj.is_num() that comes from the base column class and includes all these

@john-bodley john-bodley force-pushed the john-bodley-refresh-druid branch 4 times, most recently from acd8e5a to ca91e67 Compare March 22, 2018 00:52
with db.session.no_autoflush:
db.session.add(metric)
def refresh_metrics(self):
for col in self.columns:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean the same could apply here as well, where it's possible to combine all of the columns' metrics into one query

@john-bodley john-bodley force-pushed the john-bodley-refresh-druid branch from ca91e67 to a334220 Compare March 22, 2018 06:01
@john-bodley john-bodley merged commit f9d85bd into apache:master Mar 27, 2018
@john-bodley john-bodley deleted the john-bodley-refresh-druid branch March 27, 2018 01:35
john-bodley added a commit to john-bodley/superset that referenced this pull request Mar 27, 2018
# Add the missing uniqueness constraints.
for table, column in names.items():
with op.batch_alter_table(table, naming_convention=conv) as batch_op:
batch_op.create_unique_constraint(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hit an issue on this line while upgrading our staging. I wrapped the statement in a try block locally so that I could move forward

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the record it was something to the effect of the constraint existing already

@john-bodley
Copy link
Member Author

@mistercrunch so you recall if the constraint which existed previously was added manually? As far as I can tell this constraint never existed and thus the upgrade/downgrade logic should be sound.

@mistercrunch
Copy link
Member

Can't recall creating it. Could also be that it failed half way through before or timed out and got this error the next time around... Dunno.

michellethomas pushed a commit to michellethomas/panoramix that referenced this pull request May 24, 2018
wenchma pushed a commit to wenchma/incubator-superset that referenced this pull request Nov 16, 2018
@mistercrunch mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 0.25.0 labels Feb 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 0.25.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants