Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Remedy logic for UpdateDatasetCommand uniqueness check #28341

Conversation

john-bodley
Copy link
Member

@john-bodley john-bodley commented May 4, 2024

SUMMARY

At Airbnb we detected a number of duplicate datasets even though, per the UpdateDatasetCommand.validate() method, there is a check to ensure that the new dataset name is unique. The issue was that it wrongfully used the schema of the existing (source) table as opposed to the target schema.

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

TESTING INSTRUCTIONS

Added unit tests.

ADDITIONAL INFORMATION

  • Has associated issue:
  • Required feature flags:
  • Changes UI
  • Includes DB Migration (follow approval process in SIP-59)
    • Migration is atomic, supports rollback & is backwards-compatible
    • Confirm DB migration upgrade and downgrade tested
    • Runtime estimates and downtime expectations provided
  • Introduces new feature or API
  • Removes existing feature or API

@john-bodley john-bodley force-pushed the john-bodley--fix-update-dataset-command-uniqueness-check branch from 3b626e2 to 62ae9d2 Compare May 4, 2024 03:33

# Validate uniqueness
if not DatasetDAO.validate_uniqueness(database_id, table):
exceptions.append(DatasetExistsValidationError(table_name))
exceptions.append(DatasetExistsValidationError(table))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be using a Table object rather than the table name which is too terse, i.e., doesn't contain the schema or catalog.

# Validate uniqueness
if not DatasetDAO.validate_update_uniqueness(
self._model.database_id,
Table(table_name, self._model.schema, self._model.catalog),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the bug. The schema should be the schema defined in the properties and not the schema associated with the model.

@@ -0,0 +1,43 @@
from unittest.mock import MagicMock
Copy link
Member Author

@john-bodley john-bodley May 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was _shocked that there were zero unit or integration tests for the UpdateDatasetCommand.

Though this test isn't pretty, it's a start.

@@ -51,7 +51,7 @@ def test_validate_update_uniqueness(session: Session) -> None:
db.session.add_all([database, dataset1, dataset2])
db.session.flush()

# same table name, different schema
Copy link
Member Author

@john-bodley john-bodley May 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These comments were ordered wrongly (AFAICT) and thus misleading. It's probably best to not have comments if the logic is clear from the code.

@codecov-commenter
Copy link

codecov-commenter commented May 4, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 83.21%. Comparing base (76d897e) to head (debeea5).
Report is 1094 commits behind head on master.

Additional details and impacted files
@@             Coverage Diff             @@
##           master   #28341       +/-   ##
===========================================
+ Coverage   60.48%   83.21%   +22.72%     
===========================================
  Files        1931      521     -1410     
  Lines       76236    37179    -39057     
  Branches     8568        0     -8568     
===========================================
- Hits        46114    30938    -15176     
+ Misses      28017     6241    -21776     
+ Partials     2105        0     -2105     
Flag Coverage Δ
hive 49.11% <35.71%> (-0.05%) ⬇️
javascript ?
mysql 77.23% <100.00%> (?)
postgres 77.35% <100.00%> (?)
presto 53.71% <35.71%> (-0.10%) ⬇️
python 83.21% <100.00%> (+19.72%) ⬆️
sqlite 76.81% <100.00%> (?)
unit 58.23% <71.42%> (+0.60%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@john-bodley john-bodley force-pushed the john-bodley--fix-update-dataset-command-uniqueness-check branch 2 times, most recently from 214d13d to 96217ed Compare May 4, 2024 04:28
@pull-request-size pull-request-size bot added size/L and removed size/M labels May 4, 2024
@john-bodley john-bodley force-pushed the john-bodley--fix-update-dataset-command-uniqueness-check branch 6 times, most recently from 999fb95 to 39b95b0 Compare May 4, 2024 06:20
"message": {"table_name": ["Dataset energy_usage already exists"]}
"message": {
"table": [
f"Dataset {Table(energy_usage_ds.table_name, schema)} already exists"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The schema differs between SQLite and PostgreSQL and thus this is parameterized.

@john-bodley john-bodley marked this pull request as ready for review May 4, 2024 06:39
@@ -86,15 +86,21 @@ def validate(self) -> None:
except SupersetSecurityException as ex:
raise DatasetForbiddenError() from ex

database_id = self._properties.get("database", None)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default for dict.get() is None and thus there's no need to explicitly define it.

"message": {"table_name": ["Dataset energy_usage already exists"]}
"message": {
"table": [
f"Dataset {Table(energy_usage_ds.table_name, schema)} already exists"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See previous comment.

@john-bodley john-bodley force-pushed the john-bodley--fix-update-dataset-command-uniqueness-check branch from 39b95b0 to d90b0be Compare May 4, 2024 15:35
table = Table(
self._properties.get("table_name"), # type: ignore
self._properties.get("schema"),
self._properties.get("catalog"),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't exist in the schema, though is consistent with other logic. See 6cf681d#r141673924 for details.

@@ -51,7 +51,7 @@ def test_validate_update_uniqueness(session: Session) -> None:
db.session.add_all([database, dataset1, dataset2])
db.session.flush()

# same table name, different schema
#
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤦

schema = self._properties.get("schema")
catalog = self._properties.get("catalog")
sql = self._properties.get("sql")
owner_ids: Optional[list[int]] = self._properties.get("owners")

table = Table(table_name, schema, catalog)
table = Table(self._properties["table_name"], schema, catalog)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the previous version is more readable as it groups variables definition in just one place.

Suggested change
table = Table(self._properties["table_name"], schema, catalog)
table = Table(table_name, schema, catalog)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@michael-s-molina personally I try to remove single use variables (where possible) as I find it easier to read/grok the code.

@john-bodley john-bodley force-pushed the john-bodley--fix-update-dataset-command-uniqueness-check branch 2 times, most recently from b77077f to f48e663 Compare May 6, 2024 18:39
@john-bodley john-bodley force-pushed the john-bodley--fix-update-dataset-command-uniqueness-check branch from f48e663 to debeea5 Compare May 6, 2024 19:03
@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label May 7, 2024
@john-bodley john-bodley merged commit 467e612 into apache:master May 7, 2024
29 checks passed
@michael-s-molina michael-s-molina added the v4.0 Label added by the release manager to track PRs to be included in the 4.0 branch label May 8, 2024
imancrsrk pushed a commit to imancrsrk/superset that referenced this pull request May 10, 2024
jzhao62 pushed a commit to jzhao62/superset that referenced this pull request May 16, 2024
EnxDev pushed a commit to EnxDev/superset that referenced this pull request May 31, 2024
@mistercrunch mistercrunch added 🍒 4.0.2 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels labels Jul 24, 2024
vinothkumar66 pushed a commit to vinothkumar66/superset that referenced this pull request Nov 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels lgtm This PR has been approved by a maintainer size/L v4.0 Label added by the release manager to track PRs to be included in the 4.0 branch 🍒 4.0.2 🚢 4.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants