-
Notifications
You must be signed in to change notification settings - Fork 14.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Remedy logic for UpdateDatasetCommand uniqueness check #28341
fix: Remedy logic for UpdateDatasetCommand uniqueness check #28341
Conversation
3b626e2
to
62ae9d2
Compare
|
||
# Validate uniqueness | ||
if not DatasetDAO.validate_uniqueness(database_id, table): | ||
exceptions.append(DatasetExistsValidationError(table_name)) | ||
exceptions.append(DatasetExistsValidationError(table)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be using a Table
object rather than the table name which is too terse, i.e., doesn't contain the schema or catalog.
# Validate uniqueness | ||
if not DatasetDAO.validate_update_uniqueness( | ||
self._model.database_id, | ||
Table(table_name, self._model.schema, self._model.catalog), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's the bug. The schema should be the schema defined in the properties and not the schema associated with the model.
@@ -0,0 +1,43 @@ | |||
from unittest.mock import MagicMock |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was _shocked that there were zero unit or integration tests for the UpdateDatasetCommand
.
Though this test isn't pretty, it's a start.
@@ -51,7 +51,7 @@ def test_validate_update_uniqueness(session: Session) -> None: | |||
db.session.add_all([database, dataset1, dataset2]) | |||
db.session.flush() | |||
|
|||
# same table name, different schema |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These comments were ordered wrongly (AFAICT) and thus misleading. It's probably best to not have comments if the logic is clear from the code.
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #28341 +/- ##
===========================================
+ Coverage 60.48% 83.21% +22.72%
===========================================
Files 1931 521 -1410
Lines 76236 37179 -39057
Branches 8568 0 -8568
===========================================
- Hits 46114 30938 -15176
+ Misses 28017 6241 -21776
+ Partials 2105 0 -2105
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
214d13d
to
96217ed
Compare
999fb95
to
39b95b0
Compare
"message": {"table_name": ["Dataset energy_usage already exists"]} | ||
"message": { | ||
"table": [ | ||
f"Dataset {Table(energy_usage_ds.table_name, schema)} already exists" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The schema differs between SQLite and PostgreSQL and thus this is parameterized.
@@ -86,15 +86,21 @@ def validate(self) -> None: | |||
except SupersetSecurityException as ex: | |||
raise DatasetForbiddenError() from ex | |||
|
|||
database_id = self._properties.get("database", None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default for dict.get()
is None
and thus there's no need to explicitly define it.
"message": {"table_name": ["Dataset energy_usage already exists"]} | ||
"message": { | ||
"table": [ | ||
f"Dataset {Table(energy_usage_ds.table_name, schema)} already exists" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See previous comment.
39b95b0
to
d90b0be
Compare
table = Table( | ||
self._properties.get("table_name"), # type: ignore | ||
self._properties.get("schema"), | ||
self._properties.get("catalog"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't exist in the schema, though is consistent with other logic. See 6cf681d#r141673924 for details.
tests/unit_tests/dao/dataset_test.py
Outdated
@@ -51,7 +51,7 @@ def test_validate_update_uniqueness(session: Session) -> None: | |||
db.session.add_all([database, dataset1, dataset2]) | |||
db.session.flush() | |||
|
|||
# same table name, different schema | |||
# |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤦
schema = self._properties.get("schema") | ||
catalog = self._properties.get("catalog") | ||
sql = self._properties.get("sql") | ||
owner_ids: Optional[list[int]] = self._properties.get("owners") | ||
|
||
table = Table(table_name, schema, catalog) | ||
table = Table(self._properties["table_name"], schema, catalog) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the previous version is more readable as it groups variables definition in just one place.
table = Table(self._properties["table_name"], schema, catalog) | |
table = Table(table_name, schema, catalog) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@michael-s-molina personally I try to remove single use variables (where possible) as I find it easier to read/grok the code.
b77077f
to
f48e663
Compare
f48e663
to
debeea5
Compare
SUMMARY
At Airbnb we detected a number of duplicate datasets even though, per the
UpdateDatasetCommand.validate()
method, there is a check to ensure that the new dataset name is unique. The issue was that it wrongfully used the schema of the existing (source) table as opposed to the target schema.BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF
TESTING INSTRUCTIONS
Added unit tests.
ADDITIONAL INFORMATION