-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove DatasetConfig.dataset field + New Get DatasetConfig Endpoints [#1763] #2096
Remove DatasetConfig.dataset field + New Get DatasetConfig Endpoints [#1763] #2096
Conversation
- Add a data migration to delete the column - Remove the "dataset" key from the data sent to DatasetConfig.create_or_update in the patch_dataset_configs endpoint - Leave the "dataset" key in the data sent to DatasetConfig.upsert_with_ctl_dataset because this method currently upserts both the CTL Dataset first and then the DatasetConfig. This method is currently used in the old patch dataset config endpoints (soon to be deprecated) as well as when you're creating a connection config from a template. - Update a lot of fixtures to not pass in a "dataset" key to DatasetConfig.create - Throw a 404 if the ctl_dataset_id does not existing when creating a DatasetConfig through the new patch_dataset_configs endpoint
8dfc7e4
to
3b093f2
Compare
β¦instead of a Datasets tag to not conflict with CTL Datasets. Dataset Configs link a Connection Config to a CTL Dataset.
β¦a datasetconfig. Data categories must exist in the db.
β¦e fides_key of the DatasetConfig as well as the entire nested CTL Dataset which has the fides_key of the separate CTL Dataset resource.
β¦an collection to use new flow to upsert the CTL DatasetConfig before creating a ConnectionConfig.
) | ||
|
||
try: | ||
fetched_dataset: Dataset = Dataset.from_orm(ctl_dataset) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CTL Datasets can currently be created without an organization_fides_key
or a data_qualifier
- so customers may have existing database records where this is the case. However, these are not optional fields on the Fideslang Dataset
schema.
Trying to parse an existing CTL dataset with these issues would throw a 500 so I catch and throw a 422 instead. The issue would need to be corrected through the CTL dataset endpoint.
We should separately look into using the Fideslang Dataset
schema to validate when CTL datasets are created, or alternatively allow these fields to be optional on the Fideslang Dataset. Allison's noticed this too #2113 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we make a π« for this issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep will do!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
raise HTTPException( | ||
status_code=HTTP_422_UNPROCESSABLE_ENTITY, detail=e.errors() | ||
) | ||
validate_data_categories(fetched_dataset, db) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we have to fetch the dataset anyway to validate for saas connectors, I"m going ahead and checking that its data categories are also in the database like we do on existing dataset config endpoints. If data categories are invalid, they need to be addressed on the ctl dataset.
It's easier to add this check here than to a generic crud endpoint on the ctl side, but I'm not sure how I feel about it. It's more of a meta question about where is the best place to validate that data categories exist? On resource creation? When we use the resource, like in a DSR? In both places?
The reason this isn't part of a standard validator is that it needs access to a database session.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my instinct would be on resource creation since that's the closest layer to the db so it'd catch the most things. but yes, the generic CRUD endpoint is nice code cleanliness wise, but makes changes like this more difficult. maybe something to discuss with the team and make a follow up ticket for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's tricky though because the data categories could change later! They may be valid at the time of creation and not at the time of use!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will make a ticket!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
db=db, conditions=(DatasetConfig.connection_config_id == connection_config.id) | ||
).order_by(DatasetConfig.created_at.desc()) | ||
|
||
return paginate(dataset_configs, params) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is identical to get_datasets
except the response has both the DatasetConfig fides_key and the CTL Dataset fides_key, which can differ. The other parallel endpoint will eventually be deprecated.
def get_dataset_config( | ||
fides_key: FidesKey, | ||
db: Session = Depends(deps.get_db), | ||
connection_config: ConnectionConfig = Depends(_get_connection_config), | ||
) -> DatasetConfig: | ||
"""Returns the specific Dataset Config linked to the Connection Config.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is identical to get_dataset
except the response has both the DatasetConfig fides_key and the CTL Dataset fides_key, which can differ. The other parallel endpoint will eventually be deprecated.
@@ -117,6 +113,7 @@ def upsert_ctl_dataset(ctl_dataset_obj: Optional[CtlDataset]) -> CtlDataset: | |||
fetched_ctl_dataset | |||
) # Create/update existing ctl_dataset first | |||
data["ctl_dataset_id"] = ctl_dataset.id | |||
data.pop("dataset", None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method allows both the CTL Dataset and the DatasetConfig to be upserted. The dataset
is passed in for the CTL Dataset, but we don't want this field when we update the DatasetConfig.
Conflicts: tests/ctl/core/test_dataset.py
excluded = dict(insert_stmt.excluded.items()) # type: ignore[attr-defined] | ||
excluded.pop("id", None) # If updating, don't update the "id" | ||
|
||
result = await session.execute( | ||
insert_stmt.on_conflict_do_update( | ||
index_elements=["fides_key"], | ||
set_=insert_stmt.excluded, | ||
set_=excluded, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Allison ran into trying to upsert a CTL Dataset that was already linked to a DatasetConfig and ran into:
fides-fides-1 | 2022-12-22 17:00:45.731 [DEBUG] (crud:upsert_resources:182): (sqlalchemy.dialects.postgresql.asyncpg.IntegrityError) <class 'asyncpg.exceptions.ForeignKeyViolationError'>: update or delete on table "ctl_datasets" violates foreign key constraint "datasetconfig_ctl_dataset_id_fkey" on table "datasetconfig"
fides-fides-1 | DETAIL: Key (id)=(ctl_8ca1f08e-53de-4954-a56a-73123145946e) is still referenced from table "datasetconfig".
I believe upserting is also trying to update the id
of an existing resource, but we need this to stay consistent, especially since DatasetConfigs now have a FK to this table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
such thorough tests! π€©
) | ||
|
||
try: | ||
fetched_dataset: Dataset = Dataset.from_orm(ctl_dataset) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we make a π« for this issue?
raise HTTPException( | ||
status_code=HTTP_422_UNPROCESSABLE_ENTITY, detail=e.errors() | ||
) | ||
validate_data_categories(fetched_dataset, db) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my instinct would be on resource creation since that's the closest layer to the db so it'd catch the most things. but yes, the generic CRUD endpoint is nice code cleanliness wise, but makes changes like this more difficult. maybe something to discuss with the team and make a follow up ticket for?
β Contains migration
π This PR is against the
unified-fides-resources
branch, not main.Closes #1763
Code Changes
DatasetConfig.create_or_update
call in the newpatch_dataset_configs
endpointdataset
key in the data sent toDatasetConfig.upsert_with_ctl_dataset
because this method currently upserts both the CTL Dataset first and then the DatasetConfig. This method is currently used in the old patch dataset config endpoints (json and yaml, soon to be deprecated) as well as when you're creating a connection config from a template.Steps to Confirm
Pre-Merge Checklist
CHANGELOG.md
Description Of Changes
Get rid of the DatasetConfig.dataset field.
The previous PR stopped reading from DatasetConfig.dataset field and started writing to the new DatasetConfig.ctl_dataset resource.
Nothing should be using this field so let's get rid of it.