Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove DatasetConfig.dataset field + New Get DatasetConfig Endpoints [#1763] #2096

Conversation

pattisdr
Copy link
Contributor

@pattisdr pattisdr commented Dec 20, 2022

❗ Contains migration
πŸ‘‰ This PR is against the unified-fides-resources branch, not main.

Closes #1763

Code Changes

  • Add a data migration to delete the DatasetConfig.dataset column. It should have been previously migrated to a ctl dataset resource
  • Remove the "dataset" key from the data sent to DatasetConfig.create_or_update call in the new patch_dataset_configs endpoint
    • Leave the dataset key in the data sent to DatasetConfig.upsert_with_ctl_dataset because this method currently upserts both the CTL Dataset first and then the DatasetConfig. This method is currently used in the old patch dataset config endpoints (json and yaml, soon to be deprecated) as well as when you're creating a connection config from a template.
  • Update fixtures to not pass in a "dataset" key to data passed to DatasetConfig.create
  • Throw a 404 if the ctl_dataset_id does not existing when creating a DatasetConfig through the new patch_dataset_configs endpoint
  • Also add validation when linking an existing CTL Dataset to a DatasetConfig that that existing dataset has valid data categories that exist in the db.
  • Add a fix to prevent existing upsert from attempting to update the id of an existing resource. Datasets are referenced by DatasetConfigs now, we need them to stay the same.
  • Add new GET datasetconfig list and detail endpoints that include both the fides_key and nested ctl_dataset. The DatasetConfig and CTLDataset are different resources and their fides_keys can differ, so both keys need to be in the response.

Steps to Confirm

  • list any manual steps for reviewers to confirm the changes

Pre-Merge Checklist

Description Of Changes

Get rid of the DatasetConfig.dataset field.

The previous PR stopped reading from DatasetConfig.dataset field and started writing to the new DatasetConfig.ctl_dataset resource.

Nothing should be using this field so let's get rid of it.

- Add a data migration to delete the column
- Remove the "dataset" key from the data sent to DatasetConfig.create_or_update in the patch_dataset_configs endpoint
- Leave the "dataset" key in the data sent to DatasetConfig.upsert_with_ctl_dataset because this method currently upserts both the CTL Dataset first and then the DatasetConfig.  This method is currently used in the old patch dataset config endpoints (soon to be deprecated) as well as when you're creating a connection config from a template.
- Update a lot of fixtures to not pass in a "dataset" key to DatasetConfig.create
- Throw a 404 if the ctl_dataset_id does not existing when creating a DatasetConfig through the new patch_dataset_configs endpoint
@pattisdr pattisdr self-assigned this Dec 20, 2022
@pattisdr pattisdr added Unified Fides Resources run unsafe ci checks Runs fides-related CI checks that require sensitive credentials labels Dec 20, 2022
@pattisdr pattisdr force-pushed the fidesops_1763_stop_storing_datasetconfig_dataset branch from 8dfc7e4 to 3b093f2 Compare December 21, 2022 00:53
…instead of a Datasets tag to not conflict with CTL Datasets. Dataset Configs link a Connection Config to a CTL Dataset.
…a datasetconfig. Data categories must exist in the db.
…e fides_key of the DatasetConfig as well as the entire nested CTL Dataset which has the fides_key of the separate CTL Dataset resource.
@pattisdr pattisdr changed the title Remove DatasetConfig.dataset field [#1763] Remove DatasetConfig.dataset field + New Get DatasetConfig Endpoints [#1763] Dec 21, 2022
)

try:
fetched_dataset: Dataset = Dataset.from_orm(ctl_dataset)
Copy link
Contributor Author

@pattisdr pattisdr Dec 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CTL Datasets can currently be created without an organization_fides_key or a data_qualifier - so customers may have existing database records where this is the case. However, these are not optional fields on the Fideslang Dataset schema.

Trying to parse an existing CTL dataset with these issues would throw a 500 so I catch and throw a 422 instead. The issue would need to be corrected through the CTL dataset endpoint.


We should separately look into using the Fideslang Dataset schema to validate when CTL datasets are created, or alternatively allow these fields to be optional on the Fideslang Dataset. Allison's noticed this too #2113 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we make a 🎫 for this issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep will do!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

raise HTTPException(
status_code=HTTP_422_UNPROCESSABLE_ENTITY, detail=e.errors()
)
validate_data_categories(fetched_dataset, db)
Copy link
Contributor Author

@pattisdr pattisdr Dec 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we have to fetch the dataset anyway to validate for saas connectors, I"m going ahead and checking that its data categories are also in the database like we do on existing dataset config endpoints. If data categories are invalid, they need to be addressed on the ctl dataset.

It's easier to add this check here than to a generic crud endpoint on the ctl side, but I'm not sure how I feel about it. It's more of a meta question about where is the best place to validate that data categories exist? On resource creation? When we use the resource, like in a DSR? In both places?

The reason this isn't part of a standard validator is that it needs access to a database session.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my instinct would be on resource creation since that's the closest layer to the db so it'd catch the most things. but yes, the generic CRUD endpoint is nice code cleanliness wise, but makes changes like this more difficult. maybe something to discuss with the team and make a follow up ticket for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's tricky though because the data categories could change later! They may be valid at the time of creation and not at the time of use!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will make a ticket!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

db=db, conditions=(DatasetConfig.connection_config_id == connection_config.id)
).order_by(DatasetConfig.created_at.desc())

return paginate(dataset_configs, params)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is identical to get_datasets except the response has both the DatasetConfig fides_key and the CTL Dataset fides_key, which can differ. The other parallel endpoint will eventually be deprecated.

Comment on lines +514 to +519
def get_dataset_config(
fides_key: FidesKey,
db: Session = Depends(deps.get_db),
connection_config: ConnectionConfig = Depends(_get_connection_config),
) -> DatasetConfig:
"""Returns the specific Dataset Config linked to the Connection Config."""
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is identical to get_dataset except the response has both the DatasetConfig fides_key and the CTL Dataset fides_key, which can differ. The other parallel endpoint will eventually be deprecated.

@@ -117,6 +113,7 @@ def upsert_ctl_dataset(ctl_dataset_obj: Optional[CtlDataset]) -> CtlDataset:
fetched_ctl_dataset
) # Create/update existing ctl_dataset first
data["ctl_dataset_id"] = ctl_dataset.id
data.pop("dataset", None)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method allows both the CTL Dataset and the DatasetConfig to be upserted. The dataset is passed in for the CTL Dataset, but we don't want this field when we update the DatasetConfig.

@pattisdr pattisdr marked this pull request as ready for review December 22, 2022 03:43
Comment on lines +166 to +172
excluded = dict(insert_stmt.excluded.items()) # type: ignore[attr-defined]
excluded.pop("id", None) # If updating, don't update the "id"

result = await session.execute(
insert_stmt.on_conflict_do_update(
index_elements=["fides_key"],
set_=insert_stmt.excluded,
set_=excluded,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allison ran into trying to upsert a CTL Dataset that was already linked to a DatasetConfig and ran into:

fides-fides-1  | 2022-12-22 17:00:45.731 [DEBUG] (crud:upsert_resources:182): (sqlalchemy.dialects.postgresql.asyncpg.IntegrityError) <class 'asyncpg.exceptions.ForeignKeyViolationError'>: update or delete on table "ctl_datasets" violates foreign key constraint "datasetconfig_ctl_dataset_id_fkey" on table "datasetconfig"
fides-fides-1  | DETAIL:  Key (id)=(ctl_8ca1f08e-53de-4954-a56a-73123145946e) is still referenced from table "datasetconfig".

I believe upserting is also trying to update the id of an existing resource, but we need this to stay consistent, especially since DatasetConfigs now have a FK to this table.

Copy link
Contributor

@allisonking allisonking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

such thorough tests! 🀩

)

try:
fetched_dataset: Dataset = Dataset.from_orm(ctl_dataset)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we make a 🎫 for this issue?

raise HTTPException(
status_code=HTTP_422_UNPROCESSABLE_ENTITY, detail=e.errors()
)
validate_data_categories(fetched_dataset, db)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my instinct would be on resource creation since that's the closest layer to the db so it'd catch the most things. but yes, the generic CRUD endpoint is nice code cleanliness wise, but makes changes like this more difficult. maybe something to discuss with the team and make a follow up ticket for?

@pattisdr pattisdr merged commit e6ad6b3 into unified-fides-resources Dec 23, 2022
@pattisdr pattisdr deleted the fidesops_1763_stop_storing_datasetconfig_dataset branch December 23, 2022 21:31
@pattisdr pattisdr linked an issue Dec 23, 2022 that may be closed by this pull request
@pattisdr pattisdr mentioned this pull request Jan 17, 2023
15 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run unsafe ci checks Runs fides-related CI checks that require sensitive credentials Unified Fides Resources
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remove dataset yaml storing code from datasetconfig endpoint
2 participants