Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does removing foreign keys in detection metrics for multi-tables make sense? #285

Closed
mohamedg-croesus opened this issue Dec 19, 2022 · 2 comments
Labels
question General question about the software resolution:duplicate This issue or pull request already exists

Comments

@mohamedg-croesus
Copy link

Environment details

If you are already running SDMetrics, please indicate the following details about the environment in
which you are running it:

  • SDMetrics version: 0.8.2
  • Python version:
  • Operating System:

Problem description

Nice correction for DetectionMetric (Solving primary_key use for detection metrics #251, #251)
Removing the primary key from the table makes more sense for the evaluation. But, what about foreign keys present in a table of a relational database. Does not the same problem also occur for the foreign keys too and they need to be deleted? Also, for the parent-child logistic detection metric, the foreign keys (referencing parent rows) in the child tables are no longer required when the denormalized tables are used.

@mohamedg-croesus mohamedg-croesus added new Label applied to new issues question General question about the software labels Dec 19, 2022
@npatki
Copy link
Contributor

npatki commented Dec 20, 2022

Hi @mohamedgy thanks for filing! I want to confirm that the SDMetrics version you tested on is 0.8.1. (Currently it is listed as 0.8.2, which doesn't exist -- yet!)

I agree that it makes sense to remove foreign keys. In fact, it makes sense to remove any column that does not contribute to the statistical properties of the data:

  • Primary keys
  • Foreign keys
  • Any other kinds of IDs or keys
  • PII columns
  • Text columns

I have filed issue #286 to track this broader issue.

As a temporary workaround, you can drop any of these columns before applying the detection metric.

Also, for the parent-child logistic detection metric, the foreign keys (referencing parent rows) in the child tables are no longer required when the denormalized tables are used.

Can you explain this a bit more? If you have a denormalized table, then there should not longer be any foreign key column, so you can use the current implementation as-is.

@npatki npatki added under discussion Issue is currently being discussed and removed new Label applied to new issues labels Dec 20, 2022
@npatki
Copy link
Contributor

npatki commented Jan 3, 2023

Hi @mohamedgy I see you filed #290 with some details about parent-child denormalization. Let's close this issue off in favor of #290 (for parent-child denormalization) and #286 (for removing unnecessary columns).

Please feel free to reply if there's anything more to add. I can always reopen the issue for continued investigation.

@npatki npatki closed this as completed Jan 3, 2023
@npatki npatki added resolution:duplicate This issue or pull request already exists and removed under discussion Issue is currently being discussed labels Jan 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question General question about the software resolution:duplicate This issue or pull request already exists
Projects
None yet
Development

No branches or pull requests

2 participants