Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detection metrics should only use statistically modeled columns (filter out the rest) #286

Closed
npatki opened this issue Dec 20, 2022 · 0 comments · Fixed by #525
Closed
Assignees
Labels
feature request Request for a new feature
Milestone

Comments

@npatki
Copy link
Contributor

npatki commented Dec 20, 2022

Problem Description

The Detection metrics use machine learning to determine whether the real vs. synthetic data can be detected. For this to work, we should only be using columns that are statistically modeled.

Expected behavior

When running any of the detection metrics, the following columns should be ignored:

  • Primary keys
  • Foreign keys Edit: Foreign keys do not need to be considered because Detection metrics are only implemented at the single table level.
  • Any other kinds of IDs
  • PII or sensitive data
  • Text data (or data created by RegEx)

None of these columns provide any useful information for detection.

The remaining data types are statistically modeled and should be included: numerical, datetime, categorical (non-PII), boolean

Additional context

We already filtered out primary keys in #119. The issue of foreign keys is discussed in #285.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Request for a new feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants