Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metadata auto-detection should find primary keys of semantic sdtypes #1724

Closed
npatki opened this issue Dec 19, 2023 · 0 comments · Fixed by #1731
Closed

Metadata auto-detection should find primary keys of semantic sdtypes #1724

npatki opened this issue Dec 19, 2023 · 0 comments · Fixed by #1731
Assignees
Labels
feature:metadata Related to describing the dataset feature request Request for a new feature
Milestone

Comments

@npatki
Copy link
Contributor

npatki commented Dec 19, 2023

Problem Description

As of SDV v1.8.0, the metadata auto-detection can identify a wide variety of sdtypes:

  • statistical columns such as 'numerical'', 'datetime', 'categorical', 'boolean'
  • semantic concepts such as 'email', 'phone_number', 'latitude', etc.
  • structured identifiers, 'id'

However, when detecting a primary key, it only considers columns that are sdtype 'id'. In reality, semantic columns such as 'email' or 'phone_number' may also be primary keys and should be considered as possibilities.

Expected behavior

Consider the default demo dataset. The first column, 'guest_email' is the primary key. The metadata should continue detect is as sdtype 'email', and it should also mark is a primary key.

from sdv.datasets.demo import download_demo
from sdv.metadata import SingleTableMetadata

data, _ = download_demo(
    modality='single_table',
    dataset_name='fake_hotel_guests'
)

metadata = SingleTableMetadata()
metadata.detect_from_dataframe(data)

Additional context

Note that if the column was named 'guest_id' and contained random number ids, then the metadata script would correctly identify the sdtype as 'id' and it would mark it as a primary key.

@npatki npatki added the feature request Request for a new feature label Dec 19, 2023
@npatki npatki changed the title Metadata auto-detection should find primary keys of any sdtype Metadata auto-detection should find primary keys of semantic sdtypes Dec 19, 2023
@npatki npatki added the feature:metadata Related to describing the dataset label Dec 19, 2023
@amontanez24 amontanez24 added this to the 1.9.0 milestone Jan 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature:metadata Related to describing the dataset feature request Request for a new feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants