Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add set_address_columns method #1593

Closed
amontanez24 opened this issue Sep 20, 2023 · 0 comments · Fixed by #1607 or #1643
Closed

Add set_address_columns method #1593

amontanez24 opened this issue Sep 20, 2023 · 0 comments · Fixed by #1607 or #1643
Assignees
Labels
feature request Request for a new feature
Milestone

Comments

@amontanez24
Copy link
Contributor

amontanez24 commented Sep 20, 2023

Problem Description

As a user, I'd like a way to specify that multiple columns in my dataset are used to makeup the same address. This way, the synthetic data created maintains a relationship between those columns and makes a valid address.

Expected behavior

In the BaseSynthesizer and BaseMultiTableSynthesizer classes, add a method called add_address_columns

  • This method should be used to set the transformer to use on the columns provided.

Parameters

  • (required, multi table only) table_name : String that is the name of the table. This must be one of the tables specified in the metadata.
  • (required): column_names: A list of one or more column names. These must be specified in the metadata for the table.
  • anonymization_level: String that is the type of anonymization the user wants to see. Must be one of:
    • (default) 'full': Anonymize all components of the address (no learning required)
    • 'street_address': Only anonymized the precise street address (and secondary address). Learn everything else.

Behavior

Follow the logic below for transformer assignment

Parameter Assigned Transformer
'full' RandomLocationGenerator(locales=<synthesizer locales>)
'street' RegionalAnonymizer(locales=<synthesizer locales>)

Validation

  • (multi table only) The table_name must be found in the metadata
    Error: Unknown table name 'userss'. Please choose a table name from the metadata.
  • The column_names must be found in the metadata for the table
    Error: Unknown column names ('A', 'B', 'C'). Please choose column names listed in the metadata for your table.
  • The sdtypes for the columns must be compatible with the address transformer. That is they must be one of : 'country', 'country_code', 'administrative_unit', etc.
    Error: Column 'city_name' has invalid sdtype 'categorical'. Please provide a column that is compatible with address data.
  • You cannot have 2 or more of the same sdtype within an address.
    Error: Columns 'state_name' and 'state' have the same sdtype 'administrative_unit'. Your address data cannot have duplicate fields.
  • If the user has already fit the data, show a warning. The user will need to re-fit in order to get this to work.
    Warning: Please refit your synthesizer for the address changes to appear in your synthetic data.
synthesizer = HSASynthesizer(metadata)
synthesizer.set_address_columns(
    table_name='users',
    column_names=['line1', 'line2', 'city', 'state', 'zip_code', 'country'],
    anonymization_level='full'
)

Additional context

  • This method is meant to replace the update_transformers method for address columns. It will happen after the HyperTransformer config is created.
  • We will want to keep track of columns that are treated as an address, so we may need to store that information in a list/dict somewhere. This will help for error handling when users call update_transformers on one of these columns or try to add them to constraints.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Request for a new feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants