Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HMASynthesizer generates IDs in order (even though real data has random IDs) #1596

Closed
majidliaquat opened this issue Sep 20, 2023 · 8 comments
Labels
feature request Request for a new feature resolution:duplicate This issue or pull request already exists

Comments

@majidliaquat
Copy link

As I am using HMAsynthesizer for Multitables it works fine, but when I generate synthetic data then the generated data are not similar to real data. In below section I tried to explain the issue in detail that I am facing.

Issue with generated data

As I generted data by adjusting the table parameters mentioned earlier. You can see the refer image below:

multiTableData

The Parent Table (Real Data) have shape of 4500, 4 in real dataset the account_id has random data, as one highlighted in row 51 .

Again if you have a look at the generated data which also have same shape of 4500, 4 but the problem is that the data I am getting with account_id data which starts from 0 to 4,499 same as index. (Highlighted with red and orange box).

I want the generated data also be random as like the real data. Here are the table parameters I set:

synthesizer.set_table_parameters(
                       table_name='parent table',
                       table_parameters={
                           "enforce_min_max_values":true
                              "enforce_rounding":true
                              "locales":"en_GB"
                              "numerical_distributions":{
                              "account_id":"norm"
                              }
                              "default_distribution":"uniform"
                       }
                   )

I refer the documentation on website Documentation but couldn't find something related to this issue.

Thanks in advnce
Many Thanks,
Majid

@majidliaquat majidliaquat added bug Something isn't working new Automatic label applied to new issues labels Sep 20, 2023
@npatki
Copy link
Contributor

npatki commented Sep 20, 2023

Hi @majidliaquat,

Thanks for filing this issue with this example. I'm going to re-title this issue to be more precise about the nature of the problem.

In the SDV, we consider the purpose of ID columns to be unique identifiers (in the case of a primary key), or part of a relationship (in the case of a foreign key). So a lot of our users have not found it necessary to follow any specific patterns for ID columns. I think we can mark this as a new feature request, since the SDV itself is working as intended.

To help us prioritize, it would be great to provide a bit more information about your use case. Why is having random IDs a requirement for your project? Are you planning to use the synthetic data in a way that requires this randomness?

If I can understand the nature of this concern, then I can try to suggest other workarounds for you too. Thanks.

@npatki npatki changed the title Issue with generated Data on HMASynthesizer HMASynthesizer generates IDs in order (even though real data has random IDs) Sep 20, 2023
@npatki npatki added feature request Request for a new feature under discussion Issue is currently being discussed and removed bug Something isn't working new Automatic label applied to new issues labels Sep 20, 2023
@majidliaquat
Copy link
Author

Hi @npatki,

Thank you for your response. In my case as the real table there was some missing values I drop them so now I get random number of IDs. Actually I need random IDs as in some tables those IDs representing a place name in other table.

So I was just worndering if we can generate random IDs that will be better for the generated table to be more realistic which also improve on quality score.

I was thinking if we pass parameters with set_table_parameters and generate random numbers for specific ID:

synthesizer.set_table_parameters(
                       table_name='parent table',
                       table_parameters={
                              "numerical_distributions":{
                              "account_id":"norm",
                               "min_id_value":0,
                              "max_id_value":1000
                              }
                              "default_distribution":"uniform"
                       }
                   )

It may be not the way I am passing, because I didn't test it yet.

Thanks

@npatki
Copy link
Contributor

npatki commented Sep 25, 2023

Hi @majidliaquat, the numerical_distributions parameter does not support min or max values.

I think if your IDs are specifically referring to a different concept or table, then it wold be best to mark them as "categorical" in the metadata. That way, the SDV will learn correlations between those IDs and other columns. It will then recreate the same IDs in the output. This blog post has more information about when you should consider something categorical.

It would be very helpful if you could provide the metadata file so we could assist you further.

@majidliaquat
Copy link
Author

Hi @npatki,

Here is the metadata with parent with two child tables:

{
  "tables": {
    "Table 1 account.csv": {
      "columns": {
        "account_id": {
          "sdtype": "id"
        },
        "district_id": {
          "sdtype": "id"
        },
        "frequency": {
          "sdtype": "categorical"
        },
        "date": {
          "sdtype": "numerical"
        }
      },
      "primary_key": "account_id"
    },
    "Table 2 loan.csv": {
      "columns": {
        "loan_id": {
          "sdtype": "id"
        },
        "account_id": {
          "sdtype": "id"
        },
        "date": {
          "sdtype": "numerical"
        },
        "amount": {
          "sdtype": "numerical"
        },
        "duration": {
          "sdtype": "numerical"
        },
        "payments": {
          "sdtype": "numerical"
        },
        "status": {
          "sdtype": "categorical"
        }
      },
      "primary_key": "loan_id"
    },
    "Table 3 order.csv": {
      "columns": {
        "order_id": {
          "sdtype": "id"
        },
        "account_id": {
          "sdtype": "id"
        },
        "bank_to": {
          "sdtype": "categorical"
        },
        "account_to": {
          "sdtype": "numerical"
        },
        "amount": {
          "sdtype": "numerical"
        },
        "k_symbol": {
          "sdtype": "categorical"
        }
      },
      "primary_key": "order_id"
    }
  },
  "relationships": [
    {
      "parent_table_name": "Table 1 account.csv",
      "child_table_name": "Table 2 loan.csv",
      "parent_primary_key": "account_id",
      "child_foreign_key": "account_id"
    },
    {
      "parent_table_name": "Table 1 account.csv",
      "child_table_name": "Table 3 order.csv",
      "parent_primary_key": "account_id",
      "child_foreign_key": "account_id"
    }
  ],
  "METADATA_SPEC_VERSION": "MULTI_TABLE_V1"
}

I will try what you are suggesting after reading the blog post.

Many Thanks.

@npatki
Copy link
Contributor

npatki commented Sep 27, 2023

Hi @majidliaquat, thanks for providing the metadata. I have a follow-up question. Earlier you have mentioned:

Actually I need random IDs as in some tables those IDs representing a place name in other table.

In the example, you are wanting random IDs for the account_id variable (in the accounts table). But from the metadata, it seems like account_id seems to be listed as a primary key, so I am a bit confused.

Generally, a primary key is a unique identifier to the table that's defined in. So how can it be a primary key and also represent a place name in another table? Could you explain that a bit more.

@majidliaquat
Copy link
Author

majidliaquat commented Sep 29, 2023

Hi @npatki,

Yes the metadata I provided is not about that what I mentioned before. I have two tables in which there is district names and ID and another table just the district ID (as foreign Key). For that reason I asked for the random ID's with in a range.

I am using the dataset: 1999 Czech Financial Dataset - Real Anonymized Transactions

here is the datamap:
image

Many Thanks.

@npatki
Copy link
Contributor

npatki commented Oct 3, 2023

Hi @majidliaquat, I am still a bit confused. Would it be possible to provide the metadata for this schema and the relevant column(s) that are giving you problems? Is district_id the primary key of Demograph?

When the SDV generates multi-table data, it will guarantee the following:

  1. Primary keys of a table will always be a unique identifier (it will follow the Regex format you can optionally provide), and
  2. Foreign keys will always point to primary keys in a different table. That is: The references will be consistent.

So if district_id is a foreign key in a table, then the values will be pointing to the primary key of a different table.

If you have the metadata, that will help me understand what's happening. Thanks.

@npatki npatki removed the under discussion Issue is currently being discussed label Oct 16, 2023
@sdv-dev sdv-dev deleted a comment from Abduttayyeb Apr 17, 2024
@npatki
Copy link
Contributor

npatki commented Apr 22, 2024

Closing as a duplicate of #1922 -- we will now scramble up the IDs so that they do not appear in order. This will make the IDs appear more in order.

Note that for some cases (like small batch sampling), this may not appear as random. Truly random ID generation is something the team is actively working on.

@npatki npatki closed this as completed Apr 22, 2024
@npatki npatki added the resolution:duplicate This issue or pull request already exists label Apr 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Request for a new feature resolution:duplicate This issue or pull request already exists
Projects
None yet
Development

No branches or pull requests

2 participants