HMASynthesizer generates IDs in order (even though real data has random IDs) #1596

majidliaquat · 2023-09-20T14:12:13Z

As I am using HMAsynthesizer for Multitables it works fine, but when I generate synthetic data then the generated data are not similar to real data. In below section I tried to explain the issue in detail that I am facing.

Issue with generated data

As I generted data by adjusting the table parameters mentioned earlier. You can see the refer image below:

The Parent Table (Real Data) have shape of 4500, 4 in real dataset the account_id has random data, as one highlighted in row 51 .

Again if you have a look at the generated data which also have same shape of 4500, 4 but the problem is that the data I am getting with account_id data which starts from 0 to 4,499 same as index. (Highlighted with red and orange box).

I want the generated data also be random as like the real data. Here are the table parameters I set:

synthesizer.set_table_parameters(
                       table_name='parent table',
                       table_parameters={
                           "enforce_min_max_values":true
                              "enforce_rounding":true
                              "locales":"en_GB"
                              "numerical_distributions":{
                              "account_id":"norm"
                              }
                              "default_distribution":"uniform"
                       }
                   )

I refer the documentation on website Documentation but couldn't find something related to this issue.

Thanks in advnce
Many Thanks,
Majid

The text was updated successfully, but these errors were encountered:

npatki · 2023-09-20T20:35:19Z

Hi @majidliaquat,

Thanks for filing this issue with this example. I'm going to re-title this issue to be more precise about the nature of the problem.

In the SDV, we consider the purpose of ID columns to be unique identifiers (in the case of a primary key), or part of a relationship (in the case of a foreign key). So a lot of our users have not found it necessary to follow any specific patterns for ID columns. I think we can mark this as a new feature request, since the SDV itself is working as intended.

To help us prioritize, it would be great to provide a bit more information about your use case. Why is having random IDs a requirement for your project? Are you planning to use the synthetic data in a way that requires this randomness?

If I can understand the nature of this concern, then I can try to suggest other workarounds for you too. Thanks.

majidliaquat · 2023-09-25T09:16:28Z

Hi @npatki,

Thank you for your response. In my case as the real table there was some missing values I drop them so now I get random number of IDs. Actually I need random IDs as in some tables those IDs representing a place name in other table.

So I was just worndering if we can generate random IDs that will be better for the generated table to be more realistic which also improve on quality score.

I was thinking if we pass parameters with set_table_parameters and generate random numbers for specific ID:

synthesizer.set_table_parameters(
                       table_name='parent table',
                       table_parameters={
                              "numerical_distributions":{
                              "account_id":"norm",
                               "min_id_value":0,
                              "max_id_value":1000
                              }
                              "default_distribution":"uniform"
                       }
                   )

It may be not the way I am passing, because I didn't test it yet.

Thanks

npatki · 2023-09-25T21:09:40Z

Hi @majidliaquat, the numerical_distributions parameter does not support min or max values.

I think if your IDs are specifically referring to a different concept or table, then it wold be best to mark them as "categorical" in the metadata. That way, the SDV will learn correlations between those IDs and other columns. It will then recreate the same IDs in the output. This blog post has more information about when you should consider something categorical.

It would be very helpful if you could provide the metadata file so we could assist you further.

majidliaquat · 2023-09-27T08:56:35Z

Hi @npatki,

Here is the metadata with parent with two child tables:

{
  "tables": {
    "Table 1 account.csv": {
      "columns": {
        "account_id": {
          "sdtype": "id"
        },
        "district_id": {
          "sdtype": "id"
        },
        "frequency": {
          "sdtype": "categorical"
        },
        "date": {
          "sdtype": "numerical"
        }
      },
      "primary_key": "account_id"
    },
    "Table 2 loan.csv": {
      "columns": {
        "loan_id": {
          "sdtype": "id"
        },
        "account_id": {
          "sdtype": "id"
        },
        "date": {
          "sdtype": "numerical"
        },
        "amount": {
          "sdtype": "numerical"
        },
        "duration": {
          "sdtype": "numerical"
        },
        "payments": {
          "sdtype": "numerical"
        },
        "status": {
          "sdtype": "categorical"
        }
      },
      "primary_key": "loan_id"
    },
    "Table 3 order.csv": {
      "columns": {
        "order_id": {
          "sdtype": "id"
        },
        "account_id": {
          "sdtype": "id"
        },
        "bank_to": {
          "sdtype": "categorical"
        },
        "account_to": {
          "sdtype": "numerical"
        },
        "amount": {
          "sdtype": "numerical"
        },
        "k_symbol": {
          "sdtype": "categorical"
        }
      },
      "primary_key": "order_id"
    }
  },
  "relationships": [
    {
      "parent_table_name": "Table 1 account.csv",
      "child_table_name": "Table 2 loan.csv",
      "parent_primary_key": "account_id",
      "child_foreign_key": "account_id"
    },
    {
      "parent_table_name": "Table 1 account.csv",
      "child_table_name": "Table 3 order.csv",
      "parent_primary_key": "account_id",
      "child_foreign_key": "account_id"
    }
  ],
  "METADATA_SPEC_VERSION": "MULTI_TABLE_V1"
}

I will try what you are suggesting after reading the blog post.

Many Thanks.

npatki · 2023-09-27T20:31:48Z

Hi @majidliaquat, thanks for providing the metadata. I have a follow-up question. Earlier you have mentioned:

Actually I need random IDs as in some tables those IDs representing a place name in other table.

In the example, you are wanting random IDs for the account_id variable (in the accounts table). But from the metadata, it seems like account_id seems to be listed as a primary key, so I am a bit confused.

Generally, a primary key is a unique identifier to the table that's defined in. So how can it be a primary key and also represent a place name in another table? Could you explain that a bit more.

majidliaquat · 2023-09-29T09:07:23Z

Hi @npatki,

Yes the metadata I provided is not about that what I mentioned before. I have two tables in which there is district names and ID and another table just the district ID (as foreign Key). For that reason I asked for the random ID's with in a range.

I am using the dataset: 1999 Czech Financial Dataset - Real Anonymized Transactions

here is the datamap:

Many Thanks.

npatki · 2023-10-03T05:39:17Z

Hi @majidliaquat, I am still a bit confused. Would it be possible to provide the metadata for this schema and the relevant column(s) that are giving you problems? Is district_id the primary key of Demograph?

When the SDV generates multi-table data, it will guarantee the following:

Primary keys of a table will always be a unique identifier (it will follow the Regex format you can optionally provide), and
Foreign keys will always point to primary keys in a different table. That is: The references will be consistent.

So if district_id is a foreign key in a table, then the values will be pointing to the primary key of a different table.

If you have the metadata, that will help me understand what's happening. Thanks.

npatki · 2024-04-22T21:28:49Z

Closing as a duplicate of #1922 -- we will now scramble up the IDs so that they do not appear in order. This will make the IDs appear more in order.

Note that for some cases (like small batch sampling), this may not appear as random. Truly random ID generation is something the team is actively working on.

majidliaquat added bug Something isn't working new Automatic label applied to new issues labels Sep 20, 2023

npatki changed the title ~~Issue with generated Data on HMASynthesizer~~ HMASynthesizer generates IDs in order (even though real data has random IDs) Sep 20, 2023

npatki added feature request Request for a new feature under discussion Issue is currently being discussed and removed bug Something isn't working new Automatic label applied to new issues labels Sep 20, 2023

npatki removed the under discussion Issue is currently being discussed label Oct 16, 2023

This was referenced Apr 17, 2024

Tabular models: Add option to randomize ids #329

Closed

Random Primary IDs Regex #1062

Closed

sdv-dev deleted a comment from Abduttayyeb Apr 17, 2024

npatki closed this as completed Apr 22, 2024

npatki added the resolution:duplicate This issue or pull request already exists label Apr 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HMASynthesizer generates IDs in order (even though real data has random IDs) #1596

HMASynthesizer generates IDs in order (even though real data has random IDs) #1596

majidliaquat commented Sep 20, 2023

npatki commented Sep 20, 2023 •

edited

Loading

majidliaquat commented Sep 25, 2023

npatki commented Sep 25, 2023

majidliaquat commented Sep 27, 2023

npatki commented Sep 27, 2023

majidliaquat commented Sep 29, 2023 •

edited

Loading

npatki commented Oct 3, 2023

npatki commented Apr 22, 2024

HMASynthesizer generates IDs in order (even though real data has random IDs) #1596

HMASynthesizer generates IDs in order (even though real data has random IDs) #1596

Comments

majidliaquat commented Sep 20, 2023

Issue with generated data

npatki commented Sep 20, 2023 • edited Loading

majidliaquat commented Sep 25, 2023

npatki commented Sep 25, 2023

majidliaquat commented Sep 27, 2023

npatki commented Sep 27, 2023

majidliaquat commented Sep 29, 2023 • edited Loading

npatki commented Oct 3, 2023

npatki commented Apr 22, 2024

npatki commented Sep 20, 2023 •

edited

Loading

majidliaquat commented Sep 29, 2023 •

edited

Loading