-
Notifications
You must be signed in to change notification settings - Fork 324
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HMASynthesizer generates IDs in order (even though real data has random IDs) #1596
Comments
Hi @majidliaquat, Thanks for filing this issue with this example. I'm going to re-title this issue to be more precise about the nature of the problem. In the SDV, we consider the purpose of ID columns to be unique identifiers (in the case of a primary key), or part of a relationship (in the case of a foreign key). So a lot of our users have not found it necessary to follow any specific patterns for ID columns. I think we can mark this as a new feature request, since the SDV itself is working as intended. To help us prioritize, it would be great to provide a bit more information about your use case. Why is having random IDs a requirement for your project? Are you planning to use the synthetic data in a way that requires this randomness? If I can understand the nature of this concern, then I can try to suggest other workarounds for you too. Thanks. |
Hi @npatki, Thank you for your response. In my case as the real table there was some missing values I drop them so now I get random number of IDs. Actually I need random IDs as in some tables those IDs representing a place name in other table. So I was just worndering if we can generate random IDs that will be better for the generated table to be more realistic which also improve on quality score. I was thinking if we pass parameters with
It may be not the way I am passing, because I didn't test it yet. Thanks |
Hi @majidliaquat, the I think if your IDs are specifically referring to a different concept or table, then it wold be best to mark them as It would be very helpful if you could provide the metadata file so we could assist you further. |
Hi @npatki, Here is the metadata with parent with two child tables:
I will try what you are suggesting after reading the blog post. Many Thanks. |
Hi @majidliaquat, thanks for providing the metadata. I have a follow-up question. Earlier you have mentioned:
In the example, you are wanting random IDs for the Generally, a primary key is a unique identifier to the table that's defined in. So how can it be a primary key and also represent a place name in another table? Could you explain that a bit more. |
Hi @npatki, Yes the metadata I provided is not about that what I mentioned before. I have two tables in which there is district names and ID and another table just the district ID (as foreign Key). For that reason I asked for the random ID's with in a range. I am using the dataset: 1999 Czech Financial Dataset - Real Anonymized Transactions Many Thanks. |
Hi @majidliaquat, I am still a bit confused. Would it be possible to provide the metadata for this schema and the relevant column(s) that are giving you problems? Is When the SDV generates multi-table data, it will guarantee the following:
So if If you have the metadata, that will help me understand what's happening. Thanks. |
Closing as a duplicate of #1922 -- we will now scramble up the IDs so that they do not appear in order. This will make the IDs appear more in order. Note that for some cases (like small batch sampling), this may not appear as random. Truly random ID generation is something the team is actively working on. |
As I am using HMAsynthesizer for Multitables it works fine, but when I generate synthetic data then the generated data are not similar to real data. In below section I tried to explain the issue in detail that I am facing.
Issue with generated data
As I generted data by adjusting the table parameters mentioned earlier. You can see the refer image below:
The Parent Table (Real Data) have shape of
4500, 4
in real dataset theaccount_id
has random data, as one highlighted in row 51 .Again if you have a look at the generated data which also have same shape of
4500, 4
but the problem is that the data I am getting with account_id data which starts from 0 to 4,499 same as index. (Highlighted with red and orange box).I want the generated data also be random as like the real data. Here are the table parameters I set:
I refer the documentation on website Documentation but couldn't find something related to this issue.
Thanks in advnce
Many Thanks,
Majid
The text was updated successfully, but these errors were encountered: