-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Synthetic data generation using SDV and KFP #159
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: Tarek Abouzeid <[email protected]>
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@kubeflow/wg-pipeline-leads can someone please review ? |
author: "<a href='https://www.linkedin.com/in/aaked'>Åke Edlund</a>, <a href='https://www.linkedin.com/in/tarekabouzeid91'>Tarek Abouzeid</a>" | ||
--- | ||
## Synthetic data - why and how? | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First I'd outline what Synthetic Data Generation (SDG) is, what KFP is, and how they work together at a high level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
--- | ||
## Synthetic data - why and how? | ||
|
||
The best results come from real data, but accessing it often requires lengthy security and legal processes. The data may also be incomplete, biased, or too small, and during early exploration, we may not even know if it's worth pursuing. While real data is essential for proper evaluation, gaps or limited access frequently hinder progress until the formal process is complete. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The best results come from real data, but accessing it often requires lengthy security and legal processes.
I think it'd be good to outline what we're try ing to do here and providing context on the problem we're trying to solve.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
|
||
The best results come from real data, but accessing it often requires lengthy security and legal processes. The data may also be incomplete, biased, or too small, and during early exploration, we may not even know if it's worth pursuing. While real data is essential for proper evaluation, gaps or limited access frequently hinder progress until the formal process is complete. | ||
|
||
While the above focuses on speed and augmentation, there are more motivations for *creating* (synthetic) data. Using synthetic data *could* give us new ways to improve on speed of development, handling biases, and more: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this sentence but I think it's missing something about what we're trying to improve. The speed of development for building models?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, updated.
For us, open source frameworks are the only ones of interest. | ||
Cloud-based solutions that require sending data samples to the cloud miss the whole point — some of our data cannot be sent to the cloud. | ||
For data already in cloud, we can use other cloud-based frameworks. | ||
Synthesizers are motivated by multiple factors, but in this context, our focus remains on generating synthetic data for on-premise use. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should define synthesizer first somewhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, added.
|
||
### The Synthetic Data Vault (SDV) - high level | ||
|
||
When you initialize and fit a synthesizer (like GaussianCopulaSynthesizer, CTGANSynthesizer, etc.), it trains a model based on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you link to some of those syntheizer resources?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They appear directly on the landing page of (link above, the https://sdv.dev), and not sure if they (SDV) might move around their sub-links (not the sdv.dev, I hope).
|
||
The synthesizer doesn't memorize individual records from the dataset. | ||
Instead, it tries to learn the underlying statistical patterns, correlations, and distributions present in the data. | ||
Each synthesizer does this differently: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks :-)
Co-authored-by: Francisco Arceo <[email protected]> Signed-off-by: Ake Edlund <[email protected]>
Co-authored-by: Francisco Arceo <[email protected]> Signed-off-by: Ake Edlund <[email protected]>
Co-authored-by: Francisco Arceo <[email protected]> Signed-off-by: Ake Edlund <[email protected]>
Co-authored-by: Francisco Arceo <[email protected]> Signed-off-by: Ake Edlund <[email protected]>
Co-authored-by: Francisco Arceo <[email protected]> Signed-off-by: Ake Edlund <[email protected]>
Co-authored-by: Francisco Arceo <[email protected]> Signed-off-by: Ake Edlund <[email protected]>
Co-authored-by: Francisco Arceo <[email protected]> Signed-off-by: Ake Edlund <[email protected]>
Co-authored-by: Francisco Arceo <[email protected]> Signed-off-by: Ake Edlund <[email protected]>
Co-authored-by: Francisco Arceo <[email protected]> Signed-off-by: Ake Edlund <[email protected]>
Co-authored-by: Francisco Arceo <[email protected]> Signed-off-by: Ake Edlund <[email protected]>
Co-authored-by: Francisco Arceo <[email protected]> Signed-off-by: Ake Edlund <[email protected]>
Co-authored-by: Francisco Arceo <[email protected]> Signed-off-by: Ake Edlund <[email protected]>
Co-authored-by: Francisco Arceo <[email protected]> Signed-off-by: Ake Edlund <[email protected]>
Co-authored-by: Francisco Arceo <[email protected]> Signed-off-by: Ake Edlund <[email protected]>
Co-authored-by: Francisco Arceo <[email protected]> Signed-off-by: Ake Edlund <[email protected]>
Co-authored-by: Francisco Arceo <[email protected]> Signed-off-by: Ake Edlund <[email protected]>
Co-authored-by: Francisco Arceo <[email protected]> Signed-off-by: Ake Edlund <[email protected]>
Co-authored-by: Francisco Arceo <[email protected]> Signed-off-by: Ake Edlund <[email protected]>
Co-authored-by: Francisco Arceo <[email protected]> Signed-off-by: Ake Edlund <[email protected]>
Co-authored-by: Francisco Arceo <[email protected]> Signed-off-by: Ake Edlund <[email protected]>
Co-authored-by: Francisco Arceo <[email protected]> Signed-off-by: Ake Edlund <[email protected]>
Co-authored-by: Francisco Arceo <[email protected]> Signed-off-by: Ake Edlund <[email protected]>
Co-authored-by: Francisco Arceo <[email protected]> Signed-off-by: Ake Edlund <[email protected]>
Co-authored-by: Francisco Arceo <[email protected]> Signed-off-by: Ake Edlund <[email protected]>
Co-authored-by: Francisco Arceo <[email protected]> Signed-off-by: Ake Edlund <[email protected]>
Co-authored-by: Francisco Arceo <[email protected]> Signed-off-by: Ake Edlund <[email protected]>
Co-authored-by: Francisco Arceo <[email protected]> Signed-off-by: Ake Edlund <[email protected]>
Co-authored-by: Francisco Arceo <[email protected]> Signed-off-by: Ake Edlund <[email protected]>
Co-authored-by: Francisco Arceo <[email protected]> Signed-off-by: Ake Edlund <[email protected]>
Updated, thanks to input from franciscojavierarceo Signed-off-by: Ake Edlund <[email protected]>
AI-Generated Synthetic Data, using Kubeflow pipelines blog.
Authored by @akeed and @tarekabouzeid .