Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Synthetic data generation using SDV and KFP #159

Open
wants to merge 38 commits into
base: master
Choose a base branch
from

Conversation

tarekabouzeid
Copy link
Member

@tarekabouzeid tarekabouzeid commented Feb 16, 2025

AI-Generated Synthetic Data, using Kubeflow pipelines blog.
Authored by @akeed and @tarekabouzeid .

Signed-off-by: Tarek Abouzeid <[email protected]>
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign johnugeorge for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tarekabouzeid
Copy link
Member Author

@kubeflow/wg-pipeline-leads can someone please review ?
cc @akgraner

@juliusvonkohout
Copy link
Member

juliusvonkohout commented Feb 17, 2025

Maybe post it in the kubeflow pipelines slack channel as well. @hbelmiro @HumairAK @rimolive

@tarekabouzeid
Copy link
Member Author

Maybe post it in the kubeflow pipelines slack channel as well. @hbelmiro @HumairAK @rimolive

Thanks Julius, will do so also

author: "<a href='https://www.linkedin.com/in/aaked'>Åke Edlund</a>, <a href='https://www.linkedin.com/in/tarekabouzeid91'>Tarek Abouzeid</a>"
---
## Synthetic data - why and how?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First I'd outline what Synthetic Data Generation (SDG) is, what KFP is, and how they work together at a high level.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

---
## Synthetic data - why and how?

The best results come from real data, but accessing it often requires lengthy security and legal processes. The data may also be incomplete, biased, or too small, and during early exploration, we may not even know if it's worth pursuing. While real data is essential for proper evaluation, gaps or limited access frequently hinder progress until the formal process is complete.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The best results come from real data, but accessing it often requires lengthy security and legal processes.
I think it'd be good to outline what we're try ing to do here and providing context on the problem we're trying to solve.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated


The best results come from real data, but accessing it often requires lengthy security and legal processes. The data may also be incomplete, biased, or too small, and during early exploration, we may not even know if it's worth pursuing. While real data is essential for proper evaluation, gaps or limited access frequently hinder progress until the formal process is complete.

While the above focuses on speed and augmentation, there are more motivations for *creating* (synthetic) data. Using synthetic data *could* give us new ways to improve on speed of development, handling biases, and more:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this sentence but I think it's missing something about what we're trying to improve. The speed of development for building models?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, updated.

For us, open source frameworks are the only ones of interest.
Cloud-based solutions that require sending data samples to the cloud miss the whole point — some of our data cannot be sent to the cloud.
For data already in cloud, we can use other cloud-based frameworks.
Synthesizers are motivated by multiple factors, but in this context, our focus remains on generating synthetic data for on-premise use.
Copy link
Contributor

@franciscojavierarceo franciscojavierarceo Feb 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should define synthesizer first somewhere

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, added.


### The Synthetic Data Vault (SDV) - high level

When you initialize and fit a synthesizer (like GaussianCopulaSynthesizer, CTGANSynthesizer, etc.), it trains a model based on
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you link to some of those syntheizer resources?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They appear directly on the landing page of (link above, the https://sdv.dev), and not sure if they (SDV) might move around their sub-links (not the sdv.dev, I hope).


The synthesizer doesn't memorize individual records from the dataset.
Instead, it tries to learn the underlying statistical patterns, correlations, and distributions present in the data.
Each synthesizer does this differently:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks :-)

akeed and others added 30 commits February 24, 2025 20:03
Co-authored-by: Francisco Arceo <[email protected]>
Signed-off-by: Ake Edlund <[email protected]>
Co-authored-by: Francisco Arceo <[email protected]>
Signed-off-by: Ake Edlund <[email protected]>
Co-authored-by: Francisco Arceo <[email protected]>
Signed-off-by: Ake Edlund <[email protected]>
Co-authored-by: Francisco Arceo <[email protected]>
Signed-off-by: Ake Edlund <[email protected]>
Co-authored-by: Francisco Arceo <[email protected]>
Signed-off-by: Ake Edlund <[email protected]>
Co-authored-by: Francisco Arceo <[email protected]>
Signed-off-by: Ake Edlund <[email protected]>
Co-authored-by: Francisco Arceo <[email protected]>
Signed-off-by: Ake Edlund <[email protected]>
Co-authored-by: Francisco Arceo <[email protected]>
Signed-off-by: Ake Edlund <[email protected]>
Co-authored-by: Francisco Arceo <[email protected]>
Signed-off-by: Ake Edlund <[email protected]>
Co-authored-by: Francisco Arceo <[email protected]>
Signed-off-by: Ake Edlund <[email protected]>
Co-authored-by: Francisco Arceo <[email protected]>
Signed-off-by: Ake Edlund <[email protected]>
Co-authored-by: Francisco Arceo <[email protected]>
Signed-off-by: Ake Edlund <[email protected]>
Co-authored-by: Francisco Arceo <[email protected]>
Signed-off-by: Ake Edlund <[email protected]>
Co-authored-by: Francisco Arceo <[email protected]>
Signed-off-by: Ake Edlund <[email protected]>
Co-authored-by: Francisco Arceo <[email protected]>
Signed-off-by: Ake Edlund <[email protected]>
Co-authored-by: Francisco Arceo <[email protected]>
Signed-off-by: Ake Edlund <[email protected]>
Co-authored-by: Francisco Arceo <[email protected]>
Signed-off-by: Ake Edlund <[email protected]>
Co-authored-by: Francisco Arceo <[email protected]>
Signed-off-by: Ake Edlund <[email protected]>
Co-authored-by: Francisco Arceo <[email protected]>
Signed-off-by: Ake Edlund <[email protected]>
Co-authored-by: Francisco Arceo <[email protected]>
Signed-off-by: Ake Edlund <[email protected]>
Co-authored-by: Francisco Arceo <[email protected]>
Signed-off-by: Ake Edlund <[email protected]>
Co-authored-by: Francisco Arceo <[email protected]>
Signed-off-by: Ake Edlund <[email protected]>
Co-authored-by: Francisco Arceo <[email protected]>
Signed-off-by: Ake Edlund <[email protected]>
Co-authored-by: Francisco Arceo <[email protected]>
Signed-off-by: Ake Edlund <[email protected]>
Co-authored-by: Francisco Arceo <[email protected]>
Signed-off-by: Ake Edlund <[email protected]>
Co-authored-by: Francisco Arceo <[email protected]>
Signed-off-by: Ake Edlund <[email protected]>
Co-authored-by: Francisco Arceo <[email protected]>
Signed-off-by: Ake Edlund <[email protected]>
Co-authored-by: Francisco Arceo <[email protected]>
Signed-off-by: Ake Edlund <[email protected]>
Co-authored-by: Francisco Arceo <[email protected]>
Signed-off-by: Ake Edlund <[email protected]>
Updated, thanks to input from franciscojavierarceo

Signed-off-by: Ake Edlund <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants