Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HMASynthesizer creates too much synthetic data (always creates a child for every parent row) #1696

Merged
merged 4 commits into from
Nov 30, 2023

Conversation

frances-h
Copy link
Contributor

CU-86ayp2nrr
Resolve #1673

Round num_rows instead of using math.ceil to allow for zero child rows to be sampled. If no child rows are sampled, tries to find the parent row with the largest expected number of children (otherwise randomly selects a parent row) and forces 1 child row to be created for it.

@frances-h frances-h requested a review from a team as a code owner November 28, 2023 16:33
@frances-h frances-h requested review from pvk-developer and removed request for a team November 28, 2023 16:33
@sdv-team
Copy link
Contributor

@codecov-commenter
Copy link

codecov-commenter commented Nov 28, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (9ade85b) 97.11% compared to head (ec59d9a) 97.12%.
Report is 1 commits behind head on main.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1696   +/-   ##
=======================================
  Coverage   97.11%   97.12%           
=======================================
  Files          47       47           
  Lines        4402     4410    +8     
=======================================
+ Hits         4275     4283    +8     
  Misses        127      127           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@frances-h frances-h force-pushed the issue-1673-fix-hma-child-rows branch from b34d8a8 to 10b4915 Compare November 28, 2023 17:21
Copy link
Contributor

@amontanez24 amontanez24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Mostly curious about what causes situations where the num rows column isn't present in the parent

tests/unit/sampling/test_hierarchical_sampler.py Outdated Show resolved Hide resolved
tests/unit/sampling/test_hierarchical_sampler.py Outdated Show resolved Hide resolved
pd.testing.assert_frame_equal(result_frame, expected_frame)

def test__sample_children_no_rows_sampled_no_num_rows(self):
"""Test sampling the children of a table where no rows created.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the case where this happens?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I know it doesn't/shouldn't happen. I added it because _get_num_rows_from_parent has a similar check, so I assumed it was something we'd encountered in the past.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amontanez24 we don't hit either of these checks in our integration tests, so not sure if we should keep either around

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, can we file an issue to investigate it a little more before removing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in #1703

Copy link
Contributor

@amontanez24 amontanez24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@frances-h frances-h merged commit 8ef36b7 into main Nov 30, 2023
37 checks passed
@frances-h frances-h deleted the issue-1673-fix-hma-child-rows branch November 30, 2023 22:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

HMASynthesizer creates too much synthetic data (always creates a child for every parent row)
5 participants