-
Notifications
You must be signed in to change notification settings - Fork 322
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HMASynthesizer creates too much synthetic data (always creates a child for every parent row) #1696
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## main #1696 +/- ##
=======================================
Coverage 97.11% 97.12%
=======================================
Files 47 47
Lines 4402 4410 +8
=======================================
+ Hits 4275 4283 +8
Misses 127 127 ☔ View full report in Codecov by Sentry. |
b34d8a8
to
10b4915
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Mostly curious about what causes situations where the num rows column isn't present in the parent
pd.testing.assert_frame_equal(result_frame, expected_frame) | ||
|
||
def test__sample_children_no_rows_sampled_no_num_rows(self): | ||
"""Test sampling the children of a table where no rows created. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the case where this happens?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I know it doesn't/shouldn't happen. I added it because _get_num_rows_from_parent
has a similar check, so I assumed it was something we'd encountered in the past.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amontanez24 we don't hit either of these checks in our integration tests, so not sure if we should keep either around
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, can we file an issue to investigate it a little more before removing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in #1703
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
CU-86ayp2nrr
Resolve #1673
Round
num_rows
instead of usingmath.ceil
to allow for zero child rows to be sampled. If no child rows are sampled, tries to find the parent row with the largest expected number of children (otherwise randomly selects a parent row) and forces 1 child row to be created for it.