Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes two bugs.
Teams
table), but as a result of the particular foreign key frequency on some child table it is possible to not hit the uniqueness threshold when appearing as a joined ancestral column. This led to a bug where we would train on that data and attempt to use it as a seed column, but the synthetic child data used as a seed would include categorical values not seen during training, ultimately resulting in having to throw away records as invalid conditional seeds.Implementation note
The logic for determining if a column is highly NaN or highly unique categorical now exists in the
RelationalData
class itself, rather than being owned by theAncestralStrategy
. One immediate benefit is we now perform the calculation just once for each table and cache the result, rather than recalculate it every time the table is joined onto a child table. Additionally, in the future there might be ways for us to get this metadata from a more sophisticated connector... if that is possible, the result would need to live inRelationalData
anyways, so making this change now sets us up better for that possibility.