You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This issue is for the in-progress metadata-improvment branch, which aims to improve the accuracy of metadata detection from datasets.
Background
Sometimes a column may be represented using int/float values but it is actually encoding distinct categories. For example, an HTTP error code such as 404 or 200. To distinguish between numerical vs. categorical, the metadata detection looks at the cardinality, or the # of unique values in a column.
Current logic: If cardinality < (length of data)/10, then mark it as "categorical"
Problem
The current logic is giving too much tolerance for cardinality for large datasets. Eg. with 500K rows, we allow a cardinality of up to 50K.
Expected behavior
We propose capping the cardinality at 10. Which means using the following logic:
This issue is for the in-progress
metadata-improvment
branch, which aims to improve the accuracy of metadata detection from datasets.Background
Sometimes a column may be represented using int/float values but it is actually encoding distinct categories. For example, an HTTP error code such as
404
or200
. To distinguish between numerical vs. categorical, the metadata detection looks at the cardinality, or the # of unique values in a column.Current logic: If cardinality < (length of data)/10, then mark it as
"categorical"
Problem
The current logic is giving too much tolerance for cardinality for large datasets. Eg. with 500K rows, we allow a cardinality of up to 50K.
Expected behavior
We propose capping the cardinality at 10. Which means using the following logic:
The text was updated successfully, but these errors were encountered: