You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In this paper, we introduced a new methodology for calculating multi-sequence metrics called MSAS. We should add the MSAS-related metrics to SDMetrics so that users with sequential data can use them for evaluation.
Expected behavior
Add a new metric called SequenceLengthSimilarity to SDMetrics.
Data compatibility: ID columns (representing the sequence key)
Parameters:
(required) real_data: A column (pd.Series) containing the sequence key of the real data
(required) synthetic_data: A column (pd.Series) containing the sequence key of the synthetic data
Output: A score in range [0, 1] -- 0 being the worst and 1 being the best
How does it work? The length of a sequence is determined by the number of times the same sequence key occurs. For example if id_09231 appeared 150 times in the sequence key, then the sequence is of length 150. This metric compares the lengths of all sequence keys in the real data vs. the synthetic data:
Calculate the length of each sequence in the real data (call this distribution D_r)
Calculate the length of each sequence in the synthetic data (call this distribution D_s)
Now apply the KSComplement metric to compare the similarities of the distributions (D_r, D_s). Return this score.
The text was updated successfully, but these errors were encountered:
Problem Description
In this paper, we introduced a new methodology for calculating multi-sequence metrics called MSAS. We should add the MSAS-related metrics to SDMetrics so that users with sequential data can use them for evaluation.
Expected behavior
Add a new metric called
SequenceLengthSimilarity
to SDMetrics.Data compatibility: ID columns (representing the sequence key)
Parameters:
real_data
: A column (pd.Series) containing the sequence key of the real datasynthetic_data
: A column (pd.Series) containing the sequence key of the synthetic dataOutput: A score in range
[0, 1]
-- 0 being the worst and 1 being the bestHow does it work? The length of a sequence is determined by the number of times the same sequence key occurs. For example if
id_09231
appeared 150 times in the sequence key, then the sequence is of length 150. This metric compares the lengths of all sequence keys in the real data vs. the synthetic data:The text was updated successfully, but these errors were encountered: