Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add metric for sequence length similarity #638

Closed
npatki opened this issue Oct 9, 2024 · 0 comments · Fixed by #643 or #662
Closed

Add metric for sequence length similarity #638

npatki opened this issue Oct 9, 2024 · 0 comments · Fixed by #643 or #662
Assignees
Labels
data:sequential Related to timeseries datasets feature request Request for a new feature
Milestone

Comments

@npatki
Copy link
Contributor

npatki commented Oct 9, 2024

Problem Description

In this paper, we introduced a new methodology for calculating multi-sequence metrics called MSAS. We should add the MSAS-related metrics to SDMetrics so that users with sequential data can use them for evaluation.

Expected behavior

Add a new metric called SequenceLengthSimilarity to SDMetrics.

Data compatibility: ID columns (representing the sequence key)

Parameters:

  • (required) real_data: A column (pd.Series) containing the sequence key of the real data
  • (required) synthetic_data: A column (pd.Series) containing the sequence key of the synthetic data

Output: A score in range [0, 1] -- 0 being the worst and 1 being the best

from sdmetrics.single_column import SequenceLengthSimilarity

score = SequenceLengthSimilarity.compute(
  real_data=real_table['patient_id'],
  synthetic_data=synthetic_table['patient_id']
)

How does it work? The length of a sequence is determined by the number of times the same sequence key occurs. For example if id_09231 appeared 150 times in the sequence key, then the sequence is of length 150. This metric compares the lengths of all sequence keys in the real data vs. the synthetic data:

  1. Calculate the length of each sequence in the real data (call this distribution D_r)
  2. Calculate the length of each sequence in the synthetic data (call this distribution D_s)
  3. Now apply the KSComplement metric to compare the similarities of the distributions (D_r, D_s). Return this score.
@npatki npatki added feature request Request for a new feature data:sequential Related to timeseries datasets labels Oct 9, 2024
@fealho fealho self-assigned this Nov 14, 2024
@fealho fealho added this to the 0.16.1 milestone Nov 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data:sequential Related to timeseries datasets feature request Request for a new feature
Projects
None yet
2 participants