Improve quality of sequence_index
: Move the start dates into the context model
#1760
Labels
Milestone
sequence_index
: Move the start dates into the context model
#1760
Problem Description
In multi-sequence data, a sequence index is used to denote both the order of sequences as well as the intervals between them. Usually, this column is
datetime
(but it may also benumerical
if the user is storing absolute time values). In this example dataset, thesequence_index
is theDate
column.The
sequence_index
is generally supposed to adhere to some special rules. However, when using the PARSynthesizer as-is, users have reported that various properties are not being learned and the rules are being broken.sequence_index
is supposed to be unique, in order to uniquely determine the sequence order. Users are finding duplicates (see PARSynthesizer: dates duplicates in synthetic data #1723),sequence_index
value should be increasing, as most sequential datasets are presented in order. Users are finding that the value sometimes decreases (see Sequence index values should be strictly increasing in the synthetic data #466)sequence_index
should continue on into the future. Users are finding that the current index values are very limited in range (see PARSynthesizer creates limited ranges (and is unable to forecast past the max date) #1752)Root Cause
In an investigation with @amontanez24 and @frances-h, we found a potential root cause:
During Fit: This function is computing the diffs between rows (as expected) but it is also storing the start sequence index as a single, static value within the sequence. PAR cannot effectively learn a different static value per sequence.
SDV/sdv/sequential/par.py
Lines 205 to 212 in e6e508b
During Sampling: The diffs between each rows are being added up (as expected) but they are all using a different starting value due to the issue above.
SDV/sdv/sequential/par.py
Lines 299 to 303 in e6e508b
Solution
During fit:
During sample:
Additional Notes
sequence_index
is an optional concept so we should very that PAR can continue to work without itThe text was updated successfully, but these errors were encountered: