Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sequence index values should be strictly increasing in the synthetic data #466

Closed
silvaac opened this issue Jun 9, 2021 · 3 comments
Closed
Labels
data:sequential Related to timeseries datasets feature request Request for a new feature resolution:duplicate This issue or pull request already exists resolution:resolved The issue was fixed, the question was answered, etc.

Comments

@silvaac
Copy link

silvaac commented Jun 9, 2021

Environment Details

Please indicate the following details about the environment in which you found the bug:

  • SDV version: 0.10.0
  • Python version: 3.8.10
  • Operating System: windows

Error Description

Synthetic time-series dates are out of sequence, repeated and/or with large gaps between dates as well. I think that one wants to keep the same time step or at least the time sequence when generating Dates of synthetic time-series.

Steps to reproduce

In fact the basic example in the web-page shows it. This is a scene shot:

image

This is an alternative code that presents the issues:

import pandas as pd
import numpy as np
from pandas_datareader import data
from sdv.timeseries import PAR
if __name__ == '__main__':

    P  = data.DataReader('SPY',data_source='yahoo',start='1990-01-01')
    x0 = P.assign(logP = np.log(P["Adj Close"]))
    x  = x0.assign(r = x0.logP.diff())
    x = x.dropna()
    R = x["r"]
    # Simulator
    sequence_index = 'Date'
    model = PAR(
        sequence_index=sequence_index,
        epochs=3000,
        sample_size=10,
        verbose=True,
    )
    model.fit(R.reset_index())
    sample = model.sample(num_sequences=1)

Ability to contribute

<If you are available to contribute the necessary changes to the project to solve this issue please indicate so here, so the maintainers can assign the issue to you and provide the necessary support. Otherwise, please remove this section.>

@silvaac silvaac added bug Something isn't working pending review labels Jun 9, 2021
@npatki
Copy link
Contributor

npatki commented Jun 30, 2022

Thanks for filing this issue @silvaac.

There is a related feature request in #678 about ensuring that sequence index rows are all unique. I will turn this one into a complementary feature request for ensuring the values in sequence index are strictly increasing.

You touched upon another issue, which is that there are sometimes large gaps between sequence indices. It would be great if you could file a new feature request for this, as I have some follow-up questions about what is in the input data vs. expected as output.

@npatki npatki changed the title SDV/PAR synthetic time-series dates are often out-of-sequence, repeated or with large gaps Sequence index values should be strictly increasing in the synthetic data Jun 30, 2022
@npatki npatki added feature request Request for a new feature data:sequential Related to timeseries datasets under discussion Issue is currently being discussed and removed bug Something isn't working pending review labels Jun 30, 2022
@npatki npatki removed the under discussion Issue is currently being discussed label Jul 12, 2022
@npatki
Copy link
Contributor

npatki commented Jan 26, 2024

Update on this: We have identified a root cause, which is now linked in #1760. You can follow that issue for an update.

Please note that this may not be the only root cause so until we resolve #1760, we will keep this one open.

@npatki
Copy link
Contributor

npatki commented Feb 20, 2024

Hi @silvaac,

Good news -- this issue should be fixed in the latest SDV release (1.10.0). Please upgrade for the latest results.

On the NASDAQ demo dataset, I've confirmed that the values are strictly increasing within a sequence.

image

Do note that if your real data contains decreasing values, then your synthetic data would as well -- as the synthesizer learns what is possible based on your real data. Please feel fee to reach out if you have any questions. Thanks.

P.S. Since the original issue was filed, we've significantly updated the SDV API as well as the docs. Here is the new reference for PARSynthesizer.

@npatki npatki closed this as completed Feb 20, 2024
@npatki npatki added resolution:duplicate This issue or pull request already exists resolution:resolved The issue was fixed, the question was answered, etc. labels Feb 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data:sequential Related to timeseries datasets feature request Request for a new feature resolution:duplicate This issue or pull request already exists resolution:resolved The issue was fixed, the question was answered, etc.
Projects
None yet
Development

No branches or pull requests

2 participants