Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KeyError: 'date' on synthetic data #281

Closed
marrrcin opened this issue Jul 17, 2023 · 7 comments
Closed

KeyError: 'date' on synthetic data #281

marrrcin opened this issue Jul 17, 2023 · 7 comments

Comments

@marrrcin
Copy link

marrrcin commented Jul 17, 2023

Hi,
I'm exploring the use of your library and I've stumped across an error when working with my data.

Popmon version: 1.4.5
Error:

 in <lambda>(plot)
    157             # filter out potential empty plots
    158             plots = [e for e in plots if len(e)]
--> 159             plots = sorted(plots, key=lambda plot: plot["date"])
    160 
    161             # basic checks for histograms

KeyError: 'date'
Full stack trace: ⬇️
KeyError                                  Traceback (most recent call last)
[<ipython-input-39-c55c117796f8>](https://localhost:8080/#) in <cell line: 1>()
----> 1 report = popmon.df_stability_report(
      2     df,
      3     time_axis="time",
      4     time_width="1w",
      5 )

7 frames
[/usr/local/lib/python3.10/dist-packages/popmon/pipeline/report.py](https://localhost:8080/#) in df_stability_report(df, settings, time_width, time_offset, var_dtype, reference, split, **kwargs)
    196 
    197     # generate data stability report
--> 198     return stability_report(
    199         hists=hists,
    200         settings=settings,

[/usr/local/lib/python3.10/dist-packages/popmon/pipeline/report.py](https://localhost:8080/#) in stability_report(hists, settings, reference, **kwargs)
     73     # execute reporting pipeline
     74     pipeline = get_report_pipeline_class(settings.reference_type, reference)(**cfg)
---> 75     result = pipeline.transform(datastore)
     76 
     77     stability_report_result = StabilityReport(datastore=result)

[/usr/local/lib/python3.10/dist-packages/popmon/base/pipeline.py](https://localhost:8080/#) in transform(self, datastore)
     65         for module in self.modules:
     66             self.logger.debug(f"transform {module.__class__.__name__}")
---> 67             datastore = module.transform(datastore)
     68         return datastore
     69 

[/usr/local/lib/python3.10/dist-packages/popmon/pipeline/report_pipelines.py](https://localhost:8080/#) in transform(self, datastore)
    255     def transform(self, datastore):
    256         self.logger.info(f'Generating report "{self.store_key}".')
--> 257         return super().transform(datastore)

[/usr/local/lib/python3.10/dist-packages/popmon/base/pipeline.py](https://localhost:8080/#) in transform(self, datastore)
     65         for module in self.modules:
     66             self.logger.debug(f"transform {module.__class__.__name__}")
---> 67             datastore = module.transform(datastore)
     68         return datastore
     69 

[/usr/local/lib/python3.10/dist-packages/popmon/base/module.py](https://localhost:8080/#) in _transform(self, datastore)
     49 
     50         # transformation
---> 51         outputs = func(self, *list(inputs.values()))
     52 
     53         # transform returns None if no update needs to be made

[/usr/local/lib/python3.10/dist-packages/popmon/visualization/histogram_section.py](https://localhost:8080/#) in transform(self, data_obj, sections)
    157             # filter out potential empty plots
    158             plots = [e for e in plots if len(e)]
--> 159             plots = sorted(plots, key=lambda plot: plot["date"])
    160 
    161             # basic checks for histograms

[/usr/local/lib/python3.10/dist-packages/popmon/visualization/histogram_section.py](https://localhost:8080/#) in <lambda>(plot)
    157             # filter out potential empty plots
    158             plots = [e for e in plots if len(e)]
--> 159             plots = sorted(plots, key=lambda plot: plot["date"])
    160 
    161             # basic checks for histograms

KeyError: 'date'

Reproduction steps:
https://colab.research.google.com/drive/1N59kn7C9LN6W9AJkfz9SougiZoOMM0bn?usp=sharing

Additional information:
I'm using a function to generate synthetic data (see colab). When I generate "less" data - e.g. for 200 days, the code works fine, but after some unknown threshold (like 360 days), it breaks.
I've also tried changing the time_width parameter - sometimes it starts to work with 2w, sometimes it works with 1d but I haven't figured out any pattern.

Also note that it happens both for self-referencing data as well as data with a reference set (see second part of the colab).

Expected result:
Monitoring report generates properly.

@marrrcin marrrcin changed the title KeyError: 'date' on a synthetic data KeyError: 'date' on synthetic data Jul 17, 2023
@sbrugman
Copy link
Collaborator

Thanks for reporting Marcin, will look into it

@marrrcin
Copy link
Author

@sbrugman an update from my side:
It seems like the following lines in the data generator are causing the popmon to break:

    feature_anomalies = np.random.normal(loc=0.5, scale=0.05, size=num_days)
    anomaly_indices = np.random.choice(num_days, num_anomalies, replace=False)
    feature_anomalies[anomaly_indices] = np.random.uniform(low=-5, high=0.1, size=num_anomalies)
    feature_out_of_range = np.random.uniform(low=0, high=1, size=num_days)
    out_of_range_indices = np.random.choice(num_days, num_out_of_range, replace=False)
    feature_out_of_range[out_of_range_indices] = np.random.uniform(low=2, high=3, size=num_out_of_range)

Initially, I thought it has something to do with the memory allocation / assignments, but it it seems like the range of values is a problem. If I increase the num_anomalies to something in closer to at half of my examples (which means - generating more examples that are e.g. out of range), the code proceeds normally. It should work in both cases though.

@sbrugman
Copy link
Collaborator

@marrrcin Could you please provide the minimum reproducible code here as a snippet? Policy doesn't allow us to use colab...

@marrrcin
Copy link
Author

Absolutely!

import pandas as pd
import popmon
import numpy as np

def generate_mock_data(num_days, num_anomalies, num_out_of_range, random_state=666, start_date='1/1/2022'):
    np.random.seed(random_state)
    time = pd.date_range(start=start_date, periods=num_days, freq='D')
    feature_increasing = np.arange(1, num_days+1)
    feature_decreasing = np.arange(1000000, 1000000-num_days, -1)
    feature_stable = np.random.normal(loc=0.5, scale=0.05, size=num_days)
    feature_unstable = np.random.normal(loc=0.5, scale=2.0, size=num_days)
    feature_anomalies = np.random.normal(loc=0.5, scale=0.05, size=num_days)
    anomaly_indices = np.random.choice(num_days, num_anomalies, replace=False)
    feature_anomalies[anomaly_indices] = np.random.uniform(low=-5, high=0.1, size=num_anomalies)
    feature_out_of_range = np.random.uniform(low=0, high=1, size=num_days)
    out_of_range_indices = np.random.choice(num_days, num_out_of_range, replace=False)
    feature_out_of_range[out_of_range_indices] = np.random.uniform(low=2, high=3, size=num_out_of_range)
    trend_change = np.concatenate([np.linspace(0, 3.0, num_days//2+(num_days % 2)), np.linspace(3.0, 0, num_days//2)]) + np.random.normal(loc=0, scale=0.01, size=num_days)
    cyclic_feature = np.sin(np.linspace(0, 4*np.pi, num_days)) + np.random.normal(loc=0, scale=0.1, size=num_days)
    data = {'time': time, 'feature_increasing': feature_increasing, 'feature_decreasing': feature_decreasing, 'feature_stable': feature_stable, 'feature_unstable': feature_unstable, 'feature_anomalies': feature_anomalies, 'feature_out_of_range': feature_out_of_range, 'trend_change': trend_change, 'cyclic_feature': cyclic_feature}
    df = pd.DataFrame(data)
    return df


df = generate_mock_data(num_days=300, num_anomalies=10, num_out_of_range=13)


report = popmon.df_stability_report(
    df,
    time_axis="time",
    time_width="1w",
)

@sbrugman
Copy link
Collaborator

Can confirm this is a bug with the histogram plotting with outliers, will release a patch soon!

@sbrugman
Copy link
Collaborator

@marrrcin Release is out, feel free to open up another issue if you encounter other problems. Thanks a lot!

@marrrcin
Copy link
Author

Thanks for a quick fix, I confirm that it works now!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants