Make `TSDataset.to_flatten` faster for big datasets #848

Mr-Geekman · 2022-08-11T12:44:29Z

Before submitting (must do checklist)

Did you read the contribution guide?
Did you update the docs? We use Numpy format for all the methods and classes.
Did you write any new necessary tests?
Did you update the CHANGELOG?

Proposed Changes

Look #781.

Closing issues

Closes #781.
Closes #777.

github-actions · 2022-08-11T12:48:25Z

🚀 Deployed on https://deploy-preview-848--etna-docs.netlify.app

Mr-Geekman · 2022-08-11T12:53:05Z

Code for generation benchmark datasets:

def load_dataset(
    num_segments: int,
    num_periods: int = 100,
    num_add_blocks: int = 0,
    add_object_category: bool = False,
    make_encoded_category: bool = True,
    random_state: int = 0,
) -> pd.DataFrame:
    rng = np.random.default_rng(random_state)
    df = generate_ar_df(
        periods=num_periods, start_time="2020-01-01", n_segments=num_segments
    )

    for i in range(num_add_blocks):
        # add int column
        df[f"new_int_{i}"] = rng.integers(low=-100, high=100, size=df.shape[0])

        # add float column
        df[f"new_float_{i}"] = rng.uniform(low=-100, high=100, size=df.shape[0])

        # add category column
        num_categories = num_segments // 10
        categories = list(range(num_categories))
        column_values = rng.choice(categories, size=df.shape[0])

        # in this case we make encoded categories as category
        if make_encoded_category:
            df[f"new_cat_{i}_cat"] = column_values
            df[f"new_cat_{i}_cat"] = df[f"new_cat_{i}_cat"].astype("category")
        # in this case we keep them int (it can be beneficial for some methods)
        else:
            df[f"new_cat_{i}_encoded"] = column_values

    if add_object_category:
        num_categories = num_segments // 10
        categories = [str(cat) for cat in range(num_categories)]
        df["new_obj_cat"] = rng.choice(categories, size=df.shape[0])
        df["new_obj_cat"] = df["new_obj_cat"].astype("category")

    return df

Results of benchmark

metrics.csv

There is a little mistake on pictures below. The second plot on each image represents situation when add_object_category=True.

num_add_blocks=0:

num_add_blocks=1:

num_add_blocks=3:

codecov-commenter · 2022-08-11T12:53:53Z

Codecov Report

Merging #848 (ff21234) into master (5675c17) will decrease coverage by 35.29%.
The diff coverage is 100.00%.

@@             Coverage Diff             @@
##           master     #848       +/-   ##
===========================================
- Coverage   84.65%   49.36%   -35.30%     
===========================================
  Files         130      130               
  Lines        7411     7414        +3     
===========================================
- Hits         6274     3660     -2614     
- Misses       1137     3754     +2617

Impacted Files	Coverage Δ
etna/datasets/tsdataset.py	`67.17% <100.00%> (-23.52%)`	⬇️
etna/commands/__init__.py	`0.00% <0.00%> (-100.00%)`	⬇️
etna/commands/backtest_command.py	`0.00% <0.00%> (-97.06%)`	⬇️
etna/commands/forecast_command.py	`0.00% <0.00%> (-94.88%)`	⬇️
etna/models/utils.py	`12.50% <0.00%> (-87.50%)`	⬇️
etna/commands/__main__.py	`0.00% <0.00%> (-87.50%)`	⬇️
etna/libs/pmdarima_utils/arima.py	`16.00% <0.00%> (-84.00%)`	⬇️
etna/commands/resolvers.py	`0.00% <0.00%> (-80.00%)`	⬇️
etna/analysis/outliers/density_outliers.py	`22.44% <0.00%> (-75.52%)`	⬇️
etna/datasets/datasets_generation.py	`27.02% <0.00%> (-72.98%)`	⬇️
... and 80 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

martins0n

👍

d.a.bunin added 2 commits August 11, 2022 15:28

Write new code for TSDataset.to_flatten, update tests for it

0772c9f

Fix test on to_flatten for old versions of pandas (like 1.1.5)

6ad66ec

Mr-Geekman self-assigned this Aug 11, 2022

Update changelog

9e9ccba

github-actions bot temporarily deployed to pull request August 11, 2022 12:48 Inactive

Reformat code

d4e062a

github-actions bot temporarily deployed to pull request August 11, 2022 13:00 Inactive

Change order of segment column

fb301fb

github-actions bot temporarily deployed to pull request August 11, 2022 13:36 Inactive

martins0n self-requested a review August 12, 2022 07:30

Merge branch 'master' into issue-781

ff21234

github-actions bot temporarily deployed to pull request August 12, 2022 09:59 Inactive

martins0n approved these changes Aug 15, 2022

View reviewed changes

martins0n merged commit 5a6bdcc into master Aug 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make `TSDataset.to_flatten` faster for big datasets #848

Make `TSDataset.to_flatten` faster for big datasets #848

Mr-Geekman commented Aug 11, 2022 •

edited

Loading

github-actions bot commented Aug 11, 2022 •

edited

Loading

Mr-Geekman commented Aug 11, 2022 •

edited

Loading

codecov-commenter commented Aug 11, 2022 •

edited

Loading

martins0n left a comment

Make TSDataset.to_flatten faster for big datasets #848

Make TSDataset.to_flatten faster for big datasets #848

Conversation

Mr-Geekman commented Aug 11, 2022 • edited Loading

Before submitting (must do checklist)

Proposed Changes

Closing issues

github-actions bot commented Aug 11, 2022 • edited Loading

Mr-Geekman commented Aug 11, 2022 • edited Loading

Results of benchmark

codecov-commenter commented Aug 11, 2022 • edited Loading

Codecov Report

martins0n left a comment

Choose a reason for hiding this comment

Make `TSDataset.to_flatten` faster for big datasets #848

Make `TSDataset.to_flatten` faster for big datasets #848

Mr-Geekman commented Aug 11, 2022 •

edited

Loading

github-actions bot commented Aug 11, 2022 •

edited

Loading

Mr-Geekman commented Aug 11, 2022 •

edited

Loading

codecov-commenter commented Aug 11, 2022 •

edited

Loading