Skip to content

Make TSDataset.to_flatten faster for big datasets #848

Merged
merged 6 commits into from
Aug 15, 2022
Merged

Conversation

Mr-Geekman
Copy link
Contributor

@Mr-Geekman Mr-Geekman commented Aug 11, 2022

Before submitting (must do checklist)

  • Did you read the contribution guide?
  • Did you update the docs? We use Numpy format for all the methods and classes.
  • Did you write any new necessary tests?
  • Did you update the CHANGELOG?

Proposed Changes

Look #781.

Closing issues

Closes #781.
Closes #777.

@Mr-Geekman Mr-Geekman self-assigned this Aug 11, 2022
@github-actions
Copy link

github-actions bot commented Aug 11, 2022

🚀 Deployed on https://deploy-preview-848--etna-docs.netlify.app

@github-actions github-actions bot temporarily deployed to pull request August 11, 2022 12:48 Inactive
@Mr-Geekman
Copy link
Contributor Author

Mr-Geekman commented Aug 11, 2022

Code for generation benchmark datasets:

def load_dataset(
    num_segments: int,
    num_periods: int = 100,
    num_add_blocks: int = 0,
    add_object_category: bool = False,
    make_encoded_category: bool = True,
    random_state: int = 0,
) -> pd.DataFrame:
    rng = np.random.default_rng(random_state)
    df = generate_ar_df(
        periods=num_periods, start_time="2020-01-01", n_segments=num_segments
    )

    for i in range(num_add_blocks):
        # add int column
        df[f"new_int_{i}"] = rng.integers(low=-100, high=100, size=df.shape[0])

        # add float column
        df[f"new_float_{i}"] = rng.uniform(low=-100, high=100, size=df.shape[0])

        # add category column
        num_categories = num_segments // 10
        categories = list(range(num_categories))
        column_values = rng.choice(categories, size=df.shape[0])

        # in this case we make encoded categories as category
        if make_encoded_category:
            df[f"new_cat_{i}_cat"] = column_values
            df[f"new_cat_{i}_cat"] = df[f"new_cat_{i}_cat"].astype("category")
        # in this case we keep them int (it can be beneficial for some methods)
        else:
            df[f"new_cat_{i}_encoded"] = column_values

    if add_object_category:
        num_categories = num_segments // 10
        categories = [str(cat) for cat in range(num_categories)]
        df["new_obj_cat"] = rng.choice(categories, size=df.shape[0])
        df["new_obj_cat"] = df["new_obj_cat"].astype("category")

    return df

Results of benchmark

metrics.csv

There is a little mistake on pictures below. The second plot on each image represents situation when add_object_category=True.

num_add_blocks=0:
image

num_add_blocks=1:
image

num_add_blocks=3:
image

@codecov-commenter
Copy link

codecov-commenter commented Aug 11, 2022

Codecov Report

Merging #848 (ff21234) into master (5675c17) will decrease coverage by 35.29%.
The diff coverage is 100.00%.

@@             Coverage Diff             @@
##           master     #848       +/-   ##
===========================================
- Coverage   84.65%   49.36%   -35.30%     
===========================================
  Files         130      130               
  Lines        7411     7414        +3     
===========================================
- Hits         6274     3660     -2614     
- Misses       1137     3754     +2617     
Impacted Files Coverage Δ
etna/datasets/tsdataset.py 67.17% <100.00%> (-23.52%) ⬇️
etna/commands/__init__.py 0.00% <0.00%> (-100.00%) ⬇️
etna/commands/backtest_command.py 0.00% <0.00%> (-97.06%) ⬇️
etna/commands/forecast_command.py 0.00% <0.00%> (-94.88%) ⬇️
etna/models/utils.py 12.50% <0.00%> (-87.50%) ⬇️
etna/commands/__main__.py 0.00% <0.00%> (-87.50%) ⬇️
etna/libs/pmdarima_utils/arima.py 16.00% <0.00%> (-84.00%) ⬇️
etna/commands/resolvers.py 0.00% <0.00%> (-80.00%) ⬇️
etna/analysis/outliers/density_outliers.py 22.44% <0.00%> (-75.52%) ⬇️
etna/datasets/datasets_generation.py 27.02% <0.00%> (-72.98%) ⬇️
... and 80 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@github-actions github-actions bot temporarily deployed to pull request August 11, 2022 13:00 Inactive
@github-actions github-actions bot temporarily deployed to pull request August 11, 2022 13:36 Inactive
@martins0n martins0n self-requested a review August 12, 2022 07:30
@github-actions github-actions bot temporarily deployed to pull request August 12, 2022 09:59 Inactive
Copy link
Contributor

@martins0n martins0n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@martins0n martins0n merged commit 5a6bdcc into master Aug 15, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make TSDataset.to_flatten faster [BUG] Fix to_flatten on pandas==1.1.5
3 participants