ARIMA - Add support for missing observations and padding #4058

Nyrio · 2021-07-15T17:42:01Z

This PR allows support for missing observations and padding at the start for variable-length batch. Example:

Note: I had to change ARIMA tests because I used a different method than statsmodels (which is used as a reference in tests) to compute the initial parameter estimation. They cut all missing observations for their initial least-square estimation, and I decided to fill them with naive replacements instead, so I keep the temporal relationships in the data and have a much better initial estimate and often a better fit in the end, according to some MASE measurements I made. So I updated the integration test to use the MASE and pass if we are approximately the same or better than statsmodels.

…estimate

ajschmidt8 · 2021-09-02T14:35:28Z

Removing ops-codeowners from the required reviews since it doesn't seem there are any file changes that we're responsible for. Feel free to add us back if necessary.

Nyrio · 2021-09-02T16:16:24Z

I solved a bug of numerical issues that could result in NaNs, changed one of the new test cases to a more stable ARIMA order, and relaxed the tolerance of confidence intervals tests.

The PR should be ready to merge if CI passes.

Nyrio · 2021-09-03T08:33:54Z

rerun tests

…t supported)

Nyrio · 2021-09-10T12:04:19Z

I have fixed some C++ tests that were failing due to a change in the Jones transform.

Nyrio · 2021-09-20T10:40:53Z

@dantegd Can you please update your review?

Nyrio · 2021-09-21T10:54:51Z

rerun tests

codecov-commenter · 2021-09-21T15:37:26Z

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.10@d657178). Click here to learn what that means.
The diff coverage is n/a.

@@               Coverage Diff               @@
##             branch-21.10    #4058   +/-   ##
===============================================
  Coverage                ?   86.20%           
===============================================
  Files                   ?      231           
  Lines                   ?    19072           
  Branches                ?        0           
===============================================
  Hits                    ?    16441           
  Misses                  ?     2631           
  Partials                ?        0

Flag	Coverage Δ
dask	`47.75% <0.00%> (?)`
non-dask	`78.76% <0.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d657178...aa544b7. Read the comment docs.

dantegd · 2021-09-22T17:39:15Z

@gpucibot merge

The formula that we have been using since I added support for confidence intervals in ARIMA is slightly different than the one used in statsmodels. The difference is in particular quite pronounced when datasets have missing observations, which pushed me to raise tolerance for the intervals unit tests when I added test cases in the recent PR #4058. In this PR, I change our calculation to match statsmodels, and decrease the corresponding test tolerance, as we now have a strict match with statsmodels. Previous formula: ```python lower_t = fc_t - sqrt(2) * erfinv(level) * sqrt(F_t * mean(v_i**2 / F_i)) upper_t = fc_t + sqrt(2) * erfinv(level) * sqrt(F_t * mean(v_i**2 / F_i)) ``` New formula: ```python lower_t = fc_t - sqrt(2) * erfinv(level) * sqrt(F_t) upper_t = fc_t + sqrt(2) * erfinv(level) * sqrt(F_t) ``` Authors: - Louis Sugy (https://github.com/Nyrio) Approvers: - Tamas Bela Feher (https://github.com/tfeher) - Dante Gama Dessavre (https://github.com/dantegd) URL: #4248

This PR allows support for missing observations and padding at the start for variable-length batch. Example: ![missing_obs_0](https://user-images.githubusercontent.com/17441062/125832072-1ff903c9-088e-4d77-9b17-be365890d982.png) Note: I had to change ARIMA tests because I used a different method than statsmodels (which is used as a reference in tests) to compute the initial parameter estimation. They cut all missing observations for their initial least-square estimation, and I decided to fill them with naive replacements instead, so I keep the temporal relationships in the data and have a much better initial estimate and often a better fit in the end, according to some MASE measurements I made. So I updated the integration test to use the MASE and pass if we are approximately the same _or better_ than statsmodels. Authors: - Louis Sugy (https://github.com/Nyrio) Approvers: - Robert Maynard (https://github.com/robertmaynard) - Tamas Bela Feher (https://github.com/tfeher) - Dante Gama Dessavre (https://github.com/dantegd) - Ray Douglass (https://github.com/raydouglass) URL: rapidsai#4058

The formula that we have been using since I added support for confidence intervals in ARIMA is slightly different than the one used in statsmodels. The difference is in particular quite pronounced when datasets have missing observations, which pushed me to raise tolerance for the intervals unit tests when I added test cases in the recent PR rapidsai#4058. In this PR, I change our calculation to match statsmodels, and decrease the corresponding test tolerance, as we now have a strict match with statsmodels. Previous formula: ```python lower_t = fc_t - sqrt(2) * erfinv(level) * sqrt(F_t * mean(v_i**2 / F_i)) upper_t = fc_t + sqrt(2) * erfinv(level) * sqrt(F_t * mean(v_i**2 / F_i)) ``` New formula: ```python lower_t = fc_t - sqrt(2) * erfinv(level) * sqrt(F_t) upper_t = fc_t + sqrt(2) * erfinv(level) * sqrt(F_t) ``` Authors: - Louis Sugy (https://github.com/Nyrio) Approvers: - Tamas Bela Feher (https://github.com/tfeher) - Dante Gama Dessavre (https://github.com/dantegd) URL: rapidsai#4248

Nyrio added 28 commits June 16, 2021 05:41

Initial rewrite of the Kalman filter

8edacaf

Remove unused sparse matrices

1f1bce5

More efficient block-level gemv

8a0ac86

Shared mem Z and alpha

a4f8fa2

Shared mem K

c21b0a0

Rewrite loads for block gemm with support for vectorization

b1a27f6

Tune Kalman loop gemm policy

ffdb182

Clean-up, add docs and add test

9c7200a

Merge branch 'branch-21.08' into opt-kalman-loop

eeada6f

Merge branch 'branch-21.08' into opt-kalman-loop

1aa9b85

Test improvements + small fixes

c07967c

Test multiple policies for block gemm and gemv

5f048b4

Overlap host and device computations in block prims tests

5e84fcf

Test with and without preload for gemv and xAx'

3073f8a

Include style fix

c4685f8

Compute log-likelihood directly in Kalman kernel

ced2b86

Use predictions directly instead of residuals

8c8eaf2

Detecting missing observations

7ecaddf

Use naive replacements for missing observations in initial parameter …

801c729

…estimate

Add support for missing observations in block-local Kalman kernel

785b0a2

Add support for missing observations in thread-local Kalman kernel

d39252c

Update pytests for missing observations

5648006

Improve fillna primitive with a scan

1d770db

Apply new formatting rules to facilitate merging

c1c0a88

Merge branch 'branch-21.08' into fea-missing-observations

42c3c91

Formatting fix, improve fillna kernel and remove unused args in tests

9d073b7

Update ARIMA notebook

0d9db3b

Copy dataframe in arima notebook before modifying it

1229bc3

Nyrio requested review from a team as code owners July 15, 2021 17:42

ajschmidt8 removed the request for review from a team September 2, 2021 14:35

Fix copyright headers to pass checks

0979f33

github-actions bot removed the conda conda issue label Sep 2, 2021

Nyrio added 5 commits September 8, 2021 02:44

Provide right stream to scan in fillna

06f6247

Improve initialization with missing observations

271599c

Add error in AutoARIMA when missing observations are detected (not ye…

5a1e975

…t supported)

Merge branch 'branch-21.10' into fea-missing-observations

d80f1bb

Fix Jones transform test wrt. clamping parameters

cb419dc

Nyrio added 4 - Waiting on Reviewer Waiting for reviewer to review or respond and removed 4 - Waiting on Author Waiting for author to respond to review labels Sep 10, 2021

Nyrio added 2 commits September 20, 2021 04:14

Modify dataset for missing observations stress test

993126e

Merge branch 'branch-21.10' into fea-missing-observations

aa544b7

Nyrio mentioned this pull request Sep 22, 2021

Add support for exogenous variables to ARIMA #4221

Merged

dantegd approved these changes Sep 22, 2021

View reviewed changes

raydouglass approved these changes Sep 22, 2021

View reviewed changes

rapids-bot bot merged commit f432537 into rapidsai:branch-21.10 Sep 22, 2021

Nyrio mentioned this pull request Sep 29, 2021

Change calculation of ARIMA confidence intervals #4248

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARIMA - Add support for missing observations and padding #4058

ARIMA - Add support for missing observations and padding #4058

Nyrio commented Jul 15, 2021

ajschmidt8 commented Sep 2, 2021

Nyrio commented Sep 2, 2021

Nyrio commented Sep 3, 2021

Nyrio commented Sep 10, 2021

Nyrio commented Sep 20, 2021

Nyrio commented Sep 21, 2021

codecov-commenter commented Sep 21, 2021

dantegd commented Sep 22, 2021

ARIMA - Add support for missing observations and padding #4058

ARIMA - Add support for missing observations and padding #4058

Conversation

Nyrio commented Jul 15, 2021

ajschmidt8 commented Sep 2, 2021

Nyrio commented Sep 2, 2021

Nyrio commented Sep 3, 2021

Nyrio commented Sep 10, 2021

Nyrio commented Sep 20, 2021

Nyrio commented Sep 21, 2021

codecov-commenter commented Sep 21, 2021

Codecov Report

dantegd commented Sep 22, 2021