-
Notifications
You must be signed in to change notification settings - Fork 552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARIMA - Add support for missing observations and padding #4058
ARIMA - Add support for missing observations and padding #4058
Conversation
Removing |
I solved a bug of numerical issues that could result in NaNs, changed one of the new test cases to a more stable ARIMA order, and relaxed the tolerance of confidence intervals tests. The PR should be ready to merge if CI passes. |
rerun tests |
I have fixed some C++ tests that were failing due to a change in the Jones transform. |
@dantegd Can you please update your review? |
rerun tests |
Codecov Report
@@ Coverage Diff @@
## branch-21.10 #4058 +/- ##
===============================================
Coverage ? 86.20%
===============================================
Files ? 231
Lines ? 19072
Branches ? 0
===============================================
Hits ? 16441
Misses ? 2631
Partials ? 0
Flags with carried forward coverage won't be shown. Click here to find out more. Continue to review full report at Codecov.
|
@gpucibot merge |
The formula that we have been using since I added support for confidence intervals in ARIMA is slightly different than the one used in statsmodels. The difference is in particular quite pronounced when datasets have missing observations, which pushed me to raise tolerance for the intervals unit tests when I added test cases in the recent PR #4058. In this PR, I change our calculation to match statsmodels, and decrease the corresponding test tolerance, as we now have a strict match with statsmodels. Previous formula: ```python lower_t = fc_t - sqrt(2) * erfinv(level) * sqrt(F_t * mean(v_i**2 / F_i)) upper_t = fc_t + sqrt(2) * erfinv(level) * sqrt(F_t * mean(v_i**2 / F_i)) ``` New formula: ```python lower_t = fc_t - sqrt(2) * erfinv(level) * sqrt(F_t) upper_t = fc_t + sqrt(2) * erfinv(level) * sqrt(F_t) ``` Authors: - Louis Sugy (https://github.com/Nyrio) Approvers: - Tamas Bela Feher (https://github.com/tfeher) - Dante Gama Dessavre (https://github.com/dantegd) URL: #4248
This PR allows support for missing observations and padding at the start for variable-length batch. Example: ![missing_obs_0](https://user-images.githubusercontent.com/17441062/125832072-1ff903c9-088e-4d77-9b17-be365890d982.png) Note: I had to change ARIMA tests because I used a different method than statsmodels (which is used as a reference in tests) to compute the initial parameter estimation. They cut all missing observations for their initial least-square estimation, and I decided to fill them with naive replacements instead, so I keep the temporal relationships in the data and have a much better initial estimate and often a better fit in the end, according to some MASE measurements I made. So I updated the integration test to use the MASE and pass if we are approximately the same _or better_ than statsmodels. Authors: - Louis Sugy (https://github.com/Nyrio) Approvers: - Robert Maynard (https://github.com/robertmaynard) - Tamas Bela Feher (https://github.com/tfeher) - Dante Gama Dessavre (https://github.com/dantegd) - Ray Douglass (https://github.com/raydouglass) URL: rapidsai#4058
The formula that we have been using since I added support for confidence intervals in ARIMA is slightly different than the one used in statsmodels. The difference is in particular quite pronounced when datasets have missing observations, which pushed me to raise tolerance for the intervals unit tests when I added test cases in the recent PR rapidsai#4058. In this PR, I change our calculation to match statsmodels, and decrease the corresponding test tolerance, as we now have a strict match with statsmodels. Previous formula: ```python lower_t = fc_t - sqrt(2) * erfinv(level) * sqrt(F_t * mean(v_i**2 / F_i)) upper_t = fc_t + sqrt(2) * erfinv(level) * sqrt(F_t * mean(v_i**2 / F_i)) ``` New formula: ```python lower_t = fc_t - sqrt(2) * erfinv(level) * sqrt(F_t) upper_t = fc_t + sqrt(2) * erfinv(level) * sqrt(F_t) ``` Authors: - Louis Sugy (https://github.com/Nyrio) Approvers: - Tamas Bela Feher (https://github.com/tfeher) - Dante Gama Dessavre (https://github.com/dantegd) URL: rapidsai#4248
This PR allows support for missing observations and padding at the start for variable-length batch. Example:
Note: I had to change ARIMA tests because I used a different method than statsmodels (which is used as a reference in tests) to compute the initial parameter estimation. They cut all missing observations for their initial least-square estimation, and I decided to fill them with naive replacements instead, so I keep the temporal relationships in the data and have a much better initial estimate and often a better fit in the end, according to some MASE measurements I made. So I updated the integration test to use the MASE and pass if we are approximately the same or better than statsmodels.