Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the sp parameter to ARIMA models #3597

Merged
merged 9 commits into from
Jul 8, 2022
Merged

Add the sp parameter to ARIMA models #3597

merged 9 commits into from
Jul 8, 2022

Conversation

eccabay
Copy link
Contributor

@eccabay eccabay commented Jun 30, 2022

A bit of information on setting this parameter: http://alkaline-ml.com/pmdarima/tips_and_tricks.html#period

I tried to offer the flexibility for the user to set it if they know it, but also add some quick default values for common time series frequencies.

Perf tests are complete.

@codecov
Copy link

codecov bot commented Jun 30, 2022

Codecov Report

Merging #3597 (cfdf6c1) into main (26896e9) will increase coverage by 0.1%.
The diff coverage is 100.0%.

@@           Coverage Diff           @@
##            main   #3597     +/-   ##
=======================================
+ Coverage   99.7%   99.7%   +0.1%     
=======================================
  Files        335     335             
  Lines      33387   33456     +69     
=======================================
+ Hits       33258   33327     +69     
  Misses       129     129             
Impacted Files Coverage Δ
...omponents/estimators/regressors/arima_regressor.py 100.0% <100.0%> (ø)
...alml/tests/component_tests/test_arima_regressor.py 100.0% <100.0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 26896e9...cfdf6c1. Read the comment docs.

@eccabay eccabay marked this pull request as ready for review July 1, 2022 12:27
Copy link
Collaborator

@jeremyliweishih jeremyliweishih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great work on this @eccabay! Just some additional testing suggestions but lets wait to review the perf test results before merging!

return 1
freq_mappings = {
"D": 7,
"W": 52,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the main reasons I never implemented the mapping of sp was because of how ridiculously long ARIMA would take to fit on sp values above 12 (the R implementation seems to be much faster). I like this approach of mapping to at least daily, monthly, and quarterly data. I'm curious to see the fit time outcomes of the perf tests for any weekly datasets we have

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After taking a look, we don't have any weekly datasets in the perf tests 😭. I'm happy to remove this default for now, since I've noticed similar time issues while testing locally.

Copy link
Contributor

@chukarsten chukarsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking solid, @eccabay . Just left a few comments for exploration, particularly with respect to performance implications of re-inferring temporal frequencies and the setting/re-setting of the _component_obj in __init__() and __fit__().

time_index = self._parameters.get("time_index", None)
sp = self.arima_parameters["sp"]
if sp == "detect":
inferred_freqs = X.ww.infer_temporal_frequencies()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, dumb question here: if X already has its freqs inferred, will this reinfer them? Inference of data types is expensive, I'd assume freq inference is, too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, from what I can tell, neither woodwork nor pandas saves the frequency information - you can run inference in either library to get the frequency, but it's not saved anywhere between runs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, that's weird, because I know in a pandas DatetimeIndex, it saves the inferred frequency in both the freq and freqstr attributes. If you instantiate the DatetimeIndex like dti = pd.DatetimeIndex( ... , freq="D"), it should carry that freq around with it. I guess this is different because Woodwork doesn't necessarily have a datetime index and just has columns, one of which being the time index column.

sp_ = clf_month._get_sp(X)
assert sp_ == 12

X = pd.DataFrame({"dates": pd.date_range("2021-01-01", periods=500, freq="2D")})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think going back to one of my earlier comments, if X already knows the frequency, we might not want to have the freq re-infered. Perhaps we want to add a test counting the number of times infer_temporal_freqs() is called in the case of a known frequency provided and limit it to 1?

@eccabay eccabay requested a review from chukarsten July 6, 2022 14:18
Copy link
Contributor

@chukarsten chukarsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@eccabay eccabay enabled auto-merge (squash) July 8, 2022 15:08
@eccabay eccabay merged commit a303470 into main Jul 8, 2022
@eccabay eccabay deleted the 4285_arima_sp branch July 8, 2022 15:23
@chukarsten chukarsten mentioned this pull request Jul 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants