-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add the sp
parameter to ARIMA models
#3597
Conversation
Codecov Report
@@ Coverage Diff @@
## main #3597 +/- ##
=======================================
+ Coverage 99.7% 99.7% +0.1%
=======================================
Files 335 335
Lines 33387 33456 +69
=======================================
+ Hits 33258 33327 +69
Misses 129 129
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great work on this @eccabay! Just some additional testing suggestions but lets wait to review the perf test results before merging!
evalml/pipelines/components/estimators/regressors/arima_regressor.py
Outdated
Show resolved
Hide resolved
evalml/pipelines/components/estimators/regressors/arima_regressor.py
Outdated
Show resolved
Hide resolved
return 1 | ||
freq_mappings = { | ||
"D": 7, | ||
"W": 52, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the main reasons I never implemented the mapping of sp
was because of how ridiculously long ARIMA would take to fit on sp
values above 12 (the R implementation seems to be much faster). I like this approach of mapping to at least daily
, monthly
, and quarterly
data. I'm curious to see the fit time outcomes of the perf tests for any weekly datasets we have
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After taking a look, we don't have any weekly datasets in the perf tests 😭. I'm happy to remove this default for now, since I've noticed similar time issues while testing locally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking solid, @eccabay . Just left a few comments for exploration, particularly with respect to performance implications of re-inferring temporal frequencies and the setting/re-setting of the _component_obj
in __init__()
and __fit__()
.
evalml/pipelines/components/estimators/regressors/arima_regressor.py
Outdated
Show resolved
Hide resolved
time_index = self._parameters.get("time_index", None) | ||
sp = self.arima_parameters["sp"] | ||
if sp == "detect": | ||
inferred_freqs = X.ww.infer_temporal_frequencies() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, dumb question here: if X already has its freqs inferred, will this reinfer them? Inference of data types is expensive, I'd assume freq inference is, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, from what I can tell, neither woodwork nor pandas saves the frequency information - you can run inference in either library to get the frequency, but it's not saved anywhere between runs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, that's weird, because I know in a pandas DatetimeIndex, it saves the inferred frequency in both the freq
and freqstr
attributes. If you instantiate the DatetimeIndex like dti = pd.DatetimeIndex( ... , freq="D")
, it should carry that freq around with it. I guess this is different because Woodwork doesn't necessarily have a datetime index and just has columns, one of which being the time index column.
evalml/pipelines/components/estimators/regressors/arima_regressor.py
Outdated
Show resolved
Hide resolved
sp_ = clf_month._get_sp(X) | ||
assert sp_ == 12 | ||
|
||
X = pd.DataFrame({"dates": pd.date_range("2021-01-01", periods=500, freq="2D")}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think going back to one of my earlier comments, if X already knows the frequency, we might not want to have the freq re-infered. Perhaps we want to add a test counting the number of times infer_temporal_freqs()
is called in the case of a known frequency provided and limit it to 1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
A bit of information on setting this parameter: http://alkaline-ml.com/pmdarima/tips_and_tricks.html#period
I tried to offer the flexibility for the user to set it if they know it, but also add some quick default values for common time series frequencies.
Perf tests are complete.