-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extend DateTimeFormatCheck to work for multiseries #4300
Conversation
@@ -199,52 +210,6 @@ def validate(self, X, y): | |||
... } | |||
... ] | |||
|
|||
The column "Weeks" has a date that does not follow the weekly pattern, which is considered misaligned. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed because it looks like an exact copy of line 156-200
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## main #4300 +/- ##
=======================================
+ Coverage 99.7% 99.7% +0.1%
=======================================
Files 357 357
Lines 39587 39694 +107
=======================================
+ Hits 39467 39574 +107
Misses 120 120
☔ View full report in Codecov by Sentry. |
if not pd.DatetimeIndex(no_nan_datetime_values).is_monotonic_increasing: | ||
messages.append( | ||
DataCheckError( | ||
message="Datetime values must be sorted in ascending order.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unsure if I need to change the message here to say something like "Datetime values {at series x} must be sorted in ascending order" or if it's alright to just keep it general (note: this line is not currently in the for loop that loops through all the different series)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine to keep it general here!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few nitpicks and some testing consolidation, but otherwise looks great! Thanks for doing this
if not pd.DatetimeIndex(no_nan_datetime_values).is_monotonic_increasing: | ||
messages.append( | ||
DataCheckError( | ||
message="Datetime values must be sorted in ascending order.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine to keep it general here!
dates = ( | ||
pd.date_range("2021-01-01", periods=15, freq="2D") | ||
.drop("2021-01-13") | ||
.append(pd.date_range("2021-01-30", periods=1)) | ||
.append(pd.date_range("2021-01-31", periods=86, freq="2D")) | ||
) | ||
X = pd.DataFrame({"dates": dates}, dtype="datetime64[ns]") | ||
|
||
ww_payload = infer_frequency( | ||
X["dates"], | ||
debug=True, | ||
window_length=WINDOW_LENGTH, | ||
threshold=THRESHOLD, | ||
) | ||
|
||
assert datetime_format_check.validate(X, y) == [ | ||
DataCheckError( | ||
message="Column 'dates' has datetime values missing between start and end date.", | ||
data_check_name=datetime_format_check_name, | ||
message_code=DataCheckMessageCode.DATETIME_IS_MISSING_VALUES, | ||
).to_dict(), | ||
DataCheckError( | ||
message="Column 'dates' has datetime values that do not align with the inferred frequency.", | ||
data_check_name=datetime_format_check_name, | ||
message_code=DataCheckMessageCode.DATETIME_HAS_MISALIGNED_VALUES, | ||
).to_dict(), | ||
get_uneven_error("dates", ww_payload), | ||
] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why was this deleted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like an exact copy of line 323 - 349
evalml/tests/data_checks_tests/test_datetime_format_data_check.py
Outdated
Show resolved
Hide resolved
evalml/tests/data_checks_tests/test_datetime_format_data_check.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - just some punctuation requested.
Co-authored-by: Jeremy Shih <[email protected]>
Co-authored-by: Jeremy Shih <[email protected]>
Pull Request Description
Extend DateTimeFormatCheck to work for multiseries, works on stacked datasets.
Closes #4301
Small example of what this does
Say we had a multiseries problem with a stacked dataset, with two series (0, 1) and goes from 2021-01-01 to 2021-01-15.
For Series 1, the date "2021-01-07" is missing (line 14) while Series 0 has an extra date "2021-01-29" (line 31).
When we use the datacheck and run validate:
It gives the following result: