Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend DateTimeFormatCheck to work for multiseries #4300

Merged
merged 13 commits into from
Sep 12, 2023
Merged

Conversation

MichaelFu512
Copy link
Contributor

@MichaelFu512 MichaelFu512 commented Sep 7, 2023

Pull Request Description

Extend DateTimeFormatCheck to work for multiseries, works on stacked datasets.

Closes #4301

Small example of what this does

Say we had a multiseries problem with a stacked dataset, with two series (0, 1) and goes from 2021-01-01 to 2021-01-15.
Screen Shot 2023-09-07 at 2 05 33 PM

For Series 1, the date "2021-01-07" is missing (line 14) while Series 0 has an extra date "2021-01-29" (line 31).

When we use the datacheck and run validate:
Screen Shot 2023-09-07 at 2 09 40 PM

It gives the following result:
Screen Shot 2023-09-07 at 2 11 32 PM

@@ -199,52 +210,6 @@ def validate(self, X, y):
... }
... ]

The column "Weeks" has a date that does not follow the weekly pattern, which is considered misaligned.
Copy link
Contributor Author

@MichaelFu512 MichaelFu512 Sep 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed because it looks like an exact copy of line 156-200

@codecov
Copy link

codecov bot commented Sep 7, 2023

Codecov Report

Patch coverage: 100.0% and project coverage change: +0.1% 🎉

Comparison is base (f7276a5) 99.7% compared to head (382fdc4) 99.7%.

Additional details and impacted files
@@           Coverage Diff           @@
##            main   #4300     +/-   ##
=======================================
+ Coverage   99.7%   99.7%   +0.1%     
=======================================
  Files        357     357             
  Lines      39587   39694    +107     
=======================================
+ Hits       39467   39574    +107     
  Misses       120     120             
Files Changed Coverage Δ
evalml/data_checks/datetime_format_data_check.py 100.0% <100.0%> (ø)
...ta_checks_tests/test_datetime_format_data_check.py 100.0% <100.0%> (ø)

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

if not pd.DatetimeIndex(no_nan_datetime_values).is_monotonic_increasing:
messages.append(
DataCheckError(
message="Datetime values must be sorted in ascending order.",
Copy link
Contributor Author

@MichaelFu512 MichaelFu512 Sep 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unsure if I need to change the message here to say something like "Datetime values {at series x} must be sorted in ascending order" or if it's alright to just keep it general (note: this line is not currently in the for loop that loops through all the different series)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to keep it general here!

@MichaelFu512 MichaelFu512 marked this pull request as ready for review September 8, 2023 00:08
Copy link
Contributor

@eccabay eccabay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few nitpicks and some testing consolidation, but otherwise looks great! Thanks for doing this

evalml/data_checks/datetime_format_data_check.py Outdated Show resolved Hide resolved
if not pd.DatetimeIndex(no_nan_datetime_values).is_monotonic_increasing:
messages.append(
DataCheckError(
message="Datetime values must be sorted in ascending order.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to keep it general here!

evalml/data_checks/datetime_format_data_check.py Outdated Show resolved Hide resolved
evalml/data_checks/datetime_format_data_check.py Outdated Show resolved Hide resolved
Comment on lines -352 to -380
dates = (
pd.date_range("2021-01-01", periods=15, freq="2D")
.drop("2021-01-13")
.append(pd.date_range("2021-01-30", periods=1))
.append(pd.date_range("2021-01-31", periods=86, freq="2D"))
)
X = pd.DataFrame({"dates": dates}, dtype="datetime64[ns]")

ww_payload = infer_frequency(
X["dates"],
debug=True,
window_length=WINDOW_LENGTH,
threshold=THRESHOLD,
)

assert datetime_format_check.validate(X, y) == [
DataCheckError(
message="Column 'dates' has datetime values missing between start and end date.",
data_check_name=datetime_format_check_name,
message_code=DataCheckMessageCode.DATETIME_IS_MISSING_VALUES,
).to_dict(),
DataCheckError(
message="Column 'dates' has datetime values that do not align with the inferred frequency.",
data_check_name=datetime_format_check_name,
message_code=DataCheckMessageCode.DATETIME_HAS_MISALIGNED_VALUES,
).to_dict(),
get_uneven_error("dates", ww_payload),
]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this deleted?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like an exact copy of line 323 - 349

Copy link
Collaborator

@jeremyliweishih jeremyliweishih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - just some punctuation requested.

evalml/data_checks/datetime_format_data_check.py Outdated Show resolved Hide resolved
evalml/data_checks/datetime_format_data_check.py Outdated Show resolved Hide resolved
@MichaelFu512 MichaelFu512 merged commit 3d29ba3 into main Sep 12, 2023
22 checks passed
@MichaelFu512 MichaelFu512 deleted the TML-8377 branch September 12, 2023 19:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Extend DataTimeFormatCheck data check to support multiseries
3 participants