-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MySQL time column to dataframe fix for Pandas 1.x, increase MySQL integration testing #152
Conversation
# relying on Pandas: | ||
# | ||
# See https://github.com/bluelabsio/records-mover/pull/152 | ||
pandas_version: ">1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any sense in running these tests with old pandas as well? I mean, hopefully people aren't really using old pandas much anymore (should that maybe become the default pandas version unless overridden?) I guess that's a lot of testing to do for a relatively small segment of functionality
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly I'm only halfway convinced it makes sense to keep this in the tests as a special case - this one line of code is currently the issue, there's nothing else special about MySQL and Pandas, and now we have a unit test for this issue, which is lower on the test pyramid (and we run unit tests on both Pandas <1 and Pandas >1).
|
||
def test_load_bluelabs_format_with_header_row(self): | ||
self.load_and_verify('delimited', 'bluelabs', {'header-row': True}) | ||
self.load_and_verify('delimited', 'bluelabs', {'compression': None, 'header-row': True}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is unrelated to the above concern - but we weren't really testing MySQL unloads, as our test cases used compressed files. Those aren't supported by MySQL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should that read "we aren't really testing..."? If so, I will drop an issue in for us to add some testing :)
data = np.array([pd.Timedelta(hours=0, minutes=0, seconds=0)]) | ||
series = pd.Series(data) | ||
new_series = field.cast_series_type(series) | ||
self.assertEqual(new_series[0], '00:00:00') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test reproduced the problem:
def components_to_time_str(df: pd.DataFrame) -> str: | ||
return f"{df['hours']:02d}:{df['minutes']:02d}:{df['seconds']:02d}" | ||
out = series.dt.components.apply(axis=1, func=components_to_time_str) | ||
return out |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know how much more or less efficient this will be. It's probably not hugely important until we start dealing with very large things exported from MySQL - and given we don't currently have a bulk unloader, adding one is probably a higher priority than trying to optimize this. 🤷
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Word-- I think my expectation is that before we worry about this time transformation in pandas being a bottleneck for big datasets, we probably need to worry about pandas itself being the bottleneck.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! A few small questions, but nothing to hold up the PR.
# relying on Pandas: | ||
# | ||
# See https://github.com/bluelabsio/records-mover/pull/152 | ||
pandas_version: ">1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any sense in running these tests with old pandas as well? I mean, hopefully people aren't really using old pandas much anymore (should that maybe become the default pandas version unless overridden?) I guess that's a lot of testing to do for a relatively small segment of functionality
def components_to_time_str(df: pd.DataFrame) -> str: | ||
return f"{df['hours']:02d}:{df['minutes']:02d}:{df['seconds']:02d}" | ||
out = series.dt.components.apply(axis=1, func=components_to_time_str) | ||
return out |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Word-- I think my expectation is that before we worry about this time transformation in pandas being a bottleneck for big datasets, we probably need to worry about pandas itself being the bottleneck.
|
||
def test_load_bluelabs_format_with_header_row(self): | ||
self.load_and_verify('delimited', 'bluelabs', {'header-row': True}) | ||
self.load_and_verify('delimited', 'bluelabs', {'compression': None, 'header-row': True}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should that read "we aren't really testing..."? If so, I will drop an issue in for us to add some testing :)
MySQL doesn't really have a Time column type, it has a time delta type. People however use that type for both things.
In order to format CSVs after pulling data from MySQL into a dataframe, we have some code which formats timedelta columns as time strings. To do that, we were relying on some shady str() behavior in Pandas/NumPy timedelta types.
It looks like as of Pandas 1.x, for the time midnight (
00:00:00
), this breaks, as str(timedelta) gives us0 days
instead of0 days 00:00:00
.This PR changes over to a less shady method of generating those times.