Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MySQL time column to dataframe fix for Pandas 1.x, increase MySQL integration testing #152

Merged
merged 9 commits into from
Dec 10, 2020

Conversation

vinceatbluelabs
Copy link
Contributor

@vinceatbluelabs vinceatbluelabs commented Dec 10, 2020

MySQL doesn't really have a Time column type, it has a time delta type. People however use that type for both things.

In order to format CSVs after pulling data from MySQL into a dataframe, we have some code which formats timedelta columns as time strings. To do that, we were relying on some shady str() behavior in Pandas/NumPy timedelta types.

It looks like as of Pandas 1.x, for the time midnight (00:00:00), this breaks, as str(timedelta) gives us 0 days instead of 0 days 00:00:00.

This PR changes over to a less shady method of generating those times.

@vinceatbluelabs vinceatbluelabs changed the title Test loading with formats MySQL supports MySQL time column to dataframe fix Dec 10, 2020
@vinceatbluelabs vinceatbluelabs changed the title MySQL time column to dataframe fix MySQL time column to dataframe fix for Pandas 1.x Dec 10, 2020
# relying on Pandas:
#
# See https://github.com/bluelabsio/records-mover/pull/152
pandas_version: ">1"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any sense in running these tests with old pandas as well? I mean, hopefully people aren't really using old pandas much anymore (should that maybe become the default pandas version unless overridden?) I guess that's a lot of testing to do for a relatively small segment of functionality

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly I'm only halfway convinced it makes sense to keep this in the tests as a special case - this one line of code is currently the issue, there's nothing else special about MySQL and Pandas, and now we have a unit test for this issue, which is lower on the test pyramid (and we run unit tests on both Pandas <1 and Pandas >1).

@vinceatbluelabs vinceatbluelabs changed the title MySQL time column to dataframe fix for Pandas 1.x MySQL time column to dataframe fix for Pandas 1.x, increase MySQL integration testing Dec 10, 2020

def test_load_bluelabs_format_with_header_row(self):
self.load_and_verify('delimited', 'bluelabs', {'header-row': True})
self.load_and_verify('delimited', 'bluelabs', {'compression': None, 'header-row': True})
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unrelated to the above concern - but we weren't really testing MySQL unloads, as our test cases used compressed files. Those aren't supported by MySQL

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should that read "we aren't really testing..."? If so, I will drop an issue in for us to add some testing :)

data = np.array([pd.Timedelta(hours=0, minutes=0, seconds=0)])
series = pd.Series(data)
new_series = field.cast_series_type(series)
self.assertEqual(new_series[0], '00:00:00')
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vinceatbluelabs vinceatbluelabs marked this pull request as ready for review December 10, 2020 18:12
def components_to_time_str(df: pd.DataFrame) -> str:
return f"{df['hours']:02d}:{df['minutes']:02d}:{df['seconds']:02d}"
out = series.dt.components.apply(axis=1, func=components_to_time_str)
return out
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how much more or less efficient this will be. It's probably not hugely important until we start dealing with very large things exported from MySQL - and given we don't currently have a bulk unloader, adding one is probably a higher priority than trying to optimize this. 🤷

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Word-- I think my expectation is that before we worry about this time transformation in pandas being a bottleneck for big datasets, we probably need to worry about pandas itself being the bottleneck.

Copy link
Contributor

@cwegrzyn cwegrzyn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! A few small questions, but nothing to hold up the PR. :shipit:

# relying on Pandas:
#
# See https://github.com/bluelabsio/records-mover/pull/152
pandas_version: ">1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any sense in running these tests with old pandas as well? I mean, hopefully people aren't really using old pandas much anymore (should that maybe become the default pandas version unless overridden?) I guess that's a lot of testing to do for a relatively small segment of functionality

def components_to_time_str(df: pd.DataFrame) -> str:
return f"{df['hours']:02d}:{df['minutes']:02d}:{df['seconds']:02d}"
out = series.dt.components.apply(axis=1, func=components_to_time_str)
return out
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Word-- I think my expectation is that before we worry about this time transformation in pandas being a bottleneck for big datasets, we probably need to worry about pandas itself being the bottleneck.


def test_load_bluelabs_format_with_header_row(self):
self.load_and_verify('delimited', 'bluelabs', {'header-row': True})
self.load_and_verify('delimited', 'bluelabs', {'compression': None, 'header-row': True})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should that read "we aren't really testing..."? If so, I will drop an issue in for us to add some testing :)

@vinceatbluelabs vinceatbluelabs merged commit f9d8d6d into master Dec 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants