MySQL time column to dataframe fix for Pandas 1.x, increase MySQL integration testing #152

vinceatbluelabs · 2020-12-10T16:50:51Z

MySQL doesn't really have a Time column type, it has a time delta type. People however use that type for both things.

In order to format CSVs after pulling data from MySQL into a dataframe, we have some code which formats timedelta columns as time strings. To do that, we were relying on some shady str() behavior in Pandas/NumPy timedelta types.

It looks like as of Pandas 1.x, for the time midnight (00:00:00), this breaks, as str(timedelta) gives us 0 days instead of 0 days 00:00:00.

This PR changes over to a less shady method of generating those times.

vinceatbluelabs · 2020-12-10T17:58:37Z

.circleci/config.yml

+          # relying on Pandas:
+          #
+          # See https://github.com/bluelabsio/records-mover/pull/152
+          pandas_version: ">1"


Reproduced here: https://app.circleci.com/pipelines/github/bluelabsio/records-mover/1812/workflows/c7b38b71-8ef9-4f6c-b096-10fe82d0c2ca/jobs/15504

Is there any sense in running these tests with old pandas as well? I mean, hopefully people aren't really using old pandas much anymore (should that maybe become the default pandas version unless overridden?) I guess that's a lot of testing to do for a relatively small segment of functionality

Honestly I'm only halfway convinced it makes sense to keep this in the tests as a special case - this one line of code is currently the issue, there's nothing else special about MySQL and Pandas, and now we have a unit test for this issue, which is lower on the test pyramid (and we run unit tests on both Pandas <1 and Pandas >1).

vinceatbluelabs · 2020-12-10T17:59:55Z

tests/integration/records/single_db/test_records_load.py


    def test_load_bluelabs_format_with_header_row(self):
-        self.load_and_verify('delimited', 'bluelabs', {'header-row': True})
+        self.load_and_verify('delimited', 'bluelabs', {'compression': None, 'header-row': True})


This is unrelated to the above concern - but we weren't really testing MySQL unloads, as our test cases used compressed files. Those aren't supported by MySQL

Should that read "we aren't really testing..."? If so, I will drop an issue in for us to add some testing :)

vinceatbluelabs · 2020-12-10T18:01:55Z

tests/unit/records/schema/field/test_field.py

+        data = np.array([pd.Timedelta(hours=0, minutes=0, seconds=0)])
+        series = pd.Series(data)
+        new_series = field.cast_series_type(series)
+        self.assertEqual(new_series[0], '00:00:00')


This test reproduced the problem:

https://app.circleci.com/pipelines/github/bluelabsio/records-mover/1813/workflows/888be2a6-5694-4ddb-8436-7f6a2c386948/jobs/15507

vinceatbluelabs · 2020-12-10T18:14:19Z

records_mover/records/schema/field/__init__.py

+                    def components_to_time_str(df: pd.DataFrame) -> str:
+                        return f"{df['hours']:02d}:{df['minutes']:02d}:{df['seconds']:02d}"
+                    out = series.dt.components.apply(axis=1, func=components_to_time_str)
+                    return out


I don't know how much more or less efficient this will be. It's probably not hugely important until we start dealing with very large things exported from MySQL - and given we don't currently have a bulk unloader, adding one is probably a higher priority than trying to optimize this. 🤷

Word-- I think my expectation is that before we worry about this time transformation in pandas being a bottleneck for big datasets, we probably need to worry about pandas itself being the bottleneck.

cwegrzyn

Nice! A few small questions, but nothing to hold up the PR.

cwegrzyn · 2020-12-10T18:37:59Z

.circleci/config.yml

+          # relying on Pandas:
+          #
+          # See https://github.com/bluelabsio/records-mover/pull/152
+          pandas_version: ">1"


Is there any sense in running these tests with old pandas as well? I mean, hopefully people aren't really using old pandas much anymore (should that maybe become the default pandas version unless overridden?) I guess that's a lot of testing to do for a relatively small segment of functionality

cwegrzyn · 2020-12-10T18:40:11Z

records_mover/records/schema/field/__init__.py

+                    def components_to_time_str(df: pd.DataFrame) -> str:
+                        return f"{df['hours']:02d}:{df['minutes']:02d}:{df['seconds']:02d}"
+                    out = series.dt.components.apply(axis=1, func=components_to_time_str)
+                    return out


Word-- I think my expectation is that before we worry about this time transformation in pandas being a bottleneck for big datasets, we probably need to worry about pandas itself being the bottleneck.

cwegrzyn · 2020-12-10T18:47:36Z

tests/integration/records/single_db/test_records_load.py


    def test_load_bluelabs_format_with_header_row(self):
-        self.load_and_verify('delimited', 'bluelabs', {'header-row': True})
+        self.load_and_verify('delimited', 'bluelabs', {'compression': None, 'header-row': True})


Should that read "we aren't really testing..."? If so, I will drop an issue in for us to add some testing :)

Test loading with formats MySQL supports

00e9709

vinceatbluelabs changed the title ~~Test loading with formats MySQL supports~~ MySQL time column to dataframe fix Dec 10, 2020

vinceatbluelabs added 4 commits December 10, 2020 12:03

Merge remote-tracking branch 'origin/master' into mysql_fixes

10e7f9c

Add pandas in MySQL integration tests

da474e6

Use new Pandas

d957ac4

Reproduce issue in CircleCI

c35cb5d

vinceatbluelabs changed the title ~~MySQL time column to dataframe fix~~ MySQL time column to dataframe fix for Pandas 1.x Dec 10, 2020

vinceatbluelabs commented Dec 10, 2020

View reviewed changes

vinceatbluelabs changed the title ~~MySQL time column to dataframe fix for Pandas 1.x~~ MySQL time column to dataframe fix for Pandas 1.x, increase MySQL integration testing Dec 10, 2020

vinceatbluelabs commented Dec 10, 2020

View reviewed changes

Add unit test for timedelta conversion

1af5982

vinceatbluelabs commented Dec 10, 2020

View reviewed changes

vinceatbluelabs added 3 commits December 10, 2020 13:02

Add fix

fd3c509

Ratchet coverage

f628e4a

Unratchet

9fee886

vinceatbluelabs marked this pull request as ready for review December 10, 2020 18:12

vinceatbluelabs commented Dec 10, 2020

View reviewed changes

cwegrzyn approved these changes Dec 10, 2020

View reviewed changes

vinceatbluelabs merged commit f9d8d6d into master Dec 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MySQL time column to dataframe fix for Pandas 1.x, increase MySQL integration testing #152

MySQL time column to dataframe fix for Pandas 1.x, increase MySQL integration testing #152

vinceatbluelabs commented Dec 10, 2020 •

edited

Loading

vinceatbluelabs Dec 10, 2020

cwegrzyn Dec 10, 2020

vinceatbluelabs Dec 10, 2020

vinceatbluelabs Dec 10, 2020

cwegrzyn Dec 10, 2020

vinceatbluelabs Dec 10, 2020

vinceatbluelabs Dec 10, 2020

cwegrzyn Dec 10, 2020

cwegrzyn left a comment

cwegrzyn Dec 10, 2020

cwegrzyn Dec 10, 2020

cwegrzyn Dec 10, 2020

MySQL time column to dataframe fix for Pandas 1.x, increase MySQL integration testing #152

MySQL time column to dataframe fix for Pandas 1.x, increase MySQL integration testing #152

Conversation

vinceatbluelabs commented Dec 10, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cwegrzyn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vinceatbluelabs commented Dec 10, 2020 •

edited

Loading