Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loosen validation and mypy type requirements for date/time types, start increased testing #144

Merged
merged 9 commits into from
Dec 9, 2020

Conversation

vinceatbluelabs
Copy link
Contributor

The Records Spec uses hints to describe how dates, times, datetime, and datetimetz types are formatted in CSVs.

It does this by describing a set of allowed values. The format of the values were originally cribbed from Redshift's COPY statement.

The current set of actually allowed values for those types (dateformat/timeonlyformat/datetimeformattz/datetimeformat) are fairly arbitrary, incomplete, inconsistent with each other, and don't hold up as terribly consistent under scrutiny. They've also been a source of bugs over time.

To make those complete even insomuch as the patterns they currently try to support would involve the start of some combinatorial explosion:

  • Hour can be specified as HH, HH24 or HH12 (note also that HH and HH24 mean the same thing, but we documented it both ways in at least one case.)
  • SS can either be specified or not.
  • OF can be specified for timezone offset or not
  • Dates can be in MDY, DMY or YMD format
  • Years can be in YYYY or YY format
  • Dates can be segmented by - or /
  • ...

I think this does not bode well for maintaining an allowlist of formats - instead, I think we're going to need to (like Redshift) support a general formatting language in the spec, and do translations as needed to the formatting languages of the different databases/tools that Records Mover integrates with.

In the face of that, I'd like to expand the allowed values in a future PR. The motivation there is to ensure we allow likely DMY-flavored formats, to more generally support more things users may need in the future.

This PR is step one of that. It adds one additional test as a beginning of comprehensive testing of the existing supported values, and makes it easy to add more testable values in the future.

To deal with the combinatorial explosion problem for the time being, I've loosened the internal constraints by changing the Literal types to strings, and fixed up the various things based on what the testing found.

This PR does not yet loosen up the spec in any way.

The rest of the tests can be found in #140, which is probably too big to review right now--this is the first PR chipped off of that boulder.

'YYYY-MM-DD HH:MI:SS']:
pass
else:
cant_handle_hint(fail_if_cant_handle_hint, 'datetimeformattz', hints)
quiet_remove(unhandled_hints, 'datetimeformattz')

if hints.timeonlyformat == 'HH24:MI:SS':
if hints.timeonlyformat in ['HH24:MI:SS', 'HH:MI:SS']:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fixes issues found during testing!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specifically, the Redshift spec for date/time formats defines HH as equivalent to HH24 - so there's no reason now to handle it similarly to the HH24 cases. These were reproduced with test cases defined in datetime_cases.py in tests/component below.

"YYYY-MM-DD HH:MI:SS",
"YYYY-MM-DD HH12:MI AM",
"MM/DD/YY HH24:MI",
]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not all of this file is probably used in this PR - it's the full set of things needed to define the test cases for #140

if should_raise[dateformat]:
pass
else:
raise
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The general intent with these tests is that as we add new things to the below datetime_cases.py module, tests will blow up and we'll need to go back and hand-add expectations of what each driver should produce for each input.

@@ -324,13 +324,14 @@ def add_load_job_csv_config(unhandled_hints: Set[str],
quiet_remove(unhandled_hints, 'datetimeformat')

if hints.datetimeformattz in ['YYYY-MM-DD HH:MI:SSOF',
'YYYY-MM-DD HH24:MI:SSOF',
'YYYY-MM-DD HH:MI:SS']:
Copy link
Contributor Author

@vinceatbluelabs vinceatbluelabs Dec 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

YYYY-MM-DD HH24:MI:SS would be a good additional item - I'll write this down and address it in #140 if not before, probably by adding it as an explicit test case in my list of test cases below.

@vinceatbluelabs vinceatbluelabs marked this pull request as ready for review December 9, 2020 19:33
@vinceatbluelabs
Copy link
Contributor Author

Looks like I have a test failure, which is annoying. I’ll open up another PR related to that and probably merge master back into this branch afterwards. If you could review the PR in the meantime ignoring the test failure for the moment, I’ll hold off merging until that fixes the passing tests.

Copy link

@crvena-sonja crvena-sonja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests are a bit of a logical brain-twister here but I like the consistency and the comments on datatypes definitely help!

Nice cleanup here and helpful context, as always.

@vinceatbluelabs vinceatbluelabs merged commit b71ed91 into master Dec 9, 2020
vinceatbluelabs added a commit that referenced this pull request Dec 9, 2020
Based on the new testing approach in #144, this PR adds component
tests for date/time types while loading/exporting CSV files in Vertica
databases.

Along the way some places for improvement were found!
vinceatbluelabs added a commit that referenced this pull request Dec 9, 2020
Based on the new testing approach in #144, this PR adds component
tests for date/time types while loading/exporting CSV files in Pandas.

Along the way some places for improvement were found!
vinceatbluelabs added a commit that referenced this pull request Dec 9, 2020
…147)

* Improve date/time hint handling for Vertica based on better testing

Based on the new testing approach in #144, this PR adds component
tests for date/time types while loading/exporting CSV files in Vertica
databases.

Along the way some places for improvement were found!

* Ratchet coverage

* Ratchet mypy

* Ratchet coverage

* Ratchet mypy coverage
vinceatbluelabs added a commit that referenced this pull request Dec 9, 2020
Based on the new testing approach in #144, this PR adds component
tests for date/time types while loading/exporting CSV files in Redshift
databases.

Along the way some places for improvement were found!
vinceatbluelabs added a commit that referenced this pull request Dec 9, 2020
Based on the new testing approach in #144, this PR adds component
tests for date/time types while loading/exporting CSV files in Redshift
databases.

Along the way some places for improvement were found!
vinceatbluelabs added a commit that referenced this pull request Dec 10, 2020
Based on the new testing approach in #144, this PR adds component
tests for date/time types while loading CSV files in MySQL.

The expectations for those test are based on the results of the
integration tests in #140, which shows our MySQL driver is way too
optimistic about what date/time hints it can import!
vinceatbluelabs added a commit that referenced this pull request Dec 10, 2020
)

* Improve date/time hint handling for Pandas based on better testing

Based on the new testing approach in #144, this PR adds component
tests for date/time types while loading/exporting CSV files in Pandas.

Along the way some places for improvement were found!

* Ratchet test coverage

* Ratchet mypy coverage

* Update records_mover/records/pandas/to_csv_options.py

Co-authored-by: Chris Wegrzyn <[email protected]>

Co-authored-by: Chris Wegrzyn <[email protected]>
vinceatbluelabs added a commit that referenced this pull request Dec 10, 2020
* Improve date/time hint handling for MySQL based on better testing

Based on the new testing approach in #144, this PR adds component
tests for date/time types while loading CSV files in MySQL.

The expectations for those test are based on the results of the
integration tests in #140, which shows our MySQL driver is way too
optimistic about what date/time hints it can import!

* Make comment in complete sentences

* Quick flake8 fix

* csv variant proved to not be possible

* Ratchet

* Ratchet
vinceatbluelabs added a commit that referenced this pull request Dec 10, 2020
…149)

* Improve date/time hint handling for Redshift based on better testing

Based on the new testing approach in #144, this PR adds component
tests for date/time types while loading/exporting CSV files in Redshift
databases.

Along the way some places for improvement were found!

* Improve date/time hint handling for Redshift based on better testing

Based on the new testing approach in #144, this PR adds component
tests for date/time types while loading/exporting CSV files in Redshift
databases.

Along the way some places for improvement were found!

* Ratchet coverage

* Remove original

* Ratchet
@ryantimjohn ryantimjohn deleted the loosen_date_requirements_1 branch December 17, 2022 11:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants