Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor CLI hint code and types, more CLI hint arguments #66

Merged
merged 30 commits into from
Jun 2, 2020

Conversation

vinceatbluelabs
Copy link
Contributor

@vinceatbluelabs vinceatbluelabs commented May 23, 2020

There are three major changes in this PR:

  • Move all load/unload code to use well-typed, pre-validated hints rather than raw input when we have a fully specified RecordsFormat.
  • Introduce a TypedDict partially specified version of hints to use as input to and output from our hint sniffing code.
  • Use the latter to flush out the remaining command line options in mvrec related to hints, and enhance our JSON schema to argparse translator to handle another case.

Some background:

Up until now, the only command line parameters for hints that we've accepted are those in the 'BootstrappingHints' type, which is typed as a TypedDict with a limited selection of keys.

We have an 'initial_hints' argument to various factory methods that accept this type. Having these separate from an optional 'records_format' argument allowed us to be told whether to do hint inference or not and still be able to override hint inference when we need to.

Because we didn't have Literal[] types for the relevant hints yet, these arguments were hand-crafted, so I didn't build all of them out. The ones I hand-crafted in were motivated by things the hint sniffing code didn't address and which commonly caused loading errors (generally empirically as folks ran into issues).

Later on I provided a 'variant' CLI argument that if provided would combine with the hints provided to fill in the 'records_format' argument to the factory methods. This was a little limited, as we still only hand selected arguments available, but it was nice for dealing with, say, an uncompresssed version of a common variant, or a slight delta on a commonly used and abused variant like 'csv'.

Now that we have the ability to easily generate the additional hints, and have types for everything, I'd like to go ahead and fully extend out the hints supported by the command line.

I made this a TypedDict with all of the hint keys included to help people who use type checkers, but we do have to be careful, as this is an external interface - just because we declared a type doesn't mean people will send us well-typed things! I'm casting those back down to our untyped dictionary type once we receive it so we don't make any bad assumptions handling that
data.

Before:

(mylibs-3.8.2) �]0;graybookpro�broz@graybookpro:~$ mvrec file2table
usage: mvrec file2table [-h] [--source.variant SOURCE.VARIANT]
                        [--source.field_delimiter SOURCE.FIELD-DELIMITER] [--source.no_compression]
                        [--source.compression {BZIP,GZIP,LZO}] [--source.no_escape] [--source.escape {\}]
                        [--source.no_quoting] [--source.quoting {all,minimal,nonnumeric}]
                        [--source.encoding {UTF8,UTF16,UTF16LE,UTF16BE,LATIN1,CP1252}]
                        [--target.existing_table {delete_and_overwrite,truncate_and_overwrite,drop_and_recreate,append}]
                        [--target.drop_and_recreate_on_load_error]
                        source.filename target.db_name target.schema_name target.table_name
mvrec file2table: error: the following arguments are required: source.filename, target.db_name, target.schema_name, target.table_name
(mylibs-3.8.2) �]0;graybookpro�broz@graybookpro:~$ 

After:

(records-mover-3.8.2) �]0;graybookpro�broz@graybookpro:~/src/records-mover$ mvrec file2table
usage: mvrec file2table [-h] [--source.variant SOURCE.VARIANT]
                        [--source.datetimeformattz {YYYY-MM-DD HH:MI:SSOF,YYYY-MM-DD HH:MI:SS,YYYY-MM-DD HH24:MI:SSOF,MM/DD/YY HH24:MI}]
                        [--source.datetimeformat {YYYY-MM-DD HH24:MI:SS,YYYY-MM-DD HH:MI:SS,YYYY-MM-DD HH12:MI AM,MM/DD/YY HH24:MI}]
                        [--source.no_compression] [--source.compression {GZIP,BZIP,LZO}]
                        [--source.no_quoting] [--source.quoting {all,minimal,nonnumeric}]
                        [--source.no_escape] [--source.escape {\}]
                        [--source.encoding {UTF8,UTF16,UTF16LE,UTF16BE,UTF16BOM,UTF8BOM,LATIN1,CP1252}]
                        [--source.dateformat {YYYY-MM-DD,MM-DD-YYYY,DD-MM-YYYY,MM/DD/YY}]
                        [--source.timeonlyformat {HH12:MI AM,HH24:MI:SS}] [--source.no_doublequote]
                        [--source.doublequote] [--source.no_header_row] [--source.header_row]
                        [--source.quotechar SOURCE.QUOTECHAR]
                        [--source.record_terminator SOURCE.RECORD-TERMINATOR]
                        [--source.field_delimiter SOURCE.FIELD-DELIMITER]
                        [--target.existing_table {delete_and_overwrite,truncate_and_overwrite,drop_and_recreate,append}]
                        [--target.drop_and_recreate_on_load_error]
                        source.filename target.db_name target.schema_name target.table_name
mvrec file2table: error: the following arguments are required: source.filename, target.db_name, target.schema_name, target.table_name
(records-mover-3.8.2) �]0;graybookpro�broz@graybookpro:~/src/records-mover$ 

@vinceatbluelabs vinceatbluelabs changed the title Refactor CLI hint code Refactor CLI hint code and types May 28, 2020
@@ -1,29 +1,43 @@
from typing_inspect import is_literal_type, get_args
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's teach the Hint type to have descriptions and be able to generate JSON schema information!

@@ -1,66 +0,0 @@
"""Defines hints supported by the job config parser."""
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the duplicative descriptions of hints that are no longer needed!

@vinceatbluelabs vinceatbluelabs marked this pull request as ready for review May 30, 2020 23:50
93.6800
Copy link
Contributor Author

@vinceatbluelabs vinceatbluelabs May 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you zoom down to the tests, I had to disable a few tests - we had code and tests supporting hint values for date/time values which don't actually exist in the spec. Now that we're doing validation using the correct list of supported hint values, those tests can't be run.

Adding those hint values makes sense, for sure, though. Adding those hint values and making sure all the different drivers support them correctly is likely to spiral just a bit from this already 56 file PR, so I'm proposing to comment out those tests and leave those hints as unsupported. I have a high priority follow-up backlog task to handle these. It's not normal at least in internal use to be passing in date/time hints, so I'm not terribly concerned about breaking any existing out-of-spec code with this.

Commented out tests:

Backlog task: https://app.asana.com/0/1128138765527694/1178155809546696

@vinceatbluelabs vinceatbluelabs changed the title Refactor CLI hint code and types Refactor CLI hint code and types, more CLI hint arguments May 31, 2020
@vinceatbluelabs vinceatbluelabs requested a review from cwegrzyn May 31, 2020 00:02
@vinceatbluelabs vinceatbluelabs merged commit 278a06f into master Jun 2, 2020
@vinceatbluelabs vinceatbluelabs deleted the refactor_cli_hints branch June 2, 2020 16:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants