-
Notifications
You must be signed in to change notification settings - Fork 307
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add custom cell magic parser #213
Conversation
@shollyman I would like to hear your thoughts on this. The current cell magic line parser is not aware of Python structures (dicts, lists ...) and can thus incorrectly parse the I took a stab at it and came up with a POC custom parser that successfully parses non-trivial input lines such as the following:
The following test snippets demonstrates that the parser correctly recognized the options: from google.cloud.bigquery.ipython_magics import line_arg_parser as lap
example_line = (
" target_var "
"--use_legacy_sql "
"--project my.project.name "
"--params {'nested_dict': {'bar': 'baz', 'some_tuple': (-1, 3.2,)}, "
"\"answer\": -42, 'array': ['aaa bbb', [\"nested\", 'list'], 'ccc ddd'], "
"'--params': 'tricky --inner-option string literal',} "
"--verbose "
)
lexer = lap.lexer.Lexer(line)
parser = lap.parser.Parser(lexer)
parse_tree = parser.input_line()
printer = lap.visitors.TreePrinter()
printer.visit(parse_tree) Output:
The custom parser could be used to extract out the Do you think we should pursue this further? The grammar used is of course a simplification that is aimed to be "good enough" to reliably extract the cc: @tswast Do customers complain about the default parser's limitations often? Do you think there even exist reasonable use cases where --params`dictionary would be incredibly complex? |
I wouldn't say often, but it has come up more than once |
5aa7f07
to
e778cf2
Compare
e778cf2
to
4e30e4c
Compare
5359b3a
to
b514850
Compare
b514850
to
5a6257a
Compare
5a6257a
to
976567c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks pretty good. Thanks.
The parser should accept as wide a range of values as possible and let the code that delas with the semantics to decide whether the values are good or not.
The --params option spec must be followed by a non-alphanumeric character, otherwise it's a different option spec (e.g. --paramsX).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is looking good.
767e60b
to
5249c6a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome! It does require a bit of refactoring, though. We need to keep google.cloud.bigquery.magics
module/package name due to google.cloud.bigquery.magics.context
I'd also like to see if we can lean on pyparsing for a lot of the lexing / parsing work. I worry that those modules are going to be particularly challenging to maintain without either "literate programming" level comments or relying on common modules to make the parser easier to understand.
The context still needs to be importable from the old path
1a97836
to
19af056
Compare
Token = namedtuple("Token", ("type_", "lexeme", "pos")) | ||
StateTransition = namedtuple("StateTransition", ("new_state", "total_offset")) | ||
|
||
# Pattern matching is done with regexes, and the order in which the token patterns are |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much for this explanation and the clearer state names!
Fun fact: anything that can be parsed with a regular expression can also be parsed by a finite state machine and vice versa. It's called a regular language. That means we might have been able to convert this into one giant regular expression, but I like having these explicit states better.
I recall that lookahead & backtracking isn't a regular expression in the computer science definition, so I don't think this is technically a "regular" language, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, regular expressions correspond to finite automata, although the regex implementation is a bit more powerful than that (AFAIK), as it supports things like lookahead and lookbehind.
The language itself recognized by the parser is not regular, true, as the grammar allows for recursive patterns such as nested dicts. A proof of non-regularity is left as an exercise to the reader. 😆
(hint: pumping lemma).
Apparently black just places all implicitly concatenated string literals in a single line when short enough without replacing them with a single string literal.
806a4ad
to
ed01a66
Compare
This is necessary to retain Python 2 compatibility.
The tokens are designed in a way that the scanner *always* returns some match, even if just UNKNOWN or EOL. The "no matches" code path can thus never be taken, but the coverage check can't know that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! Thanks for the clear explanations.
🤖 I have created a release \*beep\* \*boop\* --- ## [1.28.0](https://www.github.com/googleapis/python-bigquery/compare/v1.27.2...v1.28.0) (2020-09-22) ### Features * add custom cell magic parser to handle complex `--params` values ([#213](https://www.github.com/googleapis/python-bigquery/issues/213)) ([dcfbac2](https://www.github.com/googleapis/python-bigquery/commit/dcfbac267fbf66d189b0cc7e76f4712122a74b7b)) * add instrumentation to list methods ([#239](https://www.github.com/googleapis/python-bigquery/issues/239)) ([fa9f9ca](https://www.github.com/googleapis/python-bigquery/commit/fa9f9ca491c3f9954287102c567ec483aa6151d4)) * add opentelemetry tracing ([#215](https://www.github.com/googleapis/python-bigquery/issues/215)) ([a04996c](https://www.github.com/googleapis/python-bigquery/commit/a04996c537e9d8847411fcbb1b05da5f175b339e)) * expose require_partition_filter for hive_partition ([#257](https://www.github.com/googleapis/python-bigquery/issues/257)) ([aa1613c](https://www.github.com/googleapis/python-bigquery/commit/aa1613c1bf48c7efb999cb8b8c422c80baf1950b)) ### Bug Fixes * fix dependency issue in fastavro ([#241](https://www.github.com/googleapis/python-bigquery/issues/241)) ([2874abf](https://www.github.com/googleapis/python-bigquery/commit/2874abf4827f1ea529519d4b138511d31f732a50)) * update minimum dependency versions ([#263](https://www.github.com/googleapis/python-bigquery/issues/263)) ([1be66ce](https://www.github.com/googleapis/python-bigquery/commit/1be66ce94a32b1f924bdda05d068c2977631af9e)) * validate job_config.source_format in load_table_from_dataframe ([#262](https://www.github.com/googleapis/python-bigquery/issues/262)) ([6160fee](https://www.github.com/googleapis/python-bigquery/commit/6160fee4b1a79b0ea9031cc18caf6322fe4c4084)) ### Documentation * recommend insert_rows_json to avoid call to tables.get ([#258](https://www.github.com/googleapis/python-bigquery/issues/258)) ([ae647eb](https://www.github.com/googleapis/python-bigquery/commit/ae647ebd68deff6e30ca2cffb5b7422c6de4940b)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please).
Fixes #166.
Fixes #108.
This PR is to preview the custom cell magic line parser.
PR checklist