Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using string array with magic block results in whitespace removal from first entry of the array #166

Closed
amardeep opened this issue Jul 14, 2020 · 7 comments · Fixed by #213
Assignees
Labels
api: bigquery Issues related to the googleapis/python-bigquery API. priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@amardeep
Copy link

On passing a dict { 'cats': ['apple orange', 'pear plum']} as params in bigquery magic cell, the first value is changed to appleorange - the space character is filtered out.

Environment details

  • OS type and version: colab
  • Python version: Python 3.6.9
  • pip version: 19.3.1
  • google-cloud-bigquery version: 1.21.0

Steps to reproduce

A colab notebook illustrating the error:
https://colab.research.google.com/gist/amardeep/63ec303ba8bac3db9849f4044cd19ff1/test-bigquery-array-parameter-bug.ipynb

Code example

params = {
    'cats': ['apple orange', 'pear plum']
}

%%bigquery  --params $params
SELECT * FROM UNNEST(@cats)

This results in the output:

           f0_
0  appleorange
1    pear plum
@product-auto-label product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery API. label Jul 14, 2020
@yoshi-automation yoshi-automation added the triage me I really want to be triaged. label Jul 15, 2020
@IlyaFaer IlyaFaer added type: question Request for information or clarification. Not an issue. and removed triage me I really want to be triaged. labels Jul 15, 2020
@HemangChothani
Copy link
Contributor

@amardeep The problem is related to the IPython library which used for the magics , the magic_arguments.parse_argstring method consider first element of list as a two different element so when we use join() method element will be merged.

@plamut Cloud you help in this , i don't have that much idea about the IPython 's limitation.

@plamut
Copy link
Contributor

plamut commented Jul 17, 2020

@HemangChothani @amardeep This appears to be a yet another case where IPython's argument parsing turns out to be fragile at times. It's the same reason why negative numbers in $params variable would be incorrectly recognized as additional arguments (details). The workaround also mentioned in the comment is to pass params as a JSON string variable.

Perhaps the docs on the --params option are not clear enough (they only mention the negative numbers case), but aside from possibly further improving the wording I don't think this issue can be reliably fixed in the client library itself.

Edit: Just tried it and it doesn't work, unfortunately, the spaces confuse the parser even if the value is passed as a JSON string. Maybe the we can hack around this in in the _cell_magic() function, but even if we succeed, it would still be dealing with the symptom and it might not work in all cases.

@HemangChothani Do you want me to have a closer look when I find some time?

@amardeep
Copy link
Author

It doesn't work with JSON string.

params = {
    'cats': ['apple orange', 'pear plum']
}
jparams = json.dumps(params)
jparams
'{"cats": ["apple orange", "pear plum"]}'
%%bigquery df --params $jparams
SELECT * FROM UNNEST(@cats)
           f0_
0  appleorange
1    pear plum

@HemangChothani
Copy link
Contributor

@plamut Yes, Actually i am not available for few days, so please take a look into it when you get some time, thanks.

@plamut
Copy link
Contributor

plamut commented Jul 20, 2020

The following argument string:

 --params {'cats': ['apple orange', 'pear plum']}

is tokenized by IPython as follows:

  • --params
  • {'cats':
  • ['apple
  • orange',
  • 'pear plum'
  • ]}

The parser is not aware of dicts, lists, and similar structures, thus it does not treat them as a single entity and splits them into multiple tokens. On the other hand it does appear that the parser at least recognizes multi-word strings if they are enclosed in quotes, which is why it produces 'pear plum' as a single token.

It made me wondering if we could trick the parser to treat the first array item, i.e. 'apple orange' the same, and it turned out that if we inject a space between [ and ', we can get around the issue.

>>> params = {
...         'cats': ['apple orange', 'pear plum', 'foo bar']
... }
>>> params_json = json.dumps(params)
>>> with_space = params_json.replace('["', '[ "')  # <-- Hocus-pocus, abracadabra! Magic!
>>> with_space
'{"cats": [ "apple orange", "pear plum", "foo bar"]}'

Such modified JSON string gets tokenized as follows:

  • --params
  • {"cats":
  • [
  • "apple orange"
  • "pear plum"
  • ...

The space between "apple" and "orange" is not lost anymore and correct result is produced:

%%bigquery df --params $with_space
SELECT * FROM UNNEST(@cats)

            f0_
0  apple orange
1     pear plum
2       foo bar

Such manipulation of JSON strings is, of course, far from ideal, but at least it's straightforward and should work in most cases. We are kind of limited with IPython's argument parser here.

@amardeep
Copy link
Author

Other option to consider is not using IPython argument parser, and custom parse line argument.

@plamut
Copy link
Contributor

plamut commented Jul 22, 2020

Considering the different edge cases reported where the built-in parser falls flat, we might actually have to do this at some point, yes. The replacement parser would of course have to be compatible with the existing ones in order to not break existing notebooks.

Edit: Ha! IPython's parser is based on argparse, and the latter support the type argument to argument declaration. Since type can be any callable, we can create a custom handler to correctly recognize the value passed to --params. I can take a stab at it in the next few days if the time allows.

Edit 2: Unfortunately, any type callback receives an already tokenized string, one token after --params at a time. It really seems a completely different parser would be required.

@plamut plamut added priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. and removed type: question Request for information or clarification. Not an issue. labels Jul 22, 2020
@plamut plamut self-assigned this Jul 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery API. priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants