Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The parser IniConfigFile runs into exceptions when the ini file contains Chinese or Japanese strings #3450

Closed
shzhou12 opened this issue Jun 23, 2022 · 5 comments · Fixed by #3464
Assignees

Comments

@shzhou12
Copy link
Contributor

For the example ini file:

[Global]
secret-name = "vsphere-creds"
secret-namespace = "kube-system"
insecure-flag = "1"


[Workspace]
server = "xxxxxx"
datacenter = "1-测试部"
default-datastore = "xxxxxxxx"
folder = "/1-测试部/xxxxxxxxx"


[VirtualCenter "xxxxxx"]
datacenters = "1-测试部" 

The parser IniConfigFile runs into the following exceptions when parsing the above ini file:

  File "/Users/shan/work/ccx/lib/python3.7/site-packages/insights/core/__init__.py", line 1441, in parse_content
    super(IniConfigFile, self).parse_content(content)
  File "/Users/shan/work/ccx/lib/python3.7/site-packages/insights/core/__init__.py", line 349, in parse_content
    self.doc = self.parse_doc(content)
  File "/Users/shan/work/ccx/lib/python3.7/site-packages/insights/core/__init__.py", line 1438, in parse_doc
    return iniparser.parse_doc("\n".join(content), self, return_defaults=True, return_booleans=False)
  File "/Users/shan/work/ccx/lib/python3.7/site-packages/insights/parsr/iniparser.py", line 100, in parse_doc
    res = Entry(children=Top(content), src=ctx)
  File "/Users/shan/work/ccx/lib/python3.7/site-packages/insights/parsr/__init__.py", line 356, in __call__
    raise Exception(err.read())
Exception: At line 10 column 17:
KVPair -> EOL
    Expected EOL. Got '测'.
KVPair -> EOF
    Expected end of input. Got '测'.
KVPair -> any whitespace
    Expected any whitespace. Got '测'.
KVPair -> Sep
    Expected Sep. Got '测'.
any whitespace
    Expected any whitespace. Got '测'.
Literal'#'
    Expected '#'. Got '测'.
Literal';'
    Expected ';'. Got '测'.
KVPair -> any whitespace
    Expected any whitespace. Got '测'.
KVPair -> any whitespace
    Expected any whitespace. Got '测'.
KVPair
    Expected 1 of [' ', '!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ';', '<', '>', '?', '@', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '\\', '^', '_', '`', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '{', '|', '}', '~']. Got '测'.
any whitespace
    Expected any whitespace. Got '测'.
Literal'#'
    Expected '#'. Got '测'.
Literal';'
    Expected ';'. Got '测'.
Header -> any whitespace
    Expected any whitespace. Got '测'.
Header
    Expected [. Got '测'.
any whitespace
    Expected any whitespace. Got '测'.
EOF
    Expected end of input. Got '测'.
@xiangce xiangce self-assigned this Jun 23, 2022
@xiangce
Copy link
Contributor

xiangce commented Jun 23, 2022

This is caused by the following basic definition of the parsr module

SingleQuotedString = Char("'") >> String(set(string.printable) - set("'"), "'") << Char("'")
DoubleQuotedString = Char('"') >> String(set(string.printable) - set('"'), '"') << Char('"')
QuotedString = Wrapper(DoubleQuotedString | SingleQuotedString) % "quoted string"

The string.printable only includes the following ASCII characters, but no Chinese/Japanese characters:

>>> string.printable
'0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c'

Since the IniConfigFile allows to use of other languages, like Chinese, Japanese, and Korean in the String value, I think It's necessary to add the characters from these languages to the QuotedString. But these languages have too many characters to list all of them in the String set. We may need to find a feasible method to re-define the QuotedString instead of enumerating all the characters.

What's your idea, @bfahr, @ryan-blakley ?

@xiangce
Copy link
Contributor

xiangce commented Jun 24, 2022

I like the idea that @koalakangaroo shared with me: we may replace these "invalid" characters/words with some particular or proper words formed from the characters in the pool of valid before parsing it.

This a quite good approach for us to quickly fix this issue, I think it's feasible, just like IP obfuscation that we do during the collection.

@bfahr, @ryan-blakley, Thoughts?

@ryan-blakley
Copy link
Contributor

@xiangce after playing around with the symbols I did find that the unidecode python module can convert to the unicode characters to ascii characters. But it seems that module isn't available in RHEL. You mentioned replacing the characters do you know of an easier way to replace unicode characters?

@xiangce
Copy link
Contributor

xiangce commented Jun 30, 2022

@ryan-blakley - Nope, I have no idea about this either.

And I'm also not sure if the unidecode is suitable for this case, just like the following example, after the conversion, a blank space is added for the proper noun "北京" -> "Bei Jing". In the very beginning, I just thought about "replacing" but not "translating".

>>> from unidecode import unidecode
>>> city = "北京"
>>> print(unidecode(city))
Bei Jing

@ryan-blakley
Copy link
Contributor

@xiangce - Yeah I noticed the space that was another reason I figured it wouldn't work. If you're good with replacing, I think the below may work then.

>>> import unicodedata
>>> unicodedata.normalize('NFKD', u'/1-测试部/xxxxxxxxx').encode('ascii', 'replace')
'/1-???/xxxxxxxxx'

xiangce pushed a commit that referenced this issue Jul 13, 2022
* Iniparser is only setup to parse string.printable characters. This
  doesn't include non ascii characters from various languages, which
  causes an exception when they're in a config file. So replace the
  characters with question marks.
* Fixes #3450

Signed-off-by: Ryan Blakley <[email protected]>
xiangce pushed a commit that referenced this issue Jul 13, 2022
* Iniparser is only setup to parse string.printable characters. This
  doesn't include non ascii characters from various languages, which
  causes an exception when they're in a config file. So replace the
  characters with question marks.
* Fixes #3450

Signed-off-by: Ryan Blakley <[email protected]>
(cherry picked from commit 1fe7320)
xiangce pushed a commit that referenced this issue Sep 6, 2024
* Iniparser is only setup to parse string.printable characters. This
  doesn't include non ascii characters from various languages, which
  causes an exception when they're in a config file. So replace the
  characters with question marks.
* Fixes #3450

Signed-off-by: Ryan Blakley <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants