The parser IniConfigFile runs into exceptions when the ini file contains Chinese or Japanese strings #3450

shzhou12 · 2022-06-23T03:19:49Z

For the example ini file:

[Global]
secret-name = "vsphere-creds"
secret-namespace = "kube-system"
insecure-flag = "1"


[Workspace]
server = "xxxxxx"
datacenter = "1-测试部"
default-datastore = "xxxxxxxx"
folder = "/1-测试部/xxxxxxxxx"


[VirtualCenter "xxxxxx"]
datacenters = "1-测试部"

The parser IniConfigFile runs into the following exceptions when parsing the above ini file:

  File "/Users/shan/work/ccx/lib/python3.7/site-packages/insights/core/__init__.py", line 1441, in parse_content
    super(IniConfigFile, self).parse_content(content)
  File "/Users/shan/work/ccx/lib/python3.7/site-packages/insights/core/__init__.py", line 349, in parse_content
    self.doc = self.parse_doc(content)
  File "/Users/shan/work/ccx/lib/python3.7/site-packages/insights/core/__init__.py", line 1438, in parse_doc
    return iniparser.parse_doc("\n".join(content), self, return_defaults=True, return_booleans=False)
  File "/Users/shan/work/ccx/lib/python3.7/site-packages/insights/parsr/iniparser.py", line 100, in parse_doc
    res = Entry(children=Top(content), src=ctx)
  File "/Users/shan/work/ccx/lib/python3.7/site-packages/insights/parsr/__init__.py", line 356, in __call__
    raise Exception(err.read())
Exception: At line 10 column 17:
KVPair -> EOL
    Expected EOL. Got '测'.
KVPair -> EOF
    Expected end of input. Got '测'.
KVPair -> any whitespace
    Expected any whitespace. Got '测'.
KVPair -> Sep
    Expected Sep. Got '测'.
any whitespace
    Expected any whitespace. Got '测'.
Literal'#'
    Expected '#'. Got '测'.
Literal';'
    Expected ';'. Got '测'.
KVPair -> any whitespace
    Expected any whitespace. Got '测'.
KVPair -> any whitespace
    Expected any whitespace. Got '测'.
KVPair
    Expected 1 of [' ', '!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ';', '<', '>', '?', '@', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '\\', '^', '_', '`', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '{', '|', '}', '~']. Got '测'.
any whitespace
    Expected any whitespace. Got '测'.
Literal'#'
    Expected '#'. Got '测'.
Literal';'
    Expected ';'. Got '测'.
Header -> any whitespace
    Expected any whitespace. Got '测'.
Header
    Expected [. Got '测'.
any whitespace
    Expected any whitespace. Got '测'.
EOF
    Expected end of input. Got '测'.

The text was updated successfully, but these errors were encountered:

xiangce · 2022-06-23T08:27:53Z

This is caused by the following basic definition of the parsr module

insights-core/insights/parsr/__init__.py

Lines 1252 to 1254 in bbc2186

    
           SingleQuotedString = Char("'") >> String(set(string.printable) - set("'"), "'") << Char("'") 
        
           DoubleQuotedString = Char('"') >> String(set(string.printable) - set('"'), '"') << Char('"') 
        
           QuotedString = Wrapper(DoubleQuotedString | SingleQuotedString) % "quoted string"

The string.printable only includes the following ASCII characters, but no Chinese/Japanese characters:

>>> string.printable
'0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c'

Since the IniConfigFile allows to use of other languages, like Chinese, Japanese, and Korean in the String value, I think It's necessary to add the characters from these languages to the QuotedString. But these languages have too many characters to list all of them in the String set. We may need to find a feasible method to re-define the QuotedString instead of enumerating all the characters.

What's your idea, @bfahr, @ryan-blakley ?

xiangce · 2022-06-24T01:18:15Z

I like the idea that @koalakangaroo shared with me: we may replace these "invalid" characters/words with some particular or proper words formed from the characters in the pool of valid before parsing it.

This a quite good approach for us to quickly fix this issue, I think it's feasible, just like IP obfuscation that we do during the collection.

@bfahr, @ryan-blakley, Thoughts?

ryan-blakley · 2022-06-29T18:02:43Z

@xiangce after playing around with the symbols I did find that the unidecode python module can convert to the unicode characters to ascii characters. But it seems that module isn't available in RHEL. You mentioned replacing the characters do you know of an easier way to replace unicode characters?

xiangce · 2022-06-30T08:17:47Z

@ryan-blakley - Nope, I have no idea about this either.

And I'm also not sure if the unidecode is suitable for this case, just like the following example, after the conversion, a blank space is added for the proper noun "北京" -> "Bei Jing". In the very beginning, I just thought about "replacing" but not "translating".

>>> from unidecode import unidecode
>>> city = "北京"
>>> print(unidecode(city))
Bei Jing

ryan-blakley · 2022-07-01T18:36:07Z

@xiangce - Yeah I noticed the space that was another reason I figured it wouldn't work. If you're good with replacing, I think the below may work then.

>>> import unicodedata
>>> unicodedata.normalize('NFKD', u'/1-测试部/xxxxxxxxx').encode('ascii', 'replace')
'/1-???/xxxxxxxxx'

* Iniparser is only setup to parse string.printable characters. This doesn't include non ascii characters from various languages, which causes an exception when they're in a config file. So replace the characters with question marks. * Fixes #3450 Signed-off-by: Ryan Blakley <[email protected]>

* Iniparser is only setup to parse string.printable characters. This doesn't include non ascii characters from various languages, which causes an exception when they're in a config file. So replace the characters with question marks. * Fixes #3450 Signed-off-by: Ryan Blakley <[email protected]> (cherry picked from commit 1fe7320)

* Iniparser is only setup to parse string.printable characters. This doesn't include non ascii characters from various languages, which causes an exception when they're in a config file. So replace the characters with question marks. * Fixes #3450 Signed-off-by: Ryan Blakley <[email protected]>

xiangce self-assigned this Jun 23, 2022

ryan-blakley mentioned this issue Jul 11, 2022

fix: Replace non ascii characters with question marks #3464

Merged

3 tasks

xiangce closed this as completed in #3464 Jul 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The parser IniConfigFile runs into exceptions when the ini file contains Chinese or Japanese strings #3450

The parser IniConfigFile runs into exceptions when the ini file contains Chinese or Japanese strings #3450

shzhou12 commented Jun 23, 2022

xiangce commented Jun 23, 2022 •

edited

Loading

xiangce commented Jun 24, 2022

ryan-blakley commented Jun 29, 2022

xiangce commented Jun 30, 2022

ryan-blakley commented Jul 1, 2022

The parser IniConfigFile runs into exceptions when the ini file contains Chinese or Japanese strings #3450

The parser IniConfigFile runs into exceptions when the ini file contains Chinese or Japanese strings #3450

Comments

shzhou12 commented Jun 23, 2022

xiangce commented Jun 23, 2022 • edited Loading

xiangce commented Jun 24, 2022

ryan-blakley commented Jun 29, 2022

xiangce commented Jun 30, 2022

ryan-blakley commented Jul 1, 2022

xiangce commented Jun 23, 2022 •

edited

Loading