schema: support encoding=None connections #172

Totktonada · 2020-08-26T03:50:28Z

Several different problems are fixed here, but all have the same root.
When a connection encoding is None (it is default on Python 2 and may be
set explicitly on Python 3), all mp_str values are decoded into bytes,
not Unicode strings (note that bytes is alias for str in Python 2). But
the database schema parsing code have assumptions that _vspace / _vindex
values are Unicode strings.

The resolved problems are the following:

Default encoding in bytes#decode() method is 'ascii', however names
in tarantool can contain symbols beyond ASCII symbol table. Set
'utf-8' for names decoding.
Convert all binary values into Unicode strings before parse or store
them. This allows further correct accesses to the local schema
representation.
Convert binary parameters like space, index or field name into
Unicode strings, when a schema is accessed to don't trigger redundant
schema refetching.

Those problems are briefly mentioned in 1.

Tested manually with Python 2 and Python 3: my testing tarantool
instance has a space with name '©' and after the changes I'm able to
connect to it when the connection encoding is set to None. Also I
verified that schema is not fetched each time when I do
<connection>.select('©') in Python 2 (where such string literal is str /
bytes, not Unicode string).

Added relevant test cases as separate commits within the PR.

Several different problems are fixed here, but all have the same root. When a connection encoding is None (it is default on Python 2 and may be set explicitly on Python 3), all mp_str values are decoded into bytes, not Unicode strings (note that bytes is alias for str in Python 2). But the database schema parsing code have assumptions that _vspace / _vindex values are Unicode strings. The resolved problems are the following: 1. Default encoding in bytes#decode() method is 'ascii', however names in tarantool can contain symbols beyond ASCII symbol table. Set 'utf-8' for names decoding. 2. Convert all binary values into Unicode strings before parse or store them. This allows further correct accesses to the local schema representation. 3. Convert binary parameters like space, index or field name into Unicode strings, when a schema is accessed to don't trigger redundant schema refetching. Those problems are briefly mentioned in [1]. Tested manually with Python 2 and Python 3: my testing tarantool instance has a space with name '©' and after the changes I'm able to connect to it when the connection encoding is set to None. Also I verified that schema is not fetched each time when I do <connection>.select('©') in Python 2 (where such string literal is str / bytes, not Unicode string). Relevant test cases are added in next commits. [1]: #105 (comment)

On Python 3 load_tests() function that is defined in unit/suites/__init__.py is called to filter unittest.TestCase classes to run. On Python 2 the function is not called and all unittest.TestCase classes are subject to run. I plan to define a hierarchy of test cases in the schema reload test and exclude the base class from the list to run, so this behaviour is not desired. The reason of the difference between unittest behaviour on Python 2 and Python 3 is described in the TestLoader#discover() function documentation (see [1]): | Changed in version 3.5: Found packages are now checked for load_tests | regardless of whether their path matches pattern, because it is | impossible for a package name to match the default pattern. The default pattern is 'test*.py'. [1]: https://docs.python.org/3/library/unittest.html#unittest.TestLoader.discover

In Python 2 a binary string and a Unicode string with the same ASCII characters are considered equal: | >>> 'x' == u'x' | True | >>> u'x' == 'x' | True It works seamlessly with dictionary keys too: | >>> d = dict() | >>> d[u'x'] = 42 | >>> d['x'] | 42 This, however, does not work for non-ASCII characters: | >>> 'ч' == u'ч' | False | >>> u'ч' == 'ч' | False | >>> d = dict() | >>> d[u'ч'] = 42 | >>> d['ч'] | Traceback (most recent call last): | File "<stdin>", line 1, in <module> | KeyError: '\xd1\x87' The implemented test cases verify schema loading for a space or an index with a non-ASCII name, when it is passed just as a literal: say, '∞'. The main purpose of the commit is to add test cases that verify that no extra schema fetch is performed for a space / index name given as non-ASCII string literal (which is binary string on Python 2). If a cache is implemented without taking care to this, an already cached value may be missed.

Totktonada · 2020-08-28T16:41:12Z

assert max_depth > 0

@artembo suggested to don't use assert this way: it may be disabled in production use. I agreed and found that SchemaError should be raised here (it is internal error in fact and we'll define it properly in the error hierarhy within #174). I added RecursionError for technical matters: different to_unicode_recursive() calls should give different error messages when the recursion reaches its limit.

Totktonada mentioned this pull request Aug 27, 2020

Support msgpack 1.0.0 #173

Merged

Totktonada force-pushed the Totktonada/schema-support-encoding-none branch from 65498cc to 19e1de4 Compare August 28, 2020 15:27

Totktonada added 5 commits August 28, 2020 19:34

test: schema reload with different connection encodings

a9f00b3

test: verify that no extra schema fetches occurs

973ea60

Totktonada force-pushed the Totktonada/schema-support-encoding-none branch from 19e1de4 to 1de5204 Compare August 28, 2020 16:36

Totktonada merged commit e106479 into master Aug 28, 2020

Totktonada deleted the Totktonada/schema-support-encoding-none branch August 28, 2020 23:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

schema: support encoding=None connections #172

schema: support encoding=None connections #172

Totktonada commented Aug 26, 2020 •

edited

Loading

Totktonada commented Aug 28, 2020

schema: support encoding=None connections #172

schema: support encoding=None connections #172

Conversation

Totktonada commented Aug 26, 2020 • edited Loading

Totktonada commented Aug 28, 2020

Totktonada commented Aug 26, 2020 •

edited

Loading