Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

schema: support encoding=None connections #172

Merged
merged 5 commits into from
Aug 28, 2020

Conversation

Totktonada
Copy link
Member

@Totktonada Totktonada commented Aug 26, 2020

Several different problems are fixed here, but all have the same root.
When a connection encoding is None (it is default on Python 2 and may be
set explicitly on Python 3), all mp_str values are decoded into bytes,
not Unicode strings (note that bytes is alias for str in Python 2). But
the database schema parsing code have assumptions that _vspace / _vindex
values are Unicode strings.

The resolved problems are the following:

  1. Default encoding in bytes#decode() method is 'ascii', however names
    in tarantool can contain symbols beyond ASCII symbol table. Set
    'utf-8' for names decoding.
  2. Convert all binary values into Unicode strings before parse or store
    them. This allows further correct accesses to the local schema
    representation.
  3. Convert binary parameters like space, index or field name into
    Unicode strings, when a schema is accessed to don't trigger redundant
    schema refetching.

Those problems are briefly mentioned in 1.

Tested manually with Python 2 and Python 3: my testing tarantool
instance has a space with name '©' and after the changes I'm able to
connect to it when the connection encoding is set to None. Also I
verified that schema is not fetched each time when I do
<connection>.select('©') in Python 2 (where such string literal is str /
bytes, not Unicode string).

Added relevant test cases as separate commits within the PR.

@Totktonada Totktonada mentioned this pull request Aug 27, 2020
@Totktonada Totktonada force-pushed the Totktonada/schema-support-encoding-none branch from 65498cc to 19e1de4 Compare August 28, 2020 15:27
Several different problems are fixed here, but all have the same root.
When a connection encoding is None (it is default on Python 2 and may be
set explicitly on Python 3), all mp_str values are decoded into bytes,
not Unicode strings (note that bytes is alias for str in Python 2). But
the database schema parsing code have assumptions that _vspace / _vindex
values are Unicode strings.

The resolved problems are the following:

1. Default encoding in bytes#decode() method is 'ascii', however names
   in tarantool can contain symbols beyond ASCII symbol table. Set
   'utf-8' for names decoding.
2. Convert all binary values into Unicode strings before parse or store
   them. This allows further correct accesses to the local schema
   representation.
3. Convert binary parameters like space, index or field name into
   Unicode strings, when a schema is accessed to don't trigger redundant
   schema refetching.

Those problems are briefly mentioned in [1].

Tested manually with Python 2 and Python 3: my testing tarantool
instance has a space with name '©' and after the changes I'm able to
connect to it when the connection encoding is set to None. Also I
verified that schema is not fetched each time when I do
<connection>.select('©') in Python 2 (where such string literal is str /
bytes, not Unicode string).

Relevant test cases are added in next commits.

[1]: #105 (comment)
On Python 3 load_tests() function that is defined in
unit/suites/__init__.py is called to filter unittest.TestCase classes to
run.

On Python 2 the function is not called and all unittest.TestCase classes
are subject to run. I plan to define a hierarchy of test cases in the
schema reload test and exclude the base class from the list to run, so
this behaviour is not desired.

The reason of the difference between unittest behaviour on Python 2 and
Python 3 is described in the TestLoader#discover() function
documentation (see [1]):

 | Changed in version 3.5: Found packages are now checked for load_tests
 | regardless of whether their path matches pattern, because it is
 | impossible for a package name to match the default pattern.

The default pattern is 'test*.py'.

[1]: https://docs.python.org/3/library/unittest.html#unittest.TestLoader.discover
In Python 2 a binary string and a Unicode string with the same
ASCII characters are considered equal:

 | >>> 'x' == u'x'
 | True
 | >>> u'x' == 'x'
 | True

It works seamlessly with dictionary keys too:

 | >>> d = dict()
 | >>> d[u'x'] = 42
 | >>> d['x']
 | 42

This, however, does not work for non-ASCII characters:

 | >>> 'ч' == u'ч'
 | False
 | >>> u'ч' == 'ч'
 | False

 | >>> d = dict()
 | >>> d[u'ч'] = 42
 | >>> d['ч']
 | Traceback (most recent call last):
 |   File "<stdin>", line 1, in <module>
 | KeyError: '\xd1\x87'

The implemented test cases verify schema loading for a space or an index
with a non-ASCII name, when it is passed just as a literal: say, '∞'.

The main purpose of the commit is to add test cases that verify that no
extra schema fetch is performed for a space / index name given as
non-ASCII string literal (which is binary string on Python 2). If a
cache is implemented without taking care to this, an already cached
value may be missed.
@Totktonada Totktonada force-pushed the Totktonada/schema-support-encoding-none branch from 19e1de4 to 1de5204 Compare August 28, 2020 16:36
@Totktonada
Copy link
Member Author

assert max_depth > 0

@artembo suggested to don't use assert this way: it may be disabled in production use. I agreed and found that SchemaError should be raised here (it is internal error in fact and we'll define it properly in the error hierarhy within #174). I added RecursionError for technical matters: different to_unicode_recursive() calls should give different error messages when the recursion reaches its limit.

@Totktonada Totktonada merged commit e106479 into master Aug 28, 2020
@Totktonada Totktonada deleted the Totktonada/schema-support-encoding-none branch August 28, 2020 23:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant