-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
schema: support encoding=None connections #172
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Merged
Totktonada
force-pushed
the
Totktonada/schema-support-encoding-none
branch
from
August 28, 2020 15:27
65498cc
to
19e1de4
Compare
Several different problems are fixed here, but all have the same root. When a connection encoding is None (it is default on Python 2 and may be set explicitly on Python 3), all mp_str values are decoded into bytes, not Unicode strings (note that bytes is alias for str in Python 2). But the database schema parsing code have assumptions that _vspace / _vindex values are Unicode strings. The resolved problems are the following: 1. Default encoding in bytes#decode() method is 'ascii', however names in tarantool can contain symbols beyond ASCII symbol table. Set 'utf-8' for names decoding. 2. Convert all binary values into Unicode strings before parse or store them. This allows further correct accesses to the local schema representation. 3. Convert binary parameters like space, index or field name into Unicode strings, when a schema is accessed to don't trigger redundant schema refetching. Those problems are briefly mentioned in [1]. Tested manually with Python 2 and Python 3: my testing tarantool instance has a space with name '©' and after the changes I'm able to connect to it when the connection encoding is set to None. Also I verified that schema is not fetched each time when I do <connection>.select('©') in Python 2 (where such string literal is str / bytes, not Unicode string). Relevant test cases are added in next commits. [1]: #105 (comment)
On Python 3 load_tests() function that is defined in unit/suites/__init__.py is called to filter unittest.TestCase classes to run. On Python 2 the function is not called and all unittest.TestCase classes are subject to run. I plan to define a hierarchy of test cases in the schema reload test and exclude the base class from the list to run, so this behaviour is not desired. The reason of the difference between unittest behaviour on Python 2 and Python 3 is described in the TestLoader#discover() function documentation (see [1]): | Changed in version 3.5: Found packages are now checked for load_tests | regardless of whether their path matches pattern, because it is | impossible for a package name to match the default pattern. The default pattern is 'test*.py'. [1]: https://docs.python.org/3/library/unittest.html#unittest.TestLoader.discover
In Python 2 a binary string and a Unicode string with the same ASCII characters are considered equal: | >>> 'x' == u'x' | True | >>> u'x' == 'x' | True It works seamlessly with dictionary keys too: | >>> d = dict() | >>> d[u'x'] = 42 | >>> d['x'] | 42 This, however, does not work for non-ASCII characters: | >>> 'ч' == u'ч' | False | >>> u'ч' == 'ч' | False | >>> d = dict() | >>> d[u'ч'] = 42 | >>> d['ч'] | Traceback (most recent call last): | File "<stdin>", line 1, in <module> | KeyError: '\xd1\x87' The implemented test cases verify schema loading for a space or an index with a non-ASCII name, when it is passed just as a literal: say, '∞'. The main purpose of the commit is to add test cases that verify that no extra schema fetch is performed for a space / index name given as non-ASCII string literal (which is binary string on Python 2). If a cache is implemented without taking care to this, an already cached value may be missed.
Totktonada
force-pushed
the
Totktonada/schema-support-encoding-none
branch
from
August 28, 2020 16:36
19e1de4
to
1de5204
Compare
assert max_depth > 0 @artembo suggested to don't use |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Several different problems are fixed here, but all have the same root.
When a connection encoding is None (it is default on Python 2 and may be
set explicitly on Python 3), all mp_str values are decoded into bytes,
not Unicode strings (note that bytes is alias for str in Python 2). But
the database schema parsing code have assumptions that _vspace / _vindex
values are Unicode strings.
The resolved problems are the following:
in tarantool can contain symbols beyond ASCII symbol table. Set
'utf-8' for names decoding.
them. This allows further correct accesses to the local schema
representation.
Unicode strings, when a schema is accessed to don't trigger redundant
schema refetching.
Those problems are briefly mentioned in 1.
Tested manually with Python 2 and Python 3: my testing tarantool
instance has a space with name '©' and after the changes I'm able to
connect to it when the connection encoding is set to None. Also I
verified that schema is not fetched each time when I do
<connection>.select('©') in Python 2 (where such string literal is str /
bytes, not Unicode string).
Added relevant test cases as separate commits within the PR.