Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make representation consistent for all Delphix-provided strings #387

Open
mothslaw opened this issue Jun 29, 2021 · 0 comments
Open

Make representation consistent for all Delphix-provided strings #387

mothslaw opened this issue Jun 29, 2021 · 0 comments

Comments

@mothslaw
Copy link
Contributor

Is your feature request related to a problem? Please describe.
The representation of engine-provided strings is not consistent, which makes it hard for plugins to get searching/matching correct when dealing with non-ASCII string.

Engine-provided strings are sent to the container using a Protobuf message, which is then unpacked by the Delphix wrapper code. This gives us string objects which we pass along to the plugin code as-is. According to the Protobuf documentation, these strings can be in the following formats:

  1. If all of the characters in the string are ASCII-representable, then the string object will be of type str and will contain the ASCII-encoded bytes that represent the string.
  2. If there is at least one non-ASCII-representable character in the string, then the string object can be in one of two types (it's not guaranteed which one we might get)
    a) A unicode object, containing the characters in the string.
    b) A str object, containing the UTF8-encoded bytes that represent the string

So, imagine a string that begins with the character ë. And, imagine a plugin wants to check that, indeed, the string begins with that character. You might think the plugin could just do this:

pattern = re.compile(u'ë')
pattern.match(the_string)

This will work fine for case (2a). But, it will not work for case (2b). After all, in case (2b) we've only got a str object. The str object does not contain characters, it contains bytes. So, the first two bytes here are c3-ab (the UTF-8 encoding for our character ë)
Also, there's no way for the re module to know what encoding might be in play. So, the re module cannot know that c3-ab should be interpreted as ë. So, for case (2b), the plugin would need to do something like this:

pattern = re.compile(u'ë')
uni_string = the_string.decode(u'utf-8')
pattern_match(uni_string)

But, of course, this code does not work for case (2a). So, now the plugin needs to have special code to do different things for cases (2a) and (2b). For example, they could write a function like this that they call for every single string that they ever receive from the engine:

def force_engine_string_to_unicode(engine_string):
  if type(my_uni_complex).__name__ == u"unicode":
    return engine_string
  else
    return engine_string.decode(u"utf-8")

Describe the solution you'd like
The plugin shouldn't have to jump through hoops like the above just to do string searching. It'd be better if the Delphix wrappers could give a consistent string representation to the plugin.

I think the rules should be:

  1. When the wrapper provides a string to the plugin, it will always supply a unicode string to the plugin. Never a str string.
  2. When the plugin provides a string to the wrapper, the wrapper will accept either a unicode string, or an ASCII- or UTF8-encoded str string. (The wrapper already supports this)

Describe alternatives you've considered
Another alternative would be for the wrapper to always provide UTF8-encoded str objects. At least that would be consistent. However, this still makes searching/matching a bit cumbersome, since now the plugin needs to worry about encoding and decoding rules.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

1 participant