Make representation consistent for all Delphix-provided strings #387

mothslaw · 2021-06-29T14:05:53Z

Is your feature request related to a problem? Please describe.
The representation of engine-provided strings is not consistent, which makes it hard for plugins to get searching/matching correct when dealing with non-ASCII string.

Engine-provided strings are sent to the container using a Protobuf message, which is then unpacked by the Delphix wrapper code. This gives us string objects which we pass along to the plugin code as-is. According to the Protobuf documentation, these strings can be in the following formats:

If all of the characters in the string are ASCII-representable, then the string object will be of type str and will contain the ASCII-encoded bytes that represent the string.
If there is at least one non-ASCII-representable character in the string, then the string object can be in one of two types (it's not guaranteed which one we might get)
a) A unicode object, containing the characters in the string.
b) A str object, containing the UTF8-encoded bytes that represent the string

So, imagine a string that begins with the character ë. And, imagine a plugin wants to check that, indeed, the string begins with that character. You might think the plugin could just do this:

pattern = re.compile(u'ë')
pattern.match(the_string)

This will work fine for case (2a). But, it will not work for case (2b). After all, in case (2b) we've only got a str object. The str object does not contain characters, it contains bytes. So, the first two bytes here are c3-ab (the UTF-8 encoding for our character ë)
Also, there's no way for the re module to know what encoding might be in play. So, the re module cannot know that c3-ab should be interpreted as ë. So, for case (2b), the plugin would need to do something like this:

pattern = re.compile(u'ë')
uni_string = the_string.decode(u'utf-8')
pattern_match(uni_string)

But, of course, this code does not work for case (2a). So, now the plugin needs to have special code to do different things for cases (2a) and (2b). For example, they could write a function like this that they call for every single string that they ever receive from the engine:

def force_engine_string_to_unicode(engine_string):
  if type(my_uni_complex).__name__ == u"unicode":
    return engine_string
  else
    return engine_string.decode(u"utf-8")

Describe the solution you'd like
The plugin shouldn't have to jump through hoops like the above just to do string searching. It'd be better if the Delphix wrappers could give a consistent string representation to the plugin.

I think the rules should be:

When the wrapper provides a string to the plugin, it will always supply a unicode string to the plugin. Never a str string.
When the plugin provides a string to the wrapper, the wrapper will accept either a unicode string, or an ASCII- or UTF8-encoded str string. (The wrapper already supports this)

Describe alternatives you've considered
Another alternative would be for the wrapper to always provide UTF8-encoded str objects. At least that would be consistent. However, this still makes searching/matching a bit cumbersome, since now the plugin needs to worry about encoding and decoding rules.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make representation consistent for all Delphix-provided strings #387

Make representation consistent for all Delphix-provided strings #387

mothslaw commented Jun 29, 2021

Make representation consistent for all Delphix-provided strings #387

Make representation consistent for all Delphix-provided strings #387

Comments

mothslaw commented Jun 29, 2021