Support for time, datetime, json, real, double types #202

lpoulain · 2022-07-18T22:18:24Z

This PR adds support for time, datetime, json, read and double Trino types when querying for some rows.

The time and datetime required some data massaging:

The Python precision for datetime's milliseconds is maximum 6 digits, so any extra digits (such as the last 3 digits from a datetime(9)) are dropped
The time in Python does not have any timezone, so the timezone is dropped

…ity and NaN

mdesmet · 2022-07-19T08:53:01Z

The goal of this code is similar as what is implemented for experimental_python_types. The idea behind using this flag is that it would not impact any existing code relying on the string output of the Trino API (backwards compatibllity).

Could you review the existing mapping code and correct any shortcomings that you may have fixed.

trino-python-client/trino/client.py

Lines 583 to 651 in 951ad82

    
           def _map_to_python_type(cls, item: Tuple[Any, Dict]) -> Any: 
        
               (value, data_type) = item 
        
               if value is None: 
        
                   return None 
        
               raw_type = data_type["typeSignature"]["rawType"] 
        
               try: 
        
                   if isinstance(value, list): 
        
                       if raw_type == "array": 
        
                           raw_type = { 
        
                               "typeSignature": data_type["typeSignature"]["arguments"][0]["value"] 
        
                           } 
        
                           return [cls._map_to_python_type((array_item, raw_type)) for array_item in value] 
        
                       if raw_type == "row": 
        
                           raw_types = map(lambda arg: arg["value"], data_type["typeSignature"]["arguments"]) 
        
                           return tuple( 
        
                               cls._map_to_python_type((array_item, raw_type)) 
        
                               for (array_item, raw_type) in zip(value, raw_types) 
        
                           ) 
        
                       return value 
        
                   if isinstance(value, dict): 
        
                       raw_key_type = { 
        
                           "typeSignature": data_type["typeSignature"]["arguments"][0]["value"] 
        
                       } 
        
                       raw_value_type = { 
        
                           "typeSignature": data_type["typeSignature"]["arguments"][1]["value"] 
        
                       } 
        
                       return { 
        
                           cls._map_to_python_type((key, raw_key_type)): 
        
                               cls._map_to_python_type((value[key], raw_value_type)) 
        
                           for key in value 
        
                       } 
        
                   elif "decimal" in raw_type: 
        
                       return Decimal(value) 
        
                   elif raw_type == "double": 
        
                       if value == 'Infinity': 
        
                           return INF 
        
                       elif value == '-Infinity': 
        
                           return NEGATIVE_INF 
        
                       elif value == 'NaN': 
        
                           return NAN 
        
                       return value 
        
                   elif raw_type == "date": 
        
                       return datetime.strptime(value, "%Y-%m-%d").date() 
        
                   elif raw_type == "timestamp with time zone": 
        
                       dt, tz = value.rsplit(' ', 1) 
        
                       if tz.startswith('+') or tz.startswith('-'): 
        
                           return datetime.strptime(value, "%Y-%m-%d %H:%M:%S.%f %z") 
        
                       return datetime.strptime(dt, "%Y-%m-%d %H:%M:%S.%f").replace(tzinfo=pytz.timezone(tz)) 
        
                   elif "timestamp" in raw_type: 
        
                       return datetime.strptime(value, "%Y-%m-%d %H:%M:%S.%f") 
        
                   elif "time with time zone" in raw_type: 
        
                       matches = re.match(r'^(.*)([\+\-])(\d{2}):(\d{2})$', value) 
        
                       assert matches is not None 
        
                       assert len(matches.groups()) == 4 
        
                       if matches.group(2) == '-': 
        
                           tz = -timedelta(hours=int(matches.group(3)), minutes=int(matches.group(4))) 
        
                       else: 
        
                           tz = timedelta(hours=int(matches.group(3)), minutes=int(matches.group(4))) 
        
                       return datetime.strptime(matches.group(1), "%H:%M:%S.%f").time().replace(tzinfo=timezone(tz)) 
        
                   elif "time" in raw_type: 
        
                       return datetime.strptime(value, "%H:%M:%S.%f").time() 
        
                   else: 
        
                       return value 
        
               except ValueError as e: 
        
                   error_str = f"Could not convert '{value}' into the associated python type for '{raw_type}'" 
        
                   raise trino.exceptions.TrinoDataError(error_str) from e

AFAIK we didn't do unit testing for the types. So I think that's probably something we should keep.

mdesmet · 2022-07-19T08:54:28Z

tests/unit/test_types.py

+    None
+]
+
+def test_types():


Please split the tests by type, that will be easier to troubleshoot when things go wrong.

Also test using the flag experimental_python_types set to True with the TrinoResult class.

_map_to_python_type() works slightly differently: for each single row it analyzes the type and returns the desired value. __col_func() is instead called once per column and returns a lambda which is then called for each value.

That's actually a great optimisation and definitely welcome. Can you merge this optimization with the existing logic of experimental_python_types?

mdesmet · 2022-07-19T14:24:15Z

trino/client.py

@@ -226,6 +227,66 @@ def __repr__(self):
            )
        )

+    def __process_rows(self, rows, columns):


Should probably just be a private function instead of the double underscore of a Python hook.

mdesmet · 2022-07-20T14:22:58Z

trino/client.py


    @property
    def response_headers(self):
        return self._query.response_headers

    @classmethod
-    def _map_row(cls, experimental_python_types, row, columns):
+    def _map_row(cls, experimental_python_types, row, col_mapping_func):


All logic from _map_row and _map_to_python_types can probably be moved to the col_mapping_func itself. I think this way, this class doesn't need to know anymore of experimental_python_types, and just call the mapping function instead.

It could, but then the whole body of _map_to_python_type() would be repeated inside each mapping function.

Sorry I meant _map_to_python_type. You could use the experimental_python_types in the col_mapping_funcs itself, that way the TrinoResult class just needs to know about the col_mapping_funcs.

mdesmet · 2022-07-20T18:13:13Z

trino/client.py

@@ -229,6 +231,84 @@ def __repr__(self):
            )
        )

+    def _col_func(self, column):


If i look at the comments above I read: the status of a query is represented
by TrinoStatus.

I wonder if it would be better to put the the logic in a RowMapperFactory class that returns an array of mapper functions. The question is then where this RowMapperFactory should be called and where the rows mapping should happen. Maybe TrinoQuery is a better candidate?

Processing the column section (to get a list of mapping lambdas) and processing the rows can be done in multiple locations. I performed the former in the TrinoStatus class because this is where the data is first received. Processing the rows is done downstream, so left it there.

As far as the best candidate to perform both operations, I don't have any string opinion

By extracting it to its own class it would allow for separation of concern of each class and put it in the most appropriate place (composition).

Fair enough. But there is still the question about whether the location where mapping function generation and execution should take place.

mdesmet · 2022-07-20T18:15:18Z

As a general remark, please rebase instead of merge changes from master, otherwise you will introduce already implemented changes in your PR's commits.

mdesmet · 2022-07-21T09:29:10Z

tests/integration/test_dbapi_integration.py

@@ -294,7 +294,7 @@ def test_datetime_with_utc_time_zone_query_param(trino_connection):
    rows = cur.fetchall()

    assert rows[0][0] == params
-    assert cur.description[0][1] == "timestamp with time zone"
+    assert cur.description[0][1] == "timestamp(6) with time zone"


Note that only milliseconds are currently supported. If you would change the test case to more precise values over ms, eg. params = datetime(2020, 1, 1, 16, 43, 22, 320258), you would see that 258 part is lost, because of #42.

IMHO providing a value with a precision that is not actually supported right now seems not so useful.

Actually the whole 6 digits are preserved now.

mdesmet · 2022-07-21T15:45:31Z

trino/client.py

+            return lambda val: Decimal(val)
+        elif col_type.startswith('double') or col_type.startswith('real'):
+            return lambda val: float('inf') if val == 'Infinity' \
+                else -float('inf') if val == '-Infinity' \


Use the existing constants for performance.

INF = float("inf") NEGATIVE_INF = float("-inf") NAN = float("nan")

mdesmet · 2022-07-21T15:56:47Z

trino/client.py

@@ -339,6 +421,7 @@ def __init__(
            self._http_session = self.http.Session()
            self._http_session.verify = verify
        self._http_session.headers.update(self.http_headers)
+        self._http_session.headers['x-trino-client-capabilities'] = 'PARAMETRIC_DATETIME'


This change should probably go into another PR. See #114

lpoulain · 2022-07-21T18:57:24Z

Closing to split the PR is smaller chunks

Support for time, datetime, json, real, double types as well as Infin…

a810cd3

…ity and NaN

cla-bot bot added the cla-signed label Jul 18, 2022

lpoulain requested review from mdesmet, hovaesco and aalbu July 18, 2022 22:19

mdesmet suggested changes Jul 19, 2022

View reviewed changes

mdesmet reviewed Jul 19, 2022

View reviewed changes

lpoulain added 6 commits July 19, 2022 22:00

Fix failing integration tests

838cc2e

Fixed dbapi integration tests

403f5b7

Slight optimization for array/map/row types

222d388

Merge branch 'master' into master

11d3d12

Code change to fix code check errors

2d37790

Code change to fix code check errors

c438a77

mdesmet reviewed Jul 20, 2022

View reviewed changes

lpoulain added 3 commits July 20, 2022 15:57

Fix last failing tests

379a292

Fix last failing tests

0807149

Fix wrong type declared in functions

6ef363d

mdesmet reviewed Jul 21, 2022

View reviewed changes

lpoulain closed this Jul 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for time, datetime, json, real, double types #202

Support for time, datetime, json, real, double types #202

lpoulain commented Jul 18, 2022

mdesmet commented Jul 19, 2022

mdesmet Jul 19, 2022

mdesmet Jul 19, 2022

lpoulain Jul 19, 2022

mdesmet Jul 19, 2022

lpoulain Jul 20, 2022

mdesmet Jul 19, 2022

lpoulain Jul 20, 2022

mdesmet Jul 20, 2022

lpoulain Jul 20, 2022

mdesmet Jul 21, 2022

mdesmet Jul 20, 2022

lpoulain Jul 21, 2022

mdesmet Jul 21, 2022

lpoulain Jul 21, 2022

mdesmet commented Jul 20, 2022

mdesmet Jul 21, 2022

lpoulain Jul 21, 2022

mdesmet Jul 21, 2022

mdesmet Jul 21, 2022

lpoulain commented Jul 21, 2022

Support for time, datetime, json, real, double types #202

Support for time, datetime, json, real, double types #202

Conversation

lpoulain commented Jul 18, 2022

mdesmet commented Jul 19, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mdesmet commented Jul 20, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lpoulain commented Jul 21, 2022