Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for time, datetime, json, real, double types #202

Closed
wants to merge 10 commits into from
Closed

Support for time, datetime, json, real, double types #202

wants to merge 10 commits into from

Conversation

lpoulain
Copy link
Member

This PR adds support for time, datetime, json, read and double Trino types when querying for some rows.

The time and datetime required some data massaging:

  • The Python precision for datetime's milliseconds is maximum 6 digits, so any extra digits (such as the last 3 digits from a datetime(9)) are dropped
  • The time in Python does not have any timezone, so the timezone is dropped

@cla-bot cla-bot bot added the cla-signed label Jul 18, 2022
@lpoulain lpoulain requested review from mdesmet, hovaesco and aalbu July 18, 2022 22:19
@mdesmet
Copy link
Contributor

mdesmet commented Jul 19, 2022

The goal of this code is similar as what is implemented for experimental_python_types. The idea behind using this flag is that it would not impact any existing code relying on the string output of the Trino API (backwards compatibllity).

Could you review the existing mapping code and correct any shortcomings that you may have fixed.

def _map_to_python_type(cls, item: Tuple[Any, Dict]) -> Any:
(value, data_type) = item
if value is None:
return None
raw_type = data_type["typeSignature"]["rawType"]
try:
if isinstance(value, list):
if raw_type == "array":
raw_type = {
"typeSignature": data_type["typeSignature"]["arguments"][0]["value"]
}
return [cls._map_to_python_type((array_item, raw_type)) for array_item in value]
if raw_type == "row":
raw_types = map(lambda arg: arg["value"], data_type["typeSignature"]["arguments"])
return tuple(
cls._map_to_python_type((array_item, raw_type))
for (array_item, raw_type) in zip(value, raw_types)
)
return value
if isinstance(value, dict):
raw_key_type = {
"typeSignature": data_type["typeSignature"]["arguments"][0]["value"]
}
raw_value_type = {
"typeSignature": data_type["typeSignature"]["arguments"][1]["value"]
}
return {
cls._map_to_python_type((key, raw_key_type)):
cls._map_to_python_type((value[key], raw_value_type))
for key in value
}
elif "decimal" in raw_type:
return Decimal(value)
elif raw_type == "double":
if value == 'Infinity':
return INF
elif value == '-Infinity':
return NEGATIVE_INF
elif value == 'NaN':
return NAN
return value
elif raw_type == "date":
return datetime.strptime(value, "%Y-%m-%d").date()
elif raw_type == "timestamp with time zone":
dt, tz = value.rsplit(' ', 1)
if tz.startswith('+') or tz.startswith('-'):
return datetime.strptime(value, "%Y-%m-%d %H:%M:%S.%f %z")
return datetime.strptime(dt, "%Y-%m-%d %H:%M:%S.%f").replace(tzinfo=pytz.timezone(tz))
elif "timestamp" in raw_type:
return datetime.strptime(value, "%Y-%m-%d %H:%M:%S.%f")
elif "time with time zone" in raw_type:
matches = re.match(r'^(.*)([\+\-])(\d{2}):(\d{2})$', value)
assert matches is not None
assert len(matches.groups()) == 4
if matches.group(2) == '-':
tz = -timedelta(hours=int(matches.group(3)), minutes=int(matches.group(4)))
else:
tz = timedelta(hours=int(matches.group(3)), minutes=int(matches.group(4)))
return datetime.strptime(matches.group(1), "%H:%M:%S.%f").time().replace(tzinfo=timezone(tz))
elif "time" in raw_type:
return datetime.strptime(value, "%H:%M:%S.%f").time()
else:
return value
except ValueError as e:
error_str = f"Could not convert '{value}' into the associated python type for '{raw_type}'"
raise trino.exceptions.TrinoDataError(error_str) from e

AFAIK we didn't do unit testing for the types. So I think that's probably something we should keep.

None
]

def test_types():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please split the tests by type, that will be easier to troubleshoot when things go wrong.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also test using the flag experimental_python_types set to True with the TrinoResult class.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_map_to_python_type() works slightly differently: for each single row it analyzes the type and returns the desired value. __col_func() is instead called once per column and returns a lambda which is then called for each value.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's actually a great optimisation and definitely welcome. Can you merge this optimization with the existing logic of experimental_python_types?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

trino/client.py Outdated
@@ -226,6 +227,66 @@ def __repr__(self):
)
)

def __process_rows(self, rows, columns):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably just be a private function instead of the double underscore of a Python hook.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed


@property
def response_headers(self):
return self._query.response_headers

@classmethod
def _map_row(cls, experimental_python_types, row, columns):
def _map_row(cls, experimental_python_types, row, col_mapping_func):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All logic from _map_row and _map_to_python_types can probably be moved to the col_mapping_func itself. I think this way, this class doesn't need to know anymore of experimental_python_types, and just call the mapping function instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could, but then the whole body of _map_to_python_type() would be repeated inside each mapping function.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I meant _map_to_python_type. You could use the experimental_python_types in the col_mapping_funcs itself, that way the TrinoResult class just needs to know about the col_mapping_funcs.

@@ -229,6 +231,84 @@ def __repr__(self):
)
)

def _col_func(self, column):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If i look at the comments above I read: the status of a query is represented
by TrinoStatus.

I wonder if it would be better to put the the logic in a RowMapperFactory class that returns an array of mapper functions. The question is then where this RowMapperFactory should be called and where the rows mapping should happen. Maybe TrinoQuery is a better candidate?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Processing the column section (to get a list of mapping lambdas) and processing the rows can be done in multiple locations. I performed the former in the TrinoStatus class because this is where the data is first received. Processing the rows is done downstream, so left it there.

As far as the best candidate to perform both operations, I don't have any string opinion

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By extracting it to its own class it would allow for separation of concern of each class and put it in the most appropriate place (composition).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. But there is still the question about whether the location where mapping function generation and execution should take place.

@mdesmet
Copy link
Contributor

mdesmet commented Jul 20, 2022

As a general remark, please rebase instead of merge changes from master, otherwise you will introduce already implemented changes in your PR's commits.

@@ -294,7 +294,7 @@ def test_datetime_with_utc_time_zone_query_param(trino_connection):
rows = cur.fetchall()

assert rows[0][0] == params
assert cur.description[0][1] == "timestamp with time zone"
assert cur.description[0][1] == "timestamp(6) with time zone"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that only milliseconds are currently supported. If you would change the test case to more precise values over ms, eg. params = datetime(2020, 1, 1, 16, 43, 22, 320258), you would see that 258 part is lost, because of #42.

IMHO providing a value with a precision that is not actually supported right now seems not so useful.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually the whole 6 digits are preserved now.

return lambda val: Decimal(val)
elif col_type.startswith('double') or col_type.startswith('real'):
return lambda val: float('inf') if val == 'Infinity' \
else -float('inf') if val == '-Infinity' \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the existing constants for performance.

INF = float("inf")
NEGATIVE_INF = float("-inf")
NAN = float("nan")

@@ -339,6 +421,7 @@ def __init__(
self._http_session = self.http.Session()
self._http_session.verify = verify
self._http_session.headers.update(self.http_headers)
self._http_session.headers['x-trino-client-capabilities'] = 'PARAMETRIC_DATETIME'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change should probably go into another PR. See #114

@lpoulain
Copy link
Member Author

Closing to split the PR is smaller chunks

@lpoulain lpoulain closed this Jul 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

2 participants