Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add type mapping section #17573

Merged
merged 1 commit into from
Jul 24, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
72 changes: 72 additions & 0 deletions docs/src/main/sphinx/connector/avro-decoder.fragment
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
Avro decoder
""""""""""""

The Avro decoder converts the bytes representing a message or key in Avro format
based on a schema. The message must have the Avro schema embedded. Trino does
not support schemaless Avro decoding.

The ``dataSchema`` must be defined for any key or message using ``Avro``
decoder. ``Avro`` decoder should point to the location of a valid Avro
schema file of the message which must be decoded. This location can be a remote
web server (e.g.: ``dataSchema: 'http://example.org/schema/avro_data.avsc'``) or
local file system(e.g.: ``dataSchema: '/usr/local/schema/avro_data.avsc'``). The
decoder fails if this location is not accessible from the Trino cluster.

The following attributes are supported:

* ``name`` - Name of the column in the Trino table.
* ``type`` - Trino data type of column.
* ``mapping`` - A slash-separated list of field names to select a field from the
Avro schema. If the field specified in ``mapping`` does not exist in the
original Avro schema, a read operation returns ``NULL``.

The following table lists the supported Trino types that can be used in ``type``
for the equivalent Avro field types:

.. list-table::
:widths: 40, 60
:header-rows: 1

* - Trino data type
- Allowed Avro data type
* - ``BIGINT``
- ``INT``, ``LONG``
* - ``DOUBLE``
- ``DOUBLE``, ``FLOAT``
* - ``BOOLEAN``
- ``BOOLEAN``
* - ``VARCHAR`` / ``VARCHAR(x)``
- ``STRING``
* - ``VARBINARY``
- ``FIXED``, ``BYTES``
* - ``ARRAY``
- ``ARRAY``
* - ``MAP``
- ``MAP``

No other types are supported.

Avro schema evolution
+++++++++++++++++++++

The Avro decoder supports schema evolution with backward compatibility. With
backward compatibility, a newer schema can be used to read Avro data created
with an older schema. Any change in the Avro schema must also be reflected in
Trino's topic definition file. Newly added or renamed fields must have a
default value in the Avro schema file.

The schema evolution behavior is as follows:

* Column added in new schema: Data created with an older schema produces a
*default* value when the table is using the new schema.

* Column removed in new schema: Data created with an older schema no longer
outputs the data from the column that was removed.

* Column is renamed in the new schema: This is equivalent to removing the column
and adding a new one, and data created with an older schema produces a
*default* value when the table is using the new schema.

* Changing type of column in the new schema: If the type coercion is supported
by Avro, then the conversion happens. An error is thrown for incompatible
types.
34 changes: 34 additions & 0 deletions docs/src/main/sphinx/connector/csv-decoder.fragment
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
CSV decoder
"""""""""""

The CSV decoder converts the bytes representing a message or key into a string
using UTF-8 encoding, and interprets the result as a link of comma-separated
values.

For fields, the ``type`` and ``mapping`` attributes must be defined:

* ``type`` - Trino data type. See the following table for a list of supported
data types.

* ``mapping`` - The index of the field in the CSV record.

The ``dataFormat`` and ``formatHint`` attributes are not supported and must be
omitted.

.. list-table::
:widths: 40, 60
:header-rows: 1

* - Trino data type
- Decoding rules
* - ``BIGINT``, ``INTEGER``, ``SMALLINT``, ``TINYINT``
- Decoded using Java ``Long.parseLong()``
* - ``DOUBLE``
- Decoded using Java ``Double.parseDouble()``
* - ``BOOLEAN``
- "true" character sequence maps to ``true``. Other character sequences map
to ``false``
* - ``VARCHAR`` / ``VARCHAR(x)``
- Used as is

No other types are supported.
76 changes: 76 additions & 0 deletions docs/src/main/sphinx/connector/json-decoder.fragment
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
JSON decoder
""""""""""""

The JSON decoder converts the bytes representing a message or key into
Javascript Object Notaion (JSON) according to :rfc:`4627`. The message or key
must convert into a JSON object, not an array or simple type.

For fields, the following attributes are supported:

* ``type`` - Trino data type of column.
* ``dataFormat`` - Field decoder to be used for column.
* ``mapping`` - Slash-separated list of field names to select a field from the
JSON object.
* ``formatHint`` - Only for ``custom-date-time``.

The JSON decoder supports multiple field decoders with ``_default`` being used
for standard table columns and a number of decoders for date and time-based
types.

The following table lists Trino data types, which can be used in ``type`` and
matching field decoders, and specified via ``dataFormat`` attribute:

.. list-table::
:widths: 40, 60
:header-rows: 1

* - Trino data type
- Allowed ``dataFormat`` values
* - ``BIGINT``, ``INTEGER``, ``SMALLINT``, ``TINYINT``, ``DOUBLE``,
``BOOLEAN``, ``VARCHAR``, ``VARCHAR(x)``
- Default field decoder (omitted ``dataFormat`` attribute)
* - ``DATE``
- ``custom-date-time``, ``iso8601``
* - ``TIME``
- ``custom-date-time``, ``iso8601``, ``milliseconds-since-epoch``,
``seconds-since-epoch``
* - ``TIME WITH TIME ZONE``
- ``custom-date-time``, ``iso8601``
* - ``TIMESTAMP``
- ``custom-date-time``, ``iso8601``, ``rfc2822``,
``milliseconds-since-epoch``, ``seconds-since-epoch``
* - ``TIMESTAMP WITH TIME ZONE``
- ``custom-date-time``, ``iso8601``, ``rfc2822``,
``milliseconds-since-epoch``, ``seconds-since-epoch``

No other types are supported.

Default field decoder
+++++++++++++++++++++

This is the standard field decoder. It supports all the Trino physical data
types. A field value is transformed under JSON conversion rules into boolean,
long, double, or string values. This decoder should be used for columns that are
not date or time based.

Date and time decoders
++++++++++++++++++++++

To convert values from JSON objects to Trino ``DATE``, ``TIME``, ``TIME WITH
TIME ZONE``, ``TIMESTAMP`` or ``TIMESTAMP WITH TIME ZONE`` columns, select
special decoders using the ``dataFormat`` attribute of a field definition.

* ``iso8601`` - Text based, parses a text field as an ISO 8601 timestamp.
* ``rfc2822`` - Text based, parses a text field as an :rfc:`2822` timestamp.
* ``custom-date-time`` - Text based, parses a text field according to Joda
format pattern specified via ``formatHint`` attribute. The format pattern
should conform to
https://www.joda.org/joda-time/apidocs/org/joda/time/format/DateTimeFormat.html.
* ``milliseconds-since-epoch`` - Number-based, interprets a text or number as
number of milliseconds since the epoch.
* ``seconds-since-epoch`` - Number-based, interprets a text or number as number
of milliseconds since the epoch.

For ``TIMESTAMP WITH TIME ZONE`` and ``TIME WITH TIME ZONE`` data types, if
timezone information is present in decoded value, it is used as a Trino value.
Otherwise, the result time zone is set to ``UTC``.
38 changes: 38 additions & 0 deletions docs/src/main/sphinx/connector/kinesis.rst
Original file line number Diff line number Diff line change
Expand Up @@ -261,6 +261,44 @@ and if it is a more complex type (JSON array or JSON object) then the JSON itsel

There is no limit on field descriptions for either key or message.

.. _kinesis-type-mapping:

Type mapping
------------

Because Trino and Kinesis each support types that the other does not, this
connector :ref:`maps some types <type-mapping-overview>` when reading data. Type
mapping depends on the RAW, CSV, JSON, and AVRO file formats.

Row decoding
^^^^^^^^^^^^

A decoder is used to map data to table columns.

The connector contains the following decoders:

* ``raw``: Message is not interpreted; ranges of raw message bytes are mapped
to table columns.
* ``csv``: Message is interpreted as comma separated message, and fields are
mapped to table columns.
* ``json``: Message is parsed as JSON, and JSON fields are mapped to table
columns.
* ``avro``: Message is parsed based on an Avro schema, and Avro fields are
mapped to table columns.

.. note::

If no table definition file exists for a table, the ``dummy`` decoder is
used, which does not expose any columns.

.. include:: raw-decoder.fragment

.. include:: csv-decoder.fragment

.. include:: json-decoder.fragment

.. include:: avro-decoder.fragment

.. _kinesis-sql-support:

SQL support
Expand Down
84 changes: 84 additions & 0 deletions docs/src/main/sphinx/connector/raw-decoder.fragment
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
Raw decoder
"""""""""""

The raw decoder supports reading of raw byte-based values from message or key,
and converting it into Trino columns.

For fields, the following attributes are supported:

* ``dataFormat`` - Selects the width of the data type converted.
* ``type`` - Trino data type. See the following table for a list of supported
data types.
* ``mapping`` - ``<start>[:<end>]`` - Start and end position of bytes to convert
(optional).

The ``dataFormat`` attribute selects the number of bytes converted. If absent,
``BYTE`` is assumed. All values are signed.

Supported values are:

* ``BYTE`` - one byte
* ``SHORT`` - two bytes (big-endian)
* ``INT`` - four bytes (big-endian)
* ``LONG`` - eight bytes (big-endian)
* ``FLOAT`` - four bytes (IEEE 754 format)
* ``DOUBLE`` - eight bytes (IEEE 754 format)

The ``type`` attribute defines the Trino data type on which the value is mapped.

Depending on the Trino type assigned to a column, different values of dataFormat
can be used:

.. list-table::
:widths: 40, 60
:header-rows: 1

* - Trino data type
- Allowed ``dataFormat`` values
* - ``BIGINT``
- ``BYTE``, ``SHORT``, ``INT``, ``LONG``
* - ``INTEGER``
- ``BYTE``, ``SHORT``, ``INT``
* - ``SMALLINT``
- ``BYTE``, ``SHORT``
* - ``DOUBLE``
- ``DOUBLE``, ``FLOAT``
* - ``BOOLEAN``
- ``BYTE``, ``SHORT``, ``INT``, ``LONG``
* - ``VARCHAR`` / ``VARCHAR(x)``
- ``BYTE``

No other types are supported.

The ``mapping`` attribute specifies the range of the bytes in a key or message
used for decoding. It can be one or two numbers separated by a colon
(``<start>[:<end>]``).

If only a start position is given:

* For fixed width types, the column uses the appropriate number of bytes for
the specified ``dataFormat`` (see above).
* When the ``VARCHAR`` value is decoded, all bytes from the start position to
the end of the message is used.

If start and end position are given:

* For fixed width types, the size must be equal to the number of bytes used by
specified ``dataFormat``.
* For the ``VARCHAR`` data type all bytes between start (inclusive) and end
(exclusive) are used.

If no ``mapping`` attribute is specified, it is equivalent to setting the start
position to 0 and leaving the end position undefined.

The decoding scheme of numeric data types (``BIGINT``, ``INTEGER``,
``SMALLINT``, ``TINYINT``, ``DOUBLE``) is straightforward. A sequence of bytes
is read from input message and decoded according to either:

* big-endian encoding (for integer types)
* IEEE 754 format for (for ``DOUBLE``).

The length of a decoded byte sequence is implied by the ``dataFormat``.

For the ``VARCHAR`` data type, a sequence of bytes is interpreted according to
UTF-8 encoding.
36 changes: 36 additions & 0 deletions docs/src/main/sphinx/connector/redis.rst
Original file line number Diff line number Diff line change
Expand Up @@ -265,6 +265,42 @@ In addition to the above Kafka types, the Redis connector supports ``hash`` type

.. _Kafka connector: ./kafka.html

Type mapping
------------

Because Trino and Redis each support types that the other does not, this
connector :ref:`maps some types <type-mapping-overview>` when reading data. Type
mapping depends on the RAW, CSV, JSON, and AVRO file formats.

Row decoding
^^^^^^^^^^^^

A decoder is used to map data to table columns.

The connector contains the following decoders:

* ``raw``: Message is not interpreted; ranges of raw message bytes are mapped
to table columns.
* ``csv``: Message is interpreted as comma separated message, and fields are
mapped to table columns.
* ``json``: Message is parsed as JSON, and JSON fields are mapped to table
columns.
* ``avro``: Message is parsed based on an Avro schema, and Avro fields are
mapped to table columns.

.. note::

If no table definition file exists for a table, the ``dummy`` decoder is
used, which does not expose any columns.

.. include:: raw-decoder.fragment

.. include:: csv-decoder.fragment

.. include:: json-decoder.fragment

.. include:: avro-decoder.fragment

.. _redis-sql-support:

SQL support
Expand Down