From 0afc37da5989f306305375500a53a73095c2dab1 Mon Sep 17 00:00:00 2001 From: Jessica Date: Fri, 19 May 2023 14:35:09 -0400 Subject: [PATCH] Add type mapping section --- .../sphinx/connector/avro-decoder.fragment | 72 ++++++++++++++++ .../sphinx/connector/csv-decoder.fragment | 34 ++++++++ .../sphinx/connector/json-decoder.fragment | 76 +++++++++++++++++ docs/src/main/sphinx/connector/kinesis.rst | 38 +++++++++ .../sphinx/connector/raw-decoder.fragment | 84 +++++++++++++++++++ docs/src/main/sphinx/connector/redis.rst | 36 ++++++++ 6 files changed, 340 insertions(+) create mode 100644 docs/src/main/sphinx/connector/avro-decoder.fragment create mode 100644 docs/src/main/sphinx/connector/csv-decoder.fragment create mode 100644 docs/src/main/sphinx/connector/json-decoder.fragment create mode 100644 docs/src/main/sphinx/connector/raw-decoder.fragment diff --git a/docs/src/main/sphinx/connector/avro-decoder.fragment b/docs/src/main/sphinx/connector/avro-decoder.fragment new file mode 100644 index 000000000000..dfd358f60ee0 --- /dev/null +++ b/docs/src/main/sphinx/connector/avro-decoder.fragment @@ -0,0 +1,72 @@ +Avro decoder +"""""""""""" + +The Avro decoder converts the bytes representing a message or key in Avro format +based on a schema. The message must have the Avro schema embedded. Trino does +not support schemaless Avro decoding. + +The ``dataSchema`` must be defined for any key or message using ``Avro`` +decoder. ``Avro`` decoder should point to the location of a valid Avro +schema file of the message which must be decoded. This location can be a remote +web server (e.g.: ``dataSchema: 'http://example.org/schema/avro_data.avsc'``) or +local file system(e.g.: ``dataSchema: '/usr/local/schema/avro_data.avsc'``). The +decoder fails if this location is not accessible from the Trino cluster. + +The following attributes are supported: + +* ``name`` - Name of the column in the Trino table. +* ``type`` - Trino data type of column. +* ``mapping`` - A slash-separated list of field names to select a field from the + Avro schema. If the field specified in ``mapping`` does not exist in the + original Avro schema, a read operation returns ``NULL``. + +The following table lists the supported Trino types that can be used in ``type`` +for the equivalent Avro field types: + +.. list-table:: + :widths: 40, 60 + :header-rows: 1 + + * - Trino data type + - Allowed Avro data type + * - ``BIGINT`` + - ``INT``, ``LONG`` + * - ``DOUBLE`` + - ``DOUBLE``, ``FLOAT`` + * - ``BOOLEAN`` + - ``BOOLEAN`` + * - ``VARCHAR`` / ``VARCHAR(x)`` + - ``STRING`` + * - ``VARBINARY`` + - ``FIXED``, ``BYTES`` + * - ``ARRAY`` + - ``ARRAY`` + * - ``MAP`` + - ``MAP`` + +No other types are supported. + +Avro schema evolution ++++++++++++++++++++++ + +The Avro decoder supports schema evolution with backward compatibility. With +backward compatibility, a newer schema can be used to read Avro data created +with an older schema. Any change in the Avro schema must also be reflected in +Trino's topic definition file. Newly added or renamed fields must have a +default value in the Avro schema file. + +The schema evolution behavior is as follows: + +* Column added in new schema: Data created with an older schema produces a + *default* value when the table is using the new schema. + +* Column removed in new schema: Data created with an older schema no longer + outputs the data from the column that was removed. + +* Column is renamed in the new schema: This is equivalent to removing the column + and adding a new one, and data created with an older schema produces a + *default* value when the table is using the new schema. + +* Changing type of column in the new schema: If the type coercion is supported + by Avro, then the conversion happens. An error is thrown for incompatible + types. diff --git a/docs/src/main/sphinx/connector/csv-decoder.fragment b/docs/src/main/sphinx/connector/csv-decoder.fragment new file mode 100644 index 000000000000..845110213f38 --- /dev/null +++ b/docs/src/main/sphinx/connector/csv-decoder.fragment @@ -0,0 +1,34 @@ +CSV decoder +""""""""""" + +The CSV decoder converts the bytes representing a message or key into a string +using UTF-8 encoding, and interprets the result as a link of comma-separated +values. + +For fields, the ``type`` and ``mapping`` attributes must be defined: + +* ``type`` - Trino data type. See the following table for a list of supported + data types. + +* ``mapping`` - The index of the field in the CSV record. + +The ``dataFormat`` and ``formatHint`` attributes are not supported and must be +omitted. + +.. list-table:: + :widths: 40, 60 + :header-rows: 1 + + * - Trino data type + - Decoding rules + * - ``BIGINT``, ``INTEGER``, ``SMALLINT``, ``TINYINT`` + - Decoded using Java ``Long.parseLong()`` + * - ``DOUBLE`` + - Decoded using Java ``Double.parseDouble()`` + * - ``BOOLEAN`` + - "true" character sequence maps to ``true``. Other character sequences map + to ``false`` + * - ``VARCHAR`` / ``VARCHAR(x)`` + - Used as is + +No other types are supported. diff --git a/docs/src/main/sphinx/connector/json-decoder.fragment b/docs/src/main/sphinx/connector/json-decoder.fragment new file mode 100644 index 000000000000..d3c402013d0e --- /dev/null +++ b/docs/src/main/sphinx/connector/json-decoder.fragment @@ -0,0 +1,76 @@ +JSON decoder +"""""""""""" + +The JSON decoder converts the bytes representing a message or key into +Javascript Object Notaion (JSON) according to :rfc:`4627`. The message or key +must convert into a JSON object, not an array or simple type. + +For fields, the following attributes are supported: + +* ``type`` - Trino data type of column. +* ``dataFormat`` - Field decoder to be used for column. +* ``mapping`` - Slash-separated list of field names to select a field from the + JSON object. +* ``formatHint`` - Only for ``custom-date-time``. + +The JSON decoder supports multiple field decoders with ``_default`` being used +for standard table columns and a number of decoders for date and time-based +types. + +The following table lists Trino data types, which can be used in ``type`` and +matching field decoders, and specified via ``dataFormat`` attribute: + +.. list-table:: + :widths: 40, 60 + :header-rows: 1 + + * - Trino data type + - Allowed ``dataFormat`` values + * - ``BIGINT``, ``INTEGER``, ``SMALLINT``, ``TINYINT``, ``DOUBLE``, + ``BOOLEAN``, ``VARCHAR``, ``VARCHAR(x)`` + - Default field decoder (omitted ``dataFormat`` attribute) + * - ``DATE`` + - ``custom-date-time``, ``iso8601`` + * - ``TIME`` + - ``custom-date-time``, ``iso8601``, ``milliseconds-since-epoch``, + ``seconds-since-epoch`` + * - ``TIME WITH TIME ZONE`` + - ``custom-date-time``, ``iso8601`` + * - ``TIMESTAMP`` + - ``custom-date-time``, ``iso8601``, ``rfc2822``, + ``milliseconds-since-epoch``, ``seconds-since-epoch`` + * - ``TIMESTAMP WITH TIME ZONE`` + - ``custom-date-time``, ``iso8601``, ``rfc2822``, + ``milliseconds-since-epoch``, ``seconds-since-epoch`` + +No other types are supported. + +Default field decoder ++++++++++++++++++++++ + +This is the standard field decoder. It supports all the Trino physical data +types. A field value is transformed under JSON conversion rules into boolean, +long, double, or string values. This decoder should be used for columns that are +not date or time based. + +Date and time decoders +++++++++++++++++++++++ + +To convert values from JSON objects to Trino ``DATE``, ``TIME``, ``TIME WITH +TIME ZONE``, ``TIMESTAMP`` or ``TIMESTAMP WITH TIME ZONE`` columns, select +special decoders using the ``dataFormat`` attribute of a field definition. + +* ``iso8601`` - Text based, parses a text field as an ISO 8601 timestamp. +* ``rfc2822`` - Text based, parses a text field as an :rfc:`2822` timestamp. +* ``custom-date-time`` - Text based, parses a text field according to Joda + format pattern specified via ``formatHint`` attribute. The format pattern + should conform to + https://www.joda.org/joda-time/apidocs/org/joda/time/format/DateTimeFormat.html. +* ``milliseconds-since-epoch`` - Number-based, interprets a text or number as + number of milliseconds since the epoch. +* ``seconds-since-epoch`` - Number-based, interprets a text or number as number + of milliseconds since the epoch. + +For ``TIMESTAMP WITH TIME ZONE`` and ``TIME WITH TIME ZONE`` data types, if +timezone information is present in decoded value, it is used as a Trino value. +Otherwise, the result time zone is set to ``UTC``. diff --git a/docs/src/main/sphinx/connector/kinesis.rst b/docs/src/main/sphinx/connector/kinesis.rst index ce11ef92779c..a27ed6fa1c5e 100644 --- a/docs/src/main/sphinx/connector/kinesis.rst +++ b/docs/src/main/sphinx/connector/kinesis.rst @@ -261,6 +261,44 @@ and if it is a more complex type (JSON array or JSON object) then the JSON itsel There is no limit on field descriptions for either key or message. +.. _kinesis-type-mapping: + +Type mapping +------------ + +Because Trino and Kinesis each support types that the other does not, this +connector :ref:`maps some types ` when reading data. Type +mapping depends on the RAW, CSV, JSON, and AVRO file formats. + +Row decoding +^^^^^^^^^^^^ + +A decoder is used to map data to table columns. + +The connector contains the following decoders: + +* ``raw``: Message is not interpreted; ranges of raw message bytes are mapped + to table columns. +* ``csv``: Message is interpreted as comma separated message, and fields are + mapped to table columns. +* ``json``: Message is parsed as JSON, and JSON fields are mapped to table + columns. +* ``avro``: Message is parsed based on an Avro schema, and Avro fields are + mapped to table columns. + +.. note:: + + If no table definition file exists for a table, the ``dummy`` decoder is + used, which does not expose any columns. + +.. include:: raw-decoder.fragment + +.. include:: csv-decoder.fragment + +.. include:: json-decoder.fragment + +.. include:: avro-decoder.fragment + .. _kinesis-sql-support: SQL support diff --git a/docs/src/main/sphinx/connector/raw-decoder.fragment b/docs/src/main/sphinx/connector/raw-decoder.fragment new file mode 100644 index 000000000000..02366be8eea9 --- /dev/null +++ b/docs/src/main/sphinx/connector/raw-decoder.fragment @@ -0,0 +1,84 @@ +Raw decoder +""""""""""" + +The raw decoder supports reading of raw byte-based values from message or key, +and converting it into Trino columns. + +For fields, the following attributes are supported: + +* ``dataFormat`` - Selects the width of the data type converted. +* ``type`` - Trino data type. See the following table for a list of supported + data types. +* ``mapping`` - ``[:]`` - Start and end position of bytes to convert + (optional). + +The ``dataFormat`` attribute selects the number of bytes converted. If absent, +``BYTE`` is assumed. All values are signed. + +Supported values are: + +* ``BYTE`` - one byte +* ``SHORT`` - two bytes (big-endian) +* ``INT`` - four bytes (big-endian) +* ``LONG`` - eight bytes (big-endian) +* ``FLOAT`` - four bytes (IEEE 754 format) +* ``DOUBLE`` - eight bytes (IEEE 754 format) + +The ``type`` attribute defines the Trino data type on which the value is mapped. + +Depending on the Trino type assigned to a column, different values of dataFormat +can be used: + +.. list-table:: + :widths: 40, 60 + :header-rows: 1 + + * - Trino data type + - Allowed ``dataFormat`` values + * - ``BIGINT`` + - ``BYTE``, ``SHORT``, ``INT``, ``LONG`` + * - ``INTEGER`` + - ``BYTE``, ``SHORT``, ``INT`` + * - ``SMALLINT`` + - ``BYTE``, ``SHORT`` + * - ``DOUBLE`` + - ``DOUBLE``, ``FLOAT`` + * - ``BOOLEAN`` + - ``BYTE``, ``SHORT``, ``INT``, ``LONG`` + * - ``VARCHAR`` / ``VARCHAR(x)`` + - ``BYTE`` + +No other types are supported. + +The ``mapping`` attribute specifies the range of the bytes in a key or message +used for decoding. It can be one or two numbers separated by a colon +(``[:]``). + +If only a start position is given: + +* For fixed width types, the column uses the appropriate number of bytes for + the specified ``dataFormat`` (see above). +* When the ``VARCHAR`` value is decoded, all bytes from the start position to + the end of the message is used. + +If start and end position are given: + +* For fixed width types, the size must be equal to the number of bytes used by + specified ``dataFormat``. +* For the ``VARCHAR`` data type all bytes between start (inclusive) and end + (exclusive) are used. + +If no ``mapping`` attribute is specified, it is equivalent to setting the start +position to 0 and leaving the end position undefined. + +The decoding scheme of numeric data types (``BIGINT``, ``INTEGER``, +``SMALLINT``, ``TINYINT``, ``DOUBLE``) is straightforward. A sequence of bytes +is read from input message and decoded according to either: + +* big-endian encoding (for integer types) +* IEEE 754 format for (for ``DOUBLE``). + +The length of a decoded byte sequence is implied by the ``dataFormat``. + +For the ``VARCHAR`` data type, a sequence of bytes is interpreted according to +UTF-8 encoding. diff --git a/docs/src/main/sphinx/connector/redis.rst b/docs/src/main/sphinx/connector/redis.rst index ccacc8ecd8b0..5a34845980a4 100644 --- a/docs/src/main/sphinx/connector/redis.rst +++ b/docs/src/main/sphinx/connector/redis.rst @@ -265,6 +265,42 @@ In addition to the above Kafka types, the Redis connector supports ``hash`` type .. _Kafka connector: ./kafka.html +Type mapping +------------ + +Because Trino and Redis each support types that the other does not, this +connector :ref:`maps some types ` when reading data. Type +mapping depends on the RAW, CSV, JSON, and AVRO file formats. + +Row decoding +^^^^^^^^^^^^ + +A decoder is used to map data to table columns. + +The connector contains the following decoders: + +* ``raw``: Message is not interpreted; ranges of raw message bytes are mapped + to table columns. +* ``csv``: Message is interpreted as comma separated message, and fields are + mapped to table columns. +* ``json``: Message is parsed as JSON, and JSON fields are mapped to table + columns. +* ``avro``: Message is parsed based on an Avro schema, and Avro fields are + mapped to table columns. + +.. note:: + + If no table definition file exists for a table, the ``dummy`` decoder is + used, which does not expose any columns. + +.. include:: raw-decoder.fragment + +.. include:: csv-decoder.fragment + +.. include:: json-decoder.fragment + +.. include:: avro-decoder.fragment + .. _redis-sql-support: SQL support