From 0afc37da5989f306305375500a53a73095c2dab1 Mon Sep 17 00:00:00 2001
From: Jessica <jessica.twitty-shuler@starburstdata.com>
Date: Fri, 19 May 2023 14:35:09 -0400
Subject: [PATCH] Add type mapping section

---
 .../sphinx/connector/avro-decoder.fragment    | 72 ++++++++++++++++
 .../sphinx/connector/csv-decoder.fragment     | 34 ++++++++
 .../sphinx/connector/json-decoder.fragment    | 76 +++++++++++++++++
 docs/src/main/sphinx/connector/kinesis.rst    | 38 +++++++++
 .../sphinx/connector/raw-decoder.fragment     | 84 +++++++++++++++++++
 docs/src/main/sphinx/connector/redis.rst      | 36 ++++++++
 6 files changed, 340 insertions(+)
 create mode 100644 docs/src/main/sphinx/connector/avro-decoder.fragment
 create mode 100644 docs/src/main/sphinx/connector/csv-decoder.fragment
 create mode 100644 docs/src/main/sphinx/connector/json-decoder.fragment
 create mode 100644 docs/src/main/sphinx/connector/raw-decoder.fragment

diff --git a/docs/src/main/sphinx/connector/avro-decoder.fragment b/docs/src/main/sphinx/connector/avro-decoder.fragment
new file mode 100644
index 000000000000..dfd358f60ee0
--- /dev/null
+++ b/docs/src/main/sphinx/connector/avro-decoder.fragment
@@ -0,0 +1,72 @@
+Avro decoder
+""""""""""""
+
+The Avro decoder converts the bytes representing a message or key in Avro format
+based on a schema. The message must have the Avro schema embedded. Trino does
+not support schemaless Avro decoding.
+
+The ``dataSchema`` must be defined for any key or message using ``Avro``
+decoder. ``Avro`` decoder should point to the location of a valid Avro
+schema file of the message which must be decoded. This location can be a remote
+web server (e.g.: ``dataSchema: 'http://example.org/schema/avro_data.avsc'``) or
+local file system(e.g.: ``dataSchema: '/usr/local/schema/avro_data.avsc'``). The
+decoder fails if this location is not accessible from the Trino cluster.
+
+The following attributes are supported:
+
+* ``name`` - Name of the column in the Trino table.
+* ``type`` - Trino data type of column.
+* ``mapping`` - A slash-separated list of field names to select a field from the
+  Avro schema. If the field specified in ``mapping`` does not exist in the
+  original Avro schema, a read operation returns ``NULL``.
+
+The following table lists the supported Trino types that can be used in ``type``
+for the equivalent Avro field types:
+
+.. list-table::
+  :widths: 40, 60
+  :header-rows: 1
+
+  * - Trino data type
+    - Allowed Avro data type
+  * - ``BIGINT``
+    - ``INT``, ``LONG``
+  * - ``DOUBLE``
+    - ``DOUBLE``, ``FLOAT``
+  * - ``BOOLEAN``
+    - ``BOOLEAN``
+  * - ``VARCHAR`` / ``VARCHAR(x)``
+    - ``STRING``
+  * - ``VARBINARY``
+    - ``FIXED``, ``BYTES``
+  * - ``ARRAY``
+    - ``ARRAY``
+  * - ``MAP``
+    - ``MAP``
+
+No other types are supported.
+
+Avro schema evolution
++++++++++++++++++++++
+
+The Avro decoder supports schema evolution with backward compatibility. With
+backward compatibility, a newer schema can be used to read Avro data created
+with an older schema. Any change in the Avro schema must also be reflected in
+Trino's topic definition file. Newly added or renamed fields must have a
+default value in the Avro schema file.
+
+The schema evolution behavior is as follows:
+
+* Column added in new schema: Data created with an older schema produces a
+  *default* value when the table is using the new schema.
+
+* Column removed in new schema: Data created with an older schema no longer
+  outputs the data from the column that was removed.
+
+* Column is renamed in the new schema: This is equivalent to removing the column
+  and adding a new one, and data created with an older schema produces a
+  *default* value when the table is using the new schema.
+
+* Changing type of column in the new schema: If the type coercion is supported
+  by Avro, then the conversion happens. An error is thrown for incompatible
+  types.
diff --git a/docs/src/main/sphinx/connector/csv-decoder.fragment b/docs/src/main/sphinx/connector/csv-decoder.fragment
new file mode 100644
index 000000000000..845110213f38
--- /dev/null
+++ b/docs/src/main/sphinx/connector/csv-decoder.fragment
@@ -0,0 +1,34 @@
+CSV decoder
+"""""""""""
+
+The CSV decoder converts the bytes representing a message or key into a string
+using UTF-8 encoding, and interprets the result as a link of comma-separated
+values.
+
+For fields, the ``type`` and ``mapping`` attributes must be defined:
+
+* ``type`` - Trino data type. See the following table for a list of supported
+  data types.
+
+* ``mapping`` - The index of the field in the CSV record.
+
+The ``dataFormat`` and ``formatHint`` attributes are not supported and must be
+omitted.
+
+.. list-table::
+  :widths: 40, 60
+  :header-rows: 1
+
+  * - Trino data type
+    - Decoding rules
+  * - ``BIGINT``, ``INTEGER``, ``SMALLINT``, ``TINYINT``
+    - Decoded using Java ``Long.parseLong()``
+  * - ``DOUBLE``
+    - Decoded using Java ``Double.parseDouble()``
+  * - ``BOOLEAN``
+    - "true" character sequence maps to ``true``. Other character sequences map
+      to ``false``
+  * - ``VARCHAR`` / ``VARCHAR(x)``
+    - Used as is
+
+No other types are supported.
diff --git a/docs/src/main/sphinx/connector/json-decoder.fragment b/docs/src/main/sphinx/connector/json-decoder.fragment
new file mode 100644
index 000000000000..d3c402013d0e
--- /dev/null
+++ b/docs/src/main/sphinx/connector/json-decoder.fragment
@@ -0,0 +1,76 @@
+JSON decoder
+""""""""""""
+
+The JSON decoder converts the bytes representing a message or key into
+Javascript Object Notaion (JSON) according to :rfc:`4627`. The message or key
+must convert into a JSON object, not an array or simple type.
+
+For fields, the following attributes are supported:
+
+* ``type`` - Trino data type of column.
+* ``dataFormat`` - Field decoder to be used for column.
+* ``mapping`` - Slash-separated list of field names to select a field from the
+  JSON object.
+* ``formatHint`` - Only for ``custom-date-time``.
+
+The JSON decoder supports multiple field decoders with ``_default`` being used
+for standard table columns and a number of decoders for date and time-based
+types.
+
+The following table lists Trino data types, which can be used in ``type`` and
+matching field decoders, and specified via ``dataFormat`` attribute:
+
+.. list-table::
+  :widths: 40, 60
+  :header-rows: 1
+
+  * - Trino data type
+    - Allowed ``dataFormat`` values
+  * - ``BIGINT``, ``INTEGER``, ``SMALLINT``, ``TINYINT``, ``DOUBLE``,
+      ``BOOLEAN``, ``VARCHAR``, ``VARCHAR(x)``
+    - Default field decoder (omitted ``dataFormat`` attribute)
+  * - ``DATE``
+    - ``custom-date-time``, ``iso8601``
+  * - ``TIME``
+    - ``custom-date-time``, ``iso8601``, ``milliseconds-since-epoch``,
+      ``seconds-since-epoch``
+  * - ``TIME WITH TIME ZONE``
+    - ``custom-date-time``, ``iso8601``
+  * - ``TIMESTAMP``
+    - ``custom-date-time``, ``iso8601``, ``rfc2822``,
+      ``milliseconds-since-epoch``, ``seconds-since-epoch``
+  * - ``TIMESTAMP WITH TIME ZONE``
+    - ``custom-date-time``, ``iso8601``, ``rfc2822``,
+      ``milliseconds-since-epoch``, ``seconds-since-epoch``
+
+No other types are supported.
+
+Default field decoder
++++++++++++++++++++++
+
+This is the standard field decoder. It supports all the Trino physical data
+types. A field value is transformed under JSON conversion rules into boolean,
+long, double, or string values. This decoder should be used for columns that are
+not date or time based.
+
+Date and time decoders
+++++++++++++++++++++++
+
+To convert values from JSON objects to Trino ``DATE``, ``TIME``, ``TIME WITH
+TIME ZONE``, ``TIMESTAMP`` or ``TIMESTAMP WITH TIME ZONE`` columns, select
+special decoders using the ``dataFormat`` attribute of a field definition.
+
+* ``iso8601`` - Text based, parses a text field as an ISO 8601 timestamp.
+* ``rfc2822`` - Text based, parses a text field as an :rfc:`2822` timestamp.
+* ``custom-date-time`` - Text based, parses a text field according to Joda
+  format pattern specified via ``formatHint`` attribute. The format pattern
+  should conform to
+  https://www.joda.org/joda-time/apidocs/org/joda/time/format/DateTimeFormat.html.
+* ``milliseconds-since-epoch`` - Number-based, interprets a text or number as
+  number of milliseconds since the epoch.
+* ``seconds-since-epoch`` - Number-based, interprets a text or number as number
+  of milliseconds since the epoch.
+
+For ``TIMESTAMP WITH TIME ZONE`` and ``TIME WITH TIME ZONE`` data types, if
+timezone information is present in decoded value, it is used as a Trino value.
+Otherwise, the result time zone is set to ``UTC``.
diff --git a/docs/src/main/sphinx/connector/kinesis.rst b/docs/src/main/sphinx/connector/kinesis.rst
index ce11ef92779c..a27ed6fa1c5e 100644
--- a/docs/src/main/sphinx/connector/kinesis.rst
+++ b/docs/src/main/sphinx/connector/kinesis.rst
@@ -261,6 +261,44 @@ and if it is a more complex type (JSON array or JSON object) then the JSON itsel
 
 There is no limit on field descriptions for either key or message.
 
+.. _kinesis-type-mapping:
+
+Type mapping
+------------
+
+Because Trino and Kinesis each support types that the other does not, this
+connector :ref:`maps some types <type-mapping-overview>` when reading data. Type
+mapping depends on the RAW, CSV, JSON, and AVRO file formats.
+
+Row decoding
+^^^^^^^^^^^^
+
+A decoder is used to map data to table columns.
+
+The connector contains the following decoders:
+
+* ``raw``: Message is not interpreted; ranges of raw message bytes are mapped
+  to table columns.
+* ``csv``: Message is interpreted as comma separated message, and fields are
+  mapped to table columns.
+* ``json``: Message is parsed as JSON, and JSON fields are mapped to table
+  columns.
+* ``avro``: Message is parsed based on an Avro schema, and Avro fields are
+  mapped to table columns.
+
+.. note::
+
+    If no table definition file exists for a table, the ``dummy`` decoder is
+    used, which does not expose any columns.
+
+.. include:: raw-decoder.fragment
+
+.. include:: csv-decoder.fragment
+
+.. include:: json-decoder.fragment
+
+.. include:: avro-decoder.fragment
+
 .. _kinesis-sql-support:
 
 SQL support
diff --git a/docs/src/main/sphinx/connector/raw-decoder.fragment b/docs/src/main/sphinx/connector/raw-decoder.fragment
new file mode 100644
index 000000000000..02366be8eea9
--- /dev/null
+++ b/docs/src/main/sphinx/connector/raw-decoder.fragment
@@ -0,0 +1,84 @@
+Raw decoder
+"""""""""""
+
+The raw decoder supports reading of raw byte-based values from message or key,
+and converting it into Trino columns.
+
+For fields, the following attributes are supported:
+
+* ``dataFormat`` - Selects the width of the data type converted.
+* ``type`` - Trino data type. See the following table for a list of supported
+  data types.
+* ``mapping`` - ``<start>[:<end>]`` - Start and end position of bytes to convert
+  (optional).
+
+The ``dataFormat`` attribute selects the number of bytes converted. If absent,
+``BYTE`` is assumed. All values are signed.
+
+Supported values are:
+
+* ``BYTE`` - one byte
+* ``SHORT`` - two bytes (big-endian)
+* ``INT`` - four bytes (big-endian)
+* ``LONG`` - eight bytes (big-endian)
+* ``FLOAT`` - four bytes (IEEE 754 format)
+* ``DOUBLE`` - eight bytes (IEEE 754 format)
+
+The ``type`` attribute defines the Trino data type on which the value is mapped.
+
+Depending on the Trino type assigned to a column, different values of dataFormat
+can be used:
+
+.. list-table::
+  :widths: 40, 60
+  :header-rows: 1
+
+  * - Trino data type
+    - Allowed ``dataFormat`` values
+  * - ``BIGINT``
+    - ``BYTE``, ``SHORT``, ``INT``, ``LONG``
+  * - ``INTEGER``
+    - ``BYTE``, ``SHORT``, ``INT``
+  * - ``SMALLINT``
+    - ``BYTE``, ``SHORT``
+  * - ``DOUBLE``
+    - ``DOUBLE``, ``FLOAT``
+  * - ``BOOLEAN``
+    - ``BYTE``, ``SHORT``, ``INT``, ``LONG``
+  * - ``VARCHAR`` / ``VARCHAR(x)``
+    - ``BYTE``
+
+No other types are supported.
+
+The ``mapping`` attribute specifies the range of the bytes in a key or message
+used for decoding. It can be one or two numbers separated by a colon
+(``<start>[:<end>]``).
+
+If only a start position is given:
+
+* For fixed width types, the column uses the appropriate number of bytes for
+  the specified ``dataFormat`` (see above).
+* When the ``VARCHAR`` value is decoded, all bytes from the start position to
+  the end of the message is used.
+
+If start and end position are given:
+
+* For fixed width types, the size must be equal to the number of bytes used by
+  specified ``dataFormat``.
+* For the ``VARCHAR`` data type all bytes between start (inclusive) and end
+  (exclusive) are used.
+
+If no ``mapping`` attribute is specified, it is equivalent to setting the start
+position to 0 and leaving the end position undefined.
+
+The decoding scheme of numeric data types (``BIGINT``, ``INTEGER``,
+``SMALLINT``, ``TINYINT``, ``DOUBLE``) is straightforward. A sequence of bytes
+is read from input message and decoded according to either:
+
+* big-endian encoding (for integer types)
+* IEEE 754 format for (for ``DOUBLE``).
+
+The length of a decoded byte sequence is implied by the ``dataFormat``.
+
+For the ``VARCHAR`` data type, a sequence of bytes is interpreted according to
+UTF-8 encoding.
diff --git a/docs/src/main/sphinx/connector/redis.rst b/docs/src/main/sphinx/connector/redis.rst
index ccacc8ecd8b0..5a34845980a4 100644
--- a/docs/src/main/sphinx/connector/redis.rst
+++ b/docs/src/main/sphinx/connector/redis.rst
@@ -265,6 +265,42 @@ In addition to the above Kafka types, the Redis connector supports ``hash`` type
 
 .. _Kafka connector: ./kafka.html
 
+Type mapping
+------------
+
+Because Trino and Redis each support types that the other does not, this
+connector :ref:`maps some types <type-mapping-overview>` when reading data. Type
+mapping depends on the RAW, CSV, JSON, and AVRO file formats.
+
+Row decoding
+^^^^^^^^^^^^
+
+A decoder is used to map data to table columns.
+
+The connector contains the following decoders:
+
+* ``raw``: Message is not interpreted; ranges of raw message bytes are mapped
+  to table columns.
+* ``csv``: Message is interpreted as comma separated message, and fields are
+  mapped to table columns.
+* ``json``: Message is parsed as JSON, and JSON fields are mapped to table
+  columns.
+* ``avro``: Message is parsed based on an Avro schema, and Avro fields are
+  mapped to table columns.
+
+.. note::
+
+    If no table definition file exists for a table, the ``dummy`` decoder is
+    used, which does not expose any columns.
+
+.. include:: raw-decoder.fragment
+
+.. include:: csv-decoder.fragment
+
+.. include:: json-decoder.fragment
+
+.. include:: avro-decoder.fragment
+
 .. _redis-sql-support:
 
 SQL support