From 28a5031547e4606d5304df931d8d97faff29123b Mon Sep 17 00:00:00 2001 From: Manfred Moser Date: Mon, 3 Jun 2024 14:49:30 -0700 Subject: [PATCH] Add docs for OpenX_JSON serde - Also rejig some file format stuff as side effect --- docs/src/main/sphinx/connector/hive.md | 25 ++++++++++++------- .../sphinx/object-storage/file-formats.md | 10 +------- 2 files changed, 17 insertions(+), 18 deletions(-) diff --git a/docs/src/main/sphinx/connector/hive.md b/docs/src/main/sphinx/connector/hive.md index 8d0ba30ec879..98c2aa04f50e 100644 --- a/docs/src/main/sphinx/connector/hive.md +++ b/docs/src/main/sphinx/connector/hive.md @@ -34,17 +34,24 @@ The coordinator and all workers must have network access to the Hive metastore and the storage system. Hive metastore access with the Thrift protocol defaults to using port 9083. -Data files must be in a supported file format. Some file formats can be -configured using file format configuration properties per catalog: +Data files must be in a supported file format. File formats can be +configured using the [`format` table property](hive-table-properties) +and other specific properties: - {ref}`ORC ` - {ref}`Parquet ` - Avro -- RCText (RCFile using ColumnarSerDe) -- RCBinary (RCFile using LazyBinaryColumnarSerDe) + +In the case of serializable formats, only specific +[SerDes](https://www.wikipedia.org/wiki/SerDes) are allowed: + +- RCText - RCFile using `ColumnarSerDe` +- RCBinary - RCFile using `LazyBinaryColumnarSerDe` - SequenceFile -- JSON (using org.apache.hive.hcatalog.data.JsonSerDe) -- CSV (using org.apache.hadoop.hive.serde2.OpenCSVSerde) +- CSV - using `org.apache.hadoop.hive.serde2.OpenCSVSerde` +- JSON - using `org.apache.hive.hcatalog.data.JsonSerDe` +- OPENX_JSON - OpenX JSON SerDe from `org.openx.data.jsonserde.JsonSerDe`. Find + more [details about the Trino implementation in the source repository](https://github.com/trinodb/trino/tree/master/lib/trino-hive-formats/src/main/java/io/trino/hive/formats/line/openxjson/README.md). - TextFile (hive-configuration)= @@ -783,9 +790,9 @@ WITH (format='CSV', - * - `format` - The table file format. Valid values include `ORC`, `PARQUET`, `AVRO`, - `RCBINARY`, `RCTEXT`, `SEQUENCEFILE`, `JSON`, `TEXTFILE`, `CSV`, and - `REGEX`. The catalog property `hive.storage-format` sets the default value - and can change it to a different default. + `RCBINARY`, `RCTEXT`, `SEQUENCEFILE`, `JSON`, `OPENX_JSON`, `TEXTFILE`, + `CSV`, and `REGEX`. The catalog property `hive.storage-format` sets the + default value and can change it to a different default. - * - `null_format` - The serialization format for `NULL` value. Requires TextFile, RCText, or diff --git a/docs/src/main/sphinx/object-storage/file-formats.md b/docs/src/main/sphinx/object-storage/file-formats.md index 406aa31a7c68..bc5102dd55c0 100644 --- a/docs/src/main/sphinx/object-storage/file-formats.md +++ b/docs/src/main/sphinx/object-storage/file-formats.md @@ -3,14 +3,6 @@ Object storage connectors support one or more file formats specified by the underlying data source. -In the case of serializable formats, only specific -[SerDes](https://www.wikipedia.org/wiki/SerDes) are allowed: - -- RCText - RCFile `ColumnarSerDe` -- RCBinary - RCFile `LazyBinaryColumnarSerDe` -- JSON - `org.apache.hive.hcatalog.data.JsonSerDe` -- CSV - `org.apache.hadoop.hive.serde2.OpenCSVSerde` - (hive-orc-configuration)= ## ORC format configuration properties @@ -108,7 +100,7 @@ with Parquet files performed by supported object storage connectors: `parquet_small_file_threshold`. - `3MB` * - `parquet.experimental.vectorized-decoding.enabled` - - Enable using Java Vector API for faster decoding of parquet files. + - Enable using Java Vector API (SIMD) for faster decoding of parquet files. The equivalent catalog session property is `parquet_vectorized_decoding_enabled`. - `true`