Skip to content

Commit

Permalink
Add docs for OpenX_JSON serde
Browse files Browse the repository at this point in the history
- Also rejig some file format stuff as side effect
  • Loading branch information
mosabua committed Jun 3, 2024
1 parent 8e5cfb5 commit 28a5031
Show file tree
Hide file tree
Showing 2 changed files with 17 additions and 18 deletions.
25 changes: 16 additions & 9 deletions docs/src/main/sphinx/connector/hive.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,17 +34,24 @@ The coordinator and all workers must have network access to the Hive metastore
and the storage system. Hive metastore access with the Thrift protocol defaults
to using port 9083.

Data files must be in a supported file format. Some file formats can be
configured using file format configuration properties per catalog:
Data files must be in a supported file format. File formats can be
configured using the [`format` table property](hive-table-properties)
and other specific properties:

- {ref}`ORC <hive-orc-configuration>`
- {ref}`Parquet <hive-parquet-configuration>`
- Avro
- RCText (RCFile using ColumnarSerDe)
- RCBinary (RCFile using LazyBinaryColumnarSerDe)

In the case of serializable formats, only specific
[SerDes](https://www.wikipedia.org/wiki/SerDes) are allowed:

- RCText - RCFile using `ColumnarSerDe`
- RCBinary - RCFile using `LazyBinaryColumnarSerDe`
- SequenceFile
- JSON (using org.apache.hive.hcatalog.data.JsonSerDe)
- CSV (using org.apache.hadoop.hive.serde2.OpenCSVSerde)
- CSV - using `org.apache.hadoop.hive.serde2.OpenCSVSerde`
- JSON - using `org.apache.hive.hcatalog.data.JsonSerDe`
- OPENX_JSON - OpenX JSON SerDe from `org.openx.data.jsonserde.JsonSerDe`. Find
more [details about the Trino implementation in the source repository](https://github.com/trinodb/trino/tree/master/lib/trino-hive-formats/src/main/java/io/trino/hive/formats/line/openxjson/README.md).
- TextFile

(hive-configuration)=
Expand Down Expand Up @@ -783,9 +790,9 @@ WITH (format='CSV',
-
* - `format`
- The table file format. Valid values include `ORC`, `PARQUET`, `AVRO`,
`RCBINARY`, `RCTEXT`, `SEQUENCEFILE`, `JSON`, `TEXTFILE`, `CSV`, and
`REGEX`. The catalog property `hive.storage-format` sets the default value
and can change it to a different default.
`RCBINARY`, `RCTEXT`, `SEQUENCEFILE`, `JSON`, `OPENX_JSON`, `TEXTFILE`,
`CSV`, and `REGEX`. The catalog property `hive.storage-format` sets the
default value and can change it to a different default.
-
* - `null_format`
- The serialization format for `NULL` value. Requires TextFile, RCText, or
Expand Down
10 changes: 1 addition & 9 deletions docs/src/main/sphinx/object-storage/file-formats.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,14 +3,6 @@
Object storage connectors support one or more file formats specified by the
underlying data source.

In the case of serializable formats, only specific
[SerDes](https://www.wikipedia.org/wiki/SerDes) are allowed:

- RCText - RCFile `ColumnarSerDe`
- RCBinary - RCFile `LazyBinaryColumnarSerDe`
- JSON - `org.apache.hive.hcatalog.data.JsonSerDe`
- CSV - `org.apache.hadoop.hive.serde2.OpenCSVSerde`

(hive-orc-configuration)=
## ORC format configuration properties

Expand Down Expand Up @@ -108,7 +100,7 @@ with Parquet files performed by supported object storage connectors:
`parquet_small_file_threshold`.
- `3MB`
* - `parquet.experimental.vectorized-decoding.enabled`
- Enable using Java Vector API for faster decoding of parquet files.
- Enable using Java Vector API (SIMD) for faster decoding of parquet files.
The equivalent catalog session property is
`parquet_vectorized_decoding_enabled`.
- `true`
Expand Down

0 comments on commit 28a5031

Please sign in to comment.