From 25fb75550eed7998698e795c184d6eb883ba7729 Mon Sep 17 00:00:00 2001 From: deshanxiao Date: Tue, 16 May 2023 13:19:24 -0700 Subject: [PATCH] ORC-1409: [Docs] Add stream order description in ORC spec ### What changes were proposed in this pull request? This PR is aimed to add more description about stream order in ORC spec. ### Why are the changes needed? There are many users who are misled by the order of the document table, in fact the stream has no fixed order. #1450 ### How was this patch tested? Closes #1465 from deshanxiao/add-order-description. Authored-by: deshanxiao Signed-off-by: Dongjoon Hyun --- site/specification/ORCv0.md | 28 +++++++++++++++++++++++++++- site/specification/ORCv1.md | 27 ++++++++++++++++++++++++++- site/specification/ORCv2.md | 27 ++++++++++++++++++++++++++- 3 files changed, 79 insertions(+), 3 deletions(-) diff --git a/site/specification/ORCv0.md b/site/specification/ORCv0.md index 3ca4772123..de3e4b512e 100644 --- a/site/specification/ORCv0.md +++ b/site/specification/ORCv0.md @@ -501,6 +501,24 @@ uses three streams PRESENT, DATA, and LENGTH, which stores the length of each value. The details of each type will be presented in the following subsections. +There is a general order for index and data streams: +* Index streams are always placed together in the beginning of the stripe. +* Data streams are placed together after index streams (if any). +* Inside index streams or data streams, the unencrypted streams should be + placed first and then followed by streams grouped by each encryption variant. + +There is no fixed order within each unencrypted or encryption variant in the +index and data streams: +* Different stream kinds of the same column can be placed in any order. +* Streams from different columns can even be placed in any order. + To get the precise information (a.k.a stream kind, column id and location) of + a stream within a stripe, the streams field in the StripeFooter described below + is the single source of truth. + +In the example of the integer column mentioned above, the order of the +PRESENT stream and the DATA stream cannot be determined in advance. +We need to get the precise information by **StripeFooter**. + ## Stripe Footer The stripe footer contains the encoding of each column and the @@ -566,7 +584,7 @@ message ColumnEncoding { } ``` -# Column Encodings +# Column Encodings ## SmallInt, Int, and BigInt Columns @@ -581,6 +599,8 @@ Encoding | Stream Kind | Optional | Contents DIRECT | PRESENT | Yes | Boolean RLE | DATA | No | Signed Integer RLE v1 +> Note that the order of the Stream is not fixed. It also applies to other Column types. + ## Float and Double Columns Floating point types are stored using IEEE 754 floating point bit @@ -789,3 +809,9 @@ indexes error-prone. Because dictionaries are accessed randomly, there is not a position to record for the dictionary and the entire dictionary must be read even if only part of a stripe is being read. + +Note that for columns with multiple streams, the order of stream +positions in the RowIndex is **fixed**, which may be different to +the actual data stream placement, and it is the same as +[Column Encodings](#column-encoding-section) section we described above. + diff --git a/site/specification/ORCv1.md b/site/specification/ORCv1.md index fd18ae0b8a..cb99f6081f 100644 --- a/site/specification/ORCv1.md +++ b/site/specification/ORCv1.md @@ -895,6 +895,24 @@ The layout of each stripe looks like: * encryption variant 1..N * stripe footer +There is a general order for index and data streams: +* Index streams are always placed together in the beginning of the stripe. +* Data streams are placed together after index streams (if any). +* Inside index streams or data streams, the unencrypted streams should be + placed first and then followed by streams grouped by each encryption variant. + +There is no fixed order within each unencrypted or encryption variant in the +index and data streams: +* Different stream kinds of the same column can be placed in any order. +* Streams from different columns can even be placed in any order. + To get the precise information (a.k.a stream kind, column id and location) of + a stream within a stripe, the streams field in the StripeFooter described below + is the single source of truth. + +In the example of the integer column mentioned above, the order of the +PRESENT stream and the DATA stream cannot be determined in advance. +We need to get the precise information by **StripeFooter**. + ## Stripe Footer The stripe footer contains the encoding of each column and the @@ -993,7 +1011,7 @@ message ColumnEncoding { } ``` -# Column Encodings +# Column Encodings ## SmallInt, Int, and BigInt Columns @@ -1010,6 +1028,8 @@ DIRECT | PRESENT | Yes | Boolean RLE DIRECT_V2 | PRESENT | Yes | Boolean RLE | DATA | No | Signed Integer RLE v2 +> Note that the order of the Stream is not fixed. It also applies to other Column types. + ## Float and Double Columns Floating point types are stored using IEEE 754 floating point bit @@ -1241,6 +1261,11 @@ Because dictionaries are accessed randomly, there is not a position to record for the dictionary and the entire dictionary must be read even if only part of a stripe is being read. +Note that for columns with multiple streams, the order of stream +positions in the RowIndex is **fixed**, which may be different to +the actual data stream placement, and it is the same as +[Column Encodings](#column-encoding-section) section we described above. + ## Bloom Filter Index Bloom Filters are added to ORC indexes from Hive 1.2.0 onwards. diff --git a/site/specification/ORCv2.md b/site/specification/ORCv2.md index 73d89cde44..6d82e9e969 100644 --- a/site/specification/ORCv2.md +++ b/site/specification/ORCv2.md @@ -914,6 +914,24 @@ The layout of each stripe looks like: * encryption variant 1..N * stripe footer +There is a general order for index and data streams: +* Index streams are always placed together in the beginning of the stripe. +* Data streams are placed together after index streams (if any). +* Inside index streams or data streams, the unencrypted streams should be + placed first and then followed by streams grouped by each encryption variant. + +There is no fixed order within each unencrypted or encryption variant in the +index and data streams: +* Different stream kinds of the same column can be placed in any order. +* Streams from different columns can even be placed in any order. + To get the precise information (a.k.a stream kind, column id and location) of + a stream within a stripe, the streams field in the StripeFooter described below + is the single source of truth. + +In the example of the integer column mentioned above, the order of the +PRESENT stream and the DATA stream cannot be determined in advance. +We need to get the precise information by **StripeFooter**. + ## Stripe Footer The stripe footer contains the encoding of each column and the @@ -1012,7 +1030,7 @@ message ColumnEncoding { } ``` -# Column Encodings +# Column Encodings ## SmallInt, Int, and BigInt Columns @@ -1029,6 +1047,8 @@ DIRECT | PRESENT | Yes | Boolean RLE DIRECT_V2 | PRESENT | Yes | Boolean RLE | DATA | No | Signed Integer RLE v2 +> Note that the order of the Stream is not fixed. It also applies to other Column types. + ## Float and Double Columns Floating point types are stored using IEEE 754 floating point bit @@ -1257,6 +1277,11 @@ Because dictionaries are accessed randomly, there is not a position to record for the dictionary and the entire dictionary must be read even if only part of a stripe is being read. +Note that for columns with multiple streams, the order of stream +positions in the RowIndex is **fixed**, which may be different to +the actual data stream placement, and it is the same as +[Column Encodings](#column-encoding-section) section we described above. + ## Bloom Filter Index Bloom Filters are added to ORC indexes from Hive 1.2.0 onwards.