Skip to content

Commit

Permalink
ORC-1409: [Docs] Add stream order description in ORC spec
Browse files Browse the repository at this point in the history
### What changes were proposed in this pull request?
This PR is aimed to add more description about stream order in ORC spec.

### Why are the changes needed?
There are many users who are misled by the order of the document table, in fact the stream has no fixed order.

#1450

### How was this patch tested?

Closes #1465 from deshanxiao/add-order-description.

Authored-by: deshanxiao <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
  • Loading branch information
deshanxiao authored and dongjoon-hyun committed May 16, 2023
1 parent d2e9b72 commit 25fb755
Show file tree
Hide file tree
Showing 3 changed files with 79 additions and 3 deletions.
28 changes: 27 additions & 1 deletion site/specification/ORCv0.md
Original file line number Diff line number Diff line change
Expand Up @@ -501,6 +501,24 @@ uses three streams PRESENT, DATA, and LENGTH, which stores the length
of each value. The details of each type will be presented in the
following subsections.

There is a general order for index and data streams:
* Index streams are always placed together in the beginning of the stripe.
* Data streams are placed together after index streams (if any).
* Inside index streams or data streams, the unencrypted streams should be
placed first and then followed by streams grouped by each encryption variant.

There is no fixed order within each unencrypted or encryption variant in the
index and data streams:
* Different stream kinds of the same column can be placed in any order.
* Streams from different columns can even be placed in any order.
To get the precise information (a.k.a stream kind, column id and location) of
a stream within a stripe, the streams field in the StripeFooter described below
is the single source of truth.

In the example of the integer column mentioned above, the order of the
PRESENT stream and the DATA stream cannot be determined in advance.
We need to get the precise information by **StripeFooter**.

## Stripe Footer

The stripe footer contains the encoding of each column and the
Expand Down Expand Up @@ -566,7 +584,7 @@ message ColumnEncoding {
}
```

# Column Encodings
# <a id="column-encoding-section">Column Encodings</a>

## SmallInt, Int, and BigInt Columns

Expand All @@ -581,6 +599,8 @@ Encoding | Stream Kind | Optional | Contents
DIRECT | PRESENT | Yes | Boolean RLE
| DATA | No | Signed Integer RLE v1

> Note that the order of the Stream is not fixed. It also applies to other Column types.
## Float and Double Columns

Floating point types are stored using IEEE 754 floating point bit
Expand Down Expand Up @@ -789,3 +809,9 @@ indexes error-prone.
Because dictionaries are accessed randomly, there is not a position to
record for the dictionary and the entire dictionary must be read even
if only part of a stripe is being read.

Note that for columns with multiple streams, the order of stream
positions in the RowIndex is **fixed**, which may be different to
the actual data stream placement, and it is the same as
[Column Encodings](#column-encoding-section) section we described above.

27 changes: 26 additions & 1 deletion site/specification/ORCv1.md
Original file line number Diff line number Diff line change
Expand Up @@ -895,6 +895,24 @@ The layout of each stripe looks like:
* encryption variant 1..N
* stripe footer

There is a general order for index and data streams:
* Index streams are always placed together in the beginning of the stripe.
* Data streams are placed together after index streams (if any).
* Inside index streams or data streams, the unencrypted streams should be
placed first and then followed by streams grouped by each encryption variant.

There is no fixed order within each unencrypted or encryption variant in the
index and data streams:
* Different stream kinds of the same column can be placed in any order.
* Streams from different columns can even be placed in any order.
To get the precise information (a.k.a stream kind, column id and location) of
a stream within a stripe, the streams field in the StripeFooter described below
is the single source of truth.

In the example of the integer column mentioned above, the order of the
PRESENT stream and the DATA stream cannot be determined in advance.
We need to get the precise information by **StripeFooter**.

## Stripe Footer

The stripe footer contains the encoding of each column and the
Expand Down Expand Up @@ -993,7 +1011,7 @@ message ColumnEncoding {
}
```

# Column Encodings
# <a id="column-encoding-section">Column Encodings</a>

## SmallInt, Int, and BigInt Columns

Expand All @@ -1010,6 +1028,8 @@ DIRECT | PRESENT | Yes | Boolean RLE
DIRECT_V2 | PRESENT | Yes | Boolean RLE
| DATA | No | Signed Integer RLE v2

> Note that the order of the Stream is not fixed. It also applies to other Column types.
## Float and Double Columns

Floating point types are stored using IEEE 754 floating point bit
Expand Down Expand Up @@ -1241,6 +1261,11 @@ Because dictionaries are accessed randomly, there is not a position to
record for the dictionary and the entire dictionary must be read even
if only part of a stripe is being read.

Note that for columns with multiple streams, the order of stream
positions in the RowIndex is **fixed**, which may be different to
the actual data stream placement, and it is the same as
[Column Encodings](#column-encoding-section) section we described above.

## Bloom Filter Index

Bloom Filters are added to ORC indexes from Hive 1.2.0 onwards.
Expand Down
27 changes: 26 additions & 1 deletion site/specification/ORCv2.md
Original file line number Diff line number Diff line change
Expand Up @@ -914,6 +914,24 @@ The layout of each stripe looks like:
* encryption variant 1..N
* stripe footer

There is a general order for index and data streams:
* Index streams are always placed together in the beginning of the stripe.
* Data streams are placed together after index streams (if any).
* Inside index streams or data streams, the unencrypted streams should be
placed first and then followed by streams grouped by each encryption variant.

There is no fixed order within each unencrypted or encryption variant in the
index and data streams:
* Different stream kinds of the same column can be placed in any order.
* Streams from different columns can even be placed in any order.
To get the precise information (a.k.a stream kind, column id and location) of
a stream within a stripe, the streams field in the StripeFooter described below
is the single source of truth.

In the example of the integer column mentioned above, the order of the
PRESENT stream and the DATA stream cannot be determined in advance.
We need to get the precise information by **StripeFooter**.

## Stripe Footer

The stripe footer contains the encoding of each column and the
Expand Down Expand Up @@ -1012,7 +1030,7 @@ message ColumnEncoding {
}
```

# Column Encodings
# <a id="column-encoding-section">Column Encodings</a>

## SmallInt, Int, and BigInt Columns

Expand All @@ -1029,6 +1047,8 @@ DIRECT | PRESENT | Yes | Boolean RLE
DIRECT_V2 | PRESENT | Yes | Boolean RLE
| DATA | No | Signed Integer RLE v2

> Note that the order of the Stream is not fixed. It also applies to other Column types.
## Float and Double Columns

Floating point types are stored using IEEE 754 floating point bit
Expand Down Expand Up @@ -1257,6 +1277,11 @@ Because dictionaries are accessed randomly, there is not a position to
record for the dictionary and the entire dictionary must be read even
if only part of a stripe is being read.

Note that for columns with multiple streams, the order of stream
positions in the RowIndex is **fixed**, which may be different to
the actual data stream placement, and it is the same as
[Column Encodings](#column-encoding-section) section we described above.

## Bloom Filter Index

Bloom Filters are added to ORC indexes from Hive 1.2.0 onwards.
Expand Down

0 comments on commit 25fb755

Please sign in to comment.