-
Notifications
You must be signed in to change notification settings - Fork 486
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ORC-1409: [Docs] Add stream order description in ORC spec. #1465
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please file a JIRA, @deshanxiao in Apache ORC community.
Every ASF project has its own policy and this is the way in the Apache ORC community.
site/specification/ORCv1.md
Outdated
Note that the order of these streams is **not fixed**. In the example | ||
of the integer column mentioned above, the order of the PRESENT stream | ||
and the DATA stream cannot be determined in advance. Instead, we need | ||
to determine the type of stream based on the Stream Kind, rather than |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we mention the following:
- index and data streams cannot be interleaved.
- the order of streams from different columns is not fixed as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @wgtmac . Updated.
Mark the PR as draft because I may add more details about the stream order. |
The PR is ready for review. Please take a look at it. Thank you~ |
site/specification/ORCv1.md
Outdated
* For a specific column type, the order of streams is **not fixed**. | ||
* Index and data streams cannot be **interleaved**. | ||
* The order of streams from different columns is **not fixed** as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* For a specific column type, the order of streams is **not fixed**. | |
* Index and data streams cannot be **interleaved**. | |
* The order of streams from different columns is **not fixed** as well. | |
There is a general order for index and data streams: | |
* Index streams are always placed together in the beginning of the stripe. | |
* Data streams are placed together after index streams (if any). | |
* Inside index streams or data streams, the unencrypted streams should be placed first and then followed by streams grouped by each encryption variant. | |
There is no fixed order within each unencrypted or encryption variant in the index and data streams: | |
* Different stream kinds of the same column can be placed in any order. | |
* Streams from different columns can even be placed in any order. | |
To get the precise information (a.k.a stream kind, column id and location) of a stream within a stripe, the streams field in the StripeFooter described below is the single source of truth. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These statements look a little bit vague to me, I have changed them to provide fixed order and flexible order. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I would prefer to move them to below line 907.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @wgtmac . It has been updated.
site/specification/ORCv2.md
Outdated
@@ -1257,6 +1277,19 @@ Because dictionaries are accessed randomly, there is not a position to | |||
record for the dictionary and the entire dictionary must be read even | |||
if only part of a stripe is being read. | |||
|
|||
Note that for columns with multiple streams, the order of these | |||
streams is **fixed**, which is different from data stream we mentioned above. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems the doc has a line limit that we need to follow.
site/specification/ORCv2.md
Outdated
Note that for columns with multiple streams, the order of these | ||
streams is **fixed**, which is different from data stream we mentioned above. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that for columns with multiple streams, the order of these | |
streams is **fixed**, which is different from data stream we mentioned above. | |
Note that for columns with multiple streams, the order of stream | |
positions in the RowIndex is **fixed**, which may be different to | |
the actual data stream placement. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also mention that the order is the same as what Column Encodings
section has described for each column type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks resonable and will add more description/remove the chart.
site/specification/ORCv2.md
Outdated
The following table shows the order of types that contain multiple streams: | ||
|
||
Type | The First Postion Stream Type | The Second Postion Stream Type | Contents | | ||
:----------|:--------------------------------|:-------------------------------|:----------------| | ||
| Binary | DATA | LENGTH | | | ||
| String | DATA | LENGTH | Direct encoding | | ||
| Decimal | DATA | SECONDARY | | | ||
| Timestamp | DATA | SECONDARY | | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This table is duplicated to what Column Encodings
has described. And here it doesn't include PRESENT stream. I'd prefer to remove this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good on my side. Thanks @deshanxiao!
@dongjoon-hyun @williamhyun Do you want to check another pass?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM, too. Thank you, @deshanxiao and @wgtmac .
I also published to Apache ORC website. |
Thank you @dongjoon-hyun ! |
### What changes were proposed in this pull request? This PR is aimed to add more description about stream order in ORC spec. ### Why are the changes needed? There are many users who are misled by the order of the document table, in fact the stream has no fixed order. apache#1450 ### How was this patch tested? Closes apache#1465 from deshanxiao/add-order-description. Authored-by: deshanxiao <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
What changes were proposed in this pull request?
This PR is aimed to add more description about stream order in ORC spec.
Why are the changes needed?
There are many users who are misled by the order of the document table, in fact the stream has no fixed order.
#1450
How was this patch tested?