Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ORC-1409: [Docs] Add stream order description in ORC spec. #1465

Closed
wants to merge 13 commits into from
Closed

ORC-1409: [Docs] Add stream order description in ORC spec. #1465

wants to merge 13 commits into from

Conversation

deshanxiao
Copy link
Contributor

@deshanxiao deshanxiao commented Apr 11, 2023

What changes were proposed in this pull request?

This PR is aimed to add more description about stream order in ORC spec.

Why are the changes needed?

There are many users who are misled by the order of the document table, in fact the stream has no fixed order.

#1450

How was this patch tested?

@github-actions github-actions bot added the DOCS label Apr 11, 2023
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please file a JIRA, @deshanxiao in Apache ORC community.
Every ASF project has its own policy and this is the way in the Apache ORC community.

@deshanxiao deshanxiao changed the title [Docs] Add stream order description in ORC spec. ORC-1409 [Docs] Add stream order description in ORC spec. Apr 13, 2023
@deshanxiao deshanxiao changed the title ORC-1409 [Docs] Add stream order description in ORC spec. ORC-1409: [Docs] Add stream order description in ORC spec. Apr 13, 2023
Note that the order of these streams is **not fixed**. In the example
of the integer column mentioned above, the order of the PRESENT stream
and the DATA stream cannot be determined in advance. Instead, we need
to determine the type of stream based on the Stream Kind, rather than
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we mention the following:

  • index and data streams cannot be interleaved.
  • the order of streams from different columns is not fixed as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @wgtmac . Updated.

@deshanxiao
Copy link
Contributor Author

cc @dongjoon-hyun

@deshanxiao deshanxiao changed the title ORC-1409: [Docs] Add stream order description in ORC spec. ORC-1409: [Docs][Draft] Add stream order description in ORC spec. Apr 26, 2023
@deshanxiao
Copy link
Contributor Author

Mark the PR as draft because I may add more details about the stream order.

@deshanxiao deshanxiao changed the title ORC-1409: [Docs][Draft] Add stream order description in ORC spec. ORC-1409: [Docs] Add stream order description in ORC spec. May 10, 2023
@deshanxiao
Copy link
Contributor Author

Hi @dongjoon-hyun @wgtmac

The PR is ready for review. Please take a look at it. Thank you~

Comment on lines 891 to 893
* For a specific column type, the order of streams is **not fixed**.
* Index and data streams cannot be **interleaved**.
* The order of streams from different columns is **not fixed** as well.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* For a specific column type, the order of streams is **not fixed**.
* Index and data streams cannot be **interleaved**.
* The order of streams from different columns is **not fixed** as well.
There is a general order for index and data streams:
* Index streams are always placed together in the beginning of the stripe.
* Data streams are placed together after index streams (if any).
* Inside index streams or data streams, the unencrypted streams should be placed first and then followed by streams grouped by each encryption variant.
There is no fixed order within each unencrypted or encryption variant in the index and data streams:
* Different stream kinds of the same column can be placed in any order.
* Streams from different columns can even be placed in any order.
To get the precise information (a.k.a stream kind, column id and location) of a stream within a stripe, the streams field in the StripeFooter described below is the single source of truth.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These statements look a little bit vague to me, I have changed them to provide fixed order and flexible order. WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I would prefer to move them to below line 907.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @wgtmac . It has been updated.

@@ -1257,6 +1277,19 @@ Because dictionaries are accessed randomly, there is not a position to
record for the dictionary and the entire dictionary must be read even
if only part of a stripe is being read.

Note that for columns with multiple streams, the order of these
streams is **fixed**, which is different from data stream we mentioned above.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems the doc has a line limit that we need to follow.

Comment on lines 1280 to 1281
Note that for columns with multiple streams, the order of these
streams is **fixed**, which is different from data stream we mentioned above.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Note that for columns with multiple streams, the order of these
streams is **fixed**, which is different from data stream we mentioned above.
Note that for columns with multiple streams, the order of stream
positions in the RowIndex is **fixed**, which may be different to
the actual data stream placement.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also mention that the order is the same as what Column Encodings section has described for each column type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks resonable and will add more description/remove the chart.

Comment on lines 1283 to 1290
The following table shows the order of types that contain multiple streams:

Type | The First Postion Stream Type | The Second Postion Stream Type | Contents |
:----------|:--------------------------------|:-------------------------------|:----------------|
| Binary | DATA | LENGTH | |
| String | DATA | LENGTH | Direct encoding |
| Decimal | DATA | SECONDARY | |
| Timestamp | DATA | SECONDARY | |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This table is duplicated to what Column Encodings has described. And here it doesn't include PRESENT stream. I'd prefer to remove this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good on my side. Thanks @deshanxiao!

@dongjoon-hyun @williamhyun Do you want to check another pass?

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM, too. Thank you, @deshanxiao and @wgtmac .

@dongjoon-hyun dongjoon-hyun added this to the 2.0.0 milestone May 16, 2023
@dongjoon-hyun
Copy link
Member

I also published to Apache ORC website.

@wgtmac
Copy link
Member

wgtmac commented May 17, 2023

Thank you @dongjoon-hyun !

cxzl25 pushed a commit to cxzl25/orc that referenced this pull request Jan 11, 2024
### What changes were proposed in this pull request?
This PR is aimed to add more description about stream order in ORC spec.

### Why are the changes needed?
There are many users who are misled by the order of the document table, in fact the stream has no fixed order.

apache#1450

### How was this patch tested?

Closes apache#1465 from deshanxiao/add-order-description.

Authored-by: deshanxiao <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants