Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decouple OrcPageSource from HiveColumnHandle #1557

Merged
merged 1 commit into from
Sep 23, 2019

Conversation

lxynov
Copy link
Member

@lxynov lxynov commented Sep 19, 2019

This is a counterpart of #1552 , also part of the efforts to remove HiveColumnHandle from Iceberg Connector (#1324 ). (We don't use OrcPageSource in Iceberg Connector now but will do in the future)

@cla-bot cla-bot bot added the cla-signed label Sep 19, 2019
Copy link
Member

@dain dain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Only one style comment.

@lxynov lxynov force-pushed the orc-hive-column-handle branch from 797eb74 to f37659e Compare September 19, 2019 06:18
@lxynov
Copy link
Member Author

lxynov commented Sep 19, 2019

@dain Thanks. Updated.

ImmutableList.Builder<Type> typesBuilder = ImmutableList.builder();
for (int columnIndex = 0; columnIndex < columns.size(); columnIndex++) {
HiveColumnHandle column = columns.get(columnIndex);
checkState(column.getColumnType() == REGULAR, "column type must be regular");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where did this logic go?
Am I correct this is handled by HivePageSourceProvider?

If so, I'd name this commit differently. Something signalling you're simply reusing information already collected (removing unnecessary code). Or, you may include this information as part of commit message body.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The useful logic in this block is the checking done by recordReader.isColumnPresent. I moved it to getNextPage.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Block[] blocks = new Block[hiveColumnIndexes.length];
for (int fieldId = 0; fieldId < blocks.length; fieldId++) {
if (constantBlocks[fieldId] != null) {
blocks[fieldId] = constantBlocks[fieldId].getRegion(0, batchSize);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're no longer reusing constantBlocks. I think this is OK. @dain

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For both this and the streaming, I'm not super concerned as this code is only executed once per page. That said, we haven't measured the impact, and the new code isn't dramatically more readable, so I'd would have left is as is to keep risk down. I don't have a strong preference, so I open to either solution.

@@ -66,43 +55,15 @@
public OrcPageSource(
OrcRecordReader recordReader,
OrcDataSource orcDataSource,
List<HiveColumnHandle> columns,
TypeManager typeManager,
Map<Integer, Type> includedColumns,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like the usage of Map here.

  • it's very non-obvious what map keys are
  • this is not used as a map. It's just a list of pairs (hive column index, column type) (... or triples (block no, hive column index, column type) where block no is implied from iteration order)

A List would be better.

I know this is pre-existing, but I think we should improve (might be a follow-up).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In #1552 you seem to be using List<Type> types, List<Optional<Field>> fields for similar purpose.
I'm ok with two-list approach.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like the usage of Map here.

  • it's very non-obvious what map keys are
  • this is not used as a map. It's just a list of pairs (hive column index, column type) (... or triples (block no, hive column index, column type) where block no is implied from iteration order)

A List would be better.

I know this is pre-existing, but I think we should improve (might be a follow-up).

Your description is precise. It's essentially a list of triples. I think we can do the improvement as a follow-up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the API we already use in OrcRecordReader, so it makes sense to me.

That said, we are going to need some better abstractions around ORC reader columns in general to deal with features like ACID, and I think we should deal with the new API with that PR instead.

@lxynov lxynov force-pushed the orc-hive-column-handle branch from f37659e to 18402fe Compare September 19, 2019 23:35
@lxynov
Copy link
Member Author

lxynov commented Sep 19, 2019

@dain @findepi Thanks for the review. I've updated the PR.

@electrum electrum merged commit b142e8a into trinodb:master Sep 23, 2019
@lxynov lxynov deleted the orc-hive-column-handle branch September 23, 2019 22:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

4 participants