Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Metrics mode when creating/writing Iceberg tables #9938

Merged
merged 1 commit into from
Jan 19, 2022

Conversation

liqinrae
Copy link
Contributor

Added support for all metrics modes in orc format as mentioned in #9791

  1. Default metrics mode is changed to truncate(16)
  2. Allow for configuring default mode to NONE so that we can skip all metrics by default except the ones we want
  3. Allows per column metrics mode configuration to support different modes: counts, none, truncate(length) and full

@cla-bot
Copy link

cla-bot bot commented Nov 11, 2021

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: q-li.
This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email [email protected]
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

@cla-bot
Copy link

cla-bot bot commented Nov 11, 2021

Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. In order for us to review and merge your code, please submit the signed CLA to [email protected]. For more information, see https://github.com/trinodb/cla.

2 similar comments
@cla-bot
Copy link

cla-bot bot commented Nov 11, 2021

Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. In order for us to review and merge your code, please submit the signed CLA to [email protected]. For more information, see https://github.com/trinodb/cla.

@cla-bot
Copy link

cla-bot bot commented Nov 11, 2021

Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. In order for us to review and merge your code, please submit the signed CLA to [email protected]. For more information, see https://github.com/trinodb/cla.

Comment on lines 137 to 141
if (metricsMode != MetricsModes.Counts.get()) {
toIcebergMinMax(orcColumnStats, icebergField.type(), metricsMode).ifPresent(minMax -> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we not handle metrics mode truncate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdsr yes we do, we handle truncate in the IcebergMinMax

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see we added support for this below

@@ -431,7 +432,8 @@ public ConnectorOutputTableHandle beginCreateTable(ConnectorSession session, Con
getColumns(transaction.table().schema(), typeManager),
transaction.table().location(),
getFileFormat(transaction.table()),
transaction.table().properties());
transaction.table().properties(),
Collections.emptyMap());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the whole table property map is already passed in as a part of the writable table handle, why do we need to construct the metrics config as a separated field?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jackye1995 yes you are right. There is no need to construct the metrics config since the table properties have already been passed to the writable table handle. I checked the code and we already have functions in place to retrieve metrics related properties so we can make this even simpler. Will add a fix to this, thanks!

@@ -128,6 +131,7 @@ public IcebergPageSink(
this.jsonCodec = requireNonNull(jsonCodec, "jsonCodec is null");
this.session = requireNonNull(session, "session is null");
this.fileFormat = requireNonNull(fileFormat, "fileFormat is null");
this.metricsConfig = MetricsConfig.fromProperties(requireNonNull(storageProperties, "metricsConfig is null"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

error message does not match.

Btw, I think it makes more sense to pass in the MetricsConfig to the constructor instead of storageProperties, because file format is also derived from storageProperties and we are passing it in separately.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jackye1995 Sorry you meant passing in the MetricsConfig to the pagesink or IcebergWritableTableHandle constructor?

The file format is derived from the property map and then passed to the writable table handle as a separate component, which is how we passed inMetricsConfig initially.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to the page sink. But I don't have a strong opinion on this, there are many duplicated information passed across the codebase as of today. We can just fix the error message first.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liqinrae can you please check the Ci build results?

@findepi Yes seems like there is new changes from other PRs. Will merge them and fix the CI build.

@findepi
Copy link
Member

findepi commented Nov 30, 2021

@liqinrae can you please check the Ci build results?

@liqinrae liqinrae requested a review from jackye1995 November 30, 2021 21:10
Copy link
Member

@jackye1995 jackye1995 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me

Copy link
Member

@findepi findepi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please squash your commits.

When doing so, try not to change the base commit:

git rebase -i $(git merge-base head master)

upperBoundsBuilder.put(icebergId, minMax.getMax());
});

if (metricsMode != MetricsModes.Counts.get()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MetricsMode is an interface, open for extensions. The fact that something isn't Counts doesn't mean it's Full.

Also, there is no javadoc indicating .get() returns same instance every time, so using .equals() would be more appropriate here.

this should be handled in a defensive manner:

if (metricsMode.equals(MetricsModes.None.get())) {
 // nothing to do
}
else if (metricsMode.equals(MetricsModes.Counts.get()) {
 .. collect counts 
}
else if (metricsMode.equals( full )) {
  ...
}
else {
    throw new UnsupportedOperationException("Unsupported metrics mode: " + metricsMode);
}

Copy link
Contributor Author

@liqinrae liqinrae Dec 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@findepi The conditional statement here handles the case where the metricsMode is truncate(length) or full. Both of these two cases need to store min/max except that truncate need to apply truncation if needed.

All of the metrics modes other than None need to collect counts. There would be duplication of code if we are to handle each mode separately. Would adding a verify function here more feasible if we want to handle it more defensively?

this.max = Conversions.toByteBuffer(type, max);
// Out of the two types which could be truncated, string or binary, ORC only supports string bounds.
// Therefore, truncation will be applied if needed only on string type.
if (metricsMode instanceof MetricsModes.Truncate && type.typeId() == org.apache.iceberg.types.Type.TypeID.STRING) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

binary (varbinary) and fixed too

https://github.com/apache/iceberg/blob/2e4684740b56f6d80328f50b7e7fef0863c262e7/parquet/src/main/java/org/apache/iceberg/parquet/ParquetUtil.java#L269

(we don't support creating tables with fixed, but i think we do support writes to such tables)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@findepi I was referring to the OrcMetrics implementation here: https://github.com/apache/iceberg/blob/ddc5aff9167c7e367e2b4313b91e55451f09ef16/orc/src/main/java/org/apache/iceberg/orc/OrcMetrics.java#L281
Seems like it only added truncation for string bound in the orc format.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because they don't produce any stats for binary for ORC, see


right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still, not having an if for varbinary doesn't seem future proof, does it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@findepi yes I agree. I'll add varbinary here.

Comment on lines 290 to 291
this.min = Conversions.toByteBuffer(type, min);
this.max = Conversions.toByteBuffer(type, max);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if MetricsMode is != Full (eg Counts, None), no min/max should be stored.

again, this should be handled in defensive manner, instead of if-ing on a Full class

@@ -129,7 +249,7 @@ public void testBasic()
assertQuery("SELECT min(orderpriority) FROM tpch.tiny.orders", "VALUES '" + lowerBounds.get(6) + "'");
assertQuery("SELECT min(clerk) FROM tpch.tiny.orders", "VALUES '" + lowerBounds.get(7) + "'");
assertQuery("SELECT min(shippriority) FROM tpch.tiny.orders", "VALUES " + lowerBounds.get(8));
assertQuery("SELECT min(comment) FROM tpch.tiny.orders", "VALUES '" + lowerBounds.get(9) + "'");
assertQuery("SELECT substring(min(comment), 1, 16) FROM tpch.tiny.orders", "VALUES '" + lowerBounds.get(9) + "'");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lowerBounds.get(9) is the actual, while SELECT substring(min(comment), 1, 16) FROM tpch.tiny.orders is the expected. The expected/actual are reversed here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@findepi I tried reversing the order for all the assertQuery in the test function but encountered failures when using SELECT substring(min(comment), 1, 16) FROM tpch.tiny.orders as the expected sql statement. Checked other test files and looks like they placed all select statement as the expected and result as the actual.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we shouldn't use assertQuery at all, when we want to assert a value that we have at hand.

this is preexisting, so ok to followup

@liqinrae
Copy link
Contributor Author

liqinrae commented Dec 9, 2021

Hi @findepi @jackye1995 I addressed the comments and left a few replies on this PR. Please let me know your thoughts. Thanks!

@@ -129,7 +249,7 @@ public void testBasic()
assertQuery("SELECT min(orderpriority) FROM tpch.tiny.orders", "VALUES '" + lowerBounds.get(6) + "'");
assertQuery("SELECT min(clerk) FROM tpch.tiny.orders", "VALUES '" + lowerBounds.get(7) + "'");
assertQuery("SELECT min(shippriority) FROM tpch.tiny.orders", "VALUES " + lowerBounds.get(8));
assertQuery("SELECT min(comment) FROM tpch.tiny.orders", "VALUES '" + lowerBounds.get(9) + "'");
assertQuery("SELECT substring(min(comment), 1, 16) FROM tpch.tiny.orders", "VALUES '" + lowerBounds.get(9) + "'");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we shouldn't use assertQuery at all, when we want to assert a value that we have at hand.

this is preexisting, so ok to followup

@@ -141,7 +261,8 @@ public void testBasic()
assertQuery("SELECT max(orderpriority) FROM tpch.tiny.orders", "VALUES '" + upperBounds.get(6) + "'");
assertQuery("SELECT max(clerk) FROM tpch.tiny.orders", "VALUES '" + upperBounds.get(7) + "'");
assertQuery("SELECT max(shippriority) FROM tpch.tiny.orders", "VALUES " + upperBounds.get(8));
assertQuery("SELECT max(comment) FROM tpch.tiny.orders", "VALUES '" + upperBounds.get(9) + "'");
// the default truncate(16) mode would round up the last character of the maximum bound while keeping the first 15 characters identical
assertQuery("SELECT substring(max(comment), 1, 15) FROM tpch.tiny.orders", "VALUES '" + upperBounds.get(9).substring(0, 15) + "'");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert on upperBounds.get(9) without taking substring

on the left side, use substring(max(comment), 1, 15) || '...' to construct the expected value

@findepi
Copy link
Member

findepi commented Jan 18, 2022

@liqinrae have you had time to update this for my comments?

@liqinrae
Copy link
Contributor Author

@liqinrae have you had time to update this for my comments?

@findepi Hey thanks for reviewing my changes. Updated the PR for your comment and also replaced assertQuery in the preexisting test class with the actual min/max value. Please review and let me know your thoughts. Thanks.

@liqinrae liqinrae requested a review from findepi January 18, 2022 22:46
@liqinrae liqinrae force-pushed the metrics_mode branch 2 times, most recently from 7f52b71 to f319189 Compare January 19, 2022 06:34
@findepi findepi merged commit 835438a into trinodb:master Jan 19, 2022
@findepi findepi mentioned this pull request Jan 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

5 participants