-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle sort order with nested columns on iceberg table #22099
base: master
Are you sure you want to change the base?
Conversation
Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to [email protected]. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla |
522d9d1
to
f05098a
Compare
Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to [email protected]. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla |
Just for notice; I've already sent the cla. Maybe there is some time to sync. |
f05098a
to
feefe77
Compare
Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to [email protected]. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mattheusv can we have test coverage for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@krvikash we already have a test case BaseIcebergConnectorTest.testSortingOnNestedField:1413
that expects an exception (it's failing right now), would make test to just change it to expect a success instead of an error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, yes.
This should be the purpose of this PR right? Allowing to define sort on nested fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated the test. Can you folks please take a look? @findinpath @krvikash
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/SortFieldUtils.java
Outdated
Show resolved
Hide resolved
Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to [email protected]. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla |
4193c7a
to
e6025eb
Compare
Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to [email protected]. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla |
Set<Integer> baseColumnFieldIds = schema.columns().stream() | ||
.map(Types.NestedField::fieldId) | ||
.collect(toImmutableSet()); | ||
Set<Integer> baseColumnFieldIds = collectColumnFieldIds(schema); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about we could use Iceberg library to get the field ids of all the fields org.apache.iceberg.types.TypeUtil.indexNameById(schema.asStruct())
?
assertUpdate("CREATE TABLE " + tableName + " (nationkey BIGINT, row_t ROW(name VARCHAR, regionkey BIGINT, comment VARCHAR)) " + | ||
"WITH (sorted_by = ARRAY['\"row_t.comment\"'])"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add test case where we verify the table files is sorted when using nested field. something like testSortedNationTable
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@krvikash I'm trying to add a test case for this and I having some problems/questions.
The test case that I've created is the following:
try (TestTable table = new TestTable(
getQueryRunner()::execute,
"test_sorted_table_using_nested_fields",
" (id INT, row_t ROW(name VARCHAR)) WITH (format = '" + format.name() + "', sorted_by = ARRAY[ '\"row_t.name\"' ])")) {
assertUpdate(
withSmallRowGroups,
"INSERT INTO " + table.getName() + "(id, row_t)" +
"SELECT id, ROW(CONCAT('v', CAST(id as VARCHAR))) as row_t FROM UNNEST(sequence(1, 500)) AS t(id)",
500);
for (Object filePath : computeActual("SELECT file_path from \"" + table.getName() + "$files\"").getOnlyColumnAsSet()) {
assertThat(isFileSorted(Location.of((String) filePath), "name")).isTrue();
}
assertQuery("SELECT * FROM " + table.getName(), "SELECT * FROM " + table.getName() + " ORDER BY id");
The method isFileSorted
fails because it don' support nested fields. The IcebergTestUtils.checkParquetFileSorting
method fails here because the column.getPath().iterator()
returns two elements when the column object is nested, in this case it returns <row_t, name>
which raise the following exception:
java.lang.IllegalArgumentException: expected one element but was: <row_t, name>
at com.google.common.collect.Iterators.getOnlyElement(Iterators.java:322)
at io.trino.plugin.iceberg.IcebergTestUtils.lambda$3(IcebergTestUtils.java:144)
at java.base/java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:193)
at java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1709)
at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:556)
at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:546)
at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:921)
at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:265)
at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:702)
at io.trino.plugin.iceberg.IcebergTestUtils.checkParquetFileSorting(IcebergTestUtils.java:145)
at io.trino.plugin.iceberg.catalog.jdbc.TestIcebergJdbcCatalogConnectorSmokeTest.isFileSorted(TestIcebergJdbcCatalogConnectorSmokeTest.java:188)
at io.trino.plugin.iceberg.BaseIcebergConnectorSmokeTest.testSortedTableUsingNestedField(BaseIcebergConnectorSmokeTest.java:551)
at java.base/java.lang.reflect.Method.invoke(Method.java:580)
at java.base/java.util.concurrent.RecursiveAction.exec(RecursiveAction.java:194)
at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:507)
at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1491)
at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:2073)
at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:2035)
at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:187)
The assert assertQuery("SELECT * FROM " + table.getName(), "SELECT * FROM " + table.getName() + " ORDER BY id");
also fails because the expected sql returns returns an error that the table don't exists
java.lang.AssertionError: Execution of 'expected' query failed: SELECT * FROM test_sorted_table_using_nested_fields15ugt0uecl ORDER BY id
at io.trino.testing.QueryAssertions.assertDistributedQuery(QueryAssertions.java:322)
at io.trino.testing.QueryAssertions.assertQuery(QueryAssertions.java:187)
at io.trino.testing.QueryAssertions.assertQuery(QueryAssertions.java:160)
at io.trino.testing.AbstractTestQueryFramework.assertQuery(AbstractTestQueryFramework.java:350)
at io.trino.plugin.iceberg.BaseIcebergConnectorSmokeTest.testSortedTableUsingNestedField(BaseIcebergConnectorSmokeTest.java:553)
at java.base/java.lang.reflect.Method.invoke(Method.java:580)
at java.base/java.util.concurrent.RecursiveAction.exec(RecursiveAction.java:194)
at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:507)
at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1491)
at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:2073)
at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:2035)
at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:187)
Caused by: org.jdbi.v3.core.statement.UnableToCreateStatementException: org.h2.jdbc.JdbcSQLSyntaxErrorException: Table "TEST_SORTED_TABLE_USING_NESTED_FIELDS15UGT0UECL" not found; SQL statement:
Seems that the expected and the actual query is executed in different query runners; The expected is executed on h2QueryRunner
and the actual is executed on queryRunner
from AbstractTestQueryFramework
class.
My question is how can I test this properly? Should I update the checkParquetFileSorting
method to support nested fields or there is a better way to write this test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Set<Integer> ids = new HashSet<Integer>(); | ||
schema.columns().forEach(column -> addNestedField(ids, column)); | ||
return ids; | ||
} | ||
|
||
private static void addNestedField(Set<Integer> baseColumnFieldIds, NestedField field) | ||
{ | ||
baseColumnFieldIds.add(field.fieldId()); | ||
|
||
Type type = field.type(); | ||
if (type.isNestedType()) { | ||
NestedType nestedType = type.asNestedType(); | ||
for (NestedField nestedField : nestedType.fields()) { | ||
addNestedField(baseColumnFieldIds, nestedField); | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can do this with com.google.common.graph.Traverser
, without need for helper method.
This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua |
@cla-bot check |
Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to [email protected]. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla |
The cla-bot has been summoned, and re-checked this pull request! |
e6025eb
to
4385b97
Compare
Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to [email protected]. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla |
4385b97
to
af13ddf
Compare
Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to [email protected]. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla |
af13ddf
to
4602fca
Compare
Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to [email protected]. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla |
4602fca
to
0528c46
Compare
Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to [email protected]. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla |
0528c46
to
6cdab53
Compare
Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to [email protected]. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla |
6cdab53
to
52afa49
Compare
5892ee6
to
bf763d2
Compare
@mattheusv This is in my list still, I am just catching up on things after vacation. Don't hesitate with any questions. |
Thanks @bitsondatadev I'll just give some context of some changes: As I mentioned here, the IcebergTestUtils#checkParquetFileSorting method was not handling nested columns, so I've changed the filter to find the columns to consider nested columns, so we can call the isFileSorted method like The problem now seems that the file is not being sorted, because the Another problem that I've notice is that if I execute trino> CREATE TABLE IF NOT EXISTS iceberg.test.t2 (
-> id INT,
-> row_t ROW(name VARCHAR)
-> ) WITH (
-> format = 'PARQUET',
-> sorted_by = ARRAY ['"row_t.name"']
-> );
CREATE TABLE
trino> insert into iceberg.test.t2(id, row_t) SELECT id, ROW(CONCAT('v', cast(id as varchar))) as row_t FROM UNNEST(sequence(1, 30)) AS t(id);
INSERT: 30 rows
trino> alter table iceberg.test.t2 execute optimize;
Query 20240701_125829_00002_62u7s, FAILED, 1 node
Splits: 12 total, 2 done (16,67%)
1,04 [21 rows, 856B] [20 rows/s, 826B/s]
Query 20240701_125829_00002_62u7s failed: Index -1 out of bounds for length 2 Server logs
It seems that we need to make some other changes to make the sorted_by using nested columns works properly, and I don't know if we should change all the related code on this PR or create multiple PRs, WYT? |
bf763d2
to
7367483
Compare
I've tried to get a little deep on this error and the problem is how Trino search for columns on a table schema when performing the sorting. The class So when I don't have many ideas on how this can be fixed, since several classes used in this process use the |
Previously the `parseSortFields` was only collecting the field id from the top level columns don't considering nested fields of nested types, so in case a query with a `sorted_by` property use a nested field of a nested type trino would throw an expcetion that the column don't exists, because the field id of the nested column don't exists on `baseColumnFieldIds` set. This commit fix this issue by recursively collecting the field ids from table columns which the column type is a nested type. Fix: trinodb#19620
7367483
to
8ad3745
Compare
This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua |
Description
Previously the
parseSortFields
fromSortFieldUtils
was only collecting the field id from the top level columns don't considering nested fields of nested types, so in case a query with asorted_by
property use a nested field of a nested type trino would throw an expcetion that the column don't exists, because the field id of the nested column don't exists onbaseColumnFieldIds
set.This PR fix this issue by recursively collecting the field ids from table columns which the column type is a nested type.
Fix: #19620
Additional context and related issues
This is my first time contributing to trino code base, so I'm not 100% sure that this is correct, so please let me know if anything is wrong.
Release notes
( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text: