Refine Parquet schema mismatch message #12550

zhenxiao · 2019-03-28T23:35:53Z

nezihyigitbasi

Can you please add a unit test?

nezihyigitbasi · 2019-04-02T00:12:23Z

presto-parquet/src/main/java/com/facebook/presto/parquet/reader/ParquetReader.java

+            return columnReader.readPrimitive(field);
+        }
+        catch (UnsupportedOperationException e) {
+            throw new PrestoException(INVALID_SCHEMA_PROPERTY, format("There is a mismatch between file schema and partition schema. The column %s in file %s is declared as type %s but parquet file declared column type as %s.",


How about:

The column %s is declared as type %s, but the Parquet file %s declares the column as type %s

nezihyigitbasi · 2019-04-02T00:13:28Z

presto-parquet/src/main/java/com/facebook/presto/parquet/reader/ParquetReader.java

+        try {
+            return columnReader.readPrimitive(field);
+        }
+        catch (UnsupportedOperationException e) {


UnsupportedOperationException is a broad exception to catch here. Maybe we should throw a specific Parquet exception and catch it here. What do you think?

sure. UnsupportedOperationException is actually from type.writeLong etc, when lower level ColumnReader trying to read parquet values. How about throw ParquetCorruptionException there, and catch it here?

zhenxiao

thank you, @nezihyigitbasi
will get comments addressed

zhenxiao · 2019-05-10T00:41:40Z

presto-parquet/src/main/java/com/facebook/presto/parquet/reader/ParquetReader.java

+            return columnReader.readPrimitive(field);
+        }
+        catch (UnsupportedOperationException e) {
+            throw new PrestoException(INVALID_SCHEMA_PROPERTY, format("There is a mismatch between file schema and partition schema. The column %s in file %s is declared as type %s but parquet file declared column type as %s.",


zhenxiao · 2019-05-10T00:43:09Z

presto-parquet/src/main/java/com/facebook/presto/parquet/reader/ParquetReader.java

+        try {
+            return columnReader.readPrimitive(field);
+        }
+        catch (UnsupportedOperationException e) {


sure. UnsupportedOperationException is actually from type.writeLong etc, when lower level ColumnReader trying to read parquet values. How about throw ParquetCorruptionException there, and catch it here?

zhenxiao · 2019-05-10T00:45:40Z

@nezihyigitbasi I get comments addressed
could you please review?

zhenxiao · 2019-07-01T06:55:20Z

kindly ping @nezihyigitbasi

nezihyigitbasi · 2019-08-07T15:44:36Z

@arhimondr @rschlussel can you please take a look?

arhimondr · 2019-08-10T14:15:02Z

From what i understand, this PR tries to address the confusing error message in case there's a schema mismatch between a file and the partition schema.

@zhenxiao Could you please elaborate a little bit more why it was decided to implement it this way? (by catching the exception in the very last moment).

For example the schema validation can be done early, when getting the Parquet Type:
https://github.com/prestodb/presto/blob/master/presto-hive/src/main/java/com/facebook/presto/hive/parquet/ParquetPageSourceFactory.java#L163
https://github.com/prestodb/presto/blob/master/presto-hive/src/main/java/com/facebook/presto/hive/parquet/ParquetPageSourceFactory.java#L255
The HiveColumnHandle contains the Hive Schema type, so it is possible to check if types match.

The problem is that the fact that the method in Type is implemented (e.g.: Type#writeSlice) does not necessarily mean that the types are the same / compatible. For example both Decimal and Varchar use Slice as a value. Or Integer and Float use long (writeLong).

@nezihyigitbasi Do you know how something similar is done in ORC?

zhenxiao · 2019-08-15T14:55:54Z

thank you, @arhimondr
get comments addressed. how about this approach?

arhimondr

I like the current approach. Could you please implement it for the complex types (ARRAY, MAP, ROW) as well?

arhimondr · 2019-08-23T19:20:38Z

presto-hive/src/main/java/com/facebook/presto/hive/parquet/ParquetPageSourceFactory.java

+    private static boolean schemaMatch(org.apache.parquet.schema.Type parquetType, HiveColumnHandle column)
+    {
+        String prestoType = column.getTypeSignature().getBase();
+        if (prestoType.equals(MAP) || prestoType.equals(ARRAY) || prestoType.equals(ROW)) {


Could you please check the signature for the complex types recursively?

I would suggest passing here the resolved Type object, so the convenient Type#getTypeParameters can be used to recurse into the nested types.

e.g.:

presto/presto-hive/src/main/java/com/facebook/presto/hive/parquet/ParquetPageSource.java

Line 98 in e313922

Type type = typeManager.getType(column.getTypeSignature());

arhimondr · 2019-08-23T19:21:11Z

presto-hive/src/main/java/com/facebook/presto/hive/parquet/ParquetPageSourceFactory.java

+        }
+        if (parquetType.isPrimitive()) {
+            PrimitiveTypeName parquetTypeName = parquetType.asPrimitiveType().getPrimitiveTypeName();
+            return ((parquetTypeName == PrimitiveTypeName.INT32 && (prestoType.equals(INTEGER) || prestoType.equals(SMALLINT) || prestoType.equals(DATE) || prestoType.equals(DECIMAL) || prestoType.equals(TINYINT))) ||


nit: switch seems to be more readable

switch (parquetTypeName) { case INT64: return prestoType.equals(BIGINT) || prestoType.equals(DECIMAL); case INT32: return prestoType.equals(INTEGER) || prestoType.equals(SMALLINT) || prestoType.equals(DATE) || prestoType.equals(DECIMAL) || prestoType.equals(TINYINT); case BOOLEAN: return prestoType.equals(BOOLEAN); case BINARY: return prestoType.equals(VARBINARY) || prestoType.equals(VARCHAR) || prestoType.startsWith(CHAR); case FLOAT: return prestoType.equals(REAL); case DOUBLE: return prestoType.equals(DOUBLE); case INT96: return prestoType.equals(TIMESTAMP); case FIXED_LEN_BYTE_ARRAY: return prestoType.equals(DECIMAL); default: throw new IllegalArgumentException("Unexpected parquet type name: " + parquetTypeName); }

presto-hive/src/test/java/com/facebook/presto/hive/TestHiveFileFormats.java

zhenxiao · 2019-08-24T08:19:38Z

thank you, @arhimondr
I get comments addressed
I just found, actually, the schema checking should only work for Parquet index based access, for Parquet name based access, the fields inside structs could change order, or primitive columns could change order, should not check schema in this case.
Could you please review when u are free?

arhimondr

LGTM % question

Also I'm not quite sure how good is the test coverage for Parquet. Do you have any internal test suites? Have you tried to reply some real workload with this patch on?

arhimondr · 2019-08-28T01:13:06Z

presto-hive/src/main/java/com/facebook/presto/hive/parquet/ParquetPageSourceFactory.java

+        }
+
+        // name based access could support schema evolution in Parquet
+        if (useParquetColumnNames) {


Why don't we check for this case?

Could you please elaborate a little bit more what does the schema evolution mean? Is it the table -> partition schema migration? (changing the columns for a table with existing partitions)?

What If there's a mismatch? Wouldn't it fail the old way when reading?

Hi @arhimondr
Yes, production queries were tested with this patch.
Schema evolution in our case:

table schema unchanged

Parquet file schema changes(mostly in struct fields), e.g.
new fields added to struct, s<a, b, c> becomes: s<a, b, c, d>, and partition schema will be updated to s<a, b, c, d>. In this case, select s.d from old Parquet files will return null.
fields reordered, s<a, b, c> becomes s<b, c, a>, and partition schema will be updated to s<b, c, a>. In this case, we could get s,a, s,b, and s,c values if turning on use-parquet-column-names.

old Parquet files will not be changed

field rename or type change are not allowed.

Based on above, when useParquetColumnNames are enabled, Parquet could use field name to get the corresponding parquet type, no need to check. (It is possible partition schema with field double but parquet schema with field float, which is a schema mismatch, will fail as before)

We need this check only for index based access for Parquet (use-parquet-column-names set to false), so that field types or primitive types could check based on field/column order before decoding the data.

for Parquet name based access, the fields inside structs could change order

From what i understand it is mostly about structs (since there are two modes of accessing fields).

However it seems that structs are always matched in the ordinal way (by index): https://github.com/prestodb/presto/blob/master/presto-parquet/src/main/java/com/facebook/presto/parquet/reader/ParquetReader.java#L176

Does it mean that we can safely check the schema for both access types ("by name" and "by index")?

nice catch. because we are running a slightly different version of Presto, with our customized subfield pruning patch, which only scan necessary fields from struct, instead of scanning the whole struct. Part of that is in: #13271

let me double check with name based access

Hi @arhimondr you are correct, we could turn check on for both index based access and name based access
I've made corresponding changes.

Thanks.

@nezihyigitbasi, I think it is ready to go. Do you have any additional comments?

nezihyigitbasi

LGTM % comments. Thanks @zhenxiao and @arhimondr!

nezihyigitbasi · 2019-08-28T16:35:48Z

presto-hive/src/main/java/com/facebook/presto/hive/parquet/ParquetPageSourceFactory.java

+        String prestoType = type.getTypeSignature().getBase();
+        if (parquetType instanceof GroupType) {
+            GroupType groupType = parquetType.asGroupType();
+            if (prestoType.equals(StandardTypes.ROW)) {


Why not use a switch here?

nezihyigitbasi · 2019-08-28T16:36:31Z

presto-hive/src/main/java/com/facebook/presto/hive/parquet/ParquetPageSourceFactory.java

+        }
+
+        if (!schemaMatch(type, prestoType)) {
+            String parquetTypeName = null;


This null assignment is redundant.

nezihyigitbasi · 2019-08-28T16:39:55Z

presto-hive/src/main/java/com/facebook/presto/hive/parquet/ParquetPageSourceFactory.java

+        }
+
+        if (type == null) {
+            return null;


I think in a separate PR it would be good to update this method to return Optional instead of nulls.

nezihyigitbasi · 2019-08-28T16:43:38Z

presto-hive/src/main/java/com/facebook/presto/hive/parquet/ParquetPageSourceFactory.java

+        return type;
+    }
+
+    private static boolean schemaMatch(org.apache.parquet.schema.Type parquetType, Type type)


nit: we may want to rename this method. Some options are checkSchemaMatch, isSchemaMatching, isSchemaCompatible, etc.

nezihyigitbasi · 2019-08-28T16:44:30Z

presto-hive/src/main/java/com/facebook/presto/hive/parquet/ParquetPageSourceFactory.java

+        String prestoType = type.getTypeSignature().getBase();
+        if (parquetType instanceof GroupType) {
+            GroupType groupType = parquetType.asGroupType();
+            if (prestoType.equals(StandardTypes.ROW)) {


You can static import StandardTypes.X here and below.

nezihyigitbasi · 2019-08-28T16:47:17Z

presto-hive/src/main/java/com/facebook/presto/hive/parquet/ParquetPageSourceFactory.java

+        return type;
+    }
+
+    private static boolean schemaMatch(org.apache.parquet.schema.Type parquetType, Type type)


Do we have tests to test this method for different backward compatibility rules defined in this doc?

Not yet, we could add tests in another PR

nezihyigitbasi · 2019-08-28T16:52:05Z

presto-hive/src/test/java/com/facebook/presto/hive/TestHiveFileFormats.java

+            throws Exception
+    {
+        TestColumn floatColumn = new TestColumn("column_name", javaFloatObjectInspector, 5.1f, 5.1f);
+        TestColumn doubleColumn = new TestColumn("column_name", javaDoubleObjectInspector, 5.1, 5.1);


Does this cover all types in the ParquetPageSourceFactory::schemaMatch(), e.g., different int types and fixed len byte array?

Also, the coverage for nested complex types of this test is not much. So, it would be good to have some comprehensive testing for that case, e.g., can we automate the testing of nested complex types up to a certain nesting level and have some systematic testing instead of having a point test like we have here with nestColumn?

nice catch, I will add comprehensive tests in another PR.

nezihyigitbasi · 2019-08-28T16:52:59Z

presto-hive/src/test/java/com/facebook/presto/hive/TestHiveFileFormats.java

+                mapBlockOf(createUnboundedVarcharType(), new ArrayType(RowType.anonymous(ImmutableList.of(INTEGER))),
+                    "test", arrayBlockOf(RowType.anonymous(ImmutableList.of(INTEGER)), rowBlockOf(ImmutableList.of(INTEGER), 1L))));
+
+        HiveErrorCode expectedErrorCode = HiveErrorCode.HIVE_PARTITION_SCHEMA_MISMATCH;


static import HIVE_PARTITION_SCHEMA_MISMATCH.

nezihyigitbasi · 2019-08-28T16:54:31Z

presto-hive/src/test/java/com/facebook/presto/hive/TestHiveFileFormats.java

+                .withSession(parquetPageSourceSession)
+                .isFailingForPageSource(new ParquetPageSourceFactory(TYPE_MANAGER, HDFS_ENVIRONMENT, STATS), expectedErrorCode, expectedMessageMapLongLong);
+
+        String expectedMessageMapLongMapDouble = "The column column_name is declared as type map<bigint,bigint>, but the Parquet file declares the column as type   optional group column_name (MAP) {\n"


nit: The whitespace before optional looks a bit weird.

nezihyigitbasi · 2019-08-28T16:55:27Z

presto-hive/src/test/java/com/facebook/presto/hive/TestHiveFileFormats.java

+                .withSession(parquetPageSourceSession)
+                .isFailingForPageSource(new ParquetPageSourceFactory(TYPE_MANAGER, HDFS_ENVIRONMENT, STATS), expectedErrorCode, expectedMessageMapLongMapDouble);
+
+        String expectedMessageArrayStringArrayBoolean = "The column column_name is declared as type array<string>, but the Parquet file declares the column as type   optional group column_name (LIST) {\n"


nit: The whitespace before optional looks a bit weird. I guess that's the indentation used when doing group.writeToStringBuilder(...).

yes, let me fix it

zhenxiao

thank you, @nezihyigitbasi
get comments addressed
some need extra PR to address. noted inline

zhenxiao · 2019-08-28T21:53:59Z

presto-hive/src/main/java/com/facebook/presto/hive/parquet/ParquetPageSourceFactory.java

+        return type;
+    }
+
+    private static boolean schemaMatch(org.apache.parquet.schema.Type parquetType, Type type)


Not yet, we could add tests in another PR

zhenxiao · 2019-08-28T23:58:24Z

presto-hive/src/test/java/com/facebook/presto/hive/TestHiveFileFormats.java

+                .withSession(parquetPageSourceSession)
+                .isFailingForPageSource(new ParquetPageSourceFactory(TYPE_MANAGER, HDFS_ENVIRONMENT, STATS), expectedErrorCode, expectedMessageMapLongMapDouble);
+
+        String expectedMessageArrayStringArrayBoolean = "The column column_name is declared as type array<string>, but the Parquet file declares the column as type   optional group column_name (LIST) {\n"


yes, let me fix it

zhenxiao · 2019-08-29T00:04:47Z

presto-hive/src/test/java/com/facebook/presto/hive/TestHiveFileFormats.java

+            throws Exception
+    {
+        TestColumn floatColumn = new TestColumn("column_name", javaFloatObjectInspector, 5.1f, 5.1f);
+        TestColumn doubleColumn = new TestColumn("column_name", javaDoubleObjectInspector, 5.1, 5.1);


nice catch, I will add comprehensive tests in another PR.

facebook-github-bot added the CLA Signed label Mar 28, 2019

wenleix assigned nezihyigitbasi Mar 28, 2019

nezihyigitbasi requested changes Apr 2, 2019

View reviewed changes

zhenxiao commented May 10, 2019

View reviewed changes

zhenxiao force-pushed the parquet-msg branch from adf573a to a8473cb Compare May 10, 2019 00:44

arhimondr self-assigned this Aug 8, 2019

zhenxiao force-pushed the parquet-msg branch 3 times, most recently from 58cac73 to 36d2855 Compare August 15, 2019 14:54

arhimondr reviewed Aug 23, 2019

View reviewed changes

zhenxiao force-pushed the parquet-msg branch from 36d2855 to c6eb0e7 Compare August 24, 2019 08:16

zhenxiao force-pushed the parquet-msg branch from c6eb0e7 to 18f62dc Compare August 24, 2019 22:33

arhimondr approved these changes Aug 28, 2019

View reviewed changes

zhenxiao force-pushed the parquet-msg branch from 18f62dc to bab5174 Compare August 28, 2019 07:28

nezihyigitbasi approved these changes Aug 28, 2019

View reviewed changes

Check Parquet schema mismatch

fea3707

zhenxiao force-pushed the parquet-msg branch from bab5174 to fea3707 Compare August 29, 2019 00:14

zhenxiao commented Aug 29, 2019

View reviewed changes

arhimondr merged commit 5d18a1c into prestodb:master Aug 30, 2019

neeradsomanchi mentioned this pull request Sep 9, 2019

Release notes for 0.226 #13340

Closed

zhenxiao deleted the parquet-msg branch March 3, 2020 22:15

Refine Parquet schema mismatch message #12550

Refine Parquet schema mismatch message #12550

Conversation

zhenxiao commented Mar 28, 2019

nezihyigitbasi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhenxiao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhenxiao commented May 10, 2019

zhenxiao commented Jul 1, 2019

nezihyigitbasi commented Aug 7, 2019

arhimondr commented Aug 10, 2019 • edited Loading

zhenxiao commented Aug 15, 2019

arhimondr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhenxiao commented Aug 24, 2019

arhimondr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nezihyigitbasi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhenxiao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arhimondr commented Aug 10, 2019 •

edited

Loading