Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Parquet] Ignore corrupted or invalid statistics #34139

Open
wgtmac opened this issue Feb 11, 2023 · 5 comments
Open

[C++][Parquet] Ignore corrupted or invalid statistics #34139

wgtmac opened this issue Feb 11, 2023 · 5 comments

Comments

@wgtmac
Copy link
Member

wgtmac commented Feb 11, 2023

Describe the bug, including details regarding any error messages, version, and platform.

#34112 fixes reading from stats.min_value and stats.max_value where applicable. However, these are cases where statistics are invalid or corrupted. Please check parquet-mr for reference: https://github.com/apache/parquet-mr/blob/5290bd5e0ee5dc30db0576e2bfc6eea335c465cf/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L797

    if (formatStats != null) {
      // Use the new V2 min-max statistics over the former one if it is filled
      if (formatStats.isSetMin_value() && formatStats.isSetMax_value()) {
        byte[] min = formatStats.min_value.array();
        byte[] max = formatStats.max_value.array();
        if (isMinMaxStatsSupported(type) || Arrays.equals(min, max)) {
          statsBuilder.withMin(min);
          statsBuilder.withMax(max);
        }
      } else {
        boolean isSet = formatStats.isSetMax() && formatStats.isSetMin();
        boolean maxEqualsMin = isSet ? Arrays.equals(formatStats.getMin(), formatStats.getMax()) : false;
        boolean sortOrdersMatch = SortOrder.SIGNED == typeSortOrder;
        // NOTE: See docs in CorruptStatistics for explanation of why this check is needed
        // The sort order is checked to avoid returning min/max stats that are not
        // valid with the type's sort order. In previous releases, all stats were
        // aggregated using a signed byte-wise ordering, which isn't valid for all the
        // types (e.g. strings, decimals etc.).
        if (!CorruptStatistics.shouldIgnoreStatistics(createdBy, type.getPrimitiveTypeName()) &&
            (sortOrdersMatch || maxEqualsMin)) {
          if (isSet) {
            statsBuilder.withMin(formatStats.min.array());
            statsBuilder.withMax(formatStats.max.array());
          }
        }
      }

Component(s)

C++, Parquet

@mapleFU
Copy link
Member

mapleFU commented Jul 25, 2023

  1. Does this problem solved by GH-34138: [C++][Parquet] Fix parsing stats from min_value/max_value #34112
  2. As mentioned in https://github.com/apache/arrow/pull/36814/files , can we ignore min-max when sort-order is not signed ?

@wgtmac

@mapleFU
Copy link
Member

mapleFU commented Jul 25, 2023

take

@mapleFU
Copy link
Member

mapleFU commented Aug 11, 2023

@pitrou @wgtmac Can we close this issue now?

@pitrou
Copy link
Member

pitrou commented Aug 14, 2023

Yes, sorry!

@pitrou
Copy link
Member

pitrou commented Aug 14, 2023

Closing as already fixed (see #36866 for discussion).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants