Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-2417: Add support for geometry logical type #2971

Open
wants to merge 31 commits into
base: master
Choose a base branch
from

Conversation

zhangfengcdt
Copy link

This PR is to provide a POC to support the proposed changes to the parquet-format to add geometry type to parquet.

Here is the proposal: apache/parquet-format#240

Jira

Tests

  • My PR adds the following unit tests: TestGeometryTypeRoundTrip

Commits

  • My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain Javadoc that explain what it does

@zhangfengcdt zhangfengcdt marked this pull request as draft July 26, 2024 15:39
@zhangfengcdt
Copy link
Author

CC: @jiayuasu @Kontinuation @wgtmac

parquet-column/pom.xml Outdated Show resolved Hide resolved
@zhangfengcdt zhangfengcdt marked this pull request as ready for review August 12, 2024 15:57
Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update! I have left some comments. I think we are reaching the finish line!

parquet-hadoop/pom.xml Outdated Show resolved Hide resolved
}

@Test
public void testEPSG4326BasicReadWriteGeometryValue() throws Exception {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding these tests!

I think we are missing tests in following cases:

  • verify geometry type metadata is well preserved.
  • verify all kinds of geometry stats are preserved, including bbox, covering and geometry types.
  • verify geo stats in the column index have been generated.

I can do these later.

}

@Override
void update(Geometry geom) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the difference between EnvelopeCovering and BBox stats when the edge == planar? I think they are the same. I thought we agreed that: when the edge = planar, we only generate BBox stats; when the edge = spherical, we generate both BBox and covering stats (but we omit the covering stats since we don't have the tool to calculate it).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense. We could also omit Covering stats since there is no downstream consumer at the moment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably the only reason to write a Covering would be to test its serialization/deserialization to thrift and/or make sure it is implemented in at least two places for the spec vote? At least in C++ it is not difficult to accumulate the bounding box and only when it is about to be serialized generate the WKB portion of the covering.

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay. I have left a few comments regarding to the statistics. Please take a look. Thanks!


@Override
public void setKind(String kind) {
if (kind == null || kind.isEmpty()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks weird to me since it includes more information than kind. From the specs, crs and edges should be deduced from the geometry logical type metadata.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is that the spec only persist the following metadata for each covering to the parquet meta.

struct Covering {
  /**
   * A type of covering. Currently accepted values: "WKB".
   */
  1: required string kind;
  /**
   * A payload specific to kind. Below are the supported values:
   * - WKB: well-known binary of a POLYGON or MULTI-POLYGON that completely
   *   covers the contents. This will be interpreted according to the same CRS
   *   and edges defined by the logical type.
   */
  2: required binary value;
}

The crs and edge are unique to each covering, so we have to save them with each covering metadata as a part of the kind.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to that text, a WKB covering with a different CRS and/or edges is not currently permitted. If we need this, we should reword the spec!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @paleolimbot that this should be consistent with the spec.


@Override
public String getKind() {
return kind + "|" + crs + "|" + edges;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment below this one. This seems to break the contract from base class.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants