-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-2417: Add support for geometry logical type #2971
base: master
Are you sure you want to change the base?
PARQUET-2417: Add support for geometry logical type #2971
Conversation
This PR is copied form this place: apache#1379
parquet-column/src/main/java/org/apache/parquet/column/statistics/geometry/BoundingBox.java
Outdated
Show resolved
Hide resolved
...uet-column/src/main/java/org/apache/parquet/column/statistics/geometry/EnvelopeCovering.java
Outdated
Show resolved
Hide resolved
...uet-column/src/main/java/org/apache/parquet/column/statistics/geometry/EnvelopeCovering.java
Outdated
Show resolved
Hide resolved
parquet-column/src/main/java/org/apache/parquet/column/statistics/BinaryStatistics.java
Outdated
Show resolved
Hide resolved
…e spherical edge is specified.
…apache-parquet-2417-geospatial
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update! I have left some comments. I think we are reaching the finish line!
parquet-column/src/main/java/org/apache/parquet/column/statistics/geometry/GeometryUtils.java
Outdated
Show resolved
Hide resolved
parquet-column/src/main/java/org/apache/parquet/column/statistics/geometry/BoundingBox.java
Show resolved
Hide resolved
parquet-column/src/main/java/org/apache/parquet/column/statistics/geometry/BoundingBox.java
Outdated
Show resolved
Hide resolved
parquet-column/src/main/java/org/apache/parquet/column/statistics/geometry/BoundingBox.java
Outdated
Show resolved
Hide resolved
parquet-column/src/main/java/org/apache/parquet/column/statistics/geometry/Covering.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/test/java/org/apache/parquet/statistics/TestGeometryTypeRoundTrip.java
Outdated
Show resolved
Hide resolved
} | ||
|
||
@Test | ||
public void testEPSG4326BasicReadWriteGeometryValue() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding these tests!
I think we are missing tests in following cases:
- verify geometry type metadata is well preserved.
- verify all kinds of geometry stats are preserved, including bbox, covering and geometry types.
- verify geo stats in the column index have been generated.
I can do these later.
} | ||
|
||
@Override | ||
void update(Geometry geom) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the difference between EnvelopeCovering and BBox stats when the edge == planar? I think they are the same. I thought we agreed that: when the edge = planar, we only generate BBox stats; when the edge = spherical, we generate both BBox and covering stats (but we omit the covering stats since we don't have the tool to calculate it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense. We could also omit Covering stats since there is no downstream consumer at the moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably the only reason to write a Covering would be to test its serialization/deserialization to thrift and/or make sure it is implemented in at least two places for the spec vote? At least in C++ it is not difficult to accumulate the bounding box and only when it is about to be serialized generate the WKB portion of the covering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay. I have left a few comments regarding to the statistics. Please take a look. Thanks!
parquet-column/src/main/java/org/apache/parquet/column/statistics/geometry/Covering.java
Outdated
Show resolved
Hide resolved
...uet-column/src/main/java/org/apache/parquet/column/statistics/geometry/EnvelopeCovering.java
Outdated
Show resolved
Hide resolved
...uet-column/src/main/java/org/apache/parquet/column/statistics/geometry/EnvelopeCovering.java
Outdated
Show resolved
Hide resolved
...t-column/src/main/java/org/apache/parquet/column/statistics/geometry/GeometryStatistics.java
Outdated
Show resolved
Hide resolved
...uet-column/src/main/java/org/apache/parquet/column/statistics/geometry/EnvelopeCovering.java
Outdated
Show resolved
Hide resolved
|
||
@Override | ||
public void setKind(String kind) { | ||
if (kind == null || kind.isEmpty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks weird to me since it includes more information than kind
. From the specs, crs
and edges
should be deduced from the geometry logical type metadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue is that the spec only persist the following metadata for each covering to the parquet meta.
struct Covering {
/**
* A type of covering. Currently accepted values: "WKB".
*/
1: required string kind;
/**
* A payload specific to kind. Below are the supported values:
* - WKB: well-known binary of a POLYGON or MULTI-POLYGON that completely
* covers the contents. This will be interpreted according to the same CRS
* and edges defined by the logical type.
*/
2: required binary value;
}
The crs and edge are unique to each covering, so we have to save them with each covering metadata as a part of the kind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to that text, a WKB covering with a different CRS and/or edges is not currently permitted. If we need this, we should reword the spec!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @paleolimbot that this should be consistent with the spec.
|
||
@Override | ||
public String getKind() { | ||
return kind + "|" + crs + "|" + edges; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment below this one. This seems to break the contract from base class.
This PR is to provide a POC to support the proposed changes to the parquet-format to add geometry type to parquet.
Here is the proposal: apache/parquet-format#240
Jira
Tests
Commits
Documentation