-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-2471: Add support for geometry logical type #1379
Conversation
Assert.assertNotNull(geometryStatistics); | ||
System.out.println("GeometryStatistics"); | ||
System.out.println(geometryStatistics); | ||
System.out.println(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have just finished the PoC to read/write geometry values in the form of WKB and verify the GeometryStatistics
have been collected as expected. This is the output of this minimal test.
Footer
ParquetMetaData{FileMetaData{schema: message msg {
required binary col_geom (GEOMETRY(WKB,PLANAR,EPSG:4326));
}
, metadata: {writer.model.name=example}}, blocks: [BlockMetaData{2, 73, rowIndexOffset = 0 [ColumnMetaData{UNCOMPRESSED [col_geom] required binary col_geom (GEOMETRY(WKB,PLANAR,EPSG:4326)) [BIT_PACKED, PLAIN], 4}]}]}
Statistics
min: POINT (1 1), max: POINT (2 2), num_nulls: 0
GeometryStatistics
GeometryStatistics{boundingBox=BoundingBox{xMin=1.0, xMax=2.0, yMin=1.0, yMax=2.0, zMin=NaN, zMax=NaN, mMin=NaN, mMax=NaN}, covering=Covering{geometry=POLYGON ((1 1, 1 2, 2 2, 2 1, 1 1)), edges=PLANAR}, geometryTypes=GeometryTypes{types=[1]}}
@szehon-ho @jiayuasu @paleolimbot @emkornfield
Please let me know what do you think. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you help take an initial review to see if this looks promising? @gszadovszky @julienledem
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wgtmac, I am not really familiar with this new geometry logical type. So, my comments are more general.
I can see a TODO for the comparator but I cannot see any specification in the format PR about the total ordering. (BTW, I cannot see a proper spec in LogicalTypes.md. It is for the other PR, though.)
Could you please double-check the licensing of org.locationtech.jts
whether we are able to distribute parquet-java
as is? I am not expert of this, though.
I think, it is not a good idea to include external objects (org.locationtech.jts.geom.Geometry
) into our public API. In our public API we should stick to the primitive type (Binary
).
Otherwise, it looks good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @gszadovszky for a quick review! Please see my inline comments below.
I can see a TODO for the comparator but I cannot see any specification in the format PR about the total ordering.
I thought we'd better make the ColumnOrder
of geometry logical type as undefined and do not populate min/max values. Here is the related line in the format PR: https://github.com/apache/parquet-format/pull/240/files#diff-834c5a8d91719350b20995ad99d1cb6d8d68332b9ac35694f40e375bdb2d3e7cR1083
Could you please double-check the licensing of org.locationtech.jts whether we are able to distribute parquet-java as is?
It has dual licenses:
- Eclipse Public License 2.0
- Eclipse Distribution License 1.0 (a BSD Style License).
I have also checked Apache Sedona which includes org.locationtech.jts/jts-core
in its dependency: https://github.com/apache/sedona/blob/119682321f5301e6ef776315a8f2cd53a8ff63b0/pom.xml#L168-L188. So I assume we are safe here.
@jiayuasu may provide more context as the domain expert.
I think, it is not a good idea to include external objects (org.locationtech.jts.geom.Geometry) into our public API. In our public API we should stick to the primitive type (Binary).
I agree that it is not a good idea to expose org.locationtech.jts.geom.Geometry
as a public API. Let me switch it to internal method for now. If downstream projects like Apache Iceberg (which has customized column writer implementation) could benefit from directly passing a org.locationtech.jts.geom.Geometry
to build stats without unnecessary conversion, we can revisit this later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the answers, @wgtmac!
The TODO comment about the comparator can be removed, then. With the moving of the Geometry
references into non-public scopes I'm OK with this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please let me know what do you think
This looks good to me! (I left some geometry-specific notes on the implementation)
I have also checked Apache Sedona which includes org.locationtech.jts/jts-core in its dependency
I believe Calcite also uses JTS for its spatial component and it may be worth checking there to see to what extent those types are used in various APIs and/or if it's ever caused any issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am pretty confident that JTS is fully compatible with ASF license: https://www.apache.org/legal/resolved.html
This is also why Sedona added it years ago
|
||
public class GeometryStatistics { | ||
|
||
private final BoundingBox boundingBox; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GeometryStatistics
is something that I can hardly find a way to fit into an extension
type. Please let me know if you have any better idea. @pitrou
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a look through for geometry things...this is very cool!
if (typeId == UNKNOWN_TYPE_ID) { | ||
return UNKNOWN_TYPE_ID; | ||
} | ||
Coordinate coordinate = geometry.getCoordinate(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm surprised that a Geometry doesn't have a getDimensions or similar
for (Coordinate coordinate : coordinates) { | ||
update(coordinate.getX(), coordinate.getY(), coordinate.getZ(), coordinate.getM()); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am surprised that JTS doesn't have an optimized method for this (it almost certainly has at least an internal one for building indexes internally)
formatStats.setGeometry_types( | ||
new ArrayList<Integer>(stats.getGeometryTypes().getTypes())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be worth sorting this here (if the hash set used in the geometry types has non-deterministic ordering)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense. Fixed.
try { | ||
if (geometry != EMPTY) { | ||
Geometry envelope = reader.read(geometry.array()); | ||
geometry = ByteBuffer.wrap(writer.write(envelope.union(geom).getEnvelope())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a little unclear as to the utility of the EnvelopeCovering
...it seems like the optimal strategy would be to always merge BoundingBox
es and build a geometry at the end. A union in a loop is not usually efficient and should perhaps be more difficult to accidentally do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. Perhaps I can call Envelope.expandToInclude(Envelope)
to maintain a running envelope.
Thanks @paleolimbot for the quick review! I've addressed your feedback and also added page index support for geometry stats. BTW, it is a convention that we need PoC on another Parquet implementation to move forward the proposal. I plan to add similar things to parquet-cpp. Do you happen to know any JTS-parity on the C++ side? Or do you have a plan to implement the PoC? I guess there should be a lot of existing code on the GeoParquet side. |
@wgtmac The JTS-parity on C++ side is GEOS, which is a C++ port of JTS: https://github.com/libgeos/geos Note that: GEOS is LGPL 2.1, which is NOT compatible with ASF |
return mMax; | ||
} | ||
|
||
void update(Geometry geom) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not the correct way to update Geometry in JTS. You need to use CoordinateSequenceFilter. Sedona has many examples here: https://github.com/apache/sedona/blob/0a1db3d35fd7b5721301aa3ce5d39aa8cd828661/common/src/main/java/org/apache/sedona/common/Functions.java#L294
public class BoundingBox { | ||
|
||
private double xMin = Double.MAX_VALUE; | ||
private double xMax = Double.MIN_VALUE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the init value of xMax, yMax, ..., please use -Double.MAX_VALUE
because Double.MIN_VALUE is not guaranteed to be negative. See https://stackoverflow.com/questions/3884793/why-is-double-min-value-in-not-negative
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Double.POSITIVE_INFINITY
and Double.NEGATIVE_INFINITY
are also valid choices for initial min/max values.
For extracting the bounds for each ordinate, org.locationtech.jts.io.twkb.BoundsExtractor
in the JTS code base is also a good example.
new Coordinate(envelope.getMaxX(), envelope.getMinY()), | ||
new Coordinate(envelope.getMinX(), envelope.getMinY()) | ||
}); | ||
geometry = ByteBuffer.wrap(writer.write(polygon)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pay attention to a trap here: JTS WKB writer has a bug that does not write M value. There is a PR on JTS to fix this but it hasn't been merged: locationtech/jts#734.
I think we also need to be a bit more explicit here about which WKB standard we are supporting.
Standard WKB format does not allow Z and M dimension. ISO WKB allow Z and M dimension. JTS follows ISO WKB but its writer has the aforementioned bug. See this article from GEOS: https://libgeos.org/specifications/wkb/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the advise! BTW, it seems that JTS envelope has only X and Y axises. I'm not sure if this is another issue here.
Core sedona contributors @Kontinuation and @zhangfengcdt , can you also review this? @wgtmac Thanks for putting together this PR. I guess you might need some help from the Sedona community regarding the Java geometry implementation together with JTS. We can work together to implement this POC. |
Thanks @jiayuasu for your valuable feedback! Yes, it would be great if we can get direct help from geospatial experts! |
As Jia mentioned, it's GEOS, but this is not something the C++ implementation can/should take on as a dependency.
GDAL has quite a lot to draw from ( https://github.com/OSGeo/gdal/tree/master/ogr/ogrsf_frmts/parquet ), although I think pretty much all we need to do is walk WKB for bounding boxes and geometry type ( https://github.com/OSGeo/gdal/blob/12dab86ca2d8b1a65c4c085e137c62294682ac1d/ogr/ogr_wkb.cpp#L387-L584 ).
I am happy to give it a shot (I'll make a start this week). |
public class BoundingBox { | ||
|
||
private double xMin = Double.MAX_VALUE; | ||
private double xMax = Double.MIN_VALUE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Double.POSITIVE_INFINITY
and Double.NEGATIVE_INFINITY
are also valid choices for initial min/max values.
For extracting the bounds for each ordinate, org.locationtech.jts.io.twkb.BoundsExtractor
in the JTS code base is also a good example.
Geometry polygon = factory.createPolygon(new Coordinate[] { | ||
new Coordinate(envelope.getMinX(), envelope.getMinY()), | ||
new Coordinate(envelope.getMinX(), envelope.getMaxY()), | ||
new Coordinate(envelope.getMaxX(), envelope.getMaxY()), | ||
new Coordinate(envelope.getMaxX(), envelope.getMinY()), | ||
new Coordinate(envelope.getMinX(), envelope.getMinY()) | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand it correctly, EnvelopeCovering repliates BoundingBox, so can we reuse BoundingBox here to handle all ordinates correctly?
If we still use geom.getEnvelopeInternal
, we can replace this with a shorter version factory.toGeometry(envelope)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EnvelopeCovering
is just a possible implementation of the Covering
. The idea behind putting Covering
and BoundingBox
at the same time originates from the suggestion of apache/parquet-format#240 (comment) and apache/parquet-format#240 (comment)
if (crs != null && !crs.isEmpty()) { | ||
sb.append(","); | ||
sb.append(crs); | ||
} | ||
if (metadata != null) { | ||
sb.append(","); | ||
sb.append(metadata); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a high probability that crs
itself contains comma, so this may introduce ambiguity to the generated type parameters.
metadata
will be appended as something like java.nio.HeapByteBuffer[pos=_ lim=_ cap=_]
to the type parameter. Does this have any bad implications?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is definitely something that I need to improve :)
This PR is copied form this place: apache#1379
Close this in favor of #2971 |
This is the PoC of apache/parquet-format#240
Jira
them in the PR title. For example, "PARQUET-1234: My Parquet PR"
the ASF 3rd Party License Policy.
Tests
Commits
from "How to write a good git commit message":
Style
mvn spotless:apply -Pvector-plugins
Documentation