PARQUET-2417: Add support for geometry logical type #2971

zhangfengcdt · 2024-07-26T15:31:30Z

This PR is to provide a POC to support the proposed changes to the parquet-format to add geometry type to parquet.

Here is the proposal: apache/parquet-format#240

Jira

My PR addresses the following Parquet Jira issues and references them in the PR title. For example, "PARQUET-1234: My Parquet PR"
- https://issues.apache.org/jira/browse/PARQUET-2471

Tests

My PR adds the following unit tests: TestGeometryTypeRoundTrip

Commits

My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters (not including Jira issue reference)
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not "adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"

Documentation

In case of new functionality, my PR adds documentation that describes how to use it.
- All the public functions and the classes in the PR contain Javadoc that explain what it does

This PR is copied form this place: apache#1379

zhangfengcdt · 2024-07-26T15:42:53Z

CC: @jiayuasu @Kontinuation @wgtmac

parquet-column/src/main/java/org/apache/parquet/column/statistics/geometry/BoundingBox.java

...uet-column/src/main/java/org/apache/parquet/column/statistics/geometry/EnvelopeCovering.java

parquet-column/src/main/java/org/apache/parquet/column/statistics/BinaryStatistics.java

parquet-column/pom.xml

…e spherical edge is specified.

…apache-parquet-2417-geospatial

wgtmac

Thanks for the update! I have left some comments. I think we are reaching the finish line!

parquet-column/src/main/java/org/apache/parquet/column/statistics/geometry/GeometryUtils.java

parquet-column/src/main/java/org/apache/parquet/column/statistics/geometry/BoundingBox.java

parquet-column/src/main/java/org/apache/parquet/column/statistics/geometry/Covering.java

parquet-hadoop/pom.xml

parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java

parquet-hadoop/src/test/java/org/apache/parquet/statistics/TestGeometryTypeRoundTrip.java

wgtmac · 2024-08-17T07:36:23Z

parquet-hadoop/src/test/java/org/apache/parquet/statistics/TestGeometryTypeRoundTrip.java

+  }
+
+  @Test
+  public void testEPSG4326BasicReadWriteGeometryValue() throws Exception {


Thanks for adding these tests!

I think we are missing tests in following cases:

verify geometry type metadata is well preserved.

verify all kinds of geometry stats are preserved, including bbox, covering and geometry types.

verify geo stats in the column index have been generated.

I can do these later.

… module

…apache-parquet-2417-geospatial

…nge)

jiayuasu · 2025-02-13T05:46:44Z

@wgtmac please take a look :-)

wgtmac · 2025-02-13T06:01:39Z

Sure, I will take a look. Thanks!

Kontinuation · 2025-02-17T10:37:44Z

I am depending on this PR to build geo support for iceberg. I got lots of test failures when building this branch locally:

java.lang.NullPointerException
	at org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:965)
	at org.apache.parquet.format.converter.ParquetMetadataConverter.buildColumnChunkMetaData(ParquetMetadataConverter.java:1750)
	at org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetMetadata(ParquetMetadataConverter.java:1848)
	at org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:1728)
	at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:629)
	at org.apache.parquet.hadoop.ParquetFileReader.<init>(ParquetFileReader.java:934)
	at org.apache.parquet.hadoop.ParquetFileReader.<init>(ParquetFileReader.java:925)
	at org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:698)

NPE is thrown when reading parquet files without geo columns. Can we apply the following patch to resolve this problem?

diff --git a/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java b/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java
index 3efc9345..22e51783 100644
--- a/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java
+++ b/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java
@@ -961,6 +961,9 @@ public class ParquetMetadataConverter {
 
   static org.apache.parquet.column.statistics.geometry.GeospatialStatistics fromParquetStatistics(
       GeospatialStatistics formatGeomStats, PrimitiveType type) {
+    if (formatGeomStats == null) {
+      return null;
+    }
     org.apache.parquet.column.statistics.geometry.BoundingBox bbox = null;
     if (formatGeomStats.isSetBbox()) {
       BoundingBox formatBbox = formatGeomStats.getBbox();

parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java

parquet-column/src/main/java/org/apache/parquet/schema/LogicalTypeAnnotation.java

parquet-column/src/main/java/org/apache/parquet/column/statistics/BinaryStatistics.java

parquet-column/src/main/java/org/apache/parquet/schema/LogicalTypeAnnotation.java

parquet-column/src/main/java/org/apache/parquet/schema/PrimitiveType.java

parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java

Kontinuation · 2025-02-19T14:08:42Z

parquet-column/src/main/java/org/apache/parquet/schema/LogicalTypeAnnotation.java

+
+  public static class GeographyLogicalTypeAnnotation extends LogicalTypeAnnotation {
+    private final String crs;
+    private final String edgeAlgorithm;


Should we define an enum for edge interpolation algorithms?

They are already auto generated from the thrift protocol in EdgeInterpolationAlgorithm.

The reason it is not used here is because if we do, it will generate loop dependency.

parquet-column defines its own enum for the enums that are defined by thrift. Examples are enum Encoding and enum CompressionCodecName. I'm not sure if we should follow the same convention for EdgeInterpolationAlgorithm.

Kontinuation · 2025-02-20T03:29:04Z

parquet-column/src/main/java/org/apache/parquet/schema/LogicalTypeAnnotation.java

+      GeographyLogicalTypeAnnotation other = (GeographyLogicalTypeAnnotation) obj;
+      return crs.equals(other.crs) && edgeAlgorithm == other.edgeAlgorithm;


We'd better use equals instead of == to compare Strings to be safe, even when the String instances are being reused. I think this is left out when reverting edgeAlgorithm from enum to String.

Kontinuation · 2025-02-20T03:33:08Z

parquet-column/src/main/java/org/apache/parquet/schema/LogicalTypeAnnotation.java

+
+    @Override
+    public int hashCode() {
+      return Objects.hash(crs);


Hashing both crs and edgeAlgorithm.

zhangfengcdt added 11 commits July 22, 2024 17:05

PARQUET-2471: Add support for geometry logical type

6a2e051

This PR is copied form this place: apache#1379

fix types

d354012

Refactor BoundingBox and GeometryTypes

969b696

revert naming changes

87ee8ea

revert TestDecimalUtils

d67e03b

revert more

ccf1c4a

refactor statistics

35342c2

modify EnvelopeCovering

bcfefb7

add more unit tests

cf615f6

update comments

c6ae733

add comment for envelope converging expand calculation

80a629e

zhangfengcdt marked this pull request as draft July 26, 2024 15:39

jiayuasu suggested changes Jul 26, 2024

View reviewed changes

parquet-column/src/main/java/org/apache/parquet/column/statistics/geometry/BoundingBox.java Outdated Show resolved Hide resolved

Fix the boundingbox initial values in constructor

e0ec9ef

wgtmac reviewed Aug 7, 2024

View reviewed changes

...uet-column/src/main/java/org/apache/parquet/column/statistics/geometry/EnvelopeCovering.java Outdated Show resolved Hide resolved

jiayuasu reviewed Aug 7, 2024

View reviewed changes

...uet-column/src/main/java/org/apache/parquet/column/statistics/geometry/EnvelopeCovering.java Outdated Show resolved Hide resolved

Update the poc implementation for the changes to the spec

7c728b1

jiayuasu reviewed Aug 7, 2024

View reviewed changes

parquet-column/src/main/java/org/apache/parquet/column/statistics/BinaryStatistics.java Outdated Show resolved Hide resolved

zhangfengcdt added 2 commits August 7, 2024 15:02

remove print

fa36d06

implement evelop covering for spherical coordinates

2a1c62a

wgtmac reviewed Aug 9, 2024

View reviewed changes

parquet-column/pom.xml Outdated Show resolved Hide resolved

zhangfengcdt added 3 commits August 12, 2024 08:52

throw a not-implemented exception for the covering statistics when th…

d0e7d3d

…e spherical edge is specified.

Merge branch 'master' of github.com:apache/parquet-java into feature-…

30d64be

…apache-parquet-2417-geospatial

remove unused comment codes

29a86b5

zhangfengcdt marked this pull request as ready for review August 12, 2024 15:57

wgtmac reviewed Aug 17, 2024

View reviewed changes

zhangfengcdt added 3 commits August 19, 2024 16:35

address some review comments

0c4b8b4

revert changes that are not desired

a56e9ad

refactor toString and remove test scope of jts-core in parquet-hadoop…

1ae3d99

… module

wgtmac mentioned this pull request Sep 18, 2024

PARQUET-2471: Add GEOMETRY and GEOGRAPHY logical types apache/parquet-format#240

Merged

address comments to remove string parsing (to be consistent with spec)

a378a9a

wgtmac mentioned this pull request Sep 30, 2024

PARQUET-2471: Add support for geometry logical type #1379

Closed

5 tasks

zhangfengcdt added 13 commits October 17, 2024 11:20

update according to the changes to the upstream pqrquet-format pr

698325a

remove coverings

d65ba8e

add GeometryStatistics to ColumnMetaData

1b0f5b9

more code cleanup for covering

342e400

add toParquetGeometryStatistics

3cc9b68

fix check errors

dc05cfd

Merge branch 'master' of github.com:apache/parquet-java into feature-…

6f1d586

…apache-parquet-2417-geospatial

Merge branch 'master' of github.com:apache/parquet-java into feature-…

e4e3cae

…apache-parquet-2417-geospatial

change and remove the encoding and edges from geometry type (spec cha…

69d950f

…nge)

fix unit tests

d296c6a

handle the wraparound case for X values

93e28b7

support GEOGRAPHY type

e688d05

revert import changes

01ac560

Kontinuation reviewed Feb 17, 2025

View reviewed changes

parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java Show resolved Hide resolved

Kontinuation reviewed Feb 18, 2025

View reviewed changes

Kontinuation reviewed Feb 19, 2025

View reviewed changes

parquet-column/src/main/java/org/apache/parquet/schema/LogicalTypeAnnotation.java Outdated Show resolved Hide resolved

Kontinuation reviewed Feb 19, 2025

View reviewed changes

parquet-column/src/main/java/org/apache/parquet/schema/PrimitiveType.java Outdated Show resolved Hide resolved

Kontinuation reviewed Feb 19, 2025

View reviewed changes

parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java Outdated Show resolved Hide resolved

Kontinuation reviewed Feb 19, 2025

View reviewed changes

zhangfengcdt added 4 commits February 19, 2025 08:05

address pr review comments

f9585cd

fix formatting issue

b805cf4

refactor geography logic type

6aa7028

revert the edge algorithm to use string to avoid loop dependency

4536e5f

Kontinuation reviewed Feb 20, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PARQUET-2417: Add support for geometry logical type #2971

PARQUET-2417: Add support for geometry logical type #2971

zhangfengcdt commented Jul 26, 2024

zhangfengcdt commented Jul 26, 2024

wgtmac left a comment

wgtmac Aug 17, 2024

jiayuasu commented Feb 13, 2025

wgtmac commented Feb 13, 2025

Kontinuation commented Feb 17, 2025

Kontinuation Feb 19, 2025

zhangfengcdt Feb 19, 2025

zhangfengcdt Feb 19, 2025

Kontinuation Feb 20, 2025

Kontinuation Feb 20, 2025 •

edited

Loading

Kontinuation Feb 20, 2025

		GeographyLogicalTypeAnnotation other = (GeographyLogicalTypeAnnotation) obj;
		return crs.equals(other.crs) && edgeAlgorithm == other.edgeAlgorithm;

PARQUET-2417: Add support for geometry logical type #2971

Are you sure you want to change the base?

PARQUET-2417: Add support for geometry logical type #2971

Conversation

zhangfengcdt commented Jul 26, 2024

Jira

Tests

Commits

Documentation

zhangfengcdt commented Jul 26, 2024

wgtmac left a comment

Choose a reason for hiding this comment

wgtmac Aug 17, 2024

Choose a reason for hiding this comment

jiayuasu commented Feb 13, 2025

wgtmac commented Feb 13, 2025

Kontinuation commented Feb 17, 2025

Kontinuation Feb 19, 2025

Choose a reason for hiding this comment

zhangfengcdt Feb 19, 2025

Choose a reason for hiding this comment

zhangfengcdt Feb 19, 2025

Choose a reason for hiding this comment

Kontinuation Feb 20, 2025

Choose a reason for hiding this comment

Kontinuation Feb 20, 2025 • edited Loading

Choose a reason for hiding this comment

Kontinuation Feb 20, 2025

Choose a reason for hiding this comment

Kontinuation Feb 20, 2025 •

edited

Loading