Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This Pull Request continues the work started in #95
Geospatial Relations read by Magellan are automatically indexed now.
The schema of such relations includes two new fields : the list of ZOrderCurves that cover the geometry, along with the Relation. A Relation can be one of Contains, Within, Disjoint, Intersects indicating the relationship of this ZOrderCurve to the given geometry.
The option magellan.index.precision governs at what precision we resolve the ZOrderCurves (GeoHashes in the standard case of lat/ long)
Dataframes involving shapes can also now be indexed explicitly by invoking:
df.withColumn("index", $"point" precision 30) etc., where in this case we are covering the point by a six character (30/ 5) geohash.
Similarly, for polygons,
df.withColumn("index", $"polygon" precision 30) gives us all the six character geohashes that are either contained in, contain or intersect this polygon
A future PR will use these indices to automatically choose a spatial join based on these indices.
For now, if you don;t mind manually rewriting queries to take advantage of spatial indexing in Magellan, you can do the following:
// assuming you have two datasets (points and polygons) that you want to join that have been indexed as above
import magellan.index._
val indexedPoints = points.withColumn("index", explode($"index")).select("point", "index.curve", "index.relation")
val indexedPolygons = polygons.withColumn("index", explode($"index")).select("polygon", "index.curve", "index.relation")
// instead of joined = points.join(polygons).where($"point" within $ "polygon") you have
val joined = indexedPoints.join(indexedPolygons, indexedPoints("curve") === indexedPolygons("curve")).where((indexedPolygons("relation") === "Within") or ($"point" within $ "polygon"))
A sample Databricks community notebook that illustrates how to set up the indices and perform the spatial join is here:
https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/137058993011870/2088756965947706/6891974485343070/latest.html