Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Magellan index #95

Merged
merged 8 commits into from
Feb 14, 2017
Merged

Magellan index #95

merged 8 commits into from
Feb 14, 2017

Conversation

halfabrane
Copy link
Contributor

@halfabrane halfabrane commented Feb 14, 2017

Adds spatial indexing capabilities to Magellan.

You can now index points, polygons and other shapes in Magellan by ZOrder Curves.
A general abstraction for a Spatial Index is defined and a Z Order curve implementation is provided.
In particular, Geohashing is implemented as a special case of a Z Order curve with a bounding box of the globe.

example:
df.withColumn($"point" geohash 25) will geohash each Point(long, lat) into a geohash with a 5 character precision.

df.withColumn($"polygon" geohash 25) will return a List of all the geohashes with 5 character length that intersect the polygon.

Spatial Joins can be performed by first creating an index on each dataframe and doing a hash join on the index followed by applying the predicate.

As an example:

pointsdf.join(polygonsdf).where($"point" within $"polygon") can be rewritten as
val indexedpointsdf = pointsdf.withColumn("index", $"point geohash 25).select('point, explode($"index").as("index"))
val indexedpolygonsdf = polygonsdf.withColumn("index", $"polygon" geohash 25).select('polygon, explode($"index").as("index"))
indexedpointsdf.join(indexedpolygonsdf, indexedpointsdf("index") === indexedpolygonsdf("index")).where($"point" within $"polygon")

achieves the same join, except it first creates a geohash index on each shape, then performs a hashjoin on the index followed by eliminating possible mismatches (where the geohash boundaries of the point do intersect the polygon, but the point itself does not) using the where clause.

This is usually faster than the default join in Magellan, which is a cross join.
However, constructing the spatial indices does take time so there is a tradeoff involved.
For sufficiently large datasets this join should outperform the naive cross join significantly.

A future Pull Request will add a join optimization based on this PR such that the user does not have to manually rewrite the join into a hashjoin and the optimizer does this automatically.

@harsha2010 harsha2010 merged commit 0ba8933 into harsha2010:master Feb 14, 2017
halfabrane pushed a commit to halfabrane/magellan that referenced this pull request Mar 18, 2017
@halfabrane halfabrane deleted the MAGELLAN-INDEX branch June 27, 2017 10:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants