Factor table: Connected components of university relations #415
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR attempts to solve #404 with a GraphX implementation.
It generates connected components on the full raw graph without deletes, between persons who attended university in overlapping periods of time.
By ignoring deletes, we guarantee that any two disjoint nodes in the result have been so during its full span.
This means that different components can be used as samples for ldbc/ldbc_snb_bi#77
(a) guaranteed that no path exists
spark-graphx
. Have to test whether it is provided on EMR: Yes, it is.spark-graphx
into the jar. Maybe a--with-spark-graphx
or--include-spark-graphx
. (I am not sure aboutwith
as my understanding is that is often used to denote feature switches, and this is not a feature, more like a platform-dependent build directive, that would be used to create the same "whole datagen distribution", no feature is technically added or removed here)