-
Notifications
You must be signed in to change notification settings - Fork 332
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Readd clustering #281
Readd clustering #281
Conversation
I'm not quite finished yet, but I have a question and would like you to see my current state: Currently I've put every setting about the clustering plainly inside the JPlagOptions class. This feels pretty messy. Do you have a suggestion? |
dfa04bf
to
21449f7
Compare
That is a good question. Currently, I count 14 additional options you added in this PR. The question here would be how many of those the user really modifies. If some are parameters that might be tweaked in the future, but most users will not change the default values, then we should not expose these parameters as options. Even from a usability context, modifying 14 options for clustering alone via the CLI seems excessive and very unlikely (think about the flag you need to define). From a technical standpoint, you could encapsulate the clustering parameters in a data object, but that does not solve the problem of settings these options via the CLI. |
I think the defaults are set kind of sane, so I hope most users would not have to change much.
Users would not use all options at the same time. I see two cases in which a user would want to change the options:
A user who had both problems would use 7 options at most. The only thing I can really remove without bad aftertaste is the option about pruning bad clusters. There does not seem to be a practical reason to look at those. |
jplag/src/main/java/de/jplag/clustering/algorithm/AgglomerativeClustering.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comments :)
float submissionSimilarity = (float) similarityMatrix.getEntry(leftSubmission, rightCluster.get(rightIndex)); | ||
similarity = (float) this.accumulator.applyAsDouble(similarity, submissionSimilarity); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of casting simply use BiFunction<Double, Double, Float> or replace float by double
a5bb2a3
to
5797920
Compare
This comment was marked as outdated.
This comment was marked as outdated.
Remove traces of git submodule
Concerning issue #116
This adds the clustering that was removed in #89 again, not including any user interfaces.
Clustering
The clustering is implemented in the de.jplag.clustering package.
It includes two clustering algorithms (spectral and agglomerative), preprocessing, decoupling logic, clustering options and a factory class which can be used to run the clustering in just two statements.
Algorithms
Agglomerative Clustering
Uses a bottom-up approach to successively merge similar clusters. It stops once there are no clusters left that are similar enough to merge. An implementation of this algorithm was originally included in the code base, but removed in #89. It is still included because it is a much simpler approach than the spectral clustering.
Spectral Clustering
Spectral clustering is a clustering approach specifically for for graph data. This matches the problem since the similarities between all submissions can be thought of as a fully connected graph.
Spectral clustering works by computing the Laplace Matrices of the graphs and representing the nodes as k-dimensional vectors using it's Eigenvectors.
At that point the resulting vectors can be clustered using a space-partitioning algorithm, I used k-Means++.
Still for both k-Means as well as the reduction to k dimensions, the unknown final number of clusters k needs to be known.
In addition, k-Means++ yields probabilistic results.
To find a good choice for k and a "good" clustering I employ Baysian Optimization.
A metric I found in line with my notion of a "good" clustering is:
"The average of the clusters modularity times their average inner cluster similarity over the number of the clusters connections."
With modularity I mean the measure introduced by Newman, M.; Girvan, M. in "Finding and evaluating community structure in networks (2004)".
Spectral clustering is used by default.
Preprocessing
As it can be advantageous to apply some preprocessing before clustering (in particular with Spectral clustering) I included three options for preprocessing.
CDF Preprocessor
Estimates the cumulative distribution function of all similarities and multiplies each similarity with the CDF evaluated at that similarity. This has the effect of driving the lowest similarities close to zero while hardly changing the highest ones.
Since this preprocessor is non-parametric and worked well during my experiments I made it the default.
Threshold Preprocessor
Suppresses all similarities below a given threshold. Good values for the threshold vary greatly with the set of input submissions.
Percentile Preprocessor
Is the same as the threshold preprocessor, but the threshold is given as a percentile of the calculated similarities, making it more robust.
Options and CLI
The (many) options for clustering are all defined in a new
ClusteringObjects
class. This class also contains sane defaults, that should allow users the run the clustering without specifying any additional CLI flags or defining them programmatically.If refactored the
CommandLineArgument
enum a little because I did need to add addition optimal parameters and that many constructors became confusing. It now works in a builder-pattern-ish fashion. I also added a small class for dealing with groups of arguments (Clustering
andClustering - Preprocessing
in the help text).This is how the help message is now displayed:
Technical Stuff
de.jplag
is only coupled tode.jplag.clustering
through theClusteringOptions
andClusteringFactory
classesde.jplag.clustering
is only coupled tode.jplag
through theJPlagComparison
andSubmission
classesJPlagComparison
andSubmission
in theClusteringFactory
andClusteringAdapter
classes, the latter replacing submissions with integer indices and comparisons with matrices.