Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update documentation #317

Closed
sebinside opened this issue Mar 15, 2022 · 3 comments
Closed

Update documentation #317

sebinside opened this issue Mar 15, 2022 · 3 comments
Assignees
Labels
enhancement Issue/PR that involves features, improvements and other changes minor Minor issue/feature/contribution/change
Milestone

Comments

@sebinside
Copy link
Member

The JPlag Documentation and README should contain the new information from the PRs #287 and #281.

@sebinside sebinside added enhancement Issue/PR that involves features, improvements and other changes minor Minor issue/feature/contribution/change labels Mar 15, 2022
@sebinside sebinside added this to the v4.0.0 milestone Mar 15, 2022
@sebinside sebinside self-assigned this Mar 15, 2022
@tsaglam tsaglam moved this to Todo in v4.0.0 Release Aug 22, 2022
@tsaglam
Copy link
Member

tsaglam commented Sep 16, 2022

Nearly done, I only need the markdown code from SimDing#7, as we do not have edit rights on this fork. Quote reply gives me the markdown, but the table is broken. We should copy it before the release.

@tsaglam tsaglam moved this from Todo to In Progress in v4.0.0 Release Sep 16, 2022
@dfuchss
Copy link
Member

dfuchss commented Sep 16, 2022

@tsaglam this should be the code :) gh api /repos/SimDing/JPlag/issues/7

Clustering

By default, JPlag is configured to perform a clustering of the submissions.
The clustering partitions the set of submissions into groups of similar submissions.
The found clusters can be used candidates for potentially colluding groups. Each cluster has a strength score, that measures how suspicious the cluster is compared to other clusters.

Disabling Clustering

Clustering can take long when there is a large amount of submissions.
Users who are not interested in the clustering can safely disable it:

  • Using the CLI: With the --cluster-skip option
  • Programmatically:
    JPlagOptions options = new JPlagOptions(\"/path/to/rootDir\", LanguageOption.JAVA);
    options.setClusteringOptions(new ClusteringOptions.Builder().enabled(false).build());
    
    JPlag jplag = new JPlag(options);

Clustering Configuration

Clustering can either be configured using the CLI options or programmatically using the ClusteringOptions class. Both options work analogous and share the same default values.

The clustering it designed to work out-of-the-box for running within the magnitude of about 50-500 submissions, but it can be tweaked when problems occur. For more submissions it might be necessary to increase Max-Runs or Bandwidth, so that an appropriate number of clusters can be determined.

Group Option Description Default
General Enable Controls whether the clustering is run at all. true
General Algorithm Which clustering algorithm to use.
Agglomerative Clustering
Agglomerative Clustering iteratively merges similar submissions bottom up. It usually requires manual tuning for it's parameters to yield helpful clusters.
Spectral Clustering
Spectral Clustering is combined with Bayesian Optimization to execute the k-Means clustering algorithm multiple times, hopefully finding a "good" clustering automatically. It's default parameters should work O.K. in most cases.
Spectral Clustering
General Metric The similarity score between submissions to use during clustering. Each score is expressed in terms of the size of the submissions A and B and the size of their matched intersection A ∩ B.
AVG (aka. Dice's coefficient)
AVG = 2 * (A ∩ B) / (A + B)
MAX (aka. overlap coefficient)
MAX = (A ∩ B) / min(A, B) Compared to MAX, this prevents obfuscation when a collaborator bloats his submission with unrelated code.
MIN (deprecated)
MIN = (A ∩ B) / max(A, B)
INTERSECTION (experimental)
INTERSECTION = A ∩ B
MAX
Spectral Bandwidth For Spectral Clustering, Baysian Optimization is used to determine a fitting number of clusters. If a good clustering result is found during the search, numbers of clusters that differ by something in range of the bandwidth are also expected to good. Low values result in more exploration of the search space, high values in more exploitation of known results. 20.0
Spectral Noise The result of each k-Means run in the search for good clusterings is random. The noise level models the variance in the "worth" of these results. It also acts as a regularization constant. 0.0025
Spectral Min-Runs Minimum number of k-Means executions for spectral clustering. With these initial runs clustering sizes are explored. 5
Spectral Max-Runs Maximum number of k-Means executions during spectral clustering. Any execution after the initial (min-) runs tries to balance between exploration of unknown clustering sizes and exploitation of clustering sizes known as good. 50
Spectral K-Means Iterations Maximum number of iterations during each execution of the k-Means algorithm. 200
Agglomerative Threshold Only clusters with an inter-cluster-similarity greater than this threshold are merged during agglomerative clustering. 0.2
Agglomerative inter-cluster-similarity How to measure the similarity of two clusters during agglomerative clustering.
MIN (aka. complete-linkage)
Clusters are merged if all their submissions are similar.
MAX (aka. single-linkage)
Clusters are merged if there is a similar submission in both.
AVERAGE (aka. average-linkage)
Clusters are merged if their submissions are similar on average.
AVERAGE
Preprocessing Pre-Processor How the similarities are preprocessed prior to clustering. Spectral Clustering will probably not have good results without it.
None
No preprocessing.
Cumulative Distribution Function (CDF)
Before clustering, the value of the cumulative distribution function of all similarities is estimated. The similarities are multiplied with these estimates. This has the effect of suppressing similarities that are low compared to other similarities.
Percentile
Any similarity smaller than the given percentile will be suppressed during clustering.
Threshold
Any similarity smaller than the given threshold will be suppressed during clustering.
CDF

@tsaglam
Copy link
Member

tsaglam commented Sep 16, 2022

✅ Documentation incorporated into the wiki and the repo!

@tsaglam tsaglam closed this as completed Sep 16, 2022
Repository owner moved this from In Progress to Done in v4.0.0 Release Sep 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Issue/PR that involves features, improvements and other changes minor Minor issue/feature/contribution/change
Projects
No open projects
Status: Done
Development

No branches or pull requests

3 participants