Readd clustering and min/max/avg scores #116

sebinside · 2021-05-31T08:09:28Z

Several functionalities were removed by #89 but should be restored. This includes:

Clustering of multiple submissions based on similarity (groups of plagiarism)
Max similarity to detect plagiarism even when one solution is obfuscated with additional code
Min similarity (lesser priority, maybe not relevant at all)

tsaglam · 2021-10-12T09:31:13Z

These functionalities should be restored and integrated into the result objects (and thus in the report generation). Functionally, they should be roughly the same as in the legacy version of JPlag. How this functionality is implemented, can differ from the legacy implementation.

Clustering

The clustering can be seen in the legacy web report (internal).
The clustering should be accessible via the JPlagResult class.
The legacy implementation can be mainly found in the clustering package.
How does the legacy implementation potentially compare to other clustering methods?

Max similarity

Max similarity can be seen in the legacy web report (internal).
It is important to find out how max similarity was implemented in the legacy version and what remains to do in the current state
The differences between max and avg similarity are described as follows:

JPlag compares programs by trying to cover one of the programs with (preferably big) sequences from the other program. The coverage of a program is defined as the ratio of covered source code to the total size of the source code. The average similarity is defined as the average of both program coverages. This is the default similarity that works in most cases: Matches with a high average similarity indicate that the programs work in a very similar way. The maximum similarity is defined as the maximum of both program coverages. This ranking is especially useful if the programs are very different in size. This can happen when dead code was inserted to disguise the origin of the plagiarized program.

Steps to do:

Carefully re-implement both features
Verify the functionality with test cases
Merge it into the master branch
Document both the clustering and the max similarity in the wiki

SimDing · 2021-11-08T16:33:33Z

The "Max similarity" is really just the maximum of both of the similarities.

There are two other metrics calculated at that place, but omitted in the report:

Min similarity: like max similarity but with minimum
Avg similarity: intersection over union

When looking at it from a vulnerability perspective, both of them are worse than than max similarity. The only advantage over plain similarity I can see in them is that they are symmetric. It seems like IOU was used for clustering. I don't plan to add Min similarity again, I might use IOU for clustering again, but I guess I won't add it to the report.

However I am playing with the thought of adding additional metrics that disregard submission size.

tsaglam · 2021-11-09T07:57:42Z

Good to have that confirmed, then I would say let us focus on clustering first and after that, you can experiment with additional metrics.

I might use IOU for clustering again, but I guess I won't add it to the report.

Yes, you do not need to add it to the report generation, as this is done as part of #192 for the new report anyways.
Just make sure to provide the clustering as part of the result object in some meaningful way.

SimDing · 2021-11-18T15:34:44Z

Should it be possible to use different clusterings in the same JPlag run? For example:

Do hierarchical clustering twice, but use a different threshold and inter-cluster-similarity each time
Cluster with max-similarity, min-similarity and avg-similarity

tsaglam · 2021-11-19T10:08:07Z

Should it be possible to use different clusterings in the same JPlag run? For example:

Here are some thoughts:

At first glance, different types of clusterings seem useful
However, we do not know yet, if it really provides more useful information for the users in the end, however, we can test that
Looking at the clusters in the new report should not be complicated, it should not feel like a parameter-tuning-based task where you need to check different clustering types and see if they make a difference for your JPlag run.
So if we allow multiple clusterings because it is actually useful, we should limit it to maybe 2-3 types max

Do hierarchical clustering twice, but use a different threshold and inter-cluster-similarity each time

This might be a scenario that brings complexity to the users' workflow but might not bring that much information. At least that is my first intuition.

Cluster with max-similarity, min-similarity and avg-similarity

Clustering according to max-similarity could be promising. For min-similarity, I have not even found a use case yet for pairwise comparison.

Tl;dr I think we should first find out if multiple clusterings can be useful for realistic submissions sets, before we should put much time into enabling multiple clusterings. However, the idea sounds interesting.

sebinside · 2021-11-24T09:15:36Z

@SimDing I added the (private) PseudonymizedReports repository with some reports and CSV files that should help you evaluate your clustering. Good luck!

tsaglam · 2021-12-03T08:58:37Z

@SimDing I thought again about our discussion yesterday regarding the inconvenience of having to specify the number of clusters beforehand. After a bit of consultation, I think we should try hierarchical clustering, e.g. agglomerative clustering. There is even a library for that (GitHub/Maven Central).

What do you think?

sebinside · 2021-12-16T13:51:31Z

@SimDing Regarding licenses: It should be just as I said: LGPL in GPL is no problem. Also discussed, e.g., here: https://softwareengineering.stackexchange.com/a/342158

SimDing · 2021-12-16T14:18:43Z

@SimDing I thought again about our discussion yesterday regarding the inconvenience of having to specify the number of clusters beforehand. After a bit of consultation, I think we should try hierarchical clustering, e.g. agglomerative clustering. There is even a library for that (GitHub/Maven Central).

What do you think?

@tsaglam This is what was implemented in the legacy version, spectral clustering usually yields better results, but I will leave it as an option.

@SimDing Regarding licenses: It should be just as I said: LGPL in GPL is no problem. Also discussed, e.g., here: https://softwareengineering.stackexchange.com/a/342158

@sebinside
Great, then I'll just add it as is and put a comment about the origin.

SimDing · 2021-12-16T14:51:20Z

@sebinside @tsaglam I've thought about your suggestions regarding the average similarity inside of the clusters. I think it makes sense to combine both metrics into a single one and order the clusters by the resulting metric.

This is how it would look like (using the 3 legacy reports from the pseudonymized repo):

cs: community strength; criterion for quality of clustering; sum(inner cluster similarity) - sum(inter cluster similarity)^2
ncsm: normalized community strength per member; the metric for evidence strength I suggested
avgSim: average (max-) similarity inside cluster
combined: harmonic mean of ncsm and avgSim

A

cs	ncsm	avgSim	combined	members
0,006	0,339	0,555	0,421	[Student (175), Student (176), Student (329), Student (29)]
0,003	0,179	0,298	0,224	[Student (68), Student (298), Student (279), Student (323)]
0,001	0,134	0,653	0,222	[Student (398), Student (339)]
0,001	0,129	0,632	0,215	[Student (417), Student (299)]
0,003	0,170	0,277	0,211	[Student (87), Student (422), Student (28), Student (314)]
0,133	0,508	0,129	0,206	[Student (203), Student (315), Student (34), Student (161), Student (124), Student (402), Student (262), Student (71), Student (426), Student (39), Student (348), Student (119), Student (190), Student (359), Student (11), Student (407), Student (272), Student (169), Student (229), Student (135), Student (173), Student (127), Student (326), Student (432), Student (420), Student (414), Student (373), Student (437), Student (443), Student (331), Student (178), Student (244), Student (106), Student (282), Student (257), Student (448), Student (133), Student (340), Student (302), Student (193), Student (30), Student (137), Student (72), Student (332), Student (134), Student (109), Student (342), Student (449), Student (386), Student (18), Student (438), Student (361), Student (291), Student (179), Student (202), Student (211), Student (419), Student (46), Student (206), Student (413), Student (324), Student (242), Student (388), Student (445)]
0,004	0,186	0,232	0,206	[Student (141), Student (433), Student (198), Student (385), Student (170)]
0,004	0,179	0,224	0,199	[Student (104), Student (139), Student (399), Student (208), Student (122)]
0,001	0,119	0,582	0,198	[Student (353), Student (140)]
0,005	0,196	0,194	0,195	[Student (70), Student (194), Student (446), Student (116), Student (404), Student (186)]
0,001	0,118	0,574	0,195	[Student (312), Student (295)]
0,001	0,117	0,570	0,194	[Student (336), Student (294)]
0,005	0,191	0,189	0,190	[Student (84), Student (382), Student (123), Student (136), Student (310), Student (360)]
0,007	0,227	0,161	0,188	[Student (145), Student (248), Student (376), Student (355), Student (185), Student (352), Student (344), Student (403)]
0,008	0,227	0,158	0,186	[Student (357), Student (48), Student (412), Student (132), Student (35), Student (440), Student (180), Student (52), Student (102)]
0,001	0,111	0,543	0,185	[Student (93), Student (316)]
0,001	0,111	0,540	0,184	[Student (405), Student (22)]
0,001	0,110	0,539	0,183	[Student (251), Student (101)]
0,008	0,226	0,145	0,176	[Student (321), Student (160), Student (130), Student (350), Student (233), Student (264), Student (43), Student (365), Student (241)]
0,087	0,558	0,095	0,162	[Student (255), Student (283), Student (325), Student (83), Student (85), Student (333), Student (212), Student (177), Student (216), Student (293), Student (254), Student (380), Student (275), Student (78), Student (114), Student (174), Student (129), Student (121), Student (113), Student (21), Student (338), Student (318), Student (210), Student (73), Student (191), Student (197), Student (309), Student (423), Student (49), Student (12), Student (41), Student (243), Student (157), Student (259), Student (263), Student (66), Student (91), Student (27)]
0,010	0,233	0,117	0,155	[Student (54), Student (59), Student (271), Student (370), Student (330), Student (390), Student (364), Student (38), Student (253), Student (81), Student (268)]
0,007	0,196	0,125	0,152	[Student (26), Student (252), Student (256), Student (64), Student (214), Student (384), Student (401), Student (320), Student (143)]
0,027	0,328	0,090	0,141	[Student (42), Student (238), Student (334), Student (213), Student (55), Student (37), Student (273), Student (397), Student (237), Student (144), Student (427), Student (4), Student (171), Student (162), Student (167), Student (90), Student (431), Student (183), Student (218), Student (7)]
0,018	0,238	0,097	0,138	[Student (409), Student (107), Student (125), Student (77), Student (226), Student (14), Student (69), Student (230), Student (222), Student (366), Student (375), Student (209), Student (6), Student (103), Student (15), Student (117), Student (51), Student (285)]
0,005	0,146	0,089	0,111	[Student (356), Student (204), Student (36), Student (436), Student (377), Student (47), Student (265), Student (277), Student (156)]
Community Strength: 0.3465919
Clusters: 25

B

cs	ncsm	avgSim	combined	members
0,026	0,600	0,167	0,261	[Student (329), Student (164), Student (108), Student (224), Student (220), Student (95), Student (230), Student (191), Student (265), Student (67), Student (47)]
0,004	0,214	0,144	0,172	[Student (63), Student (138), Student (319), Student (243), Student (23)]
0,004	0,162	0,086	0,112	[Student (76), Student (158), Student (44), Student (42), Student (119), Student (94)]
0,046	0,447	0,054	0,096	[Student (31), Student (223), Student (143), Student (204), Student (58), Student (267), Student (280), Student (22), Student (38), Student (216), Student (55), Student (246), Student (322), Student (35), Student (256), Student (330), Student (233), Student (162), Student (313), Student (17), Student (206), Student (104), Student (278), Student (157), Student (272), Student (196)]
0,213	0,695	0,045	0,085	[Student (62), Student (21), Student (327), Student (28), Student (168), Student (131), Student (61), Student (24), Student (263), Student (88), Student (74), Student (283), Student (275), Student (1), Student (169), Student (231), Student (98), Student (318), Student (209), Student (323), Student (82), Student (32), Student (200), Student (140), Student (153), Student (102), Student (100), Student (212), Student (205), Student (182), Student (232), Student (213), Student (176), Student (111), Student (304), Student (68), Student (242), Student (279), Student (183), Student (57), Student (185), Student (195), Student (269), Student (308), Student (46), Student (112), Student (277), Student (268), Student (192), Student (235), Student (19), Student (50), Student (303), Student (6), Student (174), Student (237), Student (312), Student (229), Student (290), Student (166), Student (202), Student (96), Student (130), Student (84), Student (152), Student (126), Student (292), Student (154), Student (27), Student (14), Student (18), Student (127), Student (39), Student (325), Student (288), Student (291), Student (156)]
0,065	0,421	0,034	0,062	[Student (99), Student (321), Student (81), Student (109), Student (144), Student (274), Student (139), Student (245), Student (150), Student (266), Student (93), Student (11), Student (289), Student (159), Student (247), Student (214), Student (171), Student (7), Student (89), Student (241), Student (25), Student (117), Student (261), Student (4), Student (5), Student (78), Student (178), Student (128), Student (184), Student (124), Student (226), Student (328), Student (3), Student (160), Student (136), Student (2), Student (188), Student (287), Student (270)]
0,205	0,592	0,028	0,053	[Student (71), Student (251), Student (186), Student (201), Student (60), Student (113), Student (218), Student (309), Student (116), Student (219), Student (271), Student (258), Student (70), Student (79), Student (175), Student (208), Student (324), Student (69), Student (315), Student (203), Student (284), Student (134), Student (273), Student (249), Student (311), Student (135), Student (87), Student (64), Student (262), Student (163), Student (239), Student (48), Student (254), Student (106), Student (299), Student (197), Student (121), Student (210), Student (56), Student (49), Student (110), Student (43), Student (167), Student (77), Student (207), Student (250), Student (221), Student (142), Student (29), Student (302), Student (189), Student (40), Student (51), Student (83), Student (307), Student (297), Student (180), Student (240), Student (45), Student (225), Student (147), Student (20), Student (149), Student (59), Student (125), Student (187), Student (281), Student (151), Student (215), Student (161), Student (90), Student (300), Student (294), Student (122), Student (66), Student (236), Student (75), Student (91), Student (198), Student (276), Student (326), Student (137), Student (173), Student (36), Student (190), Student (145), Student (260)]
Community Strength: 0.56445
Clusters: 7

C

cs	ncsm	avgSim	combined	members
0,013	0,276	0,169	0,210	[Student (28), Student (233), Student (89), Student (35), Student (96), Student (235), Student (227), Student (228)]
0,136	0,553	0,128	0,208	[Student (198), Student (211), Student (167), Student (148), Student (162), Student (123), Student (46), Student (176), Student (106), Student (121), Student (88), Student (54), Student (149), Student (41), Student (136), Student (172), Student (173), Student (39), Student (103), Student (208), Student (184), Student (195), Student (161), Student (207), Student (213), Student (122), Student (221), Student (237), Student (85), Student (140), Student (5), Student (51), Student (62), Student (160), Student (105), Student (86), Student (126), Student (18), Student (31), Student (204), Student (119)]
0,105	0,403	0,094	0,152	[Student (190), Student (222), Student (102), Student (187), Student (183), Student (219), Student (93), Student (40), Student (177), Student (164), Student (59), Student (50), Student (43), Student (80), Student (78), Student (181), Student (117), Student (10), Student (217), Student (42), Student (61), Student (114), Student (139), Student (192), Student (215), Student (155), Student (132), Student (36), Student (104), Student (125), Student (163), Student (81), Student (60), Student (205), Student (38), Student (73), Student (147), Student (194), Student (6), Student (110), Student (112), Student (185), Student (49)]
0,026	0,267	0,084	0,128	[Student (3), Student (47), Student (90), Student (174), Student (220), Student (71), Student (7), Student (74), Student (23), Student (135), Student (186), Student (218), Student (91), Student (133), Student (70), Student (94)]
0,001	0,078	0,325	0,125	[Student (238), Student (146)]
0,004	0,120	0,127	0,123	[Student (13), Student (128), Student (120), Student (107), Student (202)]
0,001	0,076	0,316	0,122	[Student (55), Student (223)]
0,034	0,299	0,074	0,119	[Student (166), Student (212), Student (236), Student (229), Student (26), Student (134), Student (188), Student (216), Student (22), Student (154), Student (33), Student (21), Student (169), Student (171), Student (29), Student (19), Student (137), Student (231), Student (203)]
0,037	0,280	0,063	0,103	[Student (17), Student (4), Student (44), Student (116), Student (151), Student (201), Student (95), Student (32), Student (101), Student (239), Student (52), Student (99), Student (118), Student (9), Student (129), Student (131), Student (24), Student (97), Student (113), Student (145), Student (2), Student (67)]
0,006	0,116	0,073	0,089	[Student (57), Student (64), Student (152), Student (232), Student (79), Student (157), Student (111), Student (168)]
Community Strength: 0.36244968
Clusters: 10

Has this any correspondence to the true positives? Is the average similarity alone better?

sebinside · 2021-12-17T08:50:36Z

@SimDing I evaluated the true positives and do have their pseudo-ID. So make your bet: Which students plagiarised?

Concerning issue #116. This adds the clustering that was removed in #89 again, not including any user interfaces. The clustering is implemented in the de.jplag.clustering package. It includes two clustering algorithms (spectral and agglomerative), preprocessing, decoupling logic, clustering options and a factory class which can be used to run the clustering in just two statements.

sebinside added enhancement Issue/PR that involves features, improvements and other changes major Major issue/feature/contribution/change labels May 31, 2021

tsaglam added the PISE-WS21/22 Tasks for the PISE practical course (WS21/22) label Oct 12, 2021

tsaglam added this to the v3.1.0 milestone Oct 12, 2021

tsaglam assigned SimDing Nov 9, 2021

tsaglam pinned this issue Nov 14, 2021

SimDing mentioned this issue Jan 28, 2022

Readd clustering #281

Merged

tsaglam linked a pull request Feb 9, 2022 that will close this issue

Readd clustering #281

Merged

tsaglam closed this as completed in #281 Feb 24, 2022

sebinside unpinned this issue Mar 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Readd clustering and min/max/avg scores #116

Readd clustering and min/max/avg scores #116

sebinside commented May 31, 2021 •

edited by tsaglam

Loading

tsaglam commented Oct 12, 2021 •

edited

Loading

SimDing commented Nov 8, 2021 •

edited

Loading

tsaglam commented Nov 9, 2021 •

edited

Loading

SimDing commented Nov 18, 2021

tsaglam commented Nov 19, 2021 •

edited

Loading

sebinside commented Nov 24, 2021

tsaglam commented Dec 3, 2021

sebinside commented Dec 16, 2021

SimDing commented Dec 16, 2021

SimDing commented Dec 16, 2021

sebinside commented Dec 17, 2021

Readd clustering and min/max/avg scores #116

Readd clustering and min/max/avg scores #116

Comments

sebinside commented May 31, 2021 • edited by tsaglam Loading

tsaglam commented Oct 12, 2021 • edited Loading

SimDing commented Nov 8, 2021 • edited Loading

tsaglam commented Nov 9, 2021 • edited Loading

SimDing commented Nov 18, 2021

tsaglam commented Nov 19, 2021 • edited Loading

sebinside commented Nov 24, 2021

tsaglam commented Dec 3, 2021

sebinside commented Dec 16, 2021

SimDing commented Dec 16, 2021

SimDing commented Dec 16, 2021

sebinside commented Dec 17, 2021

sebinside commented May 31, 2021 •

edited by tsaglam

Loading

tsaglam commented Oct 12, 2021 •

edited

Loading

SimDing commented Nov 8, 2021 •

edited

Loading

tsaglam commented Nov 9, 2021 •

edited

Loading

tsaglam commented Nov 19, 2021 •

edited

Loading