Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Readd clustering and min/max/avg scores #116

Closed
3 tasks done
sebinside opened this issue May 31, 2021 · 11 comments · Fixed by #281
Closed
3 tasks done

Readd clustering and min/max/avg scores #116

sebinside opened this issue May 31, 2021 · 11 comments · Fixed by #281
Assignees
Labels
enhancement Issue/PR that involves features, improvements and other changes major Major issue/feature/contribution/change PISE-WS21/22 Tasks for the PISE practical course (WS21/22)
Milestone

Comments

@sebinside
Copy link
Member

sebinside commented May 31, 2021

Several functionalities were removed by #89 but should be restored. This includes:

  • Clustering of multiple submissions based on similarity (groups of plagiarism)
  • Max similarity to detect plagiarism even when one solution is obfuscated with additional code
  • Min similarity (lesser priority, maybe not relevant at all)
@sebinside sebinside added enhancement Issue/PR that involves features, improvements and other changes major Major issue/feature/contribution/change labels May 31, 2021
@tsaglam tsaglam added the PISE-WS21/22 Tasks for the PISE practical course (WS21/22) label Oct 12, 2021
@tsaglam tsaglam added this to the v3.1.0 milestone Oct 12, 2021
@tsaglam
Copy link
Member

tsaglam commented Oct 12, 2021

These functionalities should be restored and integrated into the result objects (and thus in the report generation). Functionally, they should be roughly the same as in the legacy version of JPlag. How this functionality is implemented, can differ from the legacy implementation.

Clustering

  • The clustering can be seen in the legacy web report (internal).
  • The clustering should be accessible via the JPlagResult class.
  • The legacy implementation can be mainly found in the clustering package.
  • How does the legacy implementation potentially compare to other clustering methods?

Max similarity

  • Max similarity can be seen in the legacy web report (internal).
  • It is important to find out how max similarity was implemented in the legacy version and what remains to do in the current state
  • The differences between max and avg similarity are described as follows:

JPlag compares programs by trying to cover one of the programs with (preferably big) sequences from the other program. The coverage of a program is defined as the ratio of covered source code to the total size of the source code. The average similarity is defined as the average of both program coverages. This is the default similarity that works in most cases: Matches with a high average similarity indicate that the programs work in a very similar way. The maximum similarity is defined as the maximum of both program coverages. This ranking is especially useful if the programs are very different in size. This can happen when dead code was inserted to disguise the origin of the plagiarized program.

Steps to do:

  • Carefully re-implement both features
  • Verify the functionality with test cases
  • Merge it into the master branch
  • Document both the clustering and the max similarity in the wiki

@SimDing
Copy link
Contributor

SimDing commented Nov 8, 2021

The "Max similarity" is really just the maximum of both of the similarities.

There are two other metrics calculated at that place, but omitted in the report:

  • Min similarity: like max similarity but with minimum
  • Avg similarity: intersection over union

When looking at it from a vulnerability perspective, both of them are worse than than max similarity. The only advantage over plain similarity I can see in them is that they are symmetric. It seems like IOU was used for clustering. I don't plan to add Min similarity again, I might use IOU for clustering again, but I guess I won't add it to the report.

However I am playing with the thought of adding additional metrics that disregard submission size.

@tsaglam
Copy link
Member

tsaglam commented Nov 9, 2021

Good to have that confirmed, then I would say let us focus on clustering first and after that, you can experiment with additional metrics.

I might use IOU for clustering again, but I guess I won't add it to the report.

Yes, you do not need to add it to the report generation, as this is done as part of #192 for the new report anyways.
Just make sure to provide the clustering as part of the result object in some meaningful way.

@tsaglam tsaglam pinned this issue Nov 14, 2021
@SimDing
Copy link
Contributor

SimDing commented Nov 18, 2021

Should it be possible to use different clusterings in the same JPlag run? For example:

  1. Do hierarchical clustering twice, but use a different threshold and inter-cluster-similarity each time
  2. Cluster with max-similarity, min-similarity and avg-similarity

@tsaglam
Copy link
Member

tsaglam commented Nov 19, 2021

Should it be possible to use different clusterings in the same JPlag run? For example:

Here are some thoughts:

  • At first glance, different types of clusterings seem useful
  • However, we do not know yet, if it really provides more useful information for the users in the end, however, we can test that
  • Looking at the clusters in the new report should not be complicated, it should not feel like a parameter-tuning-based task where you need to check different clustering types and see if they make a difference for your JPlag run.
  • So if we allow multiple clusterings because it is actually useful, we should limit it to maybe 2-3 types max
  1. Do hierarchical clustering twice, but use a different threshold and inter-cluster-similarity each time

This might be a scenario that brings complexity to the users' workflow but might not bring that much information. At least that is my first intuition.

  1. Cluster with max-similarity, min-similarity and avg-similarity

Clustering according to max-similarity could be promising. For min-similarity, I have not even found a use case yet for pairwise comparison.

Tl;dr I think we should first find out if multiple clusterings can be useful for realistic submissions sets, before we should put much time into enabling multiple clusterings. However, the idea sounds interesting.

@sebinside
Copy link
Member Author

@SimDing I added the (private) PseudonymizedReports repository with some reports and CSV files that should help you evaluate your clustering. Good luck!

@tsaglam
Copy link
Member

tsaglam commented Dec 3, 2021

@SimDing I thought again about our discussion yesterday regarding the inconvenience of having to specify the number of clusters beforehand. After a bit of consultation, I think we should try hierarchical clustering, e.g. agglomerative clustering. There is even a library for that (GitHub/Maven Central).

What do you think?

@sebinside
Copy link
Member Author

@SimDing Regarding licenses: It should be just as I said: LGPL in GPL is no problem. Also discussed, e.g., here: https://softwareengineering.stackexchange.com/a/342158

@SimDing
Copy link
Contributor

SimDing commented Dec 16, 2021

@SimDing I thought again about our discussion yesterday regarding the inconvenience of having to specify the number of clusters beforehand. After a bit of consultation, I think we should try hierarchical clustering, e.g. agglomerative clustering. There is even a library for that (GitHub/Maven Central).

What do you think?

@tsaglam This is what was implemented in the legacy version, spectral clustering usually yields better results, but I will leave it as an option.

@SimDing Regarding licenses: It should be just as I said: LGPL in GPL is no problem. Also discussed, e.g., here: https://softwareengineering.stackexchange.com/a/342158

@sebinside
Great, then I'll just add it as is and put a comment about the origin.

@SimDing
Copy link
Contributor

SimDing commented Dec 16, 2021

@sebinside @tsaglam I've thought about your suggestions regarding the average similarity inside of the clusters. I think it makes sense to combine both metrics into a single one and order the clusters by the resulting metric.

This is how it would look like (using the 3 legacy reports from the pseudonymized repo):

cs: community strength; criterion for quality of clustering; sum(inner cluster similarity) - sum(inter cluster similarity)^2
ncsm: normalized community strength per member; the metric for evidence strength I suggested
avgSim: average (max-) similarity inside cluster
combined: harmonic mean of ncsm and avgSim

A
cs	ncsm	avgSim	combined	members
0,006	0,339	0,555	0,421	[Student (175), Student (176), Student (329), Student (29)]
0,003	0,179	0,298	0,224	[Student (68), Student (298), Student (279), Student (323)]
0,001	0,134	0,653	0,222	[Student (398), Student (339)]
0,001	0,129	0,632	0,215	[Student (417), Student (299)]
0,003	0,170	0,277	0,211	[Student (87), Student (422), Student (28), Student (314)]
0,133	0,508	0,129	0,206	[Student (203), Student (315), Student (34), Student (161), Student (124), Student (402), Student (262), Student (71), Student (426), Student (39), Student (348), Student (119), Student (190), Student (359), Student (11), Student (407), Student (272), Student (169), Student (229), Student (135), Student (173), Student (127), Student (326), Student (432), Student (420), Student (414), Student (373), Student (437), Student (443), Student (331), Student (178), Student (244), Student (106), Student (282), Student (257), Student (448), Student (133), Student (340), Student (302), Student (193), Student (30), Student (137), Student (72), Student (332), Student (134), Student (109), Student (342), Student (449), Student (386), Student (18), Student (438), Student (361), Student (291), Student (179), Student (202), Student (211), Student (419), Student (46), Student (206), Student (413), Student (324), Student (242), Student (388), Student (445)]
0,004	0,186	0,232	0,206	[Student (141), Student (433), Student (198), Student (385), Student (170)]
0,004	0,179	0,224	0,199	[Student (104), Student (139), Student (399), Student (208), Student (122)]
0,001	0,119	0,582	0,198	[Student (353), Student (140)]
0,005	0,196	0,194	0,195	[Student (70), Student (194), Student (446), Student (116), Student (404), Student (186)]
0,001	0,118	0,574	0,195	[Student (312), Student (295)]
0,001	0,117	0,570	0,194	[Student (336), Student (294)]
0,005	0,191	0,189	0,190	[Student (84), Student (382), Student (123), Student (136), Student (310), Student (360)]
0,007	0,227	0,161	0,188	[Student (145), Student (248), Student (376), Student (355), Student (185), Student (352), Student (344), Student (403)]
0,008	0,227	0,158	0,186	[Student (357), Student (48), Student (412), Student (132), Student (35), Student (440), Student (180), Student (52), Student (102)]
0,001	0,111	0,543	0,185	[Student (93), Student (316)]
0,001	0,111	0,540	0,184	[Student (405), Student (22)]
0,001	0,110	0,539	0,183	[Student (251), Student (101)]
0,008	0,226	0,145	0,176	[Student (321), Student (160), Student (130), Student (350), Student (233), Student (264), Student (43), Student (365), Student (241)]
0,087	0,558	0,095	0,162	[Student (255), Student (283), Student (325), Student (83), Student (85), Student (333), Student (212), Student (177), Student (216), Student (293), Student (254), Student (380), Student (275), Student (78), Student (114), Student (174), Student (129), Student (121), Student (113), Student (21), Student (338), Student (318), Student (210), Student (73), Student (191), Student (197), Student (309), Student (423), Student (49), Student (12), Student (41), Student (243), Student (157), Student (259), Student (263), Student (66), Student (91), Student (27)]
0,010	0,233	0,117	0,155	[Student (54), Student (59), Student (271), Student (370), Student (330), Student (390), Student (364), Student (38), Student (253), Student (81), Student (268)]
0,007	0,196	0,125	0,152	[Student (26), Student (252), Student (256), Student (64), Student (214), Student (384), Student (401), Student (320), Student (143)]
0,027	0,328	0,090	0,141	[Student (42), Student (238), Student (334), Student (213), Student (55), Student (37), Student (273), Student (397), Student (237), Student (144), Student (427), Student (4), Student (171), Student (162), Student (167), Student (90), Student (431), Student (183), Student (218), Student (7)]
0,018	0,238	0,097	0,138	[Student (409), Student (107), Student (125), Student (77), Student (226), Student (14), Student (69), Student (230), Student (222), Student (366), Student (375), Student (209), Student (6), Student (103), Student (15), Student (117), Student (51), Student (285)]
0,005	0,146	0,089	0,111	[Student (356), Student (204), Student (36), Student (436), Student (377), Student (47), Student (265), Student (277), Student (156)]
Community Strength: 0.3465919
Clusters: 25
B
cs	ncsm	avgSim	combined	members
0,026	0,600	0,167	0,261	[Student (329), Student (164), Student (108), Student (224), Student (220), Student (95), Student (230), Student (191), Student (265), Student (67), Student (47)]
0,004	0,214	0,144	0,172	[Student (63), Student (138), Student (319), Student (243), Student (23)]
0,004	0,162	0,086	0,112	[Student (76), Student (158), Student (44), Student (42), Student (119), Student (94)]
0,046	0,447	0,054	0,096	[Student (31), Student (223), Student (143), Student (204), Student (58), Student (267), Student (280), Student (22), Student (38), Student (216), Student (55), Student (246), Student (322), Student (35), Student (256), Student (330), Student (233), Student (162), Student (313), Student (17), Student (206), Student (104), Student (278), Student (157), Student (272), Student (196)]
0,213	0,695	0,045	0,085	[Student (62), Student (21), Student (327), Student (28), Student (168), Student (131), Student (61), Student (24), Student (263), Student (88), Student (74), Student (283), Student (275), Student (1), Student (169), Student (231), Student (98), Student (318), Student (209), Student (323), Student (82), Student (32), Student (200), Student (140), Student (153), Student (102), Student (100), Student (212), Student (205), Student (182), Student (232), Student (213), Student (176), Student (111), Student (304), Student (68), Student (242), Student (279), Student (183), Student (57), Student (185), Student (195), Student (269), Student (308), Student (46), Student (112), Student (277), Student (268), Student (192), Student (235), Student (19), Student (50), Student (303), Student (6), Student (174), Student (237), Student (312), Student (229), Student (290), Student (166), Student (202), Student (96), Student (130), Student (84), Student (152), Student (126), Student (292), Student (154), Student (27), Student (14), Student (18), Student (127), Student (39), Student (325), Student (288), Student (291), Student (156)]
0,065	0,421	0,034	0,062	[Student (99), Student (321), Student (81), Student (109), Student (144), Student (274), Student (139), Student (245), Student (150), Student (266), Student (93), Student (11), Student (289), Student (159), Student (247), Student (214), Student (171), Student (7), Student (89), Student (241), Student (25), Student (117), Student (261), Student (4), Student (5), Student (78), Student (178), Student (128), Student (184), Student (124), Student (226), Student (328), Student (3), Student (160), Student (136), Student (2), Student (188), Student (287), Student (270)]
0,205	0,592	0,028	0,053	[Student (71), Student (251), Student (186), Student (201), Student (60), Student (113), Student (218), Student (309), Student (116), Student (219), Student (271), Student (258), Student (70), Student (79), Student (175), Student (208), Student (324), Student (69), Student (315), Student (203), Student (284), Student (134), Student (273), Student (249), Student (311), Student (135), Student (87), Student (64), Student (262), Student (163), Student (239), Student (48), Student (254), Student (106), Student (299), Student (197), Student (121), Student (210), Student (56), Student (49), Student (110), Student (43), Student (167), Student (77), Student (207), Student (250), Student (221), Student (142), Student (29), Student (302), Student (189), Student (40), Student (51), Student (83), Student (307), Student (297), Student (180), Student (240), Student (45), Student (225), Student (147), Student (20), Student (149), Student (59), Student (125), Student (187), Student (281), Student (151), Student (215), Student (161), Student (90), Student (300), Student (294), Student (122), Student (66), Student (236), Student (75), Student (91), Student (198), Student (276), Student (326), Student (137), Student (173), Student (36), Student (190), Student (145), Student (260)]
Community Strength: 0.56445
Clusters: 7
C
cs	ncsm	avgSim	combined	members
0,013	0,276	0,169	0,210	[Student (28), Student (233), Student (89), Student (35), Student (96), Student (235), Student (227), Student (228)]
0,136	0,553	0,128	0,208	[Student (198), Student (211), Student (167), Student (148), Student (162), Student (123), Student (46), Student (176), Student (106), Student (121), Student (88), Student (54), Student (149), Student (41), Student (136), Student (172), Student (173), Student (39), Student (103), Student (208), Student (184), Student (195), Student (161), Student (207), Student (213), Student (122), Student (221), Student (237), Student (85), Student (140), Student (5), Student (51), Student (62), Student (160), Student (105), Student (86), Student (126), Student (18), Student (31), Student (204), Student (119)]
0,105	0,403	0,094	0,152	[Student (190), Student (222), Student (102), Student (187), Student (183), Student (219), Student (93), Student (40), Student (177), Student (164), Student (59), Student (50), Student (43), Student (80), Student (78), Student (181), Student (117), Student (10), Student (217), Student (42), Student (61), Student (114), Student (139), Student (192), Student (215), Student (155), Student (132), Student (36), Student (104), Student (125), Student (163), Student (81), Student (60), Student (205), Student (38), Student (73), Student (147), Student (194), Student (6), Student (110), Student (112), Student (185), Student (49)]
0,026	0,267	0,084	0,128	[Student (3), Student (47), Student (90), Student (174), Student (220), Student (71), Student (7), Student (74), Student (23), Student (135), Student (186), Student (218), Student (91), Student (133), Student (70), Student (94)]
0,001	0,078	0,325	0,125	[Student (238), Student (146)]
0,004	0,120	0,127	0,123	[Student (13), Student (128), Student (120), Student (107), Student (202)]
0,001	0,076	0,316	0,122	[Student (55), Student (223)]
0,034	0,299	0,074	0,119	[Student (166), Student (212), Student (236), Student (229), Student (26), Student (134), Student (188), Student (216), Student (22), Student (154), Student (33), Student (21), Student (169), Student (171), Student (29), Student (19), Student (137), Student (231), Student (203)]
0,037	0,280	0,063	0,103	[Student (17), Student (4), Student (44), Student (116), Student (151), Student (201), Student (95), Student (32), Student (101), Student (239), Student (52), Student (99), Student (118), Student (9), Student (129), Student (131), Student (24), Student (97), Student (113), Student (145), Student (2), Student (67)]
0,006	0,116	0,073	0,089	[Student (57), Student (64), Student (152), Student (232), Student (79), Student (157), Student (111), Student (168)]
Community Strength: 0.36244968
Clusters: 10

Has this any correspondence to the true positives? Is the average similarity alone better?

@sebinside
Copy link
Member Author

@SimDing I evaluated the true positives and do have their pseudo-ID. So make your bet: Which students plagiarised?

@tsaglam tsaglam linked a pull request Feb 9, 2022 that will close this issue
tsaglam added a commit that referenced this issue Feb 24, 2022
Concerning issue #116. This adds the clustering that was removed in #89 again, not including any user interfaces.
The clustering is implemented in the de.jplag.clustering package. It includes two clustering algorithms (spectral and agglomerative), preprocessing, decoupling logic, clustering options and a factory class which can be used to run the clustering in just two statements.
@sebinside sebinside unpinned this issue Mar 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Issue/PR that involves features, improvements and other changes major Major issue/feature/contribution/change PISE-WS21/22 Tasks for the PISE practical course (WS21/22)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants