-
Notifications
You must be signed in to change notification settings - Fork 328
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Readd clustering and min/max/avg scores #116
Comments
These functionalities should be restored and integrated into the result objects (and thus in the report generation). Functionally, they should be roughly the same as in the legacy version of JPlag. How this functionality is implemented, can differ from the legacy implementation. Clustering
Max similarity
Steps to do:
|
The "Max similarity" is really just the maximum of both of the similarities. There are two other metrics calculated at that place, but omitted in the report:
When looking at it from a vulnerability perspective, both of them are worse than than max similarity. The only advantage over plain similarity I can see in them is that they are symmetric. It seems like IOU was used for clustering. I don't plan to add Min similarity again, I might use IOU for clustering again, but I guess I won't add it to the report. However I am playing with the thought of adding additional metrics that disregard submission size. |
Good to have that confirmed, then I would say let us focus on clustering first and after that, you can experiment with additional metrics.
Yes, you do not need to add it to the report generation, as this is done as part of #192 for the new report anyways. |
Should it be possible to use different clusterings in the same JPlag run? For example:
|
Here are some thoughts:
This might be a scenario that brings complexity to the users' workflow but might not bring that much information. At least that is my first intuition.
Clustering according to max-similarity could be promising. For min-similarity, I have not even found a use case yet for pairwise comparison. Tl;dr I think we should first find out if multiple clusterings can be useful for realistic submissions sets, before we should put much time into enabling multiple clusterings. However, the idea sounds interesting. |
@SimDing I added the (private) PseudonymizedReports repository with some reports and CSV files that should help you evaluate your clustering. Good luck! |
@SimDing I thought again about our discussion yesterday regarding the inconvenience of having to specify the number of clusters beforehand. After a bit of consultation, I think we should try hierarchical clustering, e.g. agglomerative clustering. There is even a library for that (GitHub/Maven Central). What do you think? |
@SimDing Regarding licenses: It should be just as I said: LGPL in GPL is no problem. Also discussed, e.g., here: https://softwareengineering.stackexchange.com/a/342158 |
@tsaglam This is what was implemented in the legacy version, spectral clustering usually yields better results, but I will leave it as an option.
@sebinside |
@sebinside @tsaglam I've thought about your suggestions regarding the average similarity inside of the clusters. I think it makes sense to combine both metrics into a single one and order the clusters by the resulting metric. This is how it would look like (using the 3 legacy reports from the pseudonymized repo): cs: community strength; criterion for quality of clustering; sum(inner cluster similarity) - sum(inter cluster similarity)^2 Acs ncsm avgSim combined members 0,006 0,339 0,555 0,421 [Student (175), Student (176), Student (329), Student (29)] 0,003 0,179 0,298 0,224 [Student (68), Student (298), Student (279), Student (323)] 0,001 0,134 0,653 0,222 [Student (398), Student (339)] 0,001 0,129 0,632 0,215 [Student (417), Student (299)] 0,003 0,170 0,277 0,211 [Student (87), Student (422), Student (28), Student (314)] 0,133 0,508 0,129 0,206 [Student (203), Student (315), Student (34), Student (161), Student (124), Student (402), Student (262), Student (71), Student (426), Student (39), Student (348), Student (119), Student (190), Student (359), Student (11), Student (407), Student (272), Student (169), Student (229), Student (135), Student (173), Student (127), Student (326), Student (432), Student (420), Student (414), Student (373), Student (437), Student (443), Student (331), Student (178), Student (244), Student (106), Student (282), Student (257), Student (448), Student (133), Student (340), Student (302), Student (193), Student (30), Student (137), Student (72), Student (332), Student (134), Student (109), Student (342), Student (449), Student (386), Student (18), Student (438), Student (361), Student (291), Student (179), Student (202), Student (211), Student (419), Student (46), Student (206), Student (413), Student (324), Student (242), Student (388), Student (445)] 0,004 0,186 0,232 0,206 [Student (141), Student (433), Student (198), Student (385), Student (170)] 0,004 0,179 0,224 0,199 [Student (104), Student (139), Student (399), Student (208), Student (122)] 0,001 0,119 0,582 0,198 [Student (353), Student (140)] 0,005 0,196 0,194 0,195 [Student (70), Student (194), Student (446), Student (116), Student (404), Student (186)] 0,001 0,118 0,574 0,195 [Student (312), Student (295)] 0,001 0,117 0,570 0,194 [Student (336), Student (294)] 0,005 0,191 0,189 0,190 [Student (84), Student (382), Student (123), Student (136), Student (310), Student (360)] 0,007 0,227 0,161 0,188 [Student (145), Student (248), Student (376), Student (355), Student (185), Student (352), Student (344), Student (403)] 0,008 0,227 0,158 0,186 [Student (357), Student (48), Student (412), Student (132), Student (35), Student (440), Student (180), Student (52), Student (102)] 0,001 0,111 0,543 0,185 [Student (93), Student (316)] 0,001 0,111 0,540 0,184 [Student (405), Student (22)] 0,001 0,110 0,539 0,183 [Student (251), Student (101)] 0,008 0,226 0,145 0,176 [Student (321), Student (160), Student (130), Student (350), Student (233), Student (264), Student (43), Student (365), Student (241)] 0,087 0,558 0,095 0,162 [Student (255), Student (283), Student (325), Student (83), Student (85), Student (333), Student (212), Student (177), Student (216), Student (293), Student (254), Student (380), Student (275), Student (78), Student (114), Student (174), Student (129), Student (121), Student (113), Student (21), Student (338), Student (318), Student (210), Student (73), Student (191), Student (197), Student (309), Student (423), Student (49), Student (12), Student (41), Student (243), Student (157), Student (259), Student (263), Student (66), Student (91), Student (27)] 0,010 0,233 0,117 0,155 [Student (54), Student (59), Student (271), Student (370), Student (330), Student (390), Student (364), Student (38), Student (253), Student (81), Student (268)] 0,007 0,196 0,125 0,152 [Student (26), Student (252), Student (256), Student (64), Student (214), Student (384), Student (401), Student (320), Student (143)] 0,027 0,328 0,090 0,141 [Student (42), Student (238), Student (334), Student (213), Student (55), Student (37), Student (273), Student (397), Student (237), Student (144), Student (427), Student (4), Student (171), Student (162), Student (167), Student (90), Student (431), Student (183), Student (218), Student (7)] 0,018 0,238 0,097 0,138 [Student (409), Student (107), Student (125), Student (77), Student (226), Student (14), Student (69), Student (230), Student (222), Student (366), Student (375), Student (209), Student (6), Student (103), Student (15), Student (117), Student (51), Student (285)] 0,005 0,146 0,089 0,111 [Student (356), Student (204), Student (36), Student (436), Student (377), Student (47), Student (265), Student (277), Student (156)] Community Strength: 0.3465919 Clusters: 25 Bcs ncsm avgSim combined members 0,026 0,600 0,167 0,261 [Student (329), Student (164), Student (108), Student (224), Student (220), Student (95), Student (230), Student (191), Student (265), Student (67), Student (47)] 0,004 0,214 0,144 0,172 [Student (63), Student (138), Student (319), Student (243), Student (23)] 0,004 0,162 0,086 0,112 [Student (76), Student (158), Student (44), Student (42), Student (119), Student (94)] 0,046 0,447 0,054 0,096 [Student (31), Student (223), Student (143), Student (204), Student (58), Student (267), Student (280), Student (22), Student (38), Student (216), Student (55), Student (246), Student (322), Student (35), Student (256), Student (330), Student (233), Student (162), Student (313), Student (17), Student (206), Student (104), Student (278), Student (157), Student (272), Student (196)] 0,213 0,695 0,045 0,085 [Student (62), Student (21), Student (327), Student (28), Student (168), Student (131), Student (61), Student (24), Student (263), Student (88), Student (74), Student (283), Student (275), Student (1), Student (169), Student (231), Student (98), Student (318), Student (209), Student (323), Student (82), Student (32), Student (200), Student (140), Student (153), Student (102), Student (100), Student (212), Student (205), Student (182), Student (232), Student (213), Student (176), Student (111), Student (304), Student (68), Student (242), Student (279), Student (183), Student (57), Student (185), Student (195), Student (269), Student (308), Student (46), Student (112), Student (277), Student (268), Student (192), Student (235), Student (19), Student (50), Student (303), Student (6), Student (174), Student (237), Student (312), Student (229), Student (290), Student (166), Student (202), Student (96), Student (130), Student (84), Student (152), Student (126), Student (292), Student (154), Student (27), Student (14), Student (18), Student (127), Student (39), Student (325), Student (288), Student (291), Student (156)] 0,065 0,421 0,034 0,062 [Student (99), Student (321), Student (81), Student (109), Student (144), Student (274), Student (139), Student (245), Student (150), Student (266), Student (93), Student (11), Student (289), Student (159), Student (247), Student (214), Student (171), Student (7), Student (89), Student (241), Student (25), Student (117), Student (261), Student (4), Student (5), Student (78), Student (178), Student (128), Student (184), Student (124), Student (226), Student (328), Student (3), Student (160), Student (136), Student (2), Student (188), Student (287), Student (270)] 0,205 0,592 0,028 0,053 [Student (71), Student (251), Student (186), Student (201), Student (60), Student (113), Student (218), Student (309), Student (116), Student (219), Student (271), Student (258), Student (70), Student (79), Student (175), Student (208), Student (324), Student (69), Student (315), Student (203), Student (284), Student (134), Student (273), Student (249), Student (311), Student (135), Student (87), Student (64), Student (262), Student (163), Student (239), Student (48), Student (254), Student (106), Student (299), Student (197), Student (121), Student (210), Student (56), Student (49), Student (110), Student (43), Student (167), Student (77), Student (207), Student (250), Student (221), Student (142), Student (29), Student (302), Student (189), Student (40), Student (51), Student (83), Student (307), Student (297), Student (180), Student (240), Student (45), Student (225), Student (147), Student (20), Student (149), Student (59), Student (125), Student (187), Student (281), Student (151), Student (215), Student (161), Student (90), Student (300), Student (294), Student (122), Student (66), Student (236), Student (75), Student (91), Student (198), Student (276), Student (326), Student (137), Student (173), Student (36), Student (190), Student (145), Student (260)] Community Strength: 0.56445 Clusters: 7 Ccs ncsm avgSim combined members 0,013 0,276 0,169 0,210 [Student (28), Student (233), Student (89), Student (35), Student (96), Student (235), Student (227), Student (228)] 0,136 0,553 0,128 0,208 [Student (198), Student (211), Student (167), Student (148), Student (162), Student (123), Student (46), Student (176), Student (106), Student (121), Student (88), Student (54), Student (149), Student (41), Student (136), Student (172), Student (173), Student (39), Student (103), Student (208), Student (184), Student (195), Student (161), Student (207), Student (213), Student (122), Student (221), Student (237), Student (85), Student (140), Student (5), Student (51), Student (62), Student (160), Student (105), Student (86), Student (126), Student (18), Student (31), Student (204), Student (119)] 0,105 0,403 0,094 0,152 [Student (190), Student (222), Student (102), Student (187), Student (183), Student (219), Student (93), Student (40), Student (177), Student (164), Student (59), Student (50), Student (43), Student (80), Student (78), Student (181), Student (117), Student (10), Student (217), Student (42), Student (61), Student (114), Student (139), Student (192), Student (215), Student (155), Student (132), Student (36), Student (104), Student (125), Student (163), Student (81), Student (60), Student (205), Student (38), Student (73), Student (147), Student (194), Student (6), Student (110), Student (112), Student (185), Student (49)] 0,026 0,267 0,084 0,128 [Student (3), Student (47), Student (90), Student (174), Student (220), Student (71), Student (7), Student (74), Student (23), Student (135), Student (186), Student (218), Student (91), Student (133), Student (70), Student (94)] 0,001 0,078 0,325 0,125 [Student (238), Student (146)] 0,004 0,120 0,127 0,123 [Student (13), Student (128), Student (120), Student (107), Student (202)] 0,001 0,076 0,316 0,122 [Student (55), Student (223)] 0,034 0,299 0,074 0,119 [Student (166), Student (212), Student (236), Student (229), Student (26), Student (134), Student (188), Student (216), Student (22), Student (154), Student (33), Student (21), Student (169), Student (171), Student (29), Student (19), Student (137), Student (231), Student (203)] 0,037 0,280 0,063 0,103 [Student (17), Student (4), Student (44), Student (116), Student (151), Student (201), Student (95), Student (32), Student (101), Student (239), Student (52), Student (99), Student (118), Student (9), Student (129), Student (131), Student (24), Student (97), Student (113), Student (145), Student (2), Student (67)] 0,006 0,116 0,073 0,089 [Student (57), Student (64), Student (152), Student (232), Student (79), Student (157), Student (111), Student (168)] Community Strength: 0.36244968 Clusters: 10 Has this any correspondence to the true positives? Is the average similarity alone better? |
@SimDing I evaluated the true positives and do have their pseudo-ID. So make your bet: Which students plagiarised? |
Concerning issue #116. This adds the clustering that was removed in #89 again, not including any user interfaces. The clustering is implemented in the de.jplag.clustering package. It includes two clustering algorithms (spectral and agglomerative), preprocessing, decoupling logic, clustering options and a factory class which can be used to run the clustering in just two statements.
Several functionalities were removed by #89 but should be restored. This includes:
The text was updated successfully, but these errors were encountered: