EXP Plagiarism Detection for Programming Assignments #274

AlpacaMax · 2021-10-16T05:57:20Z

So I've been researching on algorithms used for plagiarism detection in codes. Unfortunately this field of research has barely made any progress for the past 20 years. Most of the algorithms are either not resilient enough for code obfuscation or have serious performance issue. After reading some papers I've reached the conclusion that machine learning based approach is the most promising one to solve this issue.

I found three papers on this approach:
https://dl.acm.org/doi/pdf/10.1145/3021460.3021473
https://ieeexplore.ieee.org/abstract/document/8575900
https://www.sciencedirect.com/science/article/abs/pii/S0167739X18315528

The first one looks like a work in progress and the proposed algorithm doesn't achieve too much improvement compared to MOSS. I haven't read the second on yet. The third one looks super promising, but the proposed model is not really built to do plagiarism detection so we need to experiment with it.

The problem with machine learning approach is always efficiency, I guess we need to make this plagiarism detection system a separate distributed system and let anubis and this system interact using restful apis.

AlpacaMax · 2022-02-05T02:10:00Z

Alright here's some update on this card:

So I spent the winter break researching this topic. Apparently, the machine learning method is not mature enough. Of all those ML models I found that try to do plagiarism detection, one of them does worse than MOSS, one of them makes too many assumptions about the input, and one of them doesn't work. The author of the last model basically says in his paper "Hey so I tried a bunch of stuff. None of them actually worked. Anyway that's not my problem anymore cuz Ima getting my PhD degree now even though I made zero progress in this field".

After some more digging I found these three papers:
http://ceur-ws.org/Vol-2259/aics_33.pdf
https://hal.archives-ouvertes.fr/hal-00780290/document
http://www.diva-portal.org/smash/get/diva2:548974/FULLTEXT01.pdf

And all of them combined should become a complete AST-based plagiarism detection algorithm. I'm currently reading these papers and just trying to implement an MVP asap. Will post updates if anything changes.

nickelpro · 2022-02-05T04:34:19Z

I'm certain you came across this in your research already but the "cutting edge", such as it is, in open source for plagiarism detection right now is Jplag. It's the only tool I know of that gets comparable results to MOSS that actually has code available for inspection, even if half the comments are in German.

AlpacaMax · 2022-02-05T05:08:03Z

I'm certain you came across this in your research already but the "cutting edge", such as it is, in open source for plagiarism detection right now is Jplag. It's the only tool I know of that gets comparable results to MOSS that actually has code available for inspection, even if half the comments are in German.

I did come across this name in the literature. The problem is we need something better than just getting "comparable results to MOSS". I did an experiment on MOSS last semester. I duplicated a C code file I wrote, refactored two for loops with while loops, changed all variable names, added a bunch of random comments and MOSS thought the "plagiarised" version was only 14% similar to the original one, which is just unbelievable.

I actually don't know why people have never developed any plagiarism detection software since MOSS and JPlag that can replace these two. Like the academic field did make some progress in this area for the past few years. Guess we need to do this ourselves.

nickelpro · 2022-02-05T05:19:11Z

Ah, interesting, I would consider 14% to be an incredibly suspect submission. Anything over 10% typically sets off my spidey senses. Average student submissions in C tend to be low single digit percentages of similarity assuming template code has been accounted for.

But ya if we can beat out MOSS, that would be incredible. A massive boon for way more than just Anubis.

wabscale · 2022-02-06T03:01:02Z

Something else to consider would be the format that we take submissions on Anubis. All these things you're looking at so far assume that you basically only have the final product. With Anubis we have the full, unaltered history written in stone through the git history. You could consider taking the git histories into account in your research.

AlpacaMax · 2022-02-06T18:35:16Z

Something else to consider would be the format that we take submissions on Anubis. All these things you're looking at so far assume that you basically only have the final product. With Anubis we have the full, unaltered history written in stone through the git history. You could consider taking the git histories into account in your research.

I actually came up with something similar the other day and shared that idea with my roommate(he took the OS class with me). He pointed out that students can always say "I wrote all the codes locally. I just copy-paste them into anubis ide for testing", which is the main reason I didn't go with this method. I guess we can make some visualization tools to highlight those commits that the student obviously changed so much that it looks suspicious, but overall I think this method is not really reliable.

You said earlier that you plan to build a plugin to detect copy-paste actions. I think that would be a better approach compared to what you said here.

GusSand · 2022-02-06T21:37:33Z

Max:
Before we go into a rathole let's consider the goals:
The goal should be to identify "potential" duplicates so that TAs can double check.

You are not looking for 100% accuracy.

Also, we should consider that if even people that were dedicated to this full time, didn't find good solutions it's a really hard problem. Your last source is using Convolutional Networks. These days the state of the art is using transformers or Transformers + other stuff. I have also been thinking about this problem and one of the things I was gonna check was the new OpenAI Code Embeddings: https://openai.com/blog/introducing-text-and-code-embeddings/. Happy to chat offline about some ideas if this is of interest.

Also, here's a good survey of the area: A Survey on the Evaluation of Clone Detection
Performance and Benchmarking - https://arxiv.org/pdf/2006.15682.pdf

AlpacaMax · 2022-02-06T22:20:16Z

I actually was thinking about the same thing last night when I was reading the alphacode paper by deepmind. Transformer is definitely a tempting choice for plagiarism detection. And yeah I would love to talk about this offline.

AlpacaMax · 2022-03-26T03:33:29Z

The "experiment" phase of plagiarism detection is over as we now have a working tool. So I'll close this card.

AlpacaMax added feature New feature or request experimental backend labels Oct 16, 2021

AlpacaMax self-assigned this Oct 16, 2021

wabscale mentioned this issue Dec 16, 2021

Winter Break & JTerm TODO #322

Closed

36 tasks

AlpacaMax closed this as completed Mar 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EXP Plagiarism Detection for Programming Assignments #274

EXP Plagiarism Detection for Programming Assignments #274

AlpacaMax commented Oct 16, 2021 •

edited

Loading

AlpacaMax commented Feb 5, 2022

nickelpro commented Feb 5, 2022 •

edited

Loading

AlpacaMax commented Feb 5, 2022

nickelpro commented Feb 5, 2022 •

edited

Loading

wabscale commented Feb 6, 2022

AlpacaMax commented Feb 6, 2022

GusSand commented Feb 6, 2022 •

edited

Loading

AlpacaMax commented Feb 6, 2022

AlpacaMax commented Mar 26, 2022

EXP Plagiarism Detection for Programming Assignments #274

EXP Plagiarism Detection for Programming Assignments #274

Comments

AlpacaMax commented Oct 16, 2021 • edited Loading

AlpacaMax commented Feb 5, 2022

nickelpro commented Feb 5, 2022 • edited Loading

AlpacaMax commented Feb 5, 2022

nickelpro commented Feb 5, 2022 • edited Loading

wabscale commented Feb 6, 2022

AlpacaMax commented Feb 6, 2022

GusSand commented Feb 6, 2022 • edited Loading

AlpacaMax commented Feb 6, 2022

AlpacaMax commented Mar 26, 2022

AlpacaMax commented Oct 16, 2021 •

edited

Loading

nickelpro commented Feb 5, 2022 •

edited

Loading

nickelpro commented Feb 5, 2022 •

edited

Loading

GusSand commented Feb 6, 2022 •

edited

Loading