-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EXP Plagiarism Detection for Programming Assignments #274
Comments
Alright here's some update on this card: So I spent the winter break researching this topic. Apparently, the machine learning method is not mature enough. Of all those ML models I found that try to do plagiarism detection, one of them does worse than MOSS, one of them makes too many assumptions about the input, and one of them doesn't work. The author of the last model basically says in his paper "Hey so I tried a bunch of stuff. None of them actually worked. Anyway that's not my problem anymore cuz Ima getting my PhD degree now even though I made zero progress in this field". After some more digging I found these three papers: And all of them combined should become a complete AST-based plagiarism detection algorithm. I'm currently reading these papers and just trying to implement an MVP asap. Will post updates if anything changes. |
I'm certain you came across this in your research already but the "cutting edge", such as it is, in open source for plagiarism detection right now is Jplag. It's the only tool I know of that gets comparable results to MOSS that actually has code available for inspection, even if half the comments are in German. |
I did come across this name in the literature. The problem is we need something better than just getting "comparable results to MOSS". I did an experiment on MOSS last semester. I duplicated a C code file I wrote, refactored two for loops with while loops, changed all variable names, added a bunch of random comments and MOSS thought the "plagiarised" version was only 14% similar to the original one, which is just unbelievable. I actually don't know why people have never developed any plagiarism detection software since MOSS and JPlag that can replace these two. Like the academic field did make some progress in this area for the past few years. Guess we need to do this ourselves. |
Ah, interesting, I would consider 14% to be an incredibly suspect submission. Anything over 10% typically sets off my spidey senses. Average student submissions in C tend to be low single digit percentages of similarity assuming template code has been accounted for. But ya if we can beat out MOSS, that would be incredible. A massive boon for way more than just Anubis. |
Something else to consider would be the format that we take submissions on Anubis. All these things you're looking at so far assume that you basically only have the final product. With Anubis we have the full, unaltered history written in stone through the git history. You could consider taking the git histories into account in your research. |
I actually came up with something similar the other day and shared that idea with my roommate(he took the OS class with me). He pointed out that students can always say "I wrote all the codes locally. I just copy-paste them into anubis ide for testing", which is the main reason I didn't go with this method. I guess we can make some visualization tools to highlight those commits that the student obviously changed so much that it looks suspicious, but overall I think this method is not really reliable. You said earlier that you plan to build a plugin to detect copy-paste actions. I think that would be a better approach compared to what you said here. |
Max: You are not looking for 100% accuracy. Also, we should consider that if even people that were dedicated to this full time, didn't find good solutions it's a really hard problem. Your last source is using Convolutional Networks. These days the state of the art is using transformers or Transformers + other stuff. I have also been thinking about this problem and one of the things I was gonna check was the new OpenAI Code Embeddings: https://openai.com/blog/introducing-text-and-code-embeddings/. Happy to chat offline about some ideas if this is of interest. Also, here's a good survey of the area: A Survey on the Evaluation of Clone Detection |
I actually was thinking about the same thing last night when I was reading the alphacode paper by deepmind. Transformer is definitely a tempting choice for plagiarism detection. And yeah I would love to talk about this offline. |
The "experiment" phase of plagiarism detection is over as we now have a working tool. So I'll close this card. |
So I've been researching on algorithms used for plagiarism detection in codes. Unfortunately this field of research has barely made any progress for the past 20 years. Most of the algorithms are either not resilient enough for code obfuscation or have serious performance issue. After reading some papers I've reached the conclusion that machine learning based approach is the most promising one to solve this issue.
I found three papers on this approach:
https://dl.acm.org/doi/pdf/10.1145/3021460.3021473
https://ieeexplore.ieee.org/abstract/document/8575900
https://www.sciencedirect.com/science/article/abs/pii/S0167739X18315528
The first one looks like a work in progress and the proposed algorithm doesn't achieve too much improvement compared to MOSS. I haven't read the second on yet. The third one looks super promising, but the proposed model is not really built to do plagiarism detection so we need to experiment with it.
The problem with machine learning approach is always efficiency, I guess we need to make this plagiarism detection system a separate distributed system and let anubis and this system interact using restful apis.
The text was updated successfully, but these errors were encountered: