Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EXP Plagiarism Detection for Programming Assignments #274

Closed
AlpacaMax opened this issue Oct 16, 2021 · 9 comments
Closed

EXP Plagiarism Detection for Programming Assignments #274

AlpacaMax opened this issue Oct 16, 2021 · 9 comments
Assignees
Labels

Comments

@AlpacaMax
Copy link
Collaborator

AlpacaMax commented Oct 16, 2021

So I've been researching on algorithms used for plagiarism detection in codes. Unfortunately this field of research has barely made any progress for the past 20 years. Most of the algorithms are either not resilient enough for code obfuscation or have serious performance issue. After reading some papers I've reached the conclusion that machine learning based approach is the most promising one to solve this issue.

I found three papers on this approach:
https://dl.acm.org/doi/pdf/10.1145/3021460.3021473
https://ieeexplore.ieee.org/abstract/document/8575900
https://www.sciencedirect.com/science/article/abs/pii/S0167739X18315528

The first one looks like a work in progress and the proposed algorithm doesn't achieve too much improvement compared to MOSS. I haven't read the second on yet. The third one looks super promising, but the proposed model is not really built to do plagiarism detection so we need to experiment with it.

The problem with machine learning approach is always efficiency, I guess we need to make this plagiarism detection system a separate distributed system and let anubis and this system interact using restful apis.

@AlpacaMax AlpacaMax added feature New feature or request experimental backend labels Oct 16, 2021
@AlpacaMax AlpacaMax self-assigned this Oct 16, 2021
@wabscale wabscale mentioned this issue Dec 16, 2021
36 tasks
@AlpacaMax
Copy link
Collaborator Author

Alright here's some update on this card:

So I spent the winter break researching this topic. Apparently, the machine learning method is not mature enough. Of all those ML models I found that try to do plagiarism detection, one of them does worse than MOSS, one of them makes too many assumptions about the input, and one of them doesn't work. The author of the last model basically says in his paper "Hey so I tried a bunch of stuff. None of them actually worked. Anyway that's not my problem anymore cuz Ima getting my PhD degree now even though I made zero progress in this field".

After some more digging I found these three papers:
http://ceur-ws.org/Vol-2259/aics_33.pdf
https://hal.archives-ouvertes.fr/hal-00780290/document
http://www.diva-portal.org/smash/get/diva2:548974/FULLTEXT01.pdf

And all of them combined should become a complete AST-based plagiarism detection algorithm. I'm currently reading these papers and just trying to implement an MVP asap. Will post updates if anything changes.

@nickelpro
Copy link

nickelpro commented Feb 5, 2022

I'm certain you came across this in your research already but the "cutting edge", such as it is, in open source for plagiarism detection right now is Jplag. It's the only tool I know of that gets comparable results to MOSS that actually has code available for inspection, even if half the comments are in German.

@AlpacaMax
Copy link
Collaborator Author

I'm certain you came across this in your research already but the "cutting edge", such as it is, in open source for plagiarism detection right now is Jplag. It's the only tool I know of that gets comparable results to MOSS that actually has code available for inspection, even if half the comments are in German.

I did come across this name in the literature. The problem is we need something better than just getting "comparable results to MOSS". I did an experiment on MOSS last semester. I duplicated a C code file I wrote, refactored two for loops with while loops, changed all variable names, added a bunch of random comments and MOSS thought the "plagiarised" version was only 14% similar to the original one, which is just unbelievable.

I actually don't know why people have never developed any plagiarism detection software since MOSS and JPlag that can replace these two. Like the academic field did make some progress in this area for the past few years. Guess we need to do this ourselves.

@nickelpro
Copy link

nickelpro commented Feb 5, 2022

Ah, interesting, I would consider 14% to be an incredibly suspect submission. Anything over 10% typically sets off my spidey senses. Average student submissions in C tend to be low single digit percentages of similarity assuming template code has been accounted for.

But ya if we can beat out MOSS, that would be incredible. A massive boon for way more than just Anubis.

@wabscale
Copy link
Collaborator

wabscale commented Feb 6, 2022

Something else to consider would be the format that we take submissions on Anubis. All these things you're looking at so far assume that you basically only have the final product. With Anubis we have the full, unaltered history written in stone through the git history. You could consider taking the git histories into account in your research.

@AlpacaMax
Copy link
Collaborator Author

Something else to consider would be the format that we take submissions on Anubis. All these things you're looking at so far assume that you basically only have the final product. With Anubis we have the full, unaltered history written in stone through the git history. You could consider taking the git histories into account in your research.

I actually came up with something similar the other day and shared that idea with my roommate(he took the OS class with me). He pointed out that students can always say "I wrote all the codes locally. I just copy-paste them into anubis ide for testing", which is the main reason I didn't go with this method. I guess we can make some visualization tools to highlight those commits that the student obviously changed so much that it looks suspicious, but overall I think this method is not really reliable.

You said earlier that you plan to build a plugin to detect copy-paste actions. I think that would be a better approach compared to what you said here.

@GusSand
Copy link
Collaborator

GusSand commented Feb 6, 2022

Max:
Before we go into a rathole let's consider the goals:
The goal should be to identify "potential" duplicates so that TAs can double check.

You are not looking for 100% accuracy.

Also, we should consider that if even people that were dedicated to this full time, didn't find good solutions it's a really hard problem. Your last source is using Convolutional Networks. These days the state of the art is using transformers or Transformers + other stuff. I have also been thinking about this problem and one of the things I was gonna check was the new OpenAI Code Embeddings: https://openai.com/blog/introducing-text-and-code-embeddings/. Happy to chat offline about some ideas if this is of interest.

Also, here's a good survey of the area: A Survey on the Evaluation of Clone Detection
Performance and Benchmarking - https://arxiv.org/pdf/2006.15682.pdf

@AlpacaMax
Copy link
Collaborator Author

I actually was thinking about the same thing last night when I was reading the alphacode paper by deepmind. Transformer is definitely a tempting choice for plagiarism detection. And yeah I would love to talk about this offline.

@AlpacaMax
Copy link
Collaborator Author

The "experiment" phase of plagiarism detection is over as we now have a working tool. So I'll close this card.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants