Skip to content
This repository has been archived by the owner on Aug 28, 2019. It is now read-only.

Travis Build: add a plagiarism check #3315

Closed
davcri opened this issue Oct 23, 2017 · 8 comments
Closed

Travis Build: add a plagiarism check #3315

davcri opened this issue Oct 23, 2017 · 8 comments

Comments

@davcri
Copy link
Contributor

davcri commented Oct 23, 2017

As discussed in #2503 (comment), we should add a plagiarism check at build time. Options that can be evaluated:

@Ethan-Arrowood
Copy link
Member

As @QuincyLarson mentioned in #2503 (comment), it might be worth implementing this as a precommit hook to block users from even submitting a plagiarized PR in the first place.

But what if the precommit hook is wrong?

Then there should also be a --not-plagiarism flag that users could add to their commit statement to override the plagiarism checker.

And how can we stop users from abusing the flag?

We should outline in our contributing guidelines that if a user submits 3 or more PRs containing blatant plagiarized content they will be blocked from contributing to this project. We could of course set up an appeal process for users who accidentally or are unaware of what they did.

Bottom line is we need to establish a strict no-plagiarism policy and we need to enforce it. Furthermore, the PR number continues to grow like a wild fire and I believe if we were to implement some precommit hooks it would limit the number of bad-prs that land on this project. I know this may stifle some new contributors but is still important that we maintain this project with good contributing guidelines.

Like I stated before, I can't take the lead on this development at the moment, but I'd be happy to answer any questions and provide feedback on things other FCC contributors propose.

@davcri
Copy link
Contributor Author

davcri commented Nov 2, 2017

About the pre-commit hook, @Bouncey stated that:

This check would be better as a Travis check due to the amount opf PR's coming via the GitHub GUI. Pre-commit hooks only work when committing locally.

@dhcodes
Copy link
Contributor

dhcodes commented Nov 2, 2017

@davcri re: your question. I'm currently working on a probot-based PR bot that would work as follows:

  • User submits PR
  • PR gets tagged as content by reviewer--or auto-tagged based its place in the pages dir. This label triggers bot
  • Bot runs selected parts of diffed content through a Google Search
  • If certain percentage is found on another site, automatically adds a comment to the PR that says in nice terms that it's suspected to be plagiarized from <link to original site> and then autotags some label accordingly.

I need to think about how to exclude articles that use direct quotes or have adequate references.

I'd love to get your thoughts. As for running the existing PRs through it, we could probably temporarily change it's criteria to go through existing PRs but we'd need to pay a one-time cost to extend the Google Custom Search API beyond its 100/day query limit.

Right now the bot is running but has none of the functionality above. I'm currently working on only getting diffed text.

@QuincyLarson
Copy link
Contributor

@dhcodes I think instead of trying to exclude articles that use direct quotes or have adequate references, this should be left to a human reviewer to decide. By raising awareness that the content comes from an external source, it will give PR reviewers a heads-up that they need to make sure things are properly cited. Then they can pass judgement as to whether further citation is necessary themselves.

@davcri
Copy link
Contributor Author

davcri commented Nov 8, 2017

@QuincyLarson yes this seems a perfect balance between the script (Probot in this case) and human effort.

@dhcodes I didn't know Probot ! Is the code (of your bot) hosted here on Github ? I could try to help, even if I'm not so experienced with JavaScript.

@dhcodes
Copy link
Contributor

dhcodes commented Nov 8, 2017

@davcri the code is here: https://aromatic-okra.glitch.me

Sorry for taking so long :(. It's most definitely a learning experience.

@jp-sauve
Copy link
Contributor

jp-sauve commented Nov 8, 2017

Is citation enough? It obviously cures plagiarism, and where the sources have permissive licenses, it's enough. But for copyrighted content, there is a delicate balance between fair use and infringement, and it seems that it would come down to a few factors. Whether the FCC guide can be considered commercial, whether the copied material would deprive the copyright owner of profits, and the amount of content copied. For reviewers, the last bit seems to be important. Copying a whole page could fall outside of fair use, and so could copying many sections from the same site. In these cases, the method of citation, and whether there are quotes or not seems irrelevant. This seems like it could be relevant, as I've seen multiple PRs copying from mathisfun.com and techopedia.com

@QuincyLarson
Copy link
Contributor

@jp-sauve Rather than delving into those legal questions you've posed, I think we follow our best judgement and operate under the principle that articles should be primarily original. We are not interested in cross-posting articles from the MDN, Stack Overflow, etc. The bulk of these articles should be original. I think it's better to have a short article that is just a single, attributed, paragraph-length quote than nothing. But an article shouldn't only consist of multiple quoted paragraphs. The author should try to add some context.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants