Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend JPlag with checking against prior submissions #289

Merged
merged 13 commits into from
Apr 12, 2022

Conversation

Alberth289346
Copy link
Contributor

@Alberth289346 Alberth289346 commented Feb 10, 2022

JPlag currently perform plagiarism checks for all combinations of submissions. However, as described in #49 and #91 there is a wish to extend checks with an additional set of "old" submissions. These old submissions are also used in the plagiarism checks, but checks between two old submission are skipped.
This improves speed of the checking process, and reduces clutter in the result, as plagiarism between old submissions is not reported.
This patch implements adding such an old submission set.

Notes:

  • In the code I opted for descriptive names plagiarismCheckRootDirectories and priorSubmssionsRootDirectories to distinguish both sets.

  • At the command-line, I opted for -new and -old, as that is reasonably short and clearly different from each other.
    For completeness the variants we considered were

      --new-submissions <foldername>
    
      --checked DIR and --unchecked DIR
    
      --checked-dir DIR and --check-dir DIR   imho too subtle difference
    
      thinking a sentence like "checking ..... submissions against .... submissions" and trying to find
      good words there. Perhaps "unchecked" and ("referenced" or "verified" or "done" or "old")?
      Don't like "unchecked" much, "unknown" perhaps?
    
      Without "unchecked", "checked" might be good at the second dots too.
    
      --old-dir and --target-dir
      --old sounds a bit negative to me, --existing
    
      --target we could have --new-dir
    
      Result:
      --old-dir and --new-dir
    

    Also there is -archive and -prior too from Archival submissions #49 and Added a -prior option. #91 .
    I somewhat added -check and -prior as notions in the code, but since most users don't read source code I am not sure how relevant that is.

  • We should decide about these names, and adapt the code and/or options to that.

  • Clearly, code names and option names are not equal, not sure if that's desired. Programmers and program users are mostly disjoint groups of people.

#91 only added the -prior option, and adapted processing, so this PR can be closed I think.

#49 does additional HTML output things, such changes may need considering for the new HTML report backend.

#206 is implemented except for globbing support of root-directory paths.
EDIT: Not sure how badly we want globbing, but I think we can live without it for a while as the shell provides this functionality already.

#238 is my old discussion PR which is fully covered I think.

Closes #238
Adresses #49 #91 #206

@tsaglam tsaglam added enhancement Issue/PR that involves features, improvements and other changes major Major issue/feature/contribution/change labels Feb 14, 2022
@tsaglam tsaglam added this to the v3.1.0 milestone Feb 14, 2022
@tsaglam tsaglam self-assigned this Feb 14, 2022
This was referenced Feb 21, 2022
@Alberth289346
Copy link
Contributor Author

Rebased.

Note I have no way to test anymore.

  • Java 11 breaks on lack of com.sun.source.tree.SwitchExpressionTree (after reducing java version 14 requirement to 11 in main pom).
  • Deleting the Java frontend before building with 11 breaks on 'linking' the frontend into the executable.
  • If you hack that as well, tests fail on not having the java frontend for testing.
  • Java 16 doesn't even get off the ground:
    $ mvn clean package assembly:single
    [ERROR] Error executing Maven.
    [ERROR] java.lang.IllegalStateException: Unable to load cache item
    [ERROR] Caused by: Unable to load cache item
    [ERROR] Caused by: Could not initialize class com.google.inject.internal.cglib.core.$MethodWrapper
    
  • Java 14 is not available any more.

@Alberth289346
Copy link
Contributor Author

Please give me a head-up when you reached this PR and you want a rebase for a review or merge. I don't get notifications when a conflict happens nor is it much use to rebase the patch only to find more conflicts a few weeks later. I'd rather handle all conflicts at the same time.

@tsaglam
Copy link
Member

tsaglam commented Apr 4, 2022

@Alberth289346 now would be the time for conflict solving. Then we can review this PR and integrate it.
Regarding your Java problem, I can build JPlag with Java 17 without any problems. I cannot really remember the details on why you can't use multiple Java versions on your system, but here are some ideas:

  • Have you tried using something like Jenv?
  • For maven you could use aliases or equivalent solutions to run it with different versions, for example on mac os I have the following in my bash profile for different java versions: alias mvn17="JAVA_HOME=path/to/jdk-17.0.2.jdk/Contents/Home && mvn"

@Alberth289346
Copy link
Contributor Author

Alberth289346 commented Apr 4, 2022

now would be the time for conflict solving. Then we can review this PR and integrate it.

Nice!

Regarding your Java problem, I can build JPlag with Java 17 without any problems.

I can change to newer Java versions, the OS package manager gives me various javas. They just all don't work due to the pasted mvn crash above. The Internet suggests it's a bug in guice: https://stackoverflow.com/a/62809287 (trying to use 0 processors for computing doesn't quite work in Unix).

As for the patch:

I rebased, force-pushed (as my maven isn't fixed yet, I just tried both with java 16 and 17), found a modified name in the new report code due to failing the CI build, and fixed that in d367380 .

One thing that could be wrong now is that reporting about comparison results may now pull in submissions from the getPriorSubmissionsDirectoryNames() collection. No idea if that works, the entire reporting thing is new to me, it seems. Without being able to build it, verifying it is also not quite trivial :(
The only thing I can think of now is to ditch maven, and build a binary by plain javac. Not sure how much work that is.

@Alberth289346
Copy link
Contributor Author

After a night of sleep, I think the better solution is to merge all the root directories before giving them to the reporter. That should just work, while the user still has the benefit of a reduced number of plagiarism checks.

Future work is perhaps to make the distinction between "new" and "old" work more clear in the report.

@tsaglam
Copy link
Member

tsaglam commented Apr 5, 2022

After a night of sleep, I think the better solution is to merge all the root directories before giving them to the reporter. That should just work, while the user still has the benefit of a reduced number of plagiarism checks.

Yes, that does sound like a good solution.

Future work is perhaps to make the distinction between "new" and "old" work more clear in the report.

Of course, that is something we need to do, but that is not your responsibility. The new report view has a few things we need to optimize anyways. If we can provide the information required to differ between old and new in the result data structure (maybe a field in the submission class or similar), it would be ideal. However, we can always implement that later as well.

@tsaglam
Copy link
Member

tsaglam commented Apr 6, 2022

Btw @Alberth289346 what is the output of mvn -version for you? Maybe the right combination of the maven version + Java version is required for you to resolve the Unable to load cache item issue.

@Alberth289346
Copy link
Contributor Author

Alberth289346 commented Apr 7, 2022

Btw @Alberth289346 what is the output of mvn -version for you? Maybe the right combination of the maven version + Java version is required for you to resolve the Unable to load cache item issue.

In case you're not familiar with Linuxes, apt is the package manager that provides all my installed software (both the OS and all applications. As you can see below you can also query what it can provide.

update-java-alternative is a generic version system-wide switching mechanism so you can configure the system java version.
build.sh is script wrapping the mvn build command-line so I don't have to type it each time.

#!/bin/sh
set -e -u -x

mvn clean package assembly:single
Click to expand!
# Available javas, didn't install 8 and 13.
$ apt list openjdk-\*-jdk | grep amd64
openjdk-11-jdk/focal-updates,focal-security,now 11.0.14.1+1-0ubuntu1~20.04 amd64 [installed]
openjdk-13-jdk/focal-updates 13.0.7+5-0ubuntu1~20.04 amd64
openjdk-16-jdk/focal-updates,focal-security,now 16.0.1+9-1~20.04 amd64 [installed]
openjdk-17-jdk/focal-updates,focal-security,now 17.0.2+8-1~20.04 amd64 [installed]
openjdk-8-jdk/focal-updates,focal-security 8u312-b07-0ubuntu1~20.04 amd64

# Available mavens.
$ apt list maven
maven/focal,focal,now 3.6.3-1 all [installed]

# and it's there.
$ mvn --version
Apache Maven 3.6.3
Maven home: /usr/share/maven
Java version: 17.0.2, vendor: Private Build, runtime: /usr/lib/jvm/java-17-openjdk-amd64
Default locale: en_US, platform encoding: UTF-8
OS name: "linux", version: "5.4.0-107-generic", arch: "amd64", family: "unix"

# Trying java 11:
$ sudo update-java-alternatives -s /usr/lib/jvm/java-1.11.0-openjdk-amd64

$  java -version                                                          
openjdk version "11.0.14.1" 2022-02-08
OpenJDK Runtime Environment (build 11.0.14.1+1-Ubuntu-0ubuntu1.20.04)
OpenJDK 64-Bit Server VM (build 11.0.14.1+1-Ubuntu-0ubuntu1.20.04, mixed mode, sharing)

$ ../build.sh                                                             
+ mvn clean package assembly:single
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by com.google.inject.internal.cglib.core.$ReflectUtils$1 (file:/usr/share/maven/lib/guice.jar) to method java.lang.ClassLoader.defineClass(java.lang.String,byte[],int,int,java.security.ProtectionDomain)
WARNING: Please consider reporting this to the maintainers of com.google.inject.internal.cglib.core.$ReflectUtils$1
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
[INFO] Scanning for projects...

# Lots of log deleted here

[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary for JPlag Plagiarism Detector 4.0.0-SNAPSHOT:
[INFO] 
[INFO] JPlag Plagiarism Detector .......................... SUCCESS [  4.945 s]
[INFO] frontend-utils ..................................... FAILURE [  1.155 s]
[INFO] chars .............................................. SKIPPED
[INFO] cpp ................................................ SKIPPED
[INFO] csharp-1.2 ......................................... SKIPPED
[INFO] java ............................................... SKIPPED
[INFO] python-3 ........................................... SKIPPED
[INFO] scheme ............................................. SKIPPED
[INFO] text ............................................... SKIPPED
[INFO] jplag .............................................. SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  8.342 s
[INFO] Finished at: 2022-04-07T09:12:38+02:00
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.1:compile (default-compile) on project frontend-utils: Fatal error compiling: error: invalid target release: 14 -> [Help 1]
[ERROR] 

So Java 11 works, but the JPlag project doesn't like it. I worked around this for some time by taking out the NewFeatureTest code and hacking the root pom.xml switching to java 11 as requirement.

However, as more and more 12+ java code is being added to the project, this doesn't work any more.

Click to expand!
# Trying java 16.
$ sudo update-java-alternatives -s /usr/lib/jvm/java-1.16.0-openjdk-amd64

$ + mvn clean package assembly:single
[ERROR] Error executing Maven.
[ERROR] java.lang.IllegalStateException: Unable to load cache item
[ERROR] Caused by: Unable to load cache item
[ERROR] Caused by: Could not initialize class com.google.inject.internal.cglib.core.$MethodWrapper

# Trying java 17.
$ sudo update-java-alternatives -s /usr/lib/jvm/java-1.17.0-openjdk-amd64

# Forgot that above, but java version does change.
$  java -version                                                          
openjdk version "17.0.2" 2022-01-18
OpenJDK Runtime Environment (build 17.0.2+8-Ubuntu-120.04)
OpenJDK 64-Bit Server VM (build 17.0.2+8-Ubuntu-120.04, mixed mode, sharing)

$ ./build.sh                                                             
+ mvn clean package assembly:single
[ERROR] Error executing Maven.
[ERROR] java.lang.IllegalStateException: Unable to load cache item
[ERROR] Caused by: Unable to load cache item
[ERROR] Caused by: Could not initialize class com.google.inject.internal.cglib.core.$MethodWrapper

I am using an LTS distribution for my OS (don't want to do re-installs every 6 months), and it makes sense to me they then also pick an LTS java version (ie 11). Unfortunately, the JPlag project decided to use a newer Java.

But the CI does run the tests afaik, and Github claims they are OK, so the code should be correct, wouldn't it?

@tsaglam
Copy link
Member

tsaglam commented Apr 7, 2022

In case you're not familiar with Linuxes, apt is the package manager that provides all my installed software (both the OS and all applications. As you can see below you can also query what it can provide.

I do not use Linux frequently but I am vaguely familiar with it.

# Available mavens.
$ apt list maven
maven/focal,focal,now 3.6.3-1 all [installed]

# and it's there.
$ mvn --version
Apache Maven 3.6.3

This is was I suspected. After a short google search I found some people claiming that the unable to load cache item error is an issue of some mvn versions with newer Java versions. Problematically, some package managers do not provide a newer mvn version than 3.6.3 which seems to be affected by this issue (this might be the case for you). This is why I was asking for your mvn version.
This posts suggests updating to a newer mvn version manually.
In this post the user also has the problem with mvn 3.6.3 and java 17. Updating to a newer mvn version seemed to work there as well.

I am using an LTS distribution for my OS (don't want to do re-installs every 6 months), and it makes sense to me they then also pick an LTS java version (ie 11). Unfortunately, the JPlag project decided to use a newer Java.

That makes totally sense. However, as the Java frontend of JPlag is designed based on the JavaC API, it can only work with newer Java code if it is built with a newer java version. Thus, we are required to move on to newer Java versions more often than usually, as it is the only way to allow users to check the code of students of these newer Java versions.

Regarding LTS versions, we are already planning to Java 17 soon, which then is again an LTS version.

But the CI does run the tests afaik, and Github claims they are OK, so the code should be correct, wouldn't it?

Yes, but knowing that you cannot run your code locally means you cannot jar your feature version of JPlag and play around with it locally, thus allowing some more manual tests than the unit tests. This means I need to do this after the code review, which takes more time on my side (of which I have very little right now) and prolongs the process of merging the PR. This is why I was also interested in fixing your mvn issues.

@Alberth289346
Copy link
Contributor Author

However, as the Java frontend of JPlag is designed based on the JavaC API, it can only work with newer Java code if it is built with a newer java version. Thus, we are required to move on to newer Java versions more often than usually, as it is the only way to allow users to check the code of students of these newer Java versions.

Two things that popped up in my mind:

  1. JPlag-core code could be separate from frontend code, and have different minimum Java versions (at least in theory).
    In my case, I don't even use the Java-frontend, other than the JPlag-tests that picked that frontend for their tests. So if I could disable certain frontends, I could use an older Java.

  2. Does the JavaC API change between versions?
    Obviously when you want to check submisions in Java 17, you need a Java 17 checker (so you don't have to have your own parser). But if I just want to check Java 11 code, can you build the java frontend with Java 11 then? The JavaC API would come from Java 11 too then, ie it would adapt itself to the used Java compiler?

Likely none of this will fly right now, but it could be a path into the (far?) future.

@tsaglam
Copy link
Member

tsaglam commented Apr 7, 2022

Regarding 1: Yes, this might be possible, but I am not sure if it will work. Moreover, project restructurings like that do not come for free, so this is something that would need to be planned carefully (e.g. think of the CI pipeline to ensure consistency among the repos).

  1. Does the JavaC API change between versions?

Yes, unfortunately, it does! For example, the tree scanner class receives additional methods for new language features. If you compile that with an older java version it will tell you that it cannot find this method in the super class tree scanner, as it was added with a newer java version. And that is only one example.

Likely none of this will fly right now, but it could be a path into the (far?) future.

Yes, those are things we can certainly consider for long-term planning.

Regarding this PR now, just tell me when it is ready to review. If you want to make some more changes that's fine too.

@Alberth289346
Copy link
Contributor Author

Alberth289346 commented Apr 7, 2022

You can review it when you have time. It just doesn't fly since you cannot specify more than 1 root directory currently due to #354 .

EDIT: Basically it adds a mustBeCheckedForPlagiarism flag to each submission, switches to the comparison tuple list both for normal and parallel strategies, and only adds pairs where at least one of the submission has that flag set.

Copy link
Member

@tsaglam tsaglam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to find clear, concise, and consistent naming of concepts and use it throughout the code. See the comments for details.

jplag/src/main/java/de/jplag/Submission.java Outdated Show resolved Hide resolved
jplag/src/main/java/de/jplag/Submission.java Outdated Show resolved Hide resolved
jplag/src/main/java/de/jplag/Submission.java Outdated Show resolved Hide resolved
jplag/src/main/java/de/jplag/SubmissionSetBuilder.java Outdated Show resolved Hide resolved
jplag/src/main/java/de/jplag/SubmissionSetBuilder.java Outdated Show resolved Hide resolved
jplag/src/main/java/de/jplag/options/JPlagOptions.java Outdated Show resolved Hide resolved
jplag/src/main/java/de/jplag/options/JPlagOptions.java Outdated Show resolved Hide resolved
jplag/src/main/java/de/jplag/options/JPlagOptions.java Outdated Show resolved Hide resolved
@Alberth289346
Copy link
Contributor Author

We need to find clear, concise, and consistent naming of concepts and use it throughout the code.

Yep, that was the still open question that I discussed in my very first comment. As I already stated there, I agree we need consistent naming. I just didn't know the direction.

In my view, the disadvantage of new and old is that they are tightly connected to the dominant use-case. For example, if you need to re-run a check that you performed 2 years ago, or you have a different use-case where new and old are not related to time but eg to source where you obtained them, the option names become less intuitive.
On the other hand, it is the (very) dominant use-case, and new and old are nicely short names.

So it's a matter of opinion how good the new / old option names are. I am pretty much fine with anything, mostly because I don't have overwhelmingly better alternatives.

@Alberth289346
Copy link
Contributor Author

Changed the names as you suggested.

I purposely didn't reformat the code to avoid complicated diffs (it's an extensive change so I used a plain editor to get the same names changed in the same way across all files for different variables).

The CI insists on reformatting apparently, but running spotless is non-fixable for me. It needs maven, so I switched to java 11, then it started barking about unsupported input in the frontend sources so I eventually just deleted all frontend source locally so it could do just jplag main code (where the changes are). Then it barked about unsupported input in the jplag code and I was out of options.

I either have a running maven and no source code support, or a working spotless but not maven.

@Alberth289346
Copy link
Contributor Author

Thank you for applying formatting.

@Alberth289346
Copy link
Contributor Author

To be clear, I am done processing review comments. Please review the changes.

Copy link
Member

@tsaglam tsaglam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I know the multi-root/archival features are currently not really compatible with the new report, but we will have a student working on fixing the new report viewer starting at the end of April. See #357, also feel free to comment there if something is missing. Thus I would say and merge this PR. This will prevent it from going out of date. Further improvements can always be made in separate PRs. I had no time yet to manually play around with the feature, but I will do that in the future.
@Alberth289346 are you okay with merging?

@Alberth289346
Copy link
Contributor Author

Yes, please merge so the feature is consolidated.

I am done for now I think. I'll discuss internally how to proceed from here. I'll keep an eye on the project.

For now, thanks for your time, assistance, and kind words and behavior, it was a pleasure working with you.
Perhaps we might meet again.

Albert

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Issue/PR that involves features, improvements and other changes major Major issue/feature/contribution/change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants