Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance the new Report Viewer #357

Closed
14 of 30 tasks
sebinside opened this issue Apr 11, 2022 · 20 comments
Closed
14 of 30 tasks

Enhance the new Report Viewer #357

sebinside opened this issue Apr 11, 2022 · 20 comments
Assignees
Labels
bug Issue/PR that involves a bug enhancement Issue/PR that involves features, improvements and other changes major Major issue/feature/contribution/change PISE-SS22 Tasks for the PISE practical course (SS22) report-viewer PR / Issue deals (partly) with the report viewer and thus involves web-dev technologies

Comments

@sebinside
Copy link
Member

sebinside commented Apr 11, 2022

In #287, we added a new web-based report viewer that was developed from scratch. However, there are still many enhancements and fixes open until the viewer is release-ready for version 4.0:

General

JSON Generator

Fixes

Report Viewer

Fixes

Enhancements

  • (Additionally) zip all files after the creation
  • Switching between avg and max metric freezes the UI if there are many submissions in the overview
  • Built report viewer files shall be deployed with results (see Replace the outdated HTML report generation #192)
  • Include and enforce eslint
  • Accept multiple files as input
  • Show a warning if some/all comparison files are missing
  • Add a "back to overview" button to the comparison view
  • Reverse sorting order of the histogram (and fix intervals being closed)
  • Change the color scheme of matches (something without implicit semantics)
  • Enhance file comparison view: Find a way to better display the sidebar (currently quite hidden)
  • Change comparison view file order, e.g., based on packages rather than alphabetical
  • Add button "collapse all" to the comparison view
  • Highlight base code, e.g., gray
  • Enable collapsing by clicking on the file header in the comparison view
  • Dark mode (e.g., by user choice or even by default)
  • Enhance overall frontend design/impression
  • The CD deployed version should have sample files ready to go (added in the deployment pipeline)
@sebinside sebinside added bug Issue/PR that involves a bug enhancement Issue/PR that involves features, improvements and other changes major Major issue/feature/contribution/change PISE-SS22 Tasks for the PISE practical course (SS22) report-viewer PR / Issue deals (partly) with the report viewer and thus involves web-dev technologies labels Apr 11, 2022
@sebinside sebinside added this to the v4.0.0 milestone Apr 11, 2022
@tsaglam tsaglam pinned this issue May 6, 2022
@tsaglam
Copy link
Member

tsaglam commented May 11, 2022

Just a few things that I noticed (I did not edit the post above so there is no confusion on what has changed):

  • Special characters are not supported in the source code view (Russian or Korean characters for example)
  • The code comparison view always shows "submission 1" and "submission 2", but should show the names. But not if the names are hidden.
  • The main screen where the files are dropped should contain a line of text that states that no files get uploaded anywhere

@tsaglam
Copy link
Member

tsaglam commented Jun 10, 2022

Currently, there is no error message when files are missing. This will be explicitly the case after merging #452.
We should add an error message if files are missing or a warning if the user clicks a comparison that has not been exported because of the CLI argument -n.

@PascalKrieg
Copy link
Contributor

When trying to open a zip file that does not contain an overview.json at the top level, a white page is shown.
This can only be resolved by a full reload. Using the "previous page" browser button does not fix the viewer. This is similar behaviour to missing comparison files from the previous comment.

Currently, there is no error message when files are missing. This will be explicitly the case after merging #452.
We should add an error message if files are missing or a warning if the user clicks a comparison that has not been exported because of the CLI argument -n.

If a correct and complete zip has been used before without reloading the page, the content of the previous zip is shown instead without warning.

There should be error handling when opening a zip file that checks if it contains the required files in the correct place.
If necessary, the zip should be rejected with an error message explaining the issue.

@nestabentum
Copy link
Contributor

nestabentum commented Jun 13, 2022

About 'Include max metric in JSON files and in the frontend'.

What max metric exactly is this about?
SimilarityMetrics is an attribute in JPlagOptions.java as well as ClusteringOptions.java. I guess this is about the JPlagOptions attribute as this is also displayed in the ReportViewer (as seen in the following picture):

Click to show image! Screenshot 2022-06-13 at 10 32 43

Now what I am wondering about is that the SimilarityMetric of JPlagOptions cannot be configured by the user via the CLI flags. And the code itself doesn't change the JPlagOptions#similarityMetric either - it is always the default value SimilarityMetric#AVG.
Another mismatch I cannot explain is that the ReportViewer JSON model expects a list of metric Objects (currently only filled with the one metric from the JPlagOptions in ReportObjectFactory#getMetrics).

What exactly is the task/expected behaviour here? Should the user be enabled to run JPlag with multiple metrics he can define via CLI flags?

@tsaglam
Copy link
Member

tsaglam commented Jun 13, 2022

What max metric exactly is this about?

"The maximum similarity is defined as the maximum of both program coverages. This ranking is especially useful if the programs are very different in size. This can happen when dead code was inserted to disguise the origin of the plagiarized program." (from the legacy report)

Imagine student A creates a copy of the submission of student B. Then he inserts random methods/classes/fields. If he adds enough LOC, the average similarity will be pretty low. However, the maximum similarity will be still pretty high. The maximum similarity can be retrieved by default from any JPlagComparison.

SimilarityMetrics is an attribute in JPlagOptions.java as well as ClusteringOptions.java. I guess this is about the JPlagOptions attribute as this is also displayed in the ReportViewer (as seen in the following picture)?

Actually, it is not about the attribute in the JPlagOptions, but rather about the method in JPlagComparison.

Now what I am wondering about is that the SimilarityMetric of JPlagOptions cannot be configured by the user via the CLI flags. And the code itself doesn't change the JPlagOptions#similarityMetric either - it is always the default value SimilarityMetric#AVG. Another mismatch I cannot explain is that the ReportViewer JSON model expects a list of metric Objects (currently only filled with the one metric from the JPlagOptions in ReportObjectFactory#getMetrics).

You are right, the similarity metric cannot be changed via the CLI, but when using JPlag programmatically (e.g. as used in Artemis). It controls which similarity metric is used for the similarity threshold -m.`

I think the JSON model was designed in a way so it is open for extension. The list of metrics should (hopefully) control the buttons at the top of your screenshot. I am not too familiar with the web report generation, so this might not be true and the button at the top is hardcoded.

What exactly is the task/expected behaviour here? Should the user be enabled to run JPlag with multiple metrics he can define via CLI flags?

The user should not be required to do anything via the CLI. The report viewer should show two buttons AVG and MAX (which is hopefully controlled by ReportObjectFactory#getMetrics?), and when the user clicks one of the buttons, the report should change which metric is used in the histogram and the list of tuples at the right side of the screen.
The question mark on the button should offer an explanation for each metric (can be taken from here) but currently this does not even work for the AVG button .
On a comparison page of a tuple currently, the average similarity is shown titled "Match %". Here, toggling should probably not change anything. However, we should clarify that it is the avg similarity and then show the similarities of both submission, as seen in the legacy example report:
grafik

For the clustering, the similarity metric probably defines which similarity is used to calculate the clustering. This is where the enum SimilarityMetric is used besides in the options. The numerals then map to the methods in the JPlagComparison class.
In the long term, the similarity metric buttons should also affect the clustering shown, meaning there should always be two clusterings calculated and thus persisted in the JSON files. This, however, has low priority, as avg similarity clustering in the report would be sufficient as a first step.

On a side note, the clustering options are a bit overkill and most of them will rarely be used by a CLI user. Maybe in future we will hide some of them in the CLI.

@nestabentum
Copy link
Contributor

Thanks, Timur - those were already some pretty helpful hints!
I still have a few questions left regarding the max metric:

  • What is the relation between SimilarityMetrix#MAX and the "max metric" I'm trying to incorporate in to the UI? Suppose I run JPlag with JPlagOptions#similarityMetric set to MAX - would the resulting top comparisons and the distribution be identical to what the desired Max Metric Button should show?
  • Suppose I run JPlag with JPlagOptions#similarityMetric set to AVG with a threshold of x%. What threshold is the Max Metric Button supposed to show? none?

@tsaglam
Copy link
Member

tsaglam commented Jun 28, 2022

@nestabentum

What is the relation between SimilarityMetrix#MAX and the "max metric" I'm trying to incorporate in to the UI?

Semantically, they are the same metric on paper. When a comparison of a submission tuple has two similarities, based on the length of each submission in the tuple. Now we can take the average of those two or the maximum.

However, the SimilarityMetric was introduced by a student for the clustering to control which similarity is taken for the cluster calculation. Besides that this enum only affects the -m CLI argument. However, this CLI argument does not directly affect the report and thus does not need to be considered by the viewer. Thus there is no relation between SimilarityMetrix#MAX and the "max metric" you are trying to incorporate into the UI.

Suppose I run JPlag with JPlagOptions#similarityMetric set to MAX - would the resulting top comparisons and the distribution be identical to what the desired Max Metric Button should show?

No, JPlagOptions#similarityMetric should barely affect the report. It should only affect the clustering and based on which metric submission tuples are filtered in the JPlag core. Your max metric in the report should just show the values of JPlagComparison#maximalSimilarity() instead of JPlagComparison#similarity(). This means, both values need to persisted in the JSON files for each comparison, and the metric button in the report just toggles which is shown.

Suppose I run JPlag with JPlagOptions#similarityMetric set to AVG with a threshold of x%. What threshold is the Max Metric Button supposed to show? none?

This does not need to be considered in the report viewer, as this configuration just affects the JPlagResult object. The report should not show the -m threshold.

Hope that helps.

@nestabentum
Copy link
Contributor

@tsaglam that helps a lot, thanks!

I have another follow-up question: is JPlagComparison#similarity semantically the same as the "average metric"?
If so, the report viewer DTO creation is faulty right now, as it the name for the button is fetched from JPlagResult#getOptions#getSimilarityMetric#name and the corresponding top comparisons and distribution are calculated from JPlagComparison#similarity.
So if the JPlagOptions#similarityMetric were set to MAX e.g. the Button would be named "MAX" but the distribution and top comparisons display the average data. Fetching the name for the button from SimilarityMetric#AVG#name directly would solve that.
Did I get this right?

@tsaglam
Copy link
Member

tsaglam commented Jun 29, 2022

@nestabentum

I have another follow-up question: is JPlagComparison#similarity semantically the same as the "average metric"? If so, the report viewer DTO creation is faulty right now, as the name for the button is fetched from JPlagResult#getOptions#getSimilarityMetric#name and the corresponding top comparisons and distribution are calculated from JPlagComparison#similarity.

Yes, JPlagComparison#similarity is the average similarity. The SimilarityMetric enum just maps to the functions of the JPlagComparison class:

AVG(JPlagComparison::similarity),
MIN(JPlagComparison::minimalSimilarity),
MAX(JPlagComparison::maximalSimilarity),
INTERSECTION(it -> (float) it.getNumberOfMatchedTokens());

This means you are right, this is faulty! Sorry, this bug might have caused some confusion. Basically, the button names should not be queried from JPlagResult#getOptions#getSimilarityMetric. Maybe we just need to remove this option altogether.

So if the JPlagOptions#similarityMetric were set to MAX e.g. the Button would be named "MAX" but the distribution and top comparisons display the average data. Fetching the name for the button from SimilarityMetric#AVG#name directly would solve that. Did I get this right?

Yes, exactly!

@nestabentum
Copy link
Contributor

About 'Revisit JSON file generation when comparing large submissions and remove code redundancy in JSON files'

I implemented an approach to reduce the report's file size: Code lines in a comparison file are not saved as plain text any more, but as numbers. These numbers are indices to a lookup table. This lookup table is global, i.e., it contains an index -> code line mapping, for every code line across all report files. This lookup table is persisted in an additional file lookupTable.json and will have to be used by the report viewer to resolve code lines before displaying them.

These example files.
If you want to, you can have a look at the implementation here.

First of all, of course, I would appreciate your feedback on this approach.
Second, I wonder: What are the requirements concerning backwards compatibility? If I were to change the file format so that the UI would have to resolve code lines before displaying them - would it have to retain the ability to process the old file format that needs no resolving? In general, how backwards compatible should the UI be whenever there are changes to the file format?

@tsaglam
Copy link
Member

tsaglam commented Jul 4, 2022

First of all, of course, I would appreciate your feedback on this approach.

@sebinside

Second, I wonder: What are the requirements concerning backwards compatibility? If I were to change the file format so that the UI would have to resolve code lines before displaying them - would it have to retain the ability to process the old file format that needs no resolving? In general, how backwards compatible should the UI be whenever there are changes to the file format?

@nestabentum you mean right now? Or in general?
Right now, we can change anything and do not need to provide any kind of backward compatibility, as we have not released the new report viewer yet. When we release it with v4.0.0 we should be more careful about backward compatibility and only break it if required, and then we need to release a new major version.

@dfuchss
Copy link
Member

dfuchss commented Jul 4, 2022

I implemented an approach to reduce the report's file size: Code lines in a comparison file are not saved as plain text any more, but as numbers. These numbers are indices to a lookup table. This lookup table is global, i.e., it contains an index -> code line mapping, for every code line across all report files. This lookup table is persisted in an additional file lookupTable.json and will have to be used by the report viewer to resolve code lines before displaying them.

Just two questions :)

  1. How shall the report viewer use multiple files? For now it can only handle the one json afaik, or am I wrong? What would be the workflow?
  2. The idea with the indices is good from my point of view! 🙂 Nevertheless, why do you need a lookup table? From my point of view, the json / data has to contain the complete submissions (at least the files with matches, but maybe even all files) since we (as instructors of a course) are also interested in files that have no matches / the parts of the files that do not match. (I would like to keep this feature)

And one comment / something we could think about:
To reduce the amount of data for the json, I would suggest to compress the submission data (source code of the students) and save them in a (maybe base64 encoded) blob in the json? Nevertheless, this has some disadvantages. (e.g., Blobs in JSON :( )

@sebinside
Copy link
Member Author

@nestabentum I also welcome the idea of the lookup table but would propose creating one table per submission to keep the file structure clearer. The only argument against this would be that there is such a high overlap that merging them can significantly reduce the file size and this not only for our fake testing data but for real submission sets.

Regarding backwards compatibility, as Timur said, we can ignore this right now.

@nestabentum
Copy link
Contributor

@dfuchss

  1. How shall the report viewer use multiple files? For now it can only handle the one json afaik, or am I wrong? What would be the workflow?

Right now, the report viewer can handle a zip file that contains an overview.json and multiple comparison files that follow the naming scheme submission1-submission2.json. The comparison file contains the complete contents of all files of both submissions leading to great redundancy as each submission is compared to all other submissions.
In addition to handling a zip folder, the report viewer can also handle a drop of only a single overview.json or submission1-submission2.json, displaying only the corresponding overview or comparison.
This would not be possible anymore with the new lookup format @sebinside

  1. The idea with the indices is good from my point of view! 🙂 Nevertheless, why do you need a lookup table? From my point of view, the json / data has to contain the complete submissions (at least the files with matches, but maybe even all files) since we (as instructors of a course) are also interested in files that have no matches / the parts of the files that do not match. (I would like to keep this feature)

As previously mentioned, a submission1-submission2.json comparison file contains all contents of all files of both submissions. This would still be the case after my proposal, but instead of a the actual contents the json model of the file of a submission would contain references to the lookup.

And one comment / something we could think about:
To reduce the amount of data for the json, I would suggest to compress the submission data (source code of the students) and save them in a (maybe base64 encoded) blob in the json? Nevertheless, this has some disadvantages. (e.g., Blobs in JSON :( )

I have little expertise to offer here, what's your opinion @sebinside?

@dfuchss
Copy link
Member

dfuchss commented Jul 8, 2022

I would suggest to discuss this in the development meeting next week as well :)

@nestabentum
Copy link
Contributor

@sebinside

@nestabentum I also welcome the idea of the lookup table but would propose creating one table per submission to keep the file structure clearer. The only argument against this would be that there is such a high overlap that merging them can significantly reduce the file size and this not only for our fake testing data but for real submission sets.

Yes, I thought about that too... I thought that it would be kind of a nice to have, to have a file structure that is somewhat more clearly structured. But then again I thought, that submission files do have a quite an overlap, concerning e.g. imports and base code and so I went with the global table.

@tsaglam
Copy link
Member

tsaglam commented Aug 22, 2022

@nestabentum @sebinside should the clustering be working right now? I can't get JPlag to export clusters via the overview.json file.

@nestabentum
Copy link
Contributor

@tsaglam it should work. Do you have some more information on your concrete case?

@tsaglam
Copy link
Member

tsaglam commented Aug 22, 2022

I compared a set of metamodels with the newly added EMF frontend. Unfortunately, I cannot share my files, as these are student submissions. I realize now this might be a problem with the EMF frontend and thus my own doing. I will test it with another frontend later, but there it will probably work.

EDIT:
@nestabentum, sadly the problem also persists with Java files.
JPlag version: 1c401de (current master)
Input: PartialPlagiarism.zip
Result: result.zip (overview.json contains no clusters)
CLI parameters tried:
java -jar jplag.jar ./PartialPlagiarism
and
java -jar jplag.jar ./PartialPlagiarism --cluster-alg AGGLOMERATIVE --cluster-metric AVG

@sebinside
Copy link
Member Author

This issue has been replaced by #710.

@sebinside sebinside unpinned this issue Sep 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue/PR that involves a bug enhancement Issue/PR that involves features, improvements and other changes major Major issue/feature/contribution/change PISE-SS22 Tasks for the PISE practical course (SS22) report-viewer PR / Issue deals (partly) with the report viewer and thus involves web-dev technologies
Projects
None yet
Development

No branches or pull requests

5 participants