Enhance the new Report Viewer #357

sebinside · 2022-04-11T11:33:50Z

In #287, we added a new web-based report viewer that was developed from scratch. However, there are still many enhancements and fixes open until the viewer is release-ready for version 4.0:

General

Include clustering in JSON files and in the frontend (see Readd clustering #281)
Include max metric in JSON files and in the frontend (see Readd clustering #281)
Revisit JSON file generation when comparing large submissions and remove code redundancy in JSON files
Enhance logging and error reporting in both JSON generator and report viewer code
Add support for multiple roots (see Crash writing json files while using multiple root directories #354)

JSON Generator

Fixes

Fix crashes of JSON file generation, e.g., if folders do not exist (see Crash writing json files while using multiple root directories #354)
JSON file generation crashes with large submissions, probably requires incremental serialization

Report Viewer

Fixes

Remove JetBrains Mono Font Ligatures (see How to view results? #355)
Investigate row numbers in report viewer being off by one (see How to view results? #355)
Report viewer file comparison does sometimes not show matches (see How to view results? #355)
Fix the directory path on the overview view being wrongly displayed
Anonymization has a very bad performance
Sometimes, the highlighting in the comparison does not work (see Fix Report Viewer #573, additional investigation required)

Enhancements

The text was updated successfully, but these errors were encountered:

tsaglam · 2022-05-11T09:02:08Z

Just a few things that I noticed (I did not edit the post above so there is no confusion on what has changed):

Special characters are not supported in the source code view (Russian or Korean characters for example)
The code comparison view always shows "submission 1" and "submission 2", but should show the names. But not if the names are hidden.
The main screen where the files are dropped should contain a line of text that states that no files get uploaded anywhere

tsaglam · 2022-06-10T07:09:58Z

Currently, there is no error message when files are missing. This will be explicitly the case after merging #452.
We should add an error message if files are missing or a warning if the user clicks a comparison that has not been exported because of the CLI argument -n.

PascalKrieg · 2022-06-10T10:15:34Z

When trying to open a zip file that does not contain an overview.json at the top level, a white page is shown.
This can only be resolved by a full reload. Using the "previous page" browser button does not fix the viewer. This is similar behaviour to missing comparison files from the previous comment.

Currently, there is no error message when files are missing. This will be explicitly the case after merging #452.
We should add an error message if files are missing or a warning if the user clicks a comparison that has not been exported because of the CLI argument -n.

If a correct and complete zip has been used before without reloading the page, the content of the previous zip is shown instead without warning.

There should be error handling when opening a zip file that checks if it contains the required files in the correct place.
If necessary, the zip should be rejected with an error message explaining the issue.

nestabentum · 2022-06-13T08:44:03Z

About 'Include max metric in JSON files and in the frontend'.

What max metric exactly is this about?
SimilarityMetrics is an attribute in JPlagOptions.java as well as ClusteringOptions.java. I guess this is about the JPlagOptions attribute as this is also displayed in the ReportViewer (as seen in the following picture):

Click to show image!

Now what I am wondering about is that the SimilarityMetric of JPlagOptions cannot be configured by the user via the CLI flags. And the code itself doesn't change the JPlagOptions#similarityMetric either - it is always the default value SimilarityMetric#AVG.
Another mismatch I cannot explain is that the ReportViewer JSON model expects a list of metric Objects (currently only filled with the one metric from the JPlagOptions in ReportObjectFactory#getMetrics).

What exactly is the task/expected behaviour here? Should the user be enabled to run JPlag with multiple metrics he can define via CLI flags?

tsaglam · 2022-06-13T11:46:55Z

What max metric exactly is this about?

"The maximum similarity is defined as the maximum of both program coverages. This ranking is especially useful if the programs are very different in size. This can happen when dead code was inserted to disguise the origin of the plagiarized program." (from the legacy report)

Imagine student A creates a copy of the submission of student B. Then he inserts random methods/classes/fields. If he adds enough LOC, the average similarity will be pretty low. However, the maximum similarity will be still pretty high. The maximum similarity can be retrieved by default from any JPlagComparison.

SimilarityMetrics is an attribute in JPlagOptions.java as well as ClusteringOptions.java. I guess this is about the JPlagOptions attribute as this is also displayed in the ReportViewer (as seen in the following picture)?

Actually, it is not about the attribute in the JPlagOptions, but rather about the method in JPlagComparison.

Now what I am wondering about is that the SimilarityMetric of JPlagOptions cannot be configured by the user via the CLI flags. And the code itself doesn't change the JPlagOptions#similarityMetric either - it is always the default value SimilarityMetric#AVG. Another mismatch I cannot explain is that the ReportViewer JSON model expects a list of metric Objects (currently only filled with the one metric from the JPlagOptions in ReportObjectFactory#getMetrics).

You are right, the similarity metric cannot be changed via the CLI, but when using JPlag programmatically (e.g. as used in Artemis). It controls which similarity metric is used for the similarity threshold -m.`

I think the JSON model was designed in a way so it is open for extension. The list of metrics should (hopefully) control the buttons at the top of your screenshot. I am not too familiar with the web report generation, so this might not be true and the button at the top is hardcoded.

What exactly is the task/expected behaviour here? Should the user be enabled to run JPlag with multiple metrics he can define via CLI flags?

The user should not be required to do anything via the CLI. The report viewer should show two buttons AVG and MAX (which is hopefully controlled by ReportObjectFactory#getMetrics?), and when the user clicks one of the buttons, the report should change which metric is used in the histogram and the list of tuples at the right side of the screen.
The question mark on the button should offer an explanation for each metric (can be taken from here) but currently this does not even work for the AVG button .
On a comparison page of a tuple currently, the average similarity is shown titled "Match %". Here, toggling should probably not change anything. However, we should clarify that it is the avg similarity and then show the similarities of both submission, as seen in the legacy example report:

For the clustering, the similarity metric probably defines which similarity is used to calculate the clustering. This is where the enum SimilarityMetric is used besides in the options. The numerals then map to the methods in the JPlagComparison class.
In the long term, the similarity metric buttons should also affect the clustering shown, meaning there should always be two clusterings calculated and thus persisted in the JSON files. This, however, has low priority, as avg similarity clustering in the report would be sufficient as a first step.

On a side note, the clustering options are a bit overkill and most of them will rarely be used by a CLI user. Maybe in future we will hide some of them in the CLI.

nestabentum · 2022-06-27T14:34:12Z

Thanks, Timur - those were already some pretty helpful hints!
I still have a few questions left regarding the max metric:

What is the relation between SimilarityMetrix#MAX and the "max metric" I'm trying to incorporate in to the UI? Suppose I run JPlag with JPlagOptions#similarityMetric set to MAX - would the resulting top comparisons and the distribution be identical to what the desired Max Metric Button should show?
Suppose I run JPlag with JPlagOptions#similarityMetric set to AVG with a threshold of x%. What threshold is the Max Metric Button supposed to show? none?

tsaglam · 2022-06-28T07:18:47Z

@nestabentum

What is the relation between SimilarityMetrix#MAX and the "max metric" I'm trying to incorporate in to the UI?

Semantically, they are the same metric on paper. When a comparison of a submission tuple has two similarities, based on the length of each submission in the tuple. Now we can take the average of those two or the maximum.

However, the SimilarityMetric was introduced by a student for the clustering to control which similarity is taken for the cluster calculation. Besides that this enum only affects the -m CLI argument. However, this CLI argument does not directly affect the report and thus does not need to be considered by the viewer. Thus there is no relation between SimilarityMetrix#MAX and the "max metric" you are trying to incorporate into the UI.

Suppose I run JPlag with JPlagOptions#similarityMetric set to MAX - would the resulting top comparisons and the distribution be identical to what the desired Max Metric Button should show?

No, JPlagOptions#similarityMetric should barely affect the report. It should only affect the clustering and based on which metric submission tuples are filtered in the JPlag core. Your max metric in the report should just show the values of JPlagComparison#maximalSimilarity() instead of JPlagComparison#similarity(). This means, both values need to persisted in the JSON files for each comparison, and the metric button in the report just toggles which is shown.

Suppose I run JPlag with JPlagOptions#similarityMetric set to AVG with a threshold of x%. What threshold is the Max Metric Button supposed to show? none?

This does not need to be considered in the report viewer, as this configuration just affects the JPlagResult object. The report should not show the -m threshold.

Hope that helps.

nestabentum · 2022-06-28T14:06:24Z

@tsaglam that helps a lot, thanks!

I have another follow-up question: is JPlagComparison#similarity semantically the same as the "average metric"?
If so, the report viewer DTO creation is faulty right now, as it the name for the button is fetched from JPlagResult#getOptions#getSimilarityMetric#name and the corresponding top comparisons and distribution are calculated from JPlagComparison#similarity.
So if the JPlagOptions#similarityMetric were set to MAX e.g. the Button would be named "MAX" but the distribution and top comparisons display the average data. Fetching the name for the button from SimilarityMetric#AVG#name directly would solve that.
Did I get this right?

tsaglam · 2022-06-29T06:39:26Z

@nestabentum

I have another follow-up question: is JPlagComparison#similarity semantically the same as the "average metric"? If so, the report viewer DTO creation is faulty right now, as the name for the button is fetched from JPlagResult#getOptions#getSimilarityMetric#name and the corresponding top comparisons and distribution are calculated from JPlagComparison#similarity.

Yes, JPlagComparison#similarity is the average similarity. The SimilarityMetric enum just maps to the functions of the JPlagComparison class:

JPlag/jplag/src/main/java/de/jplag/options/SimilarityMetric.java

Lines 8 to 11 in 690c228

    
           AVG(JPlagComparison::similarity), 
        
           MIN(JPlagComparison::minimalSimilarity), 
        
           MAX(JPlagComparison::maximalSimilarity), 
        
           INTERSECTION(it -> (float) it.getNumberOfMatchedTokens());

This means you are right, this is faulty! Sorry, this bug might have caused some confusion. Basically, the button names should not be queried from JPlagResult#getOptions#getSimilarityMetric. Maybe we just need to remove this option altogether.

So if the JPlagOptions#similarityMetric were set to MAX e.g. the Button would be named "MAX" but the distribution and top comparisons display the average data. Fetching the name for the button from SimilarityMetric#AVG#name directly would solve that. Did I get this right?

Yes, exactly!

nestabentum · 2022-07-03T20:45:27Z

About 'Revisit JSON file generation when comparing large submissions and remove code redundancy in JSON files'

I implemented an approach to reduce the report's file size: Code lines in a comparison file are not saved as plain text any more, but as numbers. These numbers are indices to a lookup table. This lookup table is global, i.e., it contains an index -> code line mapping, for every code line across all report files. This lookup table is persisted in an additional file lookupTable.json and will have to be used by the report viewer to resolve code lines before displaying them.

These example files.
If you want to, you can have a look at the implementation here.

First of all, of course, I would appreciate your feedback on this approach.
Second, I wonder: What are the requirements concerning backwards compatibility? If I were to change the file format so that the UI would have to resolve code lines before displaying them - would it have to retain the ability to process the old file format that needs no resolving? In general, how backwards compatible should the UI be whenever there are changes to the file format?

tsaglam · 2022-07-04T18:09:01Z

First of all, of course, I would appreciate your feedback on this approach.

@sebinside

Second, I wonder: What are the requirements concerning backwards compatibility? If I were to change the file format so that the UI would have to resolve code lines before displaying them - would it have to retain the ability to process the old file format that needs no resolving? In general, how backwards compatible should the UI be whenever there are changes to the file format?

@nestabentum you mean right now? Or in general?
Right now, we can change anything and do not need to provide any kind of backward compatibility, as we have not released the new report viewer yet. When we release it with v4.0.0 we should be more careful about backward compatibility and only break it if required, and then we need to release a new major version.

dfuchss · 2022-07-04T18:55:38Z

I implemented an approach to reduce the report's file size: Code lines in a comparison file are not saved as plain text any more, but as numbers. These numbers are indices to a lookup table. This lookup table is global, i.e., it contains an index -> code line mapping, for every code line across all report files. This lookup table is persisted in an additional file lookupTable.json and will have to be used by the report viewer to resolve code lines before displaying them.

Just two questions :)

How shall the report viewer use multiple files? For now it can only handle the one json afaik, or am I wrong? What would be the workflow?
The idea with the indices is good from my point of view! 🙂 Nevertheless, why do you need a lookup table? From my point of view, the json / data has to contain the complete submissions (at least the files with matches, but maybe even all files) since we (as instructors of a course) are also interested in files that have no matches / the parts of the files that do not match. (I would like to keep this feature)

And one comment / something we could think about:
To reduce the amount of data for the json, I would suggest to compress the submission data (source code of the students) and save them in a (maybe base64 encoded) blob in the json? Nevertheless, this has some disadvantages. (e.g., Blobs in JSON :( )

sebinside · 2022-07-06T12:21:34Z

@nestabentum I also welcome the idea of the lookup table but would propose creating one table per submission to keep the file structure clearer. The only argument against this would be that there is such a high overlap that merging them can significantly reduce the file size and this not only for our fake testing data but for real submission sets.

Regarding backwards compatibility, as Timur said, we can ignore this right now.

nestabentum · 2022-07-08T11:11:47Z

@dfuchss

How shall the report viewer use multiple files? For now it can only handle the one json afaik, or am I wrong? What would be the workflow?

Right now, the report viewer can handle a zip file that contains an overview.json and multiple comparison files that follow the naming scheme submission1-submission2.json. The comparison file contains the complete contents of all files of both submissions leading to great redundancy as each submission is compared to all other submissions.
In addition to handling a zip folder, the report viewer can also handle a drop of only a single overview.json or submission1-submission2.json, displaying only the corresponding overview or comparison.
This would not be possible anymore with the new lookup format @sebinside

The idea with the indices is good from my point of view! 🙂 Nevertheless, why do you need a lookup table? From my point of view, the json / data has to contain the complete submissions (at least the files with matches, but maybe even all files) since we (as instructors of a course) are also interested in files that have no matches / the parts of the files that do not match. (I would like to keep this feature)

As previously mentioned, a submission1-submission2.json comparison file contains all contents of all files of both submissions. This would still be the case after my proposal, but instead of a the actual contents the json model of the file of a submission would contain references to the lookup.

And one comment / something we could think about:
To reduce the amount of data for the json, I would suggest to compress the submission data (source code of the students) and save them in a (maybe base64 encoded) blob in the json? Nevertheless, this has some disadvantages. (e.g., Blobs in JSON :( )

I have little expertise to offer here, what's your opinion @sebinside?

dfuchss · 2022-07-08T11:13:32Z

I would suggest to discuss this in the development meeting next week as well :)

nestabentum · 2022-07-08T11:20:57Z

@sebinside

@nestabentum I also welcome the idea of the lookup table but would propose creating one table per submission to keep the file structure clearer. The only argument against this would be that there is such a high overlap that merging them can significantly reduce the file size and this not only for our fake testing data but for real submission sets.

Yes, I thought about that too... I thought that it would be kind of a nice to have, to have a file structure that is somewhat more clearly structured. But then again I thought, that submission files do have a quite an overlap, concerning e.g. imports and base code and so I went with the global table.

tsaglam · 2022-08-22T13:09:02Z

@nestabentum @sebinside should the clustering be working right now? I can't get JPlag to export clusters via the overview.json file.

nestabentum · 2022-08-22T13:44:09Z

@tsaglam it should work. Do you have some more information on your concrete case?

tsaglam · 2022-08-22T13:48:20Z

I compared a set of metamodels with the newly added EMF frontend. Unfortunately, I cannot share my files, as these are student submissions. I realize now this might be a problem with the EMF frontend and thus my own doing. I will test it with another frontend later, but there it will probably work.

EDIT:
@nestabentum, sadly the problem also persists with Java files.
JPlag version: 1c401de (current master)
Input: PartialPlagiarism.zip
Result: result.zip (overview.json contains no clusters)
CLI parameters tried:
java -jar jplag.jar ./PartialPlagiarism
and
java -jar jplag.jar ./PartialPlagiarism --cluster-alg AGGLOMERATIVE --cluster-metric AVG

sebinside · 2022-09-29T14:52:31Z

This issue has been replaced by #710.

sebinside added this to the v4.0.0 milestone Apr 11, 2022

sebinside mentioned this issue Apr 12, 2022

How to view results? #355

Closed

tsaglam pinned this issue May 6, 2022

tsaglam assigned nestabentum May 9, 2022

nestabentum mentioned this issue Jun 10, 2022

Write Clusters To JSON Report #453

Merged

nestabentum mentioned this issue Jul 1, 2022

Add Max Metric To Report Viewer #487

Merged

nestabentum mentioned this issue Jul 14, 2022

Draft: Optimize report viewer dto size #507

Closed

This was referenced Aug 7, 2022

Report Viewer broken #571

Closed

Fix Report Viewer #573

Merged

tsaglam mentioned this issue Aug 23, 2022

No clustering information persisted in the JSON files #609

Closed

nestabentum mentioned this issue Aug 31, 2022

Documentation Enhance the new Report Viewer #626

Closed

JanWittler mentioned this issue Sep 1, 2022

Documentation End2End testing #632

Closed

sebinside mentioned this issue Sep 22, 2022

Report Viewer Bug Fixes and Tests for v4.0.0 Release #686

Closed

sebinside removed this from the v4.0.0 milestone Sep 29, 2022

sebinside mentioned this issue Sep 29, 2022

Report Viewer Bugfixes and Enhancements #710

Closed

59 tasks

sebinside closed this as completed Sep 29, 2022

sebinside unpinned this issue Sep 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance the new Report Viewer #357

Enhance the new Report Viewer #357

sebinside commented Apr 11, 2022 •

edited

Loading

tsaglam commented May 11, 2022

tsaglam commented Jun 10, 2022

PascalKrieg commented Jun 10, 2022

nestabentum commented Jun 13, 2022 •

edited by tsaglam

Loading

tsaglam commented Jun 13, 2022 •

edited

Loading

nestabentum commented Jun 27, 2022

tsaglam commented Jun 28, 2022 •

edited

Loading

nestabentum commented Jun 28, 2022

tsaglam commented Jun 29, 2022 •

edited

Loading

nestabentum commented Jul 3, 2022

tsaglam commented Jul 4, 2022 •

edited

Loading

dfuchss commented Jul 4, 2022

sebinside commented Jul 6, 2022

nestabentum commented Jul 8, 2022

dfuchss commented Jul 8, 2022

nestabentum commented Jul 8, 2022

tsaglam commented Aug 22, 2022

nestabentum commented Aug 22, 2022

tsaglam commented Aug 22, 2022 •

edited

Loading

sebinside commented Sep 29, 2022

Enhance the new Report Viewer #357

Enhance the new Report Viewer #357

Comments

sebinside commented Apr 11, 2022 • edited Loading

General

JSON Generator

Fixes

Report Viewer

Fixes

Enhancements

tsaglam commented May 11, 2022

tsaglam commented Jun 10, 2022

PascalKrieg commented Jun 10, 2022

nestabentum commented Jun 13, 2022 • edited by tsaglam Loading

About 'Include max metric in JSON files and in the frontend'.

tsaglam commented Jun 13, 2022 • edited Loading

nestabentum commented Jun 27, 2022

tsaglam commented Jun 28, 2022 • edited Loading

nestabentum commented Jun 28, 2022

tsaglam commented Jun 29, 2022 • edited Loading

nestabentum commented Jul 3, 2022

About 'Revisit JSON file generation when comparing large submissions and remove code redundancy in JSON files'

tsaglam commented Jul 4, 2022 • edited Loading

dfuchss commented Jul 4, 2022

sebinside commented Jul 6, 2022

nestabentum commented Jul 8, 2022

dfuchss commented Jul 8, 2022

nestabentum commented Jul 8, 2022

tsaglam commented Aug 22, 2022

nestabentum commented Aug 22, 2022

tsaglam commented Aug 22, 2022 • edited Loading

sebinside commented Sep 29, 2022

sebinside commented Apr 11, 2022 •

edited

Loading

nestabentum commented Jun 13, 2022 •

edited by tsaglam

Loading

tsaglam commented Jun 13, 2022 •

edited

Loading

tsaglam commented Jun 28, 2022 •

edited

Loading

tsaglam commented Jun 29, 2022 •

edited

Loading

tsaglam commented Jul 4, 2022 •

edited

Loading

tsaglam commented Aug 22, 2022 •

edited

Loading