Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat request: multiple programming languages #1546

Open
euberdeveloper opened this issue Feb 7, 2024 · 7 comments
Open

Feat request: multiple programming languages #1546

euberdeveloper opened this issue Feb 7, 2024 · 7 comments
Labels
enhancement Issue/PR that involves features, improvements and other changes language PR / Issue deals (partly) with new and/or existing languages for JPlag major Major issue/feature/contribution/change

Comments

@euberdeveloper
Copy link

As of now it seems that JPlag supports multiple programming languages, but only in a homogeneous way.

This means that I can compare two different submissions both in Java, both in Python but not one in Java and one in Python.

It could seem that it doesn't make sense, but it could actually be a type of obfuscation, translating a program from a language to another one.

Maybe Java and python are not the perfect example, but if we take into account languages such as Java and Kotlin or Scala, that all work with the JVM, this issue becomes more relevant

@tsaglam tsaglam added enhancement Issue/PR that involves features, improvements and other changes major Major issue/feature/contribution/change language PR / Issue deals (partly) with new and/or existing languages for JPlag labels Feb 7, 2024
@tsaglam
Copy link
Member

tsaglam commented Feb 7, 2024

Good point, this relates to cross-language plagiarism detection. While there has been some research in that area, there are (to my knowledge) no usable tools for that. In future, we may want to introduce that by creating a shared token type set for common concepts between languages. Thus, language modules may reuse these token types thus allowing for cross-language support.
On a similar note, we may consider polyglot support, meaning parsing multi-language submissions by delegating the different files to different language modules.

@euberdeveloper
Copy link
Author

Hello, this has been done in this fork: https://github.com/euberdeveloper/JPlag/tree/feature/multilanguage-plagiarism-detection

A pull request will follow up in the future

@tsaglam
Copy link
Member

tsaglam commented Jul 15, 2024

We have our own ideas for that, but we are happy to look at yours. Keep in mind, that these might be major changes that need to consider other upcoming changes, API considerations, and not break existing features (e.g. token sequence normalization or match merging).

@euberdeveloper
Copy link
Author

I think what I've done is more like a proof of concept.
The pros until now are:

  • In the code, examples of the changes that should be done in order to accept as input a set of languages and not one can be seen
  • Each language interface is added with the method "supportCrossPlagiarism", to specify that that language supports it
  • Each language that supports the feature has an additional parser to general tokens
  • The code proves that on the side of the report there are not major changes

To speed up the process, I made the single language front ends use first their default language-specific tokens to get specific tokens and then I made a converter to convert those tokens to general ones. Don't do it, the results are not good and many issues could be fixed by obtaining language-agnostic tokens directly by parsing the source code from scratch. I will implement this improvement soon.

@euberdeveloper
Copy link
Author

Another improvement I want to do is making the language-agnostic tokens dynamic. Each language will override/implement some methods such as "supportsClasses" or "has variable declarations". For example C would return false to the first method and true for the second one. Python would return true to the first one and false to the second one. Java true to both.

Then, the langiage-agnostic tokenizers for Rach language would receive the full set of languages for this run as an additional parameter. Based on what those language support, it will change behaviour, for example if Java Python and C are provided, the java tokenizer will discard Class tokens. If only Java and Python are provided as possible languages for this run, the Java tokenizer will emit class tokens.

@euberdeveloper
Copy link
Author

I have some work in progress with this

@tsaglam
Copy link
Member

tsaglam commented Sep 6, 2024

Note, that we have our own plans here that might be conflicting with yours. But we are always happy to look at your ideas for inspiration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Issue/PR that involves features, improvements and other changes language PR / Issue deals (partly) with new and/or existing languages for JPlag major Major issue/feature/contribution/change
Projects
None yet
Development

No branches or pull requests

2 participants