Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encountered error of preprocess data #127

Open
yingdehuijin opened this issue Jun 30, 2022 · 8 comments
Open

Encountered error of preprocess data #127

yingdehuijin opened this issue Jun 30, 2022 · 8 comments

Comments

@yingdehuijin
Copy link

Hi,Uri
Hi, I am using code2seq to run on EMSE-DeepCom https://github.com/xing-hu/EMSE-DeepCom newest datasets. I followed your suggestiones to run scripts preprocess.sh,but i have encountered errors on test/val/train datasets.The error_log.txt and stdout show the following information:
b'java.util.concurrent.ExecutionException: com.github.javaparser.ParseProblemException: Encountered unexpected token: ">" ">"\n at line 2, column 407.\n\nWas expecting one of:\n\n
And examples are decreased: 20000 test methods hava decreased to 17060 , 20000 valid methods decreased to 17043 and 480000 methods decreased to 380001. Are there something wrong with the datasets?
Looking forward your reply!
Wcc

@urialon
Copy link
Contributor

urialon commented Jul 3, 2022

Hi @yingdehuijin ,
Thank you for your interest in our work!

I don't know if there is anything wrong with this dataset, I have never used it.

However, it does seem like the files there will not parse. Are they raw java files? Maybe they have a different format? Our preprocessing pipeline expects raw java files.

Can you provide a single example from the dataset?

Best,
Uri

@yingdehuijin
Copy link
Author

Hi @yingdehuijin , Thank you for your interest in our work!

I don't know if there is anything wrong with this dataset, I have never used it.

However, it does seem like the files there will not parse. Are they raw java files? Maybe they have a different format? Our preprocessing pipeline expects raw java files.

Can you provide a single example from the dataset?

Best, Uri

Thank you for your reply
A single example from the dataset is like this:
code:
public static DecomposableMatchBuilder1 < Float , Float > caseFloat ( MatchesAny f ) { List < Matcher < Object > > matchers = new ArrayList < > ( ) ; matchers . add ( any ( ) ) ; return new DecomposableMatchBuilder1 < > ( matchers , NUM_ , new PrimitiveFieldExtractor < > ( Float . class ) ) ; }
nl:
matches a float .

@urialon
Copy link
Contributor

urialon commented Jul 14, 2022

The "nl: matches a float" are part of the same file?
Our JavaExtractor expects pure java files, and extracts the method names as the labels.
You can replace the existing method name (DecomposableMatchBuilder1) with a unique ID, remove the "nl: matches a float", and later, replace the unique ID in the processed files with the natural language sequence that you wish to generate.

See also: #45

Best,
Uri

@lidiancracy
Copy link

Hello, I encountered the same issue while preprocessing the files. Does the original JAR package handle exceptions, such as skipping files that do not meet the format requirements without preprocessing them? I'm using it to process my own dataset, but it's throwing errors. I'm not sure if it will keep getting stuck there.

@urialon
Copy link
Contributor

urialon commented Sep 17, 2023

Hi @lidiancracy ,
Thank you for your interest in our work.

The truth is that I don't remember, this code was written about 5 years ago. If you wish to debug it go ahead, the entire java code is available in this repo.

But I recommend using newer models such as PolyCoder:
https://github.com/VHellendoorn/Code-LMs
https://arxiv.org/pdf/2202.13169.pdf

Best,
Uri

@lidiancracy
Copy link

lidiancracy commented Sep 18, 2023

@urialon Thank you for your timely reply. My .sh file now terminates normally and has produced 4 files with the .c2s extension. I think the logic in the JAR package is probably fine. By the way, can I continue to train a new dataset on a model that has been trained well, similar to transfer learning and incremental training? I did not find any relevant information in the readme, did I miss something?Thank you in advance.

@lidiancracy
Copy link

Sorry to bother you.I trained the model using default parameters, but now only the dictionary remains as shown in the picture. Is this normal?
image

@urialon
Copy link
Contributor

urialon commented Sep 19, 2023

Yes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants