-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retraining #6
Comments
Hi Zhangyin, If you are using the You may use this command to train the best model reported in the NL2Bash paper:
Let me know if this helps. |
@todpole3 Sorry, it doesn't help. And you can try to re-execute the This is the result on dev of my execution of the Now I don't know what should i do. Thanks |
Hi Zhangyin, I'm very sorry about the confusion caused. First of all, the "Average top k BLEU Score" and "Average top k Template Match Score" you obtained are comparable and slightly higher than the ones we reported in Appendix C of the paper, hence I believe you have already retrained the model and obtained the predicted commands correctly. The question remains why you are getting lower "Top k Match (template-only)" and "Top k Match (whole-string)" scores compared to those reported in Table 8. The scores in table 8 are accuracies as a result of manual evaluation. I have already uploaded all our manual evaluations here. The evaluation function needs to read those in order to properly output the manual evaluation accuracy. If for some reason it failed to read those files, the resulting accuracy would be significant lower since the model output contain many false negatives (correct Bash commands that are not included in our data collection). I will check if the pointer to the "manual_judgements" directory is correctly set up and get back to you. |
Thank you very much. Thanks again |
Have you checked the code? Thanks. |
Sorry about the delay. I pushed a few fixes which I think should address the issue to a great extent. A) I reproduced the dev set manual evaluation results of the "Tellina" model (Table 8, last row) by running the following commands:
B) I reproduced the dev set manual evaluation results of the "Sub-Token CopyNet" model (Table 8, second to last row) by running the following commands plus the additional effort of inputing a few manual judgement myself.
There are two reasons which causes you to get lower evaluation scores initially. Second, due the randomness of the NN implementation, the models may output different Bash commands across different runs. And there might be new commands generated that was not in the manual judgements we already collected. There are false negatives among those too, and they need to be rejudged. (I did not do it correctly in fixing the random seed for Tensorflow, hence I still observe differences in the predictions across runs, although the evaluation results do not change significantly.) The
The script prints the manual evaluation metrics in the end.
The caveat is that to develop with our codebase, one needs to constantly redo the manual judgement step as any model change will result in new predictions that were previously unseen. This makes the dev evaluation tedious and subjective. As the researchers themselves may not be proficient enough in Bash to do the judgement and different researchers may use different annotation standards. This problem is even more serious for testing, as the researcher needs to rerun the 3-annotator manual evaluation experiment to generate numbers comparable to our paper. This option could not generalize well. Hence my current suggestion would be 1) to make use of the automatic evaluation metrics proposed in appendix C as a coarse guidance for development (keep in mind that they do not strictly correlated with the manual evaluation metrics); (I would also encourage you to think of additional automatic evaluation methods) and 2) in case you are reporting new manual evaluation results on the test set, it is better to have the annotators judge both our system output and your system output such that the numbers are comparable (different annotators may have different standard which makes the evaluation scores produced by different sets of annotators incomparable). Meanwhile, I will think about better automatic evaluation methodology and how to build a common platform for test evaluation. Suggestions are welcome. Thank you for drawing this to our attention! |
When I retrain the model, the program will end automatically after four epochs. And the results were far below of the paper. I follow the instructions completely, but why can not I reproduce the results ?
The text was updated successfully, but these errors were encountered: