You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi again,
We started a few months ago to look at the probabilistic dictionary used in bicleaner and we found them quite noisy with a lot of wrong word alignment.
So we develop a new approach to build the probabilistic dictionary based on awesome-align which resulted in a better accuracy of the bicleaner score.
So here is how we did it :
cat bigcorpus.en-is| cut -f1 > bigcorpus.en-is.en
cat bigcorpus.en-is| cut -f2 > bigcorpus.en-is.is
moses/tokenizer/tokenizer.perl -l en -no-escape < bigcorpus.en-is.en > bigcorpus.en-is.tok.en
moses/tokenizer/tokenizer.perl -l is -no-escape < bigcorpus.en-is.is > bigcorpus.en-is.tok.is
sed 's/[[:upper:]]*/\L&/g' < bigcorpus.en-is.tok.en > bigcorpus.en-is.tok.low.en
sed 's/[[:upper:]]*/\L&/g' < bigcorpus.en-is.tok.is > bigcorpus.en-is.tok.low.is
paste bigcorpus.en-is.tok.low.en bigcorpus.en-is.tok.low.is | sed s'/\t/ ||| /' > bigcorpus.en-is.clean
Thanks for the recommendation, I've added the link of the issue at the wiki in case someone needs it. Unfortunately this is not useful for bicleaner-ai because does not use any of the statistical features and it's fully based on neural networks.
Hi again,
We started a few months ago to look at the probabilistic dictionary used in bicleaner and we found them quite noisy with a lot of wrong word alignment.
So we develop a new approach to build the probabilistic dictionary based on awesome-align which resulted in a better accuracy of the bicleaner score.
So here is how we did it :
Then build the alignement file :
And finally, build the probabilistic dictionaries (using pyspark):
Rename create_Bicleaner_dic.txt to
create_bicleaner_dic.py
Could also be used for bicleaner-ai
I would be very happy to hear your feedbacks and result from your test.
Thanks again for all your work,
The text was updated successfully, but these errors were encountered: