Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving probabilistic dictionary #66

Closed
jgcb00 opened this issue Feb 10, 2022 · 2 comments
Closed

Improving probabilistic dictionary #66

jgcb00 opened this issue Feb 10, 2022 · 2 comments

Comments

@jgcb00
Copy link

jgcb00 commented Feb 10, 2022

Hi again,
We started a few months ago to look at the probabilistic dictionary used in bicleaner and we found them quite noisy with a lot of wrong word alignment.
So we develop a new approach to build the probabilistic dictionary based on awesome-align which resulted in a better accuracy of the bicleaner score.

So here is how we did it :

cat  bigcorpus.en-is| cut -f1 > bigcorpus.en-is.en
cat  bigcorpus.en-is| cut -f2 > bigcorpus.en-is.is

moses/tokenizer/tokenizer.perl -l en -no-escape < bigcorpus.en-is.en > bigcorpus.en-is.tok.en
moses/tokenizer/tokenizer.perl -l is -no-escape < bigcorpus.en-is.is > bigcorpus.en-is.tok.is

sed 's/[[:upper:]]*/\L&/g' < bigcorpus.en-is.tok.en > bigcorpus.en-is.tok.low.en
sed 's/[[:upper:]]*/\L&/g' < bigcorpus.en-is.tok.is > bigcorpus.en-is.tok.low.is

paste bigcorpus.en-is.tok.low.en bigcorpus.en-is.tok.low.is | sed s'/\t/ ||| /' > bigcorpus.en-is.clean

Then build the alignement file :

CUDA_VISIBLE_DEVICES=0 awesome-align     --output_file=/dev/null     --output_word_file=alignement_big_corpus.en-is.clean     --model_name_or_path=model_without_co     --data_file=bigcorpus.en-is.clean    --extraction 'softmax'     --batch_size 32
[create_Bicleaner_dic.py.txt](https://github.com/bitextor/bicleaner/files/8042712/create_Bicleaner_dic.py.txt)

And finally, build the probabilistic dictionaries (using pyspark):

python3 create_Bicleaner_dic.py alignement_big_corpus.en-is.clean

Rename create_Bicleaner_dic.txt to create_bicleaner_dic.py

Could also be used for bicleaner-ai
I would be very happy to hear your feedbacks and result from your test.

Thanks again for all your work,

@ZJaume
Copy link
Member

ZJaume commented Feb 10, 2022

Thanks for the recommendation, I've added the link of the issue at the wiki in case someone needs it. Unfortunately this is not useful for bicleaner-ai because does not use any of the statistical features and it's fully based on neural networks.

@ZJaume ZJaume closed this as completed Feb 10, 2022
@jgcb00
Copy link
Author

jgcb00 commented Feb 10, 2022

Indeed, I misread the wiki page of bicleaner-ai.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants