Prepare new data for NLLB-200 #24

ibtiRaj · 2023-02-09T11:43:52Z

Hi, I'm trying to fine tune NLLB-200 model on new bilingual data. So I need to prepare my data using prepare_data pipeline: https://github.com/facebookresearch/stopes/tree/main/stopes/pipelines/prepare_data
there are my configs file:

My output directory is the following:

But I encountered a problem when fine tuning NLLb-200:
File "/home/admin/khadija/fairseq/slurm_snapshot_code/2023-02-08T14_51_26.242208/fairseq/data/dictionary.py", line 238, in add_from_file
with open(PathManager.get_local_path(f), "r", encoding="utf-8") as fd:
FileNotFoundError: [Errno 2] No such file or directory: '/home/admin/khadija/prepare_data_output/data_bin/shard000/dict.ary_Arab.txt'
srun: error: slurmnode1: tasks 0-2: Exited with exit code 1

Is Fairseq compatible with the new version of Stopes?
@Mortimerp9 @kauterry @gwenzek Can you help me please?

robotsp · 2023-02-20T08:55:35Z

@ibtiRaj have you solve your problem?

ibtiRaj · 2023-02-20T09:21:14Z

@robotsp No, I didn't, I'm sorry.

robotsp · 2023-02-20T09:26:54Z

@robotsp No, I didn't, I'm sorry.

No worries. BTW, may I ask the model file and vocab file in your configs, are they the same as the original ones from NLLB?
I just downloaded one from https://github.com/facebookresearch/fairseq/tree/nllb. But the size of vocabulary is 255997 which is different from your 256200. I wonder why?
@ibtiRaj

ibtiRaj · 2023-02-21T09:12:47Z

@robotsp Yes, you are right, the vocabulary size is 255997 but when I run the fine tuning, I get a vocabulary size mismatch error :

That's why I thought of adding 200 tokens to the original vocabulary.

kauterry · 2023-02-21T18:10:04Z

Hi @ibtiRaj! The stopes/pipelines/prepare_data pipeline has been completely refactored. Could you pull the latest version of the code and change your config format to be compatible with the new code? Here is the README explaining how to write a prepare_data config: https://github.com/facebookresearch/stopes/tree/main/stopes/pipelines/prepare_data

Once you re-prepare your data with the latest code and the changed config, let me know if you still face any issues.

robotsp · 2023-02-22T04:07:59Z

@kauterry Would you please have a look at facebookresearch/fairseq#4989
I prepare my data with the latest code of stopes and the changed config, but came across the new issue of "Can't instantiate abstract class TrainModule with abstract methods requirements".

ibtiRaj · 2023-02-22T08:41:33Z

Hi @kauterry , thank you for your answer.

When I prepare my data with the new version of stopes, I always get two errors:

The first one is the same as in this issue Try to finetune NLLB but got an error: "Can't instantiate abstract class TrainModule with abstract methods requirements" fairseq#4989. I solved this error by using the old version of stopes.
The second is the following :

what do you think?

And what about the mismatch error, is it true that 200 new words can be added to the original vocabulary?

robotsp · 2023-02-22T12:33:57Z

I don't found nllb module in fairseq/examples of the version ==0.12.1 that recommended by the new version of Stopes (https://github.com/facebookresearch/stopes/tree/main). But when I reinstalled the nllb version of fairseq. Some conflicts of between hydra-core and fairseq occur. I think this is the root cause. Do you know why? @kauterry @ibtiRaj

ibtiRaj · 2023-02-22T14:16:14Z

hi @robotsp, I solved the problem by following the NLLB installation guide here: https://github.com/facebookresearch/fairseq/blob/nllb/INSTALL.md.

robotsp mentioned this issue Feb 22, 2023

Try to finetune NLLB but got an error: "Can't instantiate abstract class TrainModule with abstract methods requirements" facebookresearch/fairseq#4989

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prepare new data for NLLB-200 #24

Prepare new data for NLLB-200 #24

ibtiRaj commented Feb 9, 2023 •

edited

Loading

robotsp commented Feb 20, 2023

ibtiRaj commented Feb 20, 2023

robotsp commented Feb 20, 2023 •

edited

Loading

ibtiRaj commented Feb 21, 2023

kauterry commented Feb 21, 2023

robotsp commented Feb 22, 2023

ibtiRaj commented Feb 22, 2023

robotsp commented Feb 22, 2023 •

edited

Loading

ibtiRaj commented Feb 22, 2023

Prepare new data for NLLB-200 #24

Prepare new data for NLLB-200 #24

Comments

ibtiRaj commented Feb 9, 2023 • edited Loading

robotsp commented Feb 20, 2023

ibtiRaj commented Feb 20, 2023

robotsp commented Feb 20, 2023 • edited Loading

ibtiRaj commented Feb 21, 2023

kauterry commented Feb 21, 2023

robotsp commented Feb 22, 2023

ibtiRaj commented Feb 22, 2023

robotsp commented Feb 22, 2023 • edited Loading

ibtiRaj commented Feb 22, 2023

ibtiRaj commented Feb 9, 2023 •

edited

Loading

robotsp commented Feb 20, 2023 •

edited

Loading

robotsp commented Feb 22, 2023 •

edited

Loading