Some MarianMT models broken and output garbage #26216

fergusq · 2023-09-18T10:50:33Z

System Info

transformers version: 4.30.2
Platform: macOS-13.4.1-arm64-arm-64bit
Python version: 3.11.4
Huggingface_hub version: 0.16.3
Safetensors version: 0.3.1
PyTorch version (GPU?): 2.1.0.dev20230413 (False)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: no
Using distributed or parallel set-up in script?: no

Who can help?

@ArthurZucker @younesbelkada @gante

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Some of the MarianMT models I use like Helsinki-NLP/opus-mt-tc-big-fi-en are currently broken. This bug is reproduced in the Huggingface Hub inference widget: https://huggingface.co/Helsinki-NLP/opus-mt-tc-big-fi-en?text=Kissa+k%C3%A4velee+kadulla You can see that it gives a garbage translation.

This model worked just last week! I also see that the model hasn't been updated in the Hub for over a month, so the bug must be in the transformers library.

Simple code to reproduce locally:

>>> tokenizer = transformers.AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-tc-big-fi-en")
>>> model = transformers.AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-tc-big-fi-en")
>>> model.generate(**tokenizer("kissa kävelee kadulla", return_tensors="pt"))
tensor([[57829, 16542, 16542, 16542, 16542, 16542, 16542, 16542, 16542, 16542,
         16542, 16542, 16542, 16542, 16542, 16542, 16542, 16542, 16542, 16542,
         16542, 16542, 16542, 16542, 16542, 16542, 16542, 16542, 16542, 16542,
         16542, 16542, 16542, 16542, 16542, 16542, 16542, 16542, 16542, 16542,
         16542, 16542, 16542, 16542, 16542, 16542, 16542, 19074, 19074, 19074,
         19074, 19074, 19074, 19074, 19074, 19074, 19074, 19074, 19074, 19074,
         19074, 19074, 19074, 19074, 19074, 19074, 19074, 19074, 19074, 19074,
         19074, 19074, 19074, 19074, 19074, 19074, 19074, 19074, 19074, 19074,
         19074, 19074, 19074, 19074, 19074, 19074, 19074, 19074, 19074, 19074,
         11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825,
         11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825,
         11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825,
         11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825,
         11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825,
         11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825,
         11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825,
         11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825,
         11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825,
         11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825,
         11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825,
         11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825,
         11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825,
         11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825,
         11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825,
         11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825,
         11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825,
         11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825,
         11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825,
         11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825,
         11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825,
         11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825,
         11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825,
         11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825,
         11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825,
         11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825,
         11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825,
         11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825,
         11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825,
         11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825,
         11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825,
         11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825,
         11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825,
         11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825,
         11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825,
         11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825,
         11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825,
         11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825,
         11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825,
         11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825,
         11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825,
         11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825,
         11825, 41756]])

Expected behavior

The model should give a proper translation.

The text was updated successfully, but these errors were encountered:

fergusq · 2023-09-18T12:58:42Z

I can confirm that the bug exists even after reinstalling the models and updating transformers to 4.33.2.

ArthurZucker · 2023-09-18T23:20:32Z

Hey! As I answered on the issue that was opened here, the Helsinki team should merge the PR we opened to fix the models on the hub!

ArthurZucker · 2023-09-18T23:20:55Z

In the mean time use model = AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-tatoeba-en-ja", revision = "refs/pr/3" 😉

fergusq · 2023-09-21T13:12:02Z

The Finnish–English model doesn't seem to have a similar PR as in the English–Japanese model. Could you share the script used to create the fixed version?

ArthurZucker · 2023-09-22T01:34:06Z

I'll upload PRs to all models, working on it!

thfrkielikone · 2023-10-02T07:21:12Z

Hi. Is the updated script available somewhere? We are using some models that have not been published to huggingface and would need to convert them ourselves.

ArthurZucker · 2023-10-02T07:22:59Z

I think that the conversion script works as expected, but some models have to be converted again! Really sorry I could not go through this yet. Will update here soon!

thfrkielikone · 2023-10-02T10:35:13Z

I attempted to convert a model per the master branch of the transformers library with the following steps, but the result was broken:

Cloned transformers and the tatoeba challenge repo
Installed transformers via pip install . and then the packages that it was missing with torch wget sentencepiece gitpython.
Ran the conversion with python src/transformers/models/marian/convert_marian_tatoeba_to_pytorch.py --models fin-eng --save_dir converted, resulting in converted/opus-mt-fin-eng was created without issue.
Loaded the model into a pipeline and ran it with:

from transformers import pipeline
p = pipeline("translation", model="/path/to/converted/opus-mt-fin-eng", max_new_tokens=50)
p("Mitähän tein väärin?")

The result it gave was [{'translation_text': ',,,,,,, W W W W W W W W,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,'}]. It also gave the warning

spaces_between_special_tokens is deprecated and will be removed in transformers v5. It was adding spaces between added_tokens, not special tokens, and does not exist in our fast implementation. Future tokenizers will handle the decoding process on a per-model rule.

Am I doing something wrong, or is there something broken in the conversion script itself? (Am I perhaps using a wrong version?)

ArthurZucker · 2023-10-02T13:01:01Z

No I think the warning can be safely ignored in that case. Mostprobably an issue in the conversion. If you can push the raw model to thehub it will help me in debugging!

thfrkielikone · 2023-10-02T13:03:45Z

The model is the one downloaded by the comand above. Do you want the converted result or for me to fish out whatever the command downloaded (it is mt-tc-big-fi-en that is publicly available)?

github-actions · 2023-10-27T08:05:28Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

thfrkielikone · 2023-10-27T13:34:56Z

Thank you @artzucker for uploading the fixed version of the Finnish-English model.

Converting the as of yet not on the Huggingface hub Finnish-Swedish model still gives garbage results, but so does (for me) converting the Finnish-English model. However, using the revision of the Finnish-English model from the pr like so gives correct results:

p = pipeline("translation", model="Helsinki-NLP/opus-mt-tc-big-fi-en", revision = "refs/pr/6", max_new_tokens=50)

This leads me to believe that there's something wrong (at least different) with the steps I am taking to convert it. Is the correct way to convert the models documented somewhere? I know of this: https://github.com/huggingface/transformers/blob/main/scripts/tatoeba/README.md and followed it (plus also installing torch and SentencePiece which is not mentioned there) Before this breaking change, I used these steps to successfully convert the Swedish model once.

I will contact the Helsinki people and ask, why the Finnish-Swedish model hasn't been uploaded to the Huggingface hub (or if it has been and I missed it, where). If it can be put there and you could take a look at it, that'd be great.

In the meantime, if you want to check if there is something surprising about the Finnish-Swedish model itself after all, it is this one: https://github.com/Helsinki-NLP/Tatoeba-Challenge/tree/master/models/fin-swe#opustcv20210807bt-2021-12-08zip

ArthurZucker · 2023-10-27T14:15:51Z

Hey! I think I need to update the conversion script : readmes etc. Might not have time next week (will be off) but will definitely work on fixing this once and forall!
Feel free to share whatever you can (repo with both weights for example) that would help!

ArthurZucker · 2023-11-21T13:45:53Z

I think that the last I'll keep open to track the conversion script update 😉

ArthurZucker · 2023-12-16T16:39:20Z

No time to tackle this yet sorry, adding the good difficult issue tag!

PinzhenChen · 2024-03-13T23:00:06Z

We converted some Marian models here unfortunately with the old scripts. As of now, a solution worked for us is to downgrade transformers, e.g. to 4.28.

Adding the revision="xyz" can work for specific models on HF but not universally. I wonder if there is a plan to update the conversion script? Also is there a plan to make the conversion script as well as the Marian class more "universal" so that it 1) can work with any Marian-trained models (not just OPUS models from Helsinki), and 2) can let user specify the paths to Marian vocabulary file, model file, decoder file when converting weights.

ArthurZucker · 2024-03-25T07:02:41Z

It would be very nice but I'm super low on bandwidth for that. Maybe @zucchini-nlp could be a nice challenge! Otherwise community contributions are really welcome!

fergusq · 2024-04-23T13:10:44Z

Related to this issue, I made a conversion script that can correctly convert non-OPUS/Tatoeba Marian models to the transformers format: https://gist.github.com/fergusq/bc8e4f2d3d74027df9232f05d08f2ce4

Currently, convert_marian_to_pytorch.py is Opus/Tatoeba-specific and cannot really be used for other models, such as the HLPT Marian models added recently to the hub. Somehow, even these other models have the same issue with tied weights. I'm not sure why. While the issue is being corrected, the linked script can be used to convert arbitrary Marian models to the correct format so that the weights that are tied are correct.

fergusq changed the title ~~Some MarianMT broken and output garbage~~ Some MarianMT models broken and output garbage Sep 18, 2023

ArthurZucker mentioned this issue Sep 19, 2023

Marian models broken with latest transformers #26271

Closed

4 tasks

ArthurZucker mentioned this issue Oct 2, 2023

Wrong translation of the Helsinki-NLP/opus-mt-es-en encoder and decoder in ONNX #26523

Closed

4 tasks

Disastorm mentioned this issue Oct 7, 2023

At least one model's inference seems to have broken from transformers 4.29.2 -> 4.30.* #24657

Closed

4 tasks

huggingface deleted a comment from github-actions bot Nov 21, 2023

huggingface deleted a comment from github-actions bot Dec 16, 2023

ArthurZucker added the Good Difficult Issue label Dec 16, 2023

zucchini-nlp mentioned this issue Apr 11, 2024

Fix Marian model conversion #30173

Merged

zucchini-nlp closed this as completed in #30173 May 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some MarianMT models broken and output garbage #26216

Some MarianMT models broken and output garbage #26216

fergusq commented Sep 18, 2023

fergusq commented Sep 18, 2023

ArthurZucker commented Sep 18, 2023

ArthurZucker commented Sep 18, 2023

fergusq commented Sep 21, 2023

ArthurZucker commented Sep 22, 2023

thfrkielikone commented Oct 2, 2023

ArthurZucker commented Oct 2, 2023

thfrkielikone commented Oct 2, 2023 •

edited

Loading

ArthurZucker commented Oct 2, 2023

thfrkielikone commented Oct 2, 2023

github-actions bot commented Oct 27, 2023

thfrkielikone commented Oct 27, 2023 •

edited

Loading

ArthurZucker commented Oct 27, 2023

ArthurZucker commented Nov 21, 2023

ArthurZucker commented Dec 16, 2023

PinzhenChen commented Mar 13, 2024 •

edited

Loading

ArthurZucker commented Mar 25, 2024

fergusq commented Apr 23, 2024

Some MarianMT models broken and output garbage #26216

Some MarianMT models broken and output garbage #26216

Comments

fergusq commented Sep 18, 2023

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

fergusq commented Sep 18, 2023

ArthurZucker commented Sep 18, 2023

ArthurZucker commented Sep 18, 2023

fergusq commented Sep 21, 2023

ArthurZucker commented Sep 22, 2023

thfrkielikone commented Oct 2, 2023

ArthurZucker commented Oct 2, 2023

thfrkielikone commented Oct 2, 2023 • edited Loading

ArthurZucker commented Oct 2, 2023

thfrkielikone commented Oct 2, 2023

github-actions bot commented Oct 27, 2023

thfrkielikone commented Oct 27, 2023 • edited Loading

ArthurZucker commented Oct 27, 2023

ArthurZucker commented Nov 21, 2023

ArthurZucker commented Dec 16, 2023

PinzhenChen commented Mar 13, 2024 • edited Loading

ArthurZucker commented Mar 25, 2024

fergusq commented Apr 23, 2024

thfrkielikone commented Oct 2, 2023 •

edited

Loading

thfrkielikone commented Oct 27, 2023 •

edited

Loading

PinzhenChen commented Mar 13, 2024 •

edited

Loading