Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add rope alibi to encoder #1687

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

Conversation

vince62s
Copy link
Member

No description provided.

@LynxPDA
Copy link

LynxPDA commented May 22, 2024

@vince62s I updated Ctranslate and the conversion went without errors.
The only thing is that I saved the changes by adding the line "gated-gelu" to _SUPPORTED_ACTIVATIONS:

_SUPPORTED_ACTIVATIONS = {
    "gelu": common_spec.Activation.GELU,
    "fast_gelu": common_spec.Activation.GELUTanh,
    "relu": common_spec.Activation.RELU,
    "silu": common_spec.Activation.SWISH,
    "gated-gelu": common_spec.Activation.GELU,
}

without this it still gave the error:

- Option --pos_ffn_activation_fn gated-gelu is not supported (supported activations are: gelu, fast_gelu, relu, silu)

@vince62s
Copy link
Member Author

vince62s commented May 22, 2024

yes you're correct. Were you able to process inference in ct2 without any issue?

I am not merging yet because for AliBi it requires additional changes in C

@LynxPDA
Copy link

LynxPDA commented May 23, 2024

So far tried inference only with gated-gelu activation on Ctranslate2 v3.20.0, no issues. I plan to try output with ROPE in the next month after training the corresponding model.

Copy link

@lecoqnicolas lecoqnicolas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello,
Would like to insert a line between lines 8 and 9 to support gated-gelu activation
"gated-gelu": common_spec.Activation.GELU,
Also, gated-gelu does not feature in the "transformers.py" script. Might want to add it after line 30, and modify line 1308 to include gated-gelu.

@lecoqnicolas
Copy link

Hello, Would like to insert a line between lines 8 and 9 to support gated-gelu activation "gated-gelu": common_spec.Activation.GELU, Also, gated-gelu does not feature in the "transformers.py" script. Might want to add it after line 30, and modify line 1308 to include gated-gelu.

Forget the transformers.py script : GEMMA transformers implement GeGLU, but with a GELUTanh approximation (just read an article about it) so no need to update.

@LynxPDA
Copy link

LynxPDA commented Jun 9, 2024

Were you able to process inference in ct2 without any issue?

Yes, I confirm. Successfully trained a model with gated-GELU and RoPE and inference it with Libretranslate (Ctranslate2 v3.20.0)

@lecoqnicolas
Copy link

lecoqnicolas commented Jun 11, 2024

I tried updating CT2 to 4.2.1 and pulling a training. It breaks upon validation step with the errors below. At first, I thought it was the "None" values set in the fix dated March12, so I downgraded onmt to 3.5.0 and further on down to the original pinned version (3.4.1). But best case scenatio, I get the same errors (worst, training aborts at step 1 or doesn't start at all).

Tried pretty much any version of CT4.x and onmt3.5x with compatible torch/cuda. I also tried different data to check this out, and different population method. Do you have any idea?

From the error, I think this is related to a "filtertoolong' transform that is systematically inserted in the data, but I am not sure.

[2024-06-11 09:42:02,000 INFO] Start training loop and validate every 100 steps...
[2024-06-11 09:42:02,015 INFO] Scoring with: ['sentencepiece', 'filtertoolong', 'prefix']
[2024-06-11 09:48:24,332 INFO] Step 50/32000; acc: 0.4; ppl: 31126.4; xent: 10.3; lr: 0.00000; sents:  295284; bsz: 6628/7296/236; 21670/23854 tok/s;    382 sec;
[2024-06-11 09:53:54,287 INFO] Step 100/32000; acc: 4.7; ppl: 25540.5; xent: 10.1; lr: 0.00000; sents:  292443; bsz: 6604/7244/234; 25018/27441 tok/s;    712 sec;
[2024-06-11 09:54:36,382 INFO] valid stats calculation
                           took: 42.094335317611694 s.
Traceback (most recent call last):
  File "C:\Program Files\Python39\lib\runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Program Files\Python39\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\Program Files\Python39\Scripts\onmt_train.exe\__main__.py", line 7, in <module>
  File "C:\Program Files\Python39\lib\site-packages\onmt\bin\train.py", line 67, in main
    train(opt)
  File "C:\Program Files\Python39\lib\site-packages\onmt\bin\train.py", line 52, in train
    train_process(opt, device_id=0)
  File "C:\Program Files\Python39\lib\site-packages\onmt\train_single.py", line 238, in main
    trainer.train(
  File "C:\Program Files\Python39\lib\site-packages\onmt\trainer.py", line 332, in train
    valid_stats = self.validate(
  File "C:\Program Files\Python39\lib\site-packages\onmt\trainer.py", line 420, in validate
    preds, texts_ref = self.scoring_preparator.translate(
  File "C:\Program Files\Python39\lib\site-packages\onmt\utils\scoring_utils.py", line 111, in translate
    _, preds = translator._translate(
  File "C:\Program Files\Python39\lib\site-packages\onmt\translate\translator.py", line 494, in _translate
    for batch, bucket_idx in infer_iter:
  File "C:\Program Files\Python39\lib\site-packages\onmt\inputters\dynamic_iterator.py", line 341, in __iter__
    for bucket, bucket_idx in self._bucketing():
  File "C:\Program Files\Python39\lib\site-packages\onmt\inputters\dynamic_iterator.py", line 286, in _bucketing
    yield (self._tuple_to_json_with_tokIDs(bucket), self.bucket_idx)
  File "C:\Program Files\Python39\lib\site-packages\onmt\inputters\dynamic_iterator.py", line 247, in _tuple_to_json_with_tokIDs
    tuple_bucket = process(self.task, tuple_bucket)
  File "C:\Program Files\Python39\lib\site-packages\onmt\inputters\text_utils.py", line 95, in process
    transf_bucket = transform.batch_apply(
  File "C:\Program Files\Python39\lib\site-packages\onmt\transforms\transform.py", line 232, in batch_apply
    batch = transform.batch_apply(
  File "C:\Program Files\Python39\lib\site-packages\onmt\transforms\transform.py", line 70, in batch_apply
    example = self.apply(example, is_train=is_train, **kwargs)
  File "C:\Program Files\Python39\lib\site-packages\onmt\transforms\misc.py", line 56, in apply
    or len(example["tgt"]) > self.tgt_seq_length - 2
TypeError: object of type 'NoneType' has no len()
Total checkpoints: 0

@lecoqnicolas
Copy link

lecoqnicolas commented Jun 11, 2024

Tried pretty much any version of CT4.x and onmt3.5x with compatible torch/cuda. Also tried different data to check this out, and different population method. Do you have any idea?

Well, what I did not try was to install two seemingly incompatible versions of CT2 (4.2.1) and ONMT-py (3.4.3). @LynxPDA had me update this way, and now it does work (at least with absolute PE, and probably RPE).

Although.... when I tried RoPE, upon converting the checkpoint to a CT2 model, I've got an "unexpected argument" error, explicit enough for me to comment lines 343 & 344 of the transformer_spec.py script and make it through with a seemingly working model (which was a toy config though, I still have to make a full training).
image

The point now, I do not know if these specs really work and even if RoPE is sufficiently implemented in OpenNMT-py 3.4.3 to perform as intended.

So, we'll have to work on a fix to use onmt-py 3.5.x I guess. Does the abovementioned bug come from the applied transforms? There are also empty "tgt_prefix" prefixes on top of the filtertoolong, could it be the issue? I attach the corresponding config file, since you have developed a lot of code for opennmt-py as well, you surely know better
config.txt

@lecoqnicolas
Copy link

lecoqnicolas commented Jun 12, 2024

I think I've found the bug... it happens while calculating validation metrics at first val step. At this point, train will repeat call the scoring utils which manipulates "example" strings.
In onmt-py version 3.4, the string is split with default () argument => empty resulting strings (None typed) are dumped.
In onmt-py version 3.5 the strings are split on explicit (" ") argument => if two whtiespaces follow each other in the validation set, the result is an empty string, whose len value is None, hence the error. There may be other occurences that I've not thought of.
I checked, the flores200-devtest : in english, it has one occurence of double whitespaces, in German another, unrelated...
I have a rope training in progress now (not really conclusive by the way), I'll stop it to check whether cleaning the validation set works.

The split() function has been updated in the commit you pulled on Jan.18 this year : fix "\n" tokenization + phi-2 new layer names (OpenNMT/OpenNMT-py#2552). I do not understand exactly how it fixes tokenizing \n, but could you tell me if inserting the ... code in the misc.py file would help? This is something I can pull out.

    def apply(self, example, is_train=False, stats=None, **kwargs):
        """Return None if too long or empty else return as is."""
        if (
            len(example["src"]) > self.src_seq_length
            or len(example["tgt"]) > self.tgt_seq_length - 2
            **or example["src"] is None
            or example["tgt"] is None**
        ):
            if stats is not None:
                stats.update(FilterTooLongStats())
            return None
        else:
            return example

@vince62s
Copy link
Member Author

@lecoqnicolas sorry for not responding in a timely manner but we switched to https://github.com/eole-nlp/eole
which a spin off of onmt-py, all new dev will be done in this framework. I encourage you to switch. There is no CT2 converter at the moment but we will implement one.
Specifically the easiest is to NOT inclide filtertoolong in the validset transforms (there is no reason to include it).
BUT yes there is a bug which will be fixed in Eole.

@lecoqnicolas
Copy link

lecoqnicolas commented Jun 13, 2024

OK; I'll follow eole to see when we can switch. Too much focus on LLMs in OpenNMT-py lately.

As of the transform, well there's no real need for it at valid, but the bug will also appear upon training I guess and may break it should an empty token appear.
I reviewed the commit #2552, by the way, and I can't get the rationale behind explicit whitespace separator. I mean, once "\n" argument is replaced by void argument in the strip() method, \n is automatically passed for tokenization, so why add the explicit separator not only in the split that followed, but in every other one?

I'm not a code guru so I surely have missed something, but I think of switching back to the split() default as a fix for this issue. Is it possible without further bugs?

@vince62s
Copy link
Member Author

I don't think the double space is your issue.
Your issue is that at scoring we force ['tgt']to None because we translate, hence if you use filtertoolong that will break because unlike other transforms there is a missing test with if ex['tgt']is not None.
#2552 was fixing edge cases when we had special whitespaces in text.
Anyway we need to work on this space tokenization to remove it all the way.

@lecoqnicolas
Copy link

I tried to remove it earlier this morning, then launched a test and went on an errand, you are right, it's something else. I'll remove the filtertoolong all the way, because if there's any such call while training, it will abort too.
Since it was automatically inserted in the training script, I never paid attention to it, but I figured out recently that it skips some large sentences I inserted on purpose in my datasets (because they are human translations of high quality from legal texts). So I'll probably end up better off without it.
Are the arguments "src_seq_length" and "tgt_seq_length" used in any other context? Just in case.

@vince62s
Copy link
Member Author

you still need it for training otherwise you take the risk to go OOM.

@lecoqnicolas
Copy link

lecoqnicolas commented Jun 13, 2024

isn't the risk to abort training on some empty token bigger? I've implemented it, but i keep my fingers crossed....

Then, you might want to rebase your PR on version 4.3.1 of CTranslate : the error on the "original_max_position_embeddings" argument was due to incomplete specs in the attention layer, which have been updated in merged commit #1700.
Alongside this, the opennmt-py converter has been modified, so I had to merge it manually (transformer_specs are still ok).

About the eole project, are you completely abandoning opennmt and ctranslate2 or will you still contribute to CT2?

Last question : should i specifiy arguments "max_position_embeddings" and "original..." or default is fine for a seq2seq transformer? Su specified 512 in its article if I'm not mistaken, but that was a BERT-style model.

@lecoqnicolas
Copy link

lecoqnicolas commented Jun 18, 2024

Hello,
After updating to OpenNMT3.5.1, CT2 4.3.1 and applying the PR, I cannot obtain a functional model that uses RoPE and even with basic position encoding, there are be some issues. Curiously, models that use RPE 20 train and convert normally.

ONMT training metrics look good, and more, they are coherent between models that work and those that don't.
image

Conversion goes without a glitch, but the models simply do not translate accurately : sentences are rendered into single words ("Ms") or phrases ("that is the case") having little to do with the original text, BLEU on any evaluation dataset is 0 or close.

In case it had come from some issue with CUDA or pytorch, I checked several versions, but nothing worked. Also, I cannot implement flash_attention 2 on the Tesla V100S GPUs at my disposal, could this be an issue?

@lecoqnicolas
Copy link

Actually, this is a flash_atten2 issue : I have (carefully) merged yout PR's modifications in a CT2 version prior to flash_attention2 implementation, reverted to onmt-py 3.4.1 that doesn't use it either, and successfully -if slowly- trained a rope encoded model with gated-gelu activation.

Flash_attn2 may be all the hype, but there are lots of legacy GPUs still out there working.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants