Maximum load of the first transformer iteration #5

deanAirre · 2024-08-27T16:23:34Z

Good Evening,

I am very interested in research around making Transformers more approachable to public especially within community without good GPU.

I already solved previous question, but I want to ask another question. Does the first iteration of transformer learning is supposed to be heavy in load because it is filled with unpruned tokens? I know the first iteration transformers is necessary for determining which token to focus in the pruning section, but how much is the maximum load of the first iteration compared to the final pruning with good accuracy?

Thanks in advance,
Regards,
Sean.

ZLKong · 2024-10-10T17:33:54Z

Hi Sean:

The "first iteration of transformer" you mentioned, do you mean the first transformer block or the first iteration during training?

deanAirre · 2024-10-19T10:52:47Z

Dear PeiyanFlying,

Yes the first iteration during training before pruning happens. Also in case needed, have you heard about a method of actually 'infuse' trained model to the first iteration blocks of transformers so it doesn't have to do training from scratch?

Thanks in advance, best regards,
Sean.

ZLKong · 2024-10-23T06:41:40Z

Hi Sean:

The first iteration of transformer learning should be in load because the pruning has not started, but it should be a similar load compared to the original ViT.

Regarding infuse, I am not sure about this. I assume this is similar to distillation, or lottery ticket method, where you get a good initial weight for the layers, and then do fine-tuning or training?

deanAirre · 2024-10-24T10:12:30Z

Dear PeiyanFling,

Yes, the first iteration should be as heavy as original ViT because no pruning has been done, so I was looking for a way to 'infuse' pretrained model so it doesn't have to go as heavy as original transformer. Since it is confirmed it will be as heavy I will look for a way, maybe distillation or lottery ticket method, to make SPViT even more lighter.

But then I wonder how your 'adaptive pruning' method will 'see where it suitable to stop' if it doesn't hold embedding table from ViT first iteration training, do you think it will still work if I 'distilled' model to first SPViT training layer so it goes straight to pruning?

Thanks in advance, the discussion have been very helpful,
Sean

deanAirre closed this as completed Sep 17, 2024

deanAirre reopened this Sep 17, 2024

deanAirre changed the title ~~Confidence applying this within CPU only learning~~ Maximum load of the first transformer iteration Sep 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Maximum load of the first transformer iteration #5

Maximum load of the first transformer iteration #5

deanAirre commented Aug 27, 2024 •

edited

Loading

ZLKong commented Oct 10, 2024

deanAirre commented Oct 19, 2024

ZLKong commented Oct 23, 2024

deanAirre commented Oct 24, 2024 •

edited

Loading

Maximum load of the first transformer iteration #5

Maximum load of the first transformer iteration #5

Comments

deanAirre commented Aug 27, 2024 • edited Loading

ZLKong commented Oct 10, 2024

deanAirre commented Oct 19, 2024

ZLKong commented Oct 23, 2024

deanAirre commented Oct 24, 2024 • edited Loading

deanAirre commented Aug 27, 2024 •

edited

Loading

deanAirre commented Oct 24, 2024 •

edited

Loading