Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training budget estimation #20

Open
QiuJYWX opened this issue Jun 27, 2024 · 3 comments
Open

Training budget estimation #20

QiuJYWX opened this issue Jun 27, 2024 · 3 comments

Comments

@QiuJYWX
Copy link

QiuJYWX commented Jun 27, 2024

We trained our model using AnghaBench compilation results across four optimization levels (O0~O3), selecting samples under 1024 tokens. That gave us a total of 534,564 samples per level, and we trained for 2 epochs on a cluster of 8 Nvidia A100 GPUs.

As for the training times, they were 10 hours for the 1.3B model, 85 hours for the 6.7B model, and 440 hours for the 33B model.

Let me know if you need more info!

Originally posted by @rocky-lq in #3 (comment)

Hi @rocky-lq @albertan017 ,

We are estimating the training budget of reproducing LLM4Decompile. In your previous issue response, _given 534,564 samples per level and a cluster of 8 Nvidia A100 GPUs, 10 hours were cost for the 1.3B model, 85 hours were cost for the 6.7B model, and 440 hours were cost for the 33B model _.

In the 19 june updated paper, fine-tuning the 1.3B and 6.7B LLM4Decompile-End takes 12 and 61 days on 8×A100 respectively given 7.2 million compilable samples and 1.6 million executable samples. There is some confusion about training budget estimation.

Would you please provide more information about training budget and are all the training are fully supervised finetuning?

@albertan017
Copy link
Owner

albertan017 commented Jun 27, 2024

In V1, the maximum sequence length is set at 1,024, whereas in Version 1.5 it is increased to 4,096. The computational expenses rise quadratically (theoretically for attention calculation, in practice with acclerations may not be than much) relative to the sequence length. V2 also uses a larger dataset (undergone significant deduplication), these factors collectively lead to a 30x increase in training costs.

@cmberryau
Copy link

Are you training on a single node or multiple nodes out of interest?

@albertan017
Copy link
Owner

Are you training on a single node or multiple nodes out of interest?

For the 1B model, we use a single node. For larger models, they are typically trained across multiple nodes (6B can still be trained on a single node, depending on the budget).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants