Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Latest torch titan changes #6

Merged
merged 7 commits into from
Aug 24, 2024

Conversation

philippguevorguian
Copy link
Collaborator

No description provided.

tianyu-l and others added 7 commits August 22, 2024 16:06
ghstack-source-id: ab6a7cec6ba4f4690f5834d22bc16d8d9f2bdba8
Pull Request resolved: #555
In this PR, we mostly measured the performance and loss curves for 405B
model with some optimizations techniques we recently developed. We also
want to log the actual peak TFLOPs used for MFU calculation for
cross-validation. Also we should get device information from system
rather from device name because it does not contain "NVL" or "SXM".

<img width="496" alt="image"
src="https://github.com/user-attachments/assets/ba822de5-cf23-4ecd-b29c-70f9aac38290">
As title. We have updated the peak FLOPs for H100 so we need to use the
correct number here
The lspci command is part of the `pciutils` package, which provides
tools for listing and querying PCI devices. But somehow `pciutils` is
not installed in CI machines. This PR is to first unblock CI failure and
then we can see if we want to make `pciutils` a requirement for Titan.
Somehow, when rebasing, the legacy float8 enabling flag stays in the
405B toml. Let's remove it. And this does not affect the perf number we
obtained because the old flag is just a no-op after rebase.
ghstack-source-id: 3ece57ae6d8dbf7ff66e3c41f1804ddb08078ba4
Pull Request resolved: #525
@philippguevorguian philippguevorguian merged commit 2e55278 into YerevaNN:sync_torch_titan Aug 24, 2024
1 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants