-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updated the UI for TPU support #201
base: main
Are you sure you want to change the base?
Conversation
👷 Deploy request for code-generator pending review. 🔨 Explore the source changes: 460772b |
@ydcjeff can you help with that please ? How to make the following:
@sayantan1410 first comment about the UI, basically we can not and should not choose the backend if distirbuted training is not specified => "Choose a Backend" should be a subpart of "Distributed Training" and should be active if distributed training is selected. |
@vfdev-5 Okay will change that !! |
@vfdev-5 I have made the changes as you suggested, now the "choose a backend" option is shown only when distributed training is selected. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update @sayantan1410
Now let's move on with content update once a backend is selected.
@vfdev-5 can you please guide me on how to do that ? |
@sayantan1410 we are doing right now a coding sprint (please check our discord, #start-contributing channel). if you can join it, it could be a good opportunity to learn more about the projects and being guided. |
@vfdev-5 Sorry for the halt, I have made the changes requested above, can you please guide me how to update the content once a backend is selected. |
@sayantan1410 no worries and thanks for the update! For example, the content changes we would like to do for each template. Let me illustrate on vision classification template In the readme we give some launch commands depending on the config, for example: python -m torch.distributed.launch \
--nproc_per_node #:::= nproc_per_node :::# \
--nnodes #:::= it.nnodes :::# \
--node_rank 0 \
--master_addr #:::= it.master_addr :::# \
--master_port #:::= it.master_port :::# \
--use_env main.py \
--backend nccl We can that I think that's it for GLOO backend... By the way, we need to replace |
@vfdev-5 Okay, then I will make another draft PR with the updates that you suggested, and then we can take that forward. Also Should I close this PR here or keep it as it is ? |
Sounds good for another draft PR and let's keep this one open |
@sayantan1410 can you sync now this PR against |
@vfdev-5 Yeah sure, doing it !! |
@vfdev-5 Added XLA-TPU option in the backend dropdown. |
Sounds good, so what's next ? :) |
@vfdev-5 I have a very stupid question, let's say someone selects backend as XLA-TPU, then we will have to check if XLA is there in the system or not and whether TPU is on or not, and if they are present then according to that we will have to give him/her the template code right ? |
We do not check for the infrastructure in code-generator app. For example, when specifying nccl and distributed training with 1000 nodes and 10000 processes we just say in the readme how to launch and that's it. |
@vfdev-5 Okay I will try and let you know the update. |
Description - Adding support for TPU
Fix #173
I have updated the UI but couldn't figure out the next step: that if XLA-TPU is selected then training should be distributed to 8 processes.
Here's a screenshot of the UI