Reenable the distributed checkpointing test #8424

JackCaoG · 2024-11-27T18:43:14Z

This is follow up of #8386.

In the previous pr I found that someone during fallback the pytorch will try to update an existing XLATensor with a CPU tesnor with different shape. In that case we need to remove the sharding spec otherwise there will be a shape mismatch. However I found that in the distributed point we will swap the existing XLATensor with the cpu tensor and it seems like we want to keep the sharding spec.

@jonb377 one concern I have is that test only test the single host, I felt like if it is a actual multi-host case the CPU tensor withh have different shape(sharded) than the shardingspec? I am not sure if we have such test somewhere. Even if we clear the shardingspec after a torch_xla.sync() the tensor will be moved to the device, but most likely replicated. I am a bit worried if I am breaking the distributed checkpointing here.

Reenable the distributed checkpointing test

bf771b0

JackCaoG added the tpuci label Nov 27, 2024

JackCaoG marked this pull request as ready for review November 28, 2024 09:06

JackCaoG requested a review from tengyifei November 28, 2024 09:06

tengyifei approved these changes Dec 2, 2024

View reviewed changes

JackCaoG merged commit 591c397 into master Dec 2, 2024
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reenable the distributed checkpointing test #8424

Reenable the distributed checkpointing test #8424

JackCaoG commented Nov 27, 2024 •

edited

Loading

Reenable the distributed checkpointing test #8424

Reenable the distributed checkpointing test #8424

Conversation

JackCaoG commented Nov 27, 2024 • edited Loading

JackCaoG commented Nov 27, 2024 •

edited

Loading