DID loop split on reshaped IDs #3875

Priya2698 · 2025-02-11T23:52:49Z

This PR updates propagateReshapeTransform to support DID loop split.

When the loop split is on the iterdomains being reshaped, the logical reshaped iterdomain is no longer present in the loop domain since it is split. In this case, we check if there is a sharded loop ID and compare the logical reshape iterdomain to the producer of this DID split.

Priya2698 · 2025-02-11T23:52:55Z

!test

github-actions · 2025-02-11T23:53:31Z

Review updated until commit de6a4b2

Description

Updated propagateReshapeTransform to handle DID loop splits.
Added test for sharded split reshape IDs.
Improved reordering of reshape dimensions in loop domain.

Changes walkthrough 📝

Relevant files

Enhancement

utils.cpp `Improve reshape dimension reordering` csrc/scheduler/utils.cpp Added logic to find all reachable IDs between logical reshape IDs and loop domain. Reordered reshape dimensions to the front of the domain. Improved error handling for missing logical IDs in loop domain.	+38/-10

Tests

test_multidevice_sharding.cpp `Add test for sharded split reshape` tests/cpp/test_multidevice_sharding.cpp Added test case for sharded split reshape IDs. Demonstrated propagation of DID loop splits.	+62/-0

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🧪 PR contains tests

⚡ Recommended focus areas for review

Complexity

The new logic for finding reachable IDs and reordering them might introduce additional computational overhead. Consider profiling this part to ensure it does not negatively impact performance.

// Find all reachable ids between the logical id and the loop domain.
// If the ids are in the loop domain, reorder them to the front.
auto transforms = DependencyCheck::getAllExprsBetween(
    {logical_id},
    {tv->getLoopDomain().begin(), tv->getLoopDomain().end()});
std::unordered_set<IterDomain*> reachable_ids;
// Add the logical id for the case where it is directly in the loop
// domain.
reachable_ids.insert(logical_id);

for (auto expr : transforms) {
  auto outputs = ir_utils::filterByType<IterDomain>(expr->outputs());
  reachable_ids.insert(outputs.begin(), outputs.end());
}

bool has_reachable_loop_id = false;
for (auto loop_idx :
     c10::irange(static_cast<int64_t>(tv->getLoopDomain().size()))) {
  if (reachable_ids.count(tv->axis(loop_idx)) == 0) {
    continue;
  }
  has_reachable_loop_id = true;
  // Reorder the reshape dimensions to the front of the domain
  old2new[loop_idx] = (int64_t)old2new.size();
}

Error Handling

The error message in NVF_ERROR could be more descriptive to help diagnose issues when has_reachable_loop_id is false.

      has_reachable_loop_id,
      "Require ",
      logical_id,
      " is in the active domain of ",
      tv->toString(),
      " for view propagation.");
}

Test Coverage

Ensure that the new test case covers all edge cases, including scenarios where the reshaped dimensions are not sharded or where the loop split does not align with the reshaped dimensions.

TEST_F(MultiDeviceTest, ShardedSplitReshapeIds) {
  auto fusion = std::make_unique<Fusion>();
  FusionGuard fg(fusion.get());

  const int d = communicator_->size();
  const int64_t b = 2, s = 2, h = 4, e = 3;

  TensorView* tv0 = makeContigConcreteTensor(
      {b, s, d * h * e}); // in: loop domain: {b, s, d*h*e}
  TensorView* tv1 = reshape(
      tv0,
      {b, s, d * h * e},
      {b, s, d * h, e}); // out: loop domain: {b, s, d*h, e}

  fusion->addInput(tv0);
  fusion->addOutput(tv1);

  auto mesh = DeviceMesh::createForNumDevices(d);

  // Propagate transform from reshaped output to input.
  // Without this propagation, the two DID axes on `in` and `out` will not be
  // mapped in together in ID model. This causes scheduling to fail due to
  // resharding.
  TransformPropagator propagator_c2p(tv1);
  MaxLogicalDomainInfoSpanningTree(tv1).traverse(&propagator_c2p);
  // in: loop domain: {b, s, d*h, e} after transform propagation

  // Loop split and parallelize input
  tv0->setDeviceMesh(mesh);
  tv0->split(-2, d, /*inner_split=*/false);
  tv0->axis(-3)->parallelize(ParallelType::DIDx);
  // in: loop domain: {b, s, DIDx{d}, h, e}

  // Propagate DID loop split to output
  TransformPropagator propagator_p2c(tv0);
  MaxLogicalDomainInfoSpanningTree(tv0).traverse(&propagator_p2c);
  // out: loop domain: {b, s, d, h, e} after transform propagation

  // Parallelize output
  scheduler_utils::parallelizeAllLike(
      tv0,
      /*pos=*/-1,
      /*selected_tv=*/{tv1});
  // out: loop domain: {b, s, DIDx{d}, h, e} after parallelization

  tv0->setAllocationDomain(tv0->getLoopDomain(), true);
  tv1->setAllocationDomain(tv1->getLoopDomain(), true);

  FusionExecutorCache executor_cache(std::move(fusion));
  at::Tensor inp = at::randn({b, s, d * h * e}, tensor_options);
  at::Tensor sharded_inp = shardTensor(inp, tv0);
  at::Tensor nvf_out =
      executor_cache.runFusionWithInputs({sharded_inp})[0].as<at::Tensor>();
  testValidate(
      executor_cache.fusion(),
      {nvf_out},
      {sharded_inp},
      {sharded_inp.view({b, s, h, e})},
      __LINE__,
      __FILE__);
}

} // namespace nvfuser

Priya2698 · 2025-02-13T07:15:15Z

!test

csrc/scheduler/utils.cpp

wujingyue · 2025-02-13T19:55:54Z

csrc/scheduler/utils.cpp


+        // Reorder the reshape dimensions to the front of the domain


@naoyam I'm quite confused by this pre-existing logic before I can understand this PR. Why is it necessary to move reshape dimensions to the front of the loop domain? It can cause conflict with the pre-existing assumption that DIDs have to be the front as well.

I think this is related to the propagation done at line 2279. IIRC, it propagates the outermost N dimensions, where N is old2new.size() in this case. Since here we just want to propagate the transformations related to the rfactor, this is how we limit the propagation.

We can probably just reorder tv back after line 2279.

Co-authored-by: Jingyue Wu <[email protected]>

naoyam · 2025-02-14T03:09:28Z

csrc/scheduler/utils.cpp

        auto find_it = std::find(
            tv->getLoopDomain().begin(), tv->getLoopDomain().end(), logical_id);
+
+        // If not found directly and there is a sharded loop ID,


I think I see what the below part is trying to do and why, which seems to make sense, but can you expand the comment and elaborate a little more?

This bears several assumptions that will break in a foreseeable future.

With context parallelism, the sequence dimension s will be split into [tp, iDIDy{cp}, s/tp/cp], so the code below won't be able to find tp and s/tp/cp. Similarly, with overlapping, the sequence dimension s will be split into [sp, iDID{tp}, s/sp/tp] where the sp is the stream parallelization factor. See this test for the idea.

I understand this change does fix some narrow cases that we care about at this very moment, but I'll have to think more about how to fix the broader issue...

(I still haven't given up on improving ID model.)

If we have to do graph traversal like this, we may want to do it in a place where the logic can be generalized and reused (and therefore ID model). At this moment, there are two use cases:

splitting reshape: [h]=>[d, h/d] and [h]=>[d,a/d,h/a]

merging reshape: [a,h/a]=>[d,a/d,h/a] and [a,h/a]=>[h]=>[d,h/d]
We want ID model to map the ds in both cases so these reshapes won't be considered resharding.

How much harder is it to make ID model support these cases than working around using reshape transformation? I suspect the latter has a bigger blast radius because the former is local to ID model and the latter changes TensorViews.

I realized the same limitation for DID loop split on slice:

auto fusion = std::make_unique<Fusion>(); FusionGuard fg(fusion.get()); const int d = communicator_->size(); const int64_t b = 2, s = 2, h = 4; TensorView* in = makeContigConcreteTensor( {b, s, 3 * d * h}); TensorView* out = slice( in, {0, 0, 0}, {b, s, d * h}); fusion->addInput(in); fusion->addOutput(out); auto mesh = DeviceMesh::createForNumDevices(d); for (auto tv: {in, out}) { tv->setDeviceMesh(mesh); tv->split(-1, d, /*inner_split=*/false); tv->axis(-2)->parallelize(ParallelType::DIDx); tv->setAllocationDomain(tv->getLoopDomain(), true); }

I was trying to manually handle the case of SliceOp in hasDifferentShardings but it would make certain assumptions about the parallelization patterns and can easily break.

I understand this change does fix some narrow cases that we care about at this very moment, but I'll have to think more about how to fix the broader issue...

Yes, I agree. I wanted to add an example to demonstrate how reshapes can be loop split but it certainly does not cover all the cases.

I still haven't given up on improving ID model.

(Sorry -- I wish I knew more about IdModel to be more constructive.)

Another use case to consider is manual sharding -- the user wants to manually shard a subset of TVs to improve perf when our sharding propagation is suboptimal.

They may well annotate [b,s,h]=>(reshape)=>[b,s,a,h/a] as follows

in: [b,s,h] => [b,s,d,h/d] out: [b,s,a,h/a] => [b,s,d,a/d,h/a]

and expect nvFuser to recognize this reshape is local. In this case, it's hard to replay the reshape on the input because h there is already split by d.

wujingyue · 2025-02-14T23:30:02Z

csrc/scheduler/utils.cpp

+            auto split = dynamic_cast<Split*>(
+                tv->getLoopDomain().at(sharded_axis)->definition());
+            if (split != nullptr && split->in() == logical_id) {
+              find_it = std::find(


While I understand your intention, I don't know the implications of doing this for TransformPropagator. E.g.

root=[b, s, h], logical=[b, s, a, h/a], loop=[b, s, d, a/d, h/a]

This will move a/d and h/a to the front so the new loop domain becomes [a/d, h/a, b, s, d] and later ask TransformPropagator to replay at replayed_pos_ 2. What is TransformPropagator supposed to do with that? The first two loop IDs (a/d and h/a) don't even form a split in this TV.

cc @naoyam

I see your point. In the case of reshape with DID loop split, we have already propagated the reshape upwards, so the TransformPropagator only reorders the axis when called later. In the absence of the earlier reshape propagation before the loop split, the behavior could be erroneous since they don't form a split.

Although, since the reshape has already been propagated, and, as @naoyam mentioned above, the tv is reordered back, maybe this propagation can be skipped altogether.
Let me think about it more and see what the schedulers expect from this propagateReshapeTransform.

However, this may not work for the manual sharding case you mentioned in the above comment.

One caveat for the above to hold: The reshape transform has been applied to all tensorviews prior to the reshape in the given fusion segment.

In the transformer forward: For split-reshape: The order is linear -> slice -> reshape -> permute -> SDPA. Since linear will be its own segment, it leaves slice and reshape. I am uncertain if this will be a single segment since it can potentially depend on how the sharding on slice is represented.

For merge-reshape after SDPA: The order is SDPA-> permute ->reshape -> linear. Again, SDPA and linear will be their own segment.

More generally, the boundaries upto which the reshape transform is propagated upwards is important since we may different patterns appear in other models.

FWIW, in my tests, I found that TransformPropagator can propagate the split-reshapes upwards after DID loop split as well, but we lose the DID parallelization. This might be similar to the comment here.

An orthogonal issue I see here is resharding at the boundary upto which the reshape has been propagated upwards. At that boundary, we will go from [h] -> [a, h/a] and should hit the same resharding error.

but we lose the DID parallelization

This is expected and the reason for parallelAllLike and functions like that

Priya2698 · 2025-02-25T20:19:28Z

@wujingyue should we move forward with this PR?
It fixes the case where we are using transform propagator for reshape before the DID loop split.
PR #3953 currently uses hardcoding and we may want to merge it after PR #3482.
Wdyt?

Priya2698 · 2025-02-26T22:18:07Z

!test

wujingyue · 2025-02-26T22:24:51Z

csrc/scheduler/utils.cpp

+            auto split = dynamic_cast<Split*>(
+                tv->getLoopDomain().at(sharded_axis)->definition());
+            if (split != nullptr && split->in() == logical_id) {
+              // Move the DIDx dimension to the front


Will it be more general to move all loop domains reachable from logical_id to the front? Still, DIDx needs to be the very front. This way, you don't need to make assumption on DIDx has to be the immediate outer-split of logical_id and code can probably be made simpler.

Still, DIDx needs to be the very front.

Is this for the schedulers? In that case, we don't have to worry about DID being at the front here. reorderDIDToFront is called after this function within the scheduler.

Will it be more general to move all loop domains reachable from logical_id to the front?

Yes. That should work. Are you aware of any direct utilities for this? Else, I should be able to use getExprsBetween to find relevant transforms and find the loop IDs from their outputs

getExprsBetween

That's about right. I suspect getInputsTo would also work. I didn't try enough to understand their differences. Many of the graph traversal utilities seem to overlap and/or be redundant...

I am using getExprsBetween as follows:

auto transforms = StmtSort::getExprsBetween({logical_id}, {tv->getLoopDomain().begin(), tv->getLoopDomain().end()});

For the new added test, the reshaped tensorview is: [i{b}, i{s}, i{a}, i{h/a}] (I am using TE notation, I use a different notation in the test to guarantee divisbility).
For the logical ID i{a}, I also see the split h -> [a, h/a]. I expected to only see the DID split. Similarly, for the logical ID i{h/a}, I expected to see no transforms/exprs in between since it is directly found but I see both the h->[a, h/a] and a -> [d, a/d] splits.
This is required since:

I also have transforms that are not necessarily on the reshaped IDs (For example, in the test case ViewWithSplit, we will see the split creating DIDx whereas it is not on a reshaped ID) and hence should not be reordered or propagated.

It is difficult to tell if there is atleast a loop iterdomain reachable from a particular reshaped logical ID.

Logical ID: iS7{4}rf Expr: Outer split: iS7{4}rf by factor 1 -> ideviceIdx.x13{1}, iS14{4} Output: ideviceIdx.x13{1} Output: iS14{4} Expr: Outer split: iS6{48}rf by factor 4 -> iS7{4}rf, iS8{12}rf Output: iS7{4}rf Output: iS8{12}rf

Any suggestions on what I maybe missing in this function? I have not used it from a specific ID like above, only between entire domains.

I should be using DependencyCheck::getAllExprsBetween!
That gives me the expected transforms.

Priya2698 · 2025-02-28T02:17:32Z

!test

Priya2698 · 2025-03-04T20:00:59Z

!test.

wujingyue · 2025-03-04T23:35:00Z

csrc/scheduler/utils.cpp

+        }
+
+        bool has_reachable_loop_id = false;
+        for (auto id : reachable_ids) {


I don't know reachable_ids will be ordered -- it really depends on the implementation of getAllExprsBetween. Therefore, instead, I'd loop over the loop domain and try to find a match in reachable_ids (which probably should be a set instead of a vector). It's roughly the same logic but more deterministic and more aligned with the existing order in the loop domain.

wujingyue

LGTM overall

Priya2698 · 2025-03-04T23:48:18Z

@wujingyue CI failures seem like script failures, will re-run. The PR is ready for another review.

wujingyue

LGTM with comments

Priya2698 · 2025-03-05T00:14:05Z

!test

Priya2698 · 2025-03-05T22:34:28Z

!test

wujingyue · 2025-03-05T23:18:12Z

csrc/scheduler/utils.cpp

@@ -2257,7 +2257,8 @@ void propagateReshapeTransforms(Fusion* fusion, const ComputeAtMap& ca_map) {
        }

        bool has_reachable_loop_id = false;
-        for (auto loop_idx : c10::irange(static_cast<int64_t>(tv->getLoopDomain().size()))) {
+        for (auto loop_idx :
+             c10::irange(static_cast<int64_t>(tv->getLoopDomain().size()))) {


Suggested change

c10::irange(static_cast<int64_t>(tv->getLoopDomain().size()))) {

c10::irange(std::ssize(tv->getLoopDomain()))) {

FYI. Don't bother changing this if you are about to submit the PR.

Will make this change in the other PR!

Priya2698 marked this pull request as ready for review February 12, 2025 22:34

Priya2698 force-pushed the pm/reshape_propagate branch from d3c602d to 52a7a0c Compare February 12, 2025 22:57

Priya2698 requested a review from wujingyue February 12, 2025 22:58

Priya2698 added 4 commits February 12, 2025 16:55

reshape with loop split test

baf21ef

reshape with DID loop split

4b1f376

lintrunner

6a6a135

comment

90ab5ee

Priya2698 force-pushed the pm/reshape_propagate branch from 858f9fc to 90ab5ee Compare February 13, 2025 00:57

wujingyue requested a review from naoyam February 13, 2025 04:01

rm duplicate test from rebase

dee6e1a

wujingyue reviewed Feb 13, 2025

View reviewed changes

Update csrc/scheduler/utils.cpp

73d237a

Co-authored-by: Jingyue Wu <[email protected]>

naoyam reviewed Feb 14, 2025

View reviewed changes

wujingyue reviewed Feb 14, 2025

View reviewed changes

Merge branch 'main' into pm/reshape_propagate

a2f7615

move did to front, use shardTensor in test

823b5ee

Priya2698 requested review from naoyam and wujingyue February 26, 2025 22:17

wujingyue reviewed Feb 26, 2025

View reviewed changes

Priya2698 and others added 2 commits February 27, 2025 18:17

find all reachable ids from logical reshape ids to reorder

d36f2e0

Merge branch 'main' into pm/reshape_propagate

c825173

explicit conversion to at::Tensor in outputs

fc3c82f

wujingyue reviewed Mar 4, 2025

View reviewed changes

Priya2698 requested a review from wujingyue March 4, 2025 23:47

wujingyue approved these changes Mar 4, 2025

View reviewed changes

iterate over loop ids instead of reachable ids

e712681

Priya2698 mentioned this pull request Mar 5, 2025

DID loop split for reshape without pre-sharding reshape propagation #3953

Open

Priya2698 added 2 commits March 4, 2025 17:08

clangtidy

200f14b

lint

de6a4b2

wujingyue reviewed Mar 5, 2025

View reviewed changes

Priya2698 merged commit 1bbc745 into main Mar 6, 2025
48 of 49 checks passed

Priya2698 deleted the pm/reshape_propagate branch March 6, 2025 03:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DID loop split on reshaped IDs #3875

DID loop split on reshaped IDs #3875

Priya2698 commented Feb 11, 2025 •

edited

Loading

Priya2698 commented Feb 11, 2025

github-actions bot commented Feb 11, 2025 •

edited

Loading

Priya2698 commented Feb 13, 2025

wujingyue Feb 13, 2025

naoyam Feb 14, 2025

naoyam Feb 14, 2025

wujingyue Feb 14, 2025

wujingyue Feb 14, 2025

Priya2698 Feb 14, 2025

Priya2698 Feb 14, 2025

wujingyue Feb 15, 2025

wujingyue Feb 14, 2025

Priya2698 Feb 15, 2025

Priya2698 Feb 20, 2025

wujingyue Feb 21, 2025

Priya2698 commented Feb 25, 2025

Priya2698 commented Feb 26, 2025

wujingyue Feb 26, 2025

Priya2698 Feb 26, 2025

wujingyue Feb 26, 2025

Priya2698 Feb 28, 2025

Priya2698 Feb 28, 2025 •

edited

Loading

wujingyue Feb 28, 2025

Priya2698 commented Feb 28, 2025

Priya2698 commented Mar 4, 2025

wujingyue Mar 4, 2025

wujingyue left a comment

Priya2698 commented Mar 4, 2025

wujingyue left a comment

Priya2698 commented Mar 5, 2025

Priya2698 commented Mar 5, 2025

wujingyue Mar 5, 2025

Priya2698 Mar 6, 2025


		// Reorder the reshape dimensions to the front of the domain

	c10::irange(static_cast<int64_t>(tv->getLoopDomain().size()))) {
	c10::irange(std::ssize(tv->getLoopDomain()))) {

DID loop split on reshaped IDs #3875

DID loop split on reshaped IDs #3875

Conversation

Priya2698 commented Feb 11, 2025 • edited Loading

Priya2698 commented Feb 11, 2025

github-actions bot commented Feb 11, 2025 • edited Loading

Description

Changes walkthrough 📝

PR Reviewer Guide 🔍

Priya2698 commented Feb 13, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Priya2698 commented Feb 25, 2025

Priya2698 commented Feb 26, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Priya2698 Feb 28, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Priya2698 commented Feb 28, 2025

Priya2698 commented Mar 4, 2025

Choose a reason for hiding this comment

wujingyue left a comment

Choose a reason for hiding this comment

Priya2698 commented Mar 4, 2025

wujingyue left a comment

Choose a reason for hiding this comment

Priya2698 commented Mar 5, 2025

Priya2698 commented Mar 5, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Priya2698 commented Feb 11, 2025 •

edited

Loading

github-actions bot commented Feb 11, 2025 •

edited

Loading

Priya2698 Feb 28, 2025 •

edited

Loading