Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CC] [autodiff] Support AdStack on C backend #1752

Merged
merged 6 commits into from
Aug 29, 2020

Conversation

archibate
Copy link
Collaborator

@archibate archibate requested a review from k-ye August 23, 2020 05:26
static inline Ti_AdStackPtr Ti_ad_stack_top_primal(Ti_AdStackPtr stack,
Ti_u32 element_size) {
Ti_u32 *n = Ti_ad_stack_n(stack);
return Ti_ad_stack_data(stack) + (*n - 1) * 2 * element_size;
Copy link
Collaborator Author

@archibate archibate Aug 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I copied from ad_stack.metal.h, can you tell me why n - 1 here? @k-ye
Sometimes n can be 0 and it gets overflowed, resulting in a serious segfault when the lhs pointer is 64-bit (-1 = 0xffffffff). But it somehow silently passed on Metal whose pointer is 32-bit?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think when n is 0, it's pretty OK to have a segfault here -- just like this in C++:

std::stack<int> s;
s.top();  // runtime error

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly, this is just l[len(l) - 1]. As mentioned, accessing top without push sounds like a bug.

Ti_i32 *data = (Ti_i32 *)Ti_ad_stack_data(stack);
data[0] = 0;
data[1] = 0;
*n = 1;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have to do this mock for L108 to prevent overflow when Ti_ad_stack_top_primal called without Ti_ad_stack_push, do you have the same issue on Metal?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you have the same issue on Metal?

No I don't remember seeing such an issue. It sounds like a bug if top_primal is called before push, which probably won't be limited to Metal only. Could you provide a test to repro this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I've no knowledge about the autodiff system, I just know that test_ad_for.py fails but test_ad_if didn't. And it only occur on CC, not x64. But I'll try to find the min-repro based on the test later.

@archibate archibate requested a review from xumingkuan August 23, 2020 05:31
@codecov
Copy link

codecov bot commented Aug 23, 2020

Codecov Report

Merging #1752 into master will decrease coverage by 0.05%.
The diff coverage is 0.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1752      +/-   ##
==========================================
- Coverage   42.50%   42.45%   -0.06%     
==========================================
  Files          44       44              
  Lines        6194     6202       +8     
  Branches     1073     1073              
==========================================
  Hits         2633     2633              
- Misses       3406     3414       +8     
  Partials      155      155              
Impacted Files Coverage Δ
python/taichi/core/util.py 0.37% <0.00%> (-0.02%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7fa0d9d...bdc1559. Read the comment docs.

Copy link
Contributor

@xumingkuan xumingkuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTMig!

// Copied from Metal:
typedef Ti_u8 *Ti_AdStackPtr;

static inline Ti_u32 *Ti_ad_stack_n(Ti_AdStackPtr stack) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well... I would suggest not using a capital letter as the beginning of a function's name. I haven't taken a look at the C backend before, so is there any reason that the Ti_ prefix is used? I would suggest cc_ prefix to show that it's the C backend (and probably Cc for classes).

static inline Ti_AdStackPtr Ti_ad_stack_top_primal(Ti_AdStackPtr stack,
Ti_u32 element_size) {
Ti_u32 *n = Ti_ad_stack_n(stack);
return Ti_ad_stack_data(stack) + (*n - 1) * 2 * element_size;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think when n is 0, it's pretty OK to have a segfault here -- just like this in C++:

std::stack<int> s;
s.top();  // runtime error

@archibate
Copy link
Collaborator Author

I think when n is 0, it's pretty OK to have a segfault here -- just like this in C++:

But it makes test_ad_for.py to fail. It seems no push is executed before top.

so is there any reason that the Ti_ prefix is used?

It's a namespace, prevent possible name conflict when users exporting the source code to their projects.

I would suggest cc_ prefix to show that it's the C backend.

Yes, cc_ is clear only within the Taichi repo codebase, but not so clear when being exported to a thrid-party shared object. Ti_ immediately hints them this is a function of Taichi runtime.
What's more, we may futher support exporting kernels in LLVM backend, which could use the same naming rule for portability.

@xumingkuan
Copy link
Contributor

I think when n is 0, it's pretty OK to have a segfault here -- just like this in C++:

But it makes test_ad_for.py to fail. It seems no push is executed before top.

Interesting... Looks like a bug in autodiff or optimization passes.

@archibate
Copy link
Collaborator Author

I think when n is 0, it's pretty OK to have a segfault here -- just like this in C++:

But it makes test_ad_for.py to fail. It seems no push is executed before top.

Interesting... Looks like a bug in autodiff or optimization passes.

What's more, it silently passed on x64 and metal test.. Any idea?

@archibate
Copy link
Collaborator Author

Can we confirm that this is an issue in autodiff system or C backend?

@xumingkuan
Copy link
Contributor

Can we confirm that this is an issue in autodiff system or C backend?

I'll take a look tomorrow.

@xumingkuan
Copy link
Contributor

It's an issue in autodiff.

test case:
test_ad_fibonacci_index() in test_ad_for.py
log:

[I 08/26/20 19:11:30.900] [compile_to_offloads.cpp:taichi::lang::irpass::`anon
ymous-namespace'::make_pass_printer::<lambda_fe1d620add3df83d4ee306f9c0ab10ca>
::operator ()@18] [fib_c5_0_grad_grad] Simplified I:
kernel {
  <i32 x1> $0 = const [5]
  <i32 x1> $1 = const [0]
  <i32 x1> $2 = const [1]
  <i32 x1> $3 = const [10]
  $4 : for in range($1, $3) (vectorize 1) block_dim=adaptive {
    <i32 x1> $5 = loop $4 index 0
    <f32*x1> $6 = global ptr [S6place_f32], index [$5] activate=true
    <f32 x1> $7 = global load $6
    <f32*x1> $8 = global ptr [S10place_f32], index [] activate=true
    <f32 x1> $9 = atomic add($8, $7)
  }
  $10 : for in range($1, $0) (vectorize 1) block_dim=adaptive {
    <i32 x1> $11 = alloca
    <i32 x1> $12 = alloca
    <i32 x1> $13 : local store [$12 <- $2]
    $14 : for in range($1, $0) (vectorize 1) block_dim=adaptive {
      <i32 x1> $15 = local load [ [$12[0]]]
      <i32 x1> $16 = local load [ [$11[0]]]
      <i32 x1> $17 = add $16 $15
      <i32 x1> $18 : local store [$11 <- $15]
      <i32 x1> $19 : local store [$12 <- $17]
      <f32*x1> $20 = global ptr [S2place_f32], index [$17] activate=true
      <f32 x1> $21 = global load $20
      <f32*x1> $22 = global ptr [S6place_f32], index [$17] activate=true
      <f32 x1> $23 = atomic add($22, $21)
    }
  }
}
[I 08/26/20 19:11:30.903] [compile_to_offloads.cpp:taichi::lang::irpass::`anon
ymous-namespace'::make_pass_printer::<lambda_fe1d620add3df83d4ee306f9c0ab10ca>
::operator ()@18] [fib_c5_0_grad_grad] Gradient:
kernel {
  <i32 x1> $0 = const [5]
  <i32 x1> $1 = const [0]
  <i32 x1> $2 = const [1]
  <i32 x1> $3 = const [10]
  $4 : for in range($1, $3) (vectorize 1) block_dim=adaptive {
    <i32 x1> $5 = loop $4 index 0
    <f32*x1> $6 = global ptr [S6place_f32], index [$5] activate=true
    <f32*x1> $7 = global ptr [S10place_f32], index [] activate=true
    <f32*x1> $8 = global ptr [S12place_f32], index [] activate=true
    <f32 x1> $9 = global load $8
    <f32*x1> $10 = global ptr [S8place_f32], index [$5] activate=true
    <f32 x1> $11 = atomic add($10, $9)
  }
  $12 : for in range($1, $0) (vectorize 1) block_dim=adaptive {
    <f32 x1> $13 = stack alloc (max_size=16)
    <i32 x1> $14 = stack alloc (max_size=16)
    <i32 x1> $15 = stack alloc (max_size=16)
    <i32 x1> $16 = stack alloc (max_size=16)
    <i32 x1> $17 : stack push $16, val = $2
    $18 : for in range($1, $0) (vectorize 1) block_dim=adaptive {
      <i32 x1> $19 = stack load top $16
      <i32 x1> $20 = stack load top $15 // <------------------------------ empty stack!
      <i32 x1> $21 = add $20 $19
      <i32 x1> $22 : stack push $14, val = $21
      <i32 x1> $23 = stack load top $14
      <i32 x1> $24 : stack push $15, val = $19
      <i32 x1> $25 : stack push $16, val = $23
      <f32*x1> $26 = global ptr [S2place_f32], index [$23] activate=true
      <f32 x1> $27 = global load $26
      <f32 x1> $28 : stack push $13, val = $27
      <f32*x1> $29 = global ptr [S6place_f32], index [$23] activate=true
    }
    $30 : reversed for in range($1, $0) (vectorize 1) block_dim=adaptive {
      <i32 x1> $31 = stack load top $14
      <f32*x1> $32 = global ptr [S8place_f32], index [$31] activate=true
      <f32 x1> $33 = global load $32
      <f32 x1> $34 : stack acc adj $13, val = $33
      <f32 x1> $35 = stack load top adj $13
      <f32 x1> $36 : stack pop $13
      <f32*x1> $37 = global ptr [S4place_f32], index [$31] activate=true
      <f32 x1> $38 = atomic add($37, $35)
      <i32 x1> $39 : stack pop $16
      <i32 x1> $40 : stack pop $15
      <i32 x1> $41 : stack pop $14
    }
    <i32 x1> $42 : stack pop $16
  }
}

@archibate
Copy link
Collaborator Author

Great, so will we merge this PR before or after that issue is resolved?

@xumingkuan
Copy link
Contributor

*n = 1; looks too hacky to me... I prefer to merge this after that issue is resolved if that won't take too much time.

@archibate archibate requested a review from xumingkuan August 28, 2020 09:35
Copy link
Contributor

@xumingkuan xumingkuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! LGTM.

@archibate archibate added the LGTM label Aug 28, 2020
@archibate archibate merged commit d94c2dd into taichi-dev:master Aug 29, 2020
@yuanming-hu yuanming-hu mentioned this pull request Sep 1, 2020
@archibate archibate mentioned this pull request Sep 6, 2020
25 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[AutoDiff] [CC] Add ti.extension.adstack support to C backend
4 participants