[type] [Opt] Bit_array vectorization(Step 1) #2101

TH3CHARLie · 2020-12-17T16:32:14Z

Related issue = #1905

misc/test_bit_array_vectorization.py

TH3CHARLie · 2020-12-21T19:33:48Z

before:

[I 12/22/20 01:15:00.160] [compile_to_offloads.cpp:operator()@18] [assign_c4_0] Bit Loop Vectorized:
kernel {
  $0 : struct for in S2bit_array<ba(cu1x32)> (vectorize 1) (bit_vectorize 32) noneblock_dim=adaptive {
    <i32> $1 = loop $0 index 0
    <i32> $2 = loop $0 index 1
    <*cu1> $3 = global ptr [S3place<cu1><bit>], index [$1, $2] activate=true
    <u32> $4 = global load $3
    <*cu1> $5 = global ptr [S6place<cu1><bit>], index [$1, $2] activate=true
    $6 : global store [$5 <- $4]
  }
}

after bit_loop_vectorize:

[I 12/22/20 03:19:25.579] [compile_to_offloads.cpp:operator()@18] [assign_c4_0] Bit Loop Vectorized:
kernel {
  $0 : struct for in S2bit_array<ba(cu1x32)> (vectorize 1) (bit_vectorize 32) noneblock_dim=adaptive {
    <i32> $1 = loop $0 index 0
    <i32> $2 = loop $0 index 1
    <*u32> $3 = global ptr [S3place<cu1><bit>], index [$1, $2] activate=true
    <u32> $4 = global load $3
    <*u32> $5 = global ptr [S6place<cu1><bit>], index [$1, $2] activate=true
    $6 : global store [$5 <- $4]
  }
}

TH3CHARLie · 2020-12-21T21:13:52Z

TODO:

fix lower access to get byte pointer(bit struct pointer) when GlobalPtrStmt tagged with bit_vectorized mark.
fix code generation of LoopIndexStmt on bit vectorized cases(Should be covered by for i in x.parent)
fix broken range-for

…PtrStmt tagged with bit_vectorized mark

TH3CHARLie · 2020-12-23T18:52:27Z

IR changes after some significant changes:

avoid listgen on bit_array node:

before:

$9 = offloaded listgen S1pointer->S2dense
$10 = offloaded  
body {
  $11 = clear_list S3bit_array<ba(cu1x32)>
}
$12 = offloaded listgen S2dense->S3bit_array<ba(cu1x32)>
$13 = offloaded struct_for(S3bit_array<ba(cu1x32)>) grid_dim=0 block_dim=32 bls=none

$9 = offloaded listgen S1pointer->S2dense
$10 = offloaded struct_for(S3bit_array<ba(cu1x32)>) grid_dim=0 block_dim=32 bls=none

Fix lower_access(including passing typecheck):

before:

  <*u32> $21 = global ptr [S4place<cu1><bit>], index [$7, $8] activate=false
  <i32> $22 = shuffle $7[0]
  <i32> $23 = shuffle $8[0]
  <*gen> $24 = get root
  <i32> $25 = linearized(ind {}, stride {})
  <*gen> $26 = [S0root][root]::lookup($24, $25) activate = false
  <*gen> $27 = get child [S0root->S1pointer] $26
  <i32> $28 = bit_extract($22) bit_range=[10, 12)
  <i32> $29 = bit_extract($23) bit_range=[10, 12)
  <i32> $30 = linearized(ind {$28, $29}, stride {4, 4})
  <*gen> $31 = [S1pointer][pointer]::lookup($27, $30) activate = false
  <*gen> $32 = get child [S1pointer->S2dense] $31
  <i32> $33 = bit_extract($22) bit_range=[0, 10)
  <i32> $34 = bit_extract($23) bit_range=[5, 10)
  <i32> $35 = linearized(ind {$33, $34}, stride {1024, 32})
  <*gen> $36 = [S2dense][dense]::lookup($32, $35) activate = false
  <*ba(cu1x32)> $37 = get child [S2dense->S3bit_array<ba(cu1x32)>] $36
  <i32> $38 = bit_extract($22) bit_range=[0, 0)
  <i32> $39 = bit_extract($23) bit_range=[0, 5)
  <i32> $40 = linearized(ind {$38, $39}, stride {1, 32})
  <^cu1> $41 = [S3bit_array<ba(cu1x32)>][bit_array]::lookup($37, $40) activate = false
  <^cu1> $42 = get child [S3bit_array<ba(cu1x32)>->S4place<cu1><bit>] $41
  <^cu1> $43 = shuffle $42[0]
  <u32> $44 = global load $43

after:

<*u32> $21 = global ptr [S4place<cu1><bit>], index [$7, $8] activate=false
<i32> $22 = shuffle $7[0]
<i32> $23 = shuffle $8[0]
<*gen> $24 = get root
<i32> $25 = linearized(ind {}, stride {})
<*gen> $26 = [S0root][root]::lookup($24, $25) activate = false
<*gen> $27 = get child [S0root->S1pointer] $26
<i32> $28 = bit_extract($22) bit_range=[10, 12)
<i32> $29 = bit_extract($23) bit_range=[10, 12)
<i32> $30 = linearized(ind {$28, $29}, stride {4, 4})
<*gen> $31 = [S1pointer][pointer]::lookup($27, $30) activate = false
<*gen> $32 = get child [S1pointer->S2dense] $31
<i32> $33 = bit_extract($22) bit_range=[0, 10)
<i32> $34 = bit_extract($23) bit_range=[5, 10)
<i32> $35 = linearized(ind {$33, $34}, stride {1024, 32})
<*gen> $36 = [S2dense][dense]::lookup($32, $35) activate = false
<*u32> $37 = get child [S2dense->S3bit_array<ba(cu1x32)>] $36
<*u32> $38 = shuffle $37[0]
<u32> $39 = global load $38

TH3CHARLie · 2020-12-24T16:38:35Z

marking this as ready for review as it passes misc/test_bit_array_vectorized.py

Some performance numbers, for a 4096 * 4096 2D bit array, two implementations assign_naive and assign_vectorized yields:

[      %     total   count |      min       avg       max   ] Kernel name
[ 51.26%   0.379 s      1x |  378.887   378.887   378.887 ms] init_c4_0_kernel_0_range_for
[ 42.83%   0.317 s      1x |  316.606   316.606   316.606 ms] assign_naive_c8_0_kernel_1_range_for
[  5.91%   0.044 s      1x |   43.715    43.715    43.715 ms] verify_c10_0_kernel_2_range_for

[      %     total   count |      min       avg       max   ] Kernel name
[ 88.72%   0.383 s      1x |  383.436   383.436   383.436 ms] init_c4_0_kernel_0_range_for
[  9.93%   0.043 s      1x |   42.899    42.899    42.899 ms] verify_c10_0_kernel_6_range_for
[  0.98%   0.004 s      1x |    4.236     4.236     4.236 ms] assign_vectorized_c6_0_kernel_5_struct_for
[  0.34%   0.001 s      1x |    1.464     1.464     1.464 ms] assign_vectorized_c6_0_kernel_4_listgen_S2dense
[  0.03%   0.000 s      1x |    0.144     0.144     0.144 ms] assign_vectorized_c6_0_kernel_2_listgen_S1pointer
[  0.00%   0.000 s      1x |    0.001     0.001     0.001 ms] assign_vectorized_c6_0_kernel_3_serial
[  0.00%   0.000 s      1x |    0.000     0.000     0.000 ms] assign_vectorized_c6_0_kernel_1_serial

There are some minor costs on listgen as we are using sparse data structure but the speedup in the for-loop is significant.

taichi/codegen/codegen_llvm.cpp

TH3CHARLie · 2020-12-24T16:48:06Z

taichi/transforms/lower_access.cpp

+      if (is_bit_vectorized && snode->type == SNodeType::bit_array &&
+          i == length - 1 && snodes[i - 1]->type == SNodeType::dense) {
+        continue;
+      }


core change: do not generate lookup/getch for bit_array snode when it's the last snode and its parent is a dense node.

taichi/transforms/lower_access.cpp

taichi/transforms/offload.cpp

TH3CHARLie · 2020-12-24T16:51:46Z

taichi/transforms/type_check.cpp

+          TypeFactory::get_instance().get_pointer_type(physical_type);
+      stmt->ret_type = DataType(ptr_ret_type);
+      return;
+    }


core change: make sure the tagged statements pass the typecheck pass

yuanming-hu

Great!! It's nice to see a 100x speed up. Some early comments for now. I'll take a closer look after lunch :-)

Thanks!

misc/test_bit_array_vectorization.py

taichi/transforms/compile_to_offloads.cpp

taichi/codegen/codegen_llvm.cpp

misc/test_bit_array_vectorization.py

taichi/codegen/codegen_llvm.cpp

Co-authored-by: Yuanming Hu <[email protected]>

…ense

yuanming-hu

Awesome!! All LGTM now. Feel free to merge after the final comment is addressed. Thanks for the great implementation! Can't wait to see the more powerful version for GoL :-)

yuanming-hu · 2020-12-24T18:57:38Z

taichi/transforms/bit_loop_vectorize.cpp

+        // TODO: Do we need to explicitly make the load stmt's return type same
+        // as physical type for now, this seems to hold under the demo code


Good question. Let's assert the bit width of the physical type equals bit_vectorize for safety.

TH3CHARLie · 2020-12-25T07:33:26Z

thanks for the review and guidance!

ti.bit_vectorize infra and pretty IR printing

cd89a7f

TH3CHARLie marked this pull request as draft December 17, 2020 16:32

TH3CHARLie requested a review from taichi-gardener December 17, 2020 16:32

format

9df8f5c

yuanming-hu reviewed Dec 18, 2020

View reviewed changes

misc/test_bit_array_vectorization.py Outdated Show resolved Hide resolved

TH3CHARLie added 2 commits December 22, 2020 01:12

make sure coding style in test comforms demo

f66efb8

sfix typing in GlobalStore/GlobalLoad

e9c9582

TH3CHARLie added 2 commits December 22, 2020 03:43

make sure the modified IR passes typecheck

671fb2c

format

8903eb5

TH3CHARLie added 3 commits December 24, 2020 00:22

avoid list gen for bit array

cf58850

modify test

04eabf7

codegen struct for on parent node when child is bit array

e44fe63

TH3CHARLie requested review from taichi-gardener and removed request for taichi-gardener December 23, 2020 16:31

TH3CHARLie added 2 commits December 24, 2020 00:39

format

e981b43

fix lower access to get byte pointer(bit struct pointer) when Global…

959b2ca

…PtrStmt tagged with bit_vectorized mark

TH3CHARLie added 3 commits December 24, 2020 02:58

format

f64c17e

fix range-for

c685acb

fix codegen for struct-for

1eff43a

TH3CHARLie marked this pull request as ready for review December 24, 2020 16:34

TH3CHARLie requested a review from yuanming-hu December 24, 2020 16:34

TH3CHARLie added 2 commits December 25, 2020 00:41

format

40c2097

remove debugging typecheck pass

1cc6ed2

TH3CHARLie commented Dec 24, 2020

View reviewed changes

remove debug info

7f0705b

yuanming-hu reviewed Dec 24, 2020

View reviewed changes

TH3CHARLie and others added 4 commits December 25, 2020 02:12

refine test

df9058e

Update taichi/codegen/codegen_llvm.cpp comment

55ff7e8

Co-authored-by: Yuanming Hu <[email protected]>

Update taichi/transforms/compile_to_offloads.cpp comment

05b27ab

Co-authored-by: Yuanming Hu <[email protected]>

add explicit error when looping through bit array but parent is not d…

8a7d2a9

…ense

yuanming-hu approved these changes Dec 24, 2020

View reviewed changes

remove todo item

61b6e8e

TH3CHARLie merged commit 6cf7300 into taichi-dev:master Dec 25, 2020

TH3CHARLie mentioned this pull request Dec 25, 2020

[type] [Bug] Fix bit array vectorization on GPU #2120

Merged

k-ye mentioned this pull request Jan 5, 2021

[release] v0.7.12 #2144

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[type] [Opt] Bit_array vectorization(Step 1) #2101

[type] [Opt] Bit_array vectorization(Step 1) #2101

TH3CHARLie commented Dec 17, 2020 •

edited

Loading

TH3CHARLie commented Dec 21, 2020

TH3CHARLie commented Dec 21, 2020 •

edited

Loading

TH3CHARLie commented Dec 23, 2020 •

edited

Loading

TH3CHARLie commented Dec 24, 2020

TH3CHARLie Dec 24, 2020

TH3CHARLie Dec 24, 2020

yuanming-hu left a comment

yuanming-hu left a comment

yuanming-hu Dec 24, 2020

TH3CHARLie commented Dec 25, 2020

		// TODO: Do we need to explicitly make the load stmt's return type same
		// as physical type for now, this seems to hold under the demo code

[type] [Opt] Bit_array vectorization(Step 1) #2101

[type] [Opt] Bit_array vectorization(Step 1) #2101

Conversation

TH3CHARLie commented Dec 17, 2020 • edited Loading

TH3CHARLie commented Dec 21, 2020

TH3CHARLie commented Dec 21, 2020 • edited Loading

TH3CHARLie commented Dec 23, 2020 • edited Loading

TH3CHARLie commented Dec 24, 2020

TH3CHARLie Dec 24, 2020

Choose a reason for hiding this comment

TH3CHARLie Dec 24, 2020

Choose a reason for hiding this comment

yuanming-hu left a comment

Choose a reason for hiding this comment

yuanming-hu left a comment

Choose a reason for hiding this comment

yuanming-hu Dec 24, 2020

Choose a reason for hiding this comment

TH3CHARLie commented Dec 25, 2020

TH3CHARLie commented Dec 17, 2020 •

edited

Loading

TH3CHARLie commented Dec 21, 2020 •

edited

Loading

TH3CHARLie commented Dec 23, 2020 •

edited

Loading