Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[type] [Opt] Bit_array vectorization(Step 1) #2101

Merged
merged 22 commits into from
Dec 25, 2020

Conversation

TH3CHARLie
Copy link
Collaborator

@TH3CHARLie TH3CHARLie commented Dec 17, 2020

@TH3CHARLie TH3CHARLie marked this pull request as draft December 17, 2020 16:32
@TH3CHARLie
Copy link
Collaborator Author

before:

[I 12/22/20 01:15:00.160] [compile_to_offloads.cpp:operator()@18] [assign_c4_0] Bit Loop Vectorized:
kernel {
  $0 : struct for in S2bit_array<ba(cu1x32)> (vectorize 1) (bit_vectorize 32) noneblock_dim=adaptive {
    <i32> $1 = loop $0 index 0
    <i32> $2 = loop $0 index 1
    <*cu1> $3 = global ptr [S3place<cu1><bit>], index [$1, $2] activate=true
    <u32> $4 = global load $3
    <*cu1> $5 = global ptr [S6place<cu1><bit>], index [$1, $2] activate=true
    $6 : global store [$5 <- $4]
  }
}

after bit_loop_vectorize:

[I 12/22/20 03:19:25.579] [compile_to_offloads.cpp:operator()@18] [assign_c4_0] Bit Loop Vectorized:
kernel {
  $0 : struct for in S2bit_array<ba(cu1x32)> (vectorize 1) (bit_vectorize 32) noneblock_dim=adaptive {
    <i32> $1 = loop $0 index 0
    <i32> $2 = loop $0 index 1
    <*u32> $3 = global ptr [S3place<cu1><bit>], index [$1, $2] activate=true
    <u32> $4 = global load $3
    <*u32> $5 = global ptr [S6place<cu1><bit>], index [$1, $2] activate=true
    $6 : global store [$5 <- $4]
  }
}

@TH3CHARLie
Copy link
Collaborator Author

TH3CHARLie commented Dec 21, 2020

TODO:

  • fix lower access to get byte pointer(bit struct pointer) when GlobalPtrStmt tagged with bit_vectorized mark.
  • fix code generation of LoopIndexStmt on bit vectorized cases(Should be covered by for i in x.parent)
  • fix broken range-for

@TH3CHARLie TH3CHARLie requested review from taichi-gardener and removed request for taichi-gardener December 23, 2020 16:31
@TH3CHARLie
Copy link
Collaborator Author

TH3CHARLie commented Dec 23, 2020

IR changes after some significant changes:

avoid listgen on bit_array node:

before:

$9 = offloaded listgen S1pointer->S2dense
$10 = offloaded  
body {
  $11 = clear_list S3bit_array<ba(cu1x32)>
}
$12 = offloaded listgen S2dense->S3bit_array<ba(cu1x32)>
$13 = offloaded struct_for(S3bit_array<ba(cu1x32)>) grid_dim=0 block_dim=32 bls=none 
$9 = offloaded listgen S1pointer->S2dense
$10 = offloaded struct_for(S3bit_array<ba(cu1x32)>) grid_dim=0 block_dim=32 bls=none 

Fix lower_access(including passing typecheck):

before:

  <*u32> $21 = global ptr [S4place<cu1><bit>], index [$7, $8] activate=false
  <i32> $22 = shuffle $7[0]
  <i32> $23 = shuffle $8[0]
  <*gen> $24 = get root
  <i32> $25 = linearized(ind {}, stride {})
  <*gen> $26 = [S0root][root]::lookup($24, $25) activate = false
  <*gen> $27 = get child [S0root->S1pointer] $26
  <i32> $28 = bit_extract($22) bit_range=[10, 12)
  <i32> $29 = bit_extract($23) bit_range=[10, 12)
  <i32> $30 = linearized(ind {$28, $29}, stride {4, 4})
  <*gen> $31 = [S1pointer][pointer]::lookup($27, $30) activate = false
  <*gen> $32 = get child [S1pointer->S2dense] $31
  <i32> $33 = bit_extract($22) bit_range=[0, 10)
  <i32> $34 = bit_extract($23) bit_range=[5, 10)
  <i32> $35 = linearized(ind {$33, $34}, stride {1024, 32})
  <*gen> $36 = [S2dense][dense]::lookup($32, $35) activate = false
  <*ba(cu1x32)> $37 = get child [S2dense->S3bit_array<ba(cu1x32)>] $36
  <i32> $38 = bit_extract($22) bit_range=[0, 0)
  <i32> $39 = bit_extract($23) bit_range=[0, 5)
  <i32> $40 = linearized(ind {$38, $39}, stride {1, 32})
  <^cu1> $41 = [S3bit_array<ba(cu1x32)>][bit_array]::lookup($37, $40) activate = false
  <^cu1> $42 = get child [S3bit_array<ba(cu1x32)>->S4place<cu1><bit>] $41
  <^cu1> $43 = shuffle $42[0]
  <u32> $44 = global load $43

after:

<*u32> $21 = global ptr [S4place<cu1><bit>], index [$7, $8] activate=false
<i32> $22 = shuffle $7[0]
<i32> $23 = shuffle $8[0]
<*gen> $24 = get root
<i32> $25 = linearized(ind {}, stride {})
<*gen> $26 = [S0root][root]::lookup($24, $25) activate = false
<*gen> $27 = get child [S0root->S1pointer] $26
<i32> $28 = bit_extract($22) bit_range=[10, 12)
<i32> $29 = bit_extract($23) bit_range=[10, 12)
<i32> $30 = linearized(ind {$28, $29}, stride {4, 4})
<*gen> $31 = [S1pointer][pointer]::lookup($27, $30) activate = false
<*gen> $32 = get child [S1pointer->S2dense] $31
<i32> $33 = bit_extract($22) bit_range=[0, 10)
<i32> $34 = bit_extract($23) bit_range=[5, 10)
<i32> $35 = linearized(ind {$33, $34}, stride {1024, 32})
<*gen> $36 = [S2dense][dense]::lookup($32, $35) activate = false
<*u32> $37 = get child [S2dense->S3bit_array<ba(cu1x32)>] $36
<*u32> $38 = shuffle $37[0]
<u32> $39 = global load $38

@TH3CHARLie TH3CHARLie marked this pull request as ready for review December 24, 2020 16:34
@TH3CHARLie
Copy link
Collaborator Author

marking this as ready for review as it passes misc/test_bit_array_vectorized.py

Some performance numbers, for a 4096 * 4096 2D bit array, two implementations assign_naive and assign_vectorized yields:

[      %     total   count |      min       avg       max   ] Kernel name
[ 51.26%   0.379 s      1x |  378.887   378.887   378.887 ms] init_c4_0_kernel_0_range_for
[ 42.83%   0.317 s      1x |  316.606   316.606   316.606 ms] assign_naive_c8_0_kernel_1_range_for
[  5.91%   0.044 s      1x |   43.715    43.715    43.715 ms] verify_c10_0_kernel_2_range_for
[      %     total   count |      min       avg       max   ] Kernel name
[ 88.72%   0.383 s      1x |  383.436   383.436   383.436 ms] init_c4_0_kernel_0_range_for
[  9.93%   0.043 s      1x |   42.899    42.899    42.899 ms] verify_c10_0_kernel_6_range_for
[  0.98%   0.004 s      1x |    4.236     4.236     4.236 ms] assign_vectorized_c6_0_kernel_5_struct_for
[  0.34%   0.001 s      1x |    1.464     1.464     1.464 ms] assign_vectorized_c6_0_kernel_4_listgen_S2dense
[  0.03%   0.000 s      1x |    0.144     0.144     0.144 ms] assign_vectorized_c6_0_kernel_2_listgen_S1pointer
[  0.00%   0.000 s      1x |    0.001     0.001     0.001 ms] assign_vectorized_c6_0_kernel_3_serial
[  0.00%   0.000 s      1x |    0.000     0.000     0.000 ms] assign_vectorized_c6_0_kernel_1_serial

There are some minor costs on listgen as we are using sparse data structure but the speedup in the for-loop is significant.

taichi/codegen/codegen_llvm.cpp Show resolved Hide resolved
if (is_bit_vectorized && snode->type == SNodeType::bit_array &&
i == length - 1 && snodes[i - 1]->type == SNodeType::dense) {
continue;
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

core change: do not generate lookup/getch for bit_array snode when it's the last snode and its parent is a dense node.

taichi/transforms/lower_access.cpp Show resolved Hide resolved
taichi/transforms/offload.cpp Show resolved Hide resolved
TypeFactory::get_instance().get_pointer_type(physical_type);
stmt->ret_type = DataType(ptr_ret_type);
return;
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

core change: make sure the tagged statements pass the typecheck pass

Copy link
Member

@yuanming-hu yuanming-hu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!! It's nice to see a 100x speed up. Some early comments for now. I'll take a closer look after lunch :-)

Thanks!

misc/test_bit_array_vectorization.py Outdated Show resolved Hide resolved
taichi/transforms/compile_to_offloads.cpp Outdated Show resolved Hide resolved
taichi/codegen/codegen_llvm.cpp Outdated Show resolved Hide resolved
taichi/codegen/codegen_llvm.cpp Show resolved Hide resolved
misc/test_bit_array_vectorization.py Outdated Show resolved Hide resolved
taichi/codegen/codegen_llvm.cpp Outdated Show resolved Hide resolved
Copy link
Member

@yuanming-hu yuanming-hu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome!! All LGTM now. Feel free to merge after the final comment is addressed. Thanks for the great implementation! Can't wait to see the more powerful version for GoL :-)

Comment on lines 48 to 49
// TODO: Do we need to explicitly make the load stmt's return type same
// as physical type for now, this seems to hold under the demo code
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. Let's assert the bit width of the physical type equals bit_vectorize for safety.

@TH3CHARLie TH3CHARLie merged commit 6cf7300 into taichi-dev:master Dec 25, 2020
@TH3CHARLie
Copy link
Collaborator Author

thanks for the review and guidance!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants