Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Perf] Very slow CPU memcpy performance on M1. #6771

Open
turbo0628 opened this issue Nov 30, 2022 · 3 comments
Open

[Perf] Very slow CPU memcpy performance on M1. #6771

turbo0628 opened this issue Nov 30, 2022 · 3 comments
Assignees
Labels
advanced optimization The issue or bug is related to advanced optimization perf

Comments

@turbo0628
Copy link
Member

Working on some benchmark of Taichi vs clang15, and found surprising performance gap on Apple M1.

The C baseline func for clang++:

void serial_memcpy(DTYPE* dst, DTYPE* src, int len) {
    for (int i = 0; i < len; ++i) {
        dst[i] = src[i];
    }
}

Testing against an 1GB float array, it reached ~50GB/s performance on my macbook. Compilation option is -Og. No multithreading.

Taichi counterpart:

@ti.kernel
def ti_memcpy(dst : ti.types.ndarray(), src : ti.types.ndarray()):
    for I in ti.grouped(src):
        dst[I] = src[I]

The performance is 7.6GB/s for single-thread and 22GB/s for multi-thread. Maybe there's low-hanging fruits to improve CPU performance.

@turbo0628 turbo0628 added advanced optimization The issue or bug is related to advanced optimization perf labels Nov 30, 2022
@turbo0628 turbo0628 self-assigned this Nov 30, 2022
@taichi-gardener taichi-gardener moved this to Untriaged in Taichi Lang Nov 30, 2022
@strongoier strongoier moved this from Untriaged to Todo in Taichi Lang Dec 2, 2022
@bobcao3
Copy link
Collaborator

bobcao3 commented Dec 2, 2022

My suspicion is that NDArrays have heavy indexing related overhead which drags down perf

@turbo0628
Copy link
Member Author

My suspicion is that NDArrays have heavy indexing related overhead which drags down perf

The real timing is ~70ms vs ~35ms, Indexing overhead is a problem, but it should not be at 30ms level.

@turbo0628
Copy link
Member Author

turbo0628 commented Dec 5, 2022

Interesting finding: This code snippet could achieve the expected bandwidth (~55GB/s)

@ti.kernel
def ti_memcpy(dst : ti.types.ndarray(), src : ti.types.ndarray()):
    for _ in range(1):
        for I in range(N):
            dst[I] = src[I]

Further clang experiments:

  • -O0 matches the single-threaded Taichi parallel-for performance (7-8GB/s)
  • -O1 or -Og matches the expected bandwidth (50-60GB/s).
  • -O0 and -O1 emits identical LLVM IR

==================================================
Update: The Taichi emitted LLVM IR is ALREADY vectorized. I suspect our thread pool or scheduling policy is the problem.


; Function Attrs: nofree norecurse nosync nounwind
define internal void @function_body(ptr nocapture readonly %0, ptr nocapture readnone %1, i32 %2) #1 {
allocs:
  %3 = getelementptr inbounds %struct.RuntimeContext, ptr %0, i64 0, i32 1, i64 1
  %4 = load i64, ptr %3, align 8
  %5 = inttoptr i64 %4 to ptr
  %6 = getelementptr inbounds %struct.RuntimeContext, ptr %0, i64 0, i32 1, i64 0
  %7 = load i64, ptr %6, align 8
  %8 = inttoptr i64 %7 to ptr
  %9 = sub i64 %7, %4
  %diff.check = icmp ult i64 %9, 32
  br i1 %diff.check, label %for_loop_body.preheader, label %vector.body.preheader

vector.body.preheader:                            ; preds = %allocs
  %uglygep11 = getelementptr i8, ptr %5, i64 16
  %uglygep15 = getelementptr i8, ptr %8, i64 16
  br label %vector.body

for_loop_body.preheader:                          ; preds = %allocs
  br label %for_loop_body

vector.body:                        sr.iv12 = phi ptr [ %uglygep11, %vector.body.preheader ], [ %uglygep13, %vector.body ]
  %uglygep14 = getelementptr i8, ptr %lsr.iv12, i64 -16
  %wide.load = load <4 x float>, ptr              ; preds = %vector.body.preheader, %vector.body
  %lsr.iv19 = phi i64 [ 268435456, %vector.body.preheader ], [ %lsr.iv.next20, %vector.body ]
  %lsr.iv16 = phi ptr [ %uglygep15, %vector.body.preheader ], [ %uglygep17, %vector.body ]
  %l %uglygep14, align 4
  %wide.load5 = load <4 x float>, ptr %lsr.iv12, align 4
  %uglygep18 = getelementptr i8, ptr %lsr.iv16, i64 -16
  store <4 x float> %wide.load, ptr %uglygep18, align 4
  store <4 x float> %wide.load5, ptr %lsr.iv16, align 4
  %uglygep13 = getelementptr i8, ptr %lsr.iv12, i64 32
  %uglygep17 = getelementptr i8, ptr %lsr.iv16, i64 32
  %lsr.iv.next20 = add nsw i64 %lsr.iv19, -8
  %10 = icmp eq i64 %lsr.iv.next20, 0
  br i1 %10, label %after_for.loopexit7, label %vector.body, !llvm.loop !5

for_loop_body:                                    ; preds = %for_loop_body.preheader, %for_loop_body
  %lsr.iv10 = phi i64 [ 268435456, %for_loop_body.preheader ], [ %lsr.iv.next, %for_loop_body ]
  %lsr.iv8 = phi ptr [ %5, %for_loop_body.preheader ], [ %uglygep9, %for_loop_body ]
  %lsr.iv = phi ptr [ %8, %for_loop_body.preheader ], [ %uglygep, %for_loop_body ]
  %11 = load float, ptr %lsr.iv8, align 4
  store float %11, ptr %lsr.iv, align 4
  %uglygep = getelementptr i8, ptr %lsr.iv, i64 4
  %uglygep9 = getelementptr i8, ptr %lsr.iv8, i64 4
  %lsr.iv.next = add nsw i64 %lsr.iv10, -1
  %exitcond.not = icmp eq i64 %lsr.iv.next, 0
  br i1 %exitcond.not, label %after_for.loopexit, label %for_loop_body, !llvm.loop !7

after_for.loopexit:                               ; preds = %for_loop_body
  br label %after_for

after_for.loopexit7:                              ; preds = %vector.body
  br label %after_for

after_for:                                        ; preds = %after_for.loopexit7, %after_for.loopexit
  ret void
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
advanced optimization The issue or bug is related to advanced optimization perf
Projects
Status: In Progress
Development

No branches or pull requests

2 participants