Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why is compile time related to input shape? #7343

Closed
leosongwei opened this issue Feb 12, 2023 · 18 comments
Closed

Why is compile time related to input shape? #7343

leosongwei opened this issue Feb 12, 2023 · 18 comments
Assignees
Labels
question Question on using Taichi

Comments

@leosongwei
Copy link

leosongwei commented Feb 12, 2023

I wrote a snippet of code to compare the levenshtein distance between every 2 strings in an array:

  • the number of strings is denoted as NUM, then length of strings is denoted as LEN.
  • I ran the compare kernel 3 times to get the time duration.
  • I disabled offline cache.
  • It seems doesn't matter if I use CPU or GPU.

Here is the time duration varying from the input length, can be seen very easily that the compile time quickly goes very high:

[Taichi] version 1.4.1, llvm 15.0.4, commit e67c674e, linux, python 3.10.6
[Taichi] Starting on arch=x64

when LEN = 20
time (warm up jit): 0.9031398296356201s
time: 6.103515625e-05s
time: 6.031990051269531e-05s

when LEN = 30
time (warm up jit): 3.604764223098755s
time: 9.799003601074219e-05s
time: 0.00010275840759277344s

when LEN = 40
time (warm up jit): 10.099121809005737s
time: 8.916854858398438e-05s
time: 7.414817810058594e-05s

when LEN = 50
time (warm up jit): 24.58306908607483s
time: 0.00010180473327636719s
time: 9.274482727050781e-05s

when LEN = 60
time (warm up jit): 45.84676432609558s
time: 0.00012636184692382812s
time: 0.00011277198791503906s

when LEN = 70
time (warm up jit): 85.22769165039062s
time: 0.00015425682067871094s
time: 0.000141143798828125s

Here is my code:

import taichi as ti
import taichi.math as tm
import time
import taichi.types
import numpy as np
import time

ti.init(arch=ti.cpu, offline_cache=False)
ti.set_logging_level(ti.ERROR)

NUM = 4
LEN = 70
print(f"when LEN = {LEN}")

u8string = taichi.types.vector(LEN, ti.u8)

strings = ti.ndarray(dtype=u8string, shape=NUM)
results = ti.ndarray(dtype=ti.u16, shape=(NUM, NUM))

def gen_rand_with_np():
    return np.random.randint(0, 4, (NUM, LEN)) # ATCG

@ti.func
def levenshtein_distance(s1: u8string, s2: u8string):
    d = ti.Matrix.zero(ti.u16, LEN + 1, LEN + 1)
    for i in range(LEN + 1):
        d[i, 0] = ti.cast(i, ti.u16)
    for j in range(LEN + 1):
        d[0, j] = ti.cast(j, ti.u16)

    ti.loop_config(serialize=True)
    for i in range(1, LEN + 1):
        ti.loop_config(serialize=True)
        for j in range(1, LEN + 1):
            cost = ti.u16(0)
            if s1[i - 1] == s2[j - 1]:
                cost = ti.u16(0)
            else:
                cost = ti.u16(1)
            d[i,j] = ti.min(
                d[i-1, j] + ti.u16(1),
                d[i, j-1] + ti.u16(1),
                d[i-1, j-1] + cost
            )
    #print(d)
    return d[LEN, LEN]

@ti.kernel
def compare_each(
    strings: taichi.types.ndarray(dtype=u8string, ndim=1),
    results: taichi.types.ndarray(dtype=ti.u16, ndim=2)
    ):
    N = results.shape[0]
    for i in range(N):
        for j in range(i+1, N):
            s1 = strings[i]
            s2 = strings[j]
            result = levenshtein_distance(s1, s2)
            results[i, j] = result


strings.from_numpy(gen_rand_with_np())
t0 = time.time()
compare_each(strings, results)
t1 = time.time()
# print(strings.to_numpy())
# print(results.to_numpy())
print(f"time (warm up jit): {t1-t0}s")
results.fill(0)

strings.from_numpy(gen_rand_with_np())
t0 = time.time()
compare_each(strings, results)
t1 = time.time()
# print(strings.to_numpy())
# print(results.to_numpy())
print(f"time: {t1-t0}s")
results.fill(0)

strings.from_numpy(gen_rand_with_np())
t0 = time.time()
compare_each(strings, results)
t1 = time.time()
# print(strings.to_numpy())
# print(results.to_numpy())
print(f"time: {t1-t0}s")
results.fill(0)
@leosongwei leosongwei added the question Question on using Taichi label Feb 12, 2023
@github-project-automation github-project-automation bot moved this to Untriaged in Taichi Lang Feb 12, 2023
@strongoier
Copy link
Contributor

strongoier commented Feb 13, 2023

Hi @leosongwei. The problem here is that d = ti.Matrix.zero(ti.u16, LEN + 1, LEN + 1) creates LEN * LEN-level number of instructions and the compiler performance is very sensitive to that. Considering that initializing to 0 each time is unnecessary in your code and it is not recommended to allocate a large local array in GPU programming, I suggest having a global d = ti.ndarray(dtype=ti.u16, shape=(NUM, LEN + 1, LEN + 1)) so that in for i in range(N) each i can simply have a separate part of a global array (d[i, ..., ....]).

@leosongwei
Copy link
Author

Hi @leosongwei. The problem here is that d = ti.Matrix.zero(ti.u16, LEN + 1, LEN + 1) creates LEN * LEN-level number of instructions and the compiler performance is very sensitive to that. Considering that initializing to 0 each time is unnecessary in your code and it is not recommended to allocate a large local array in GPU programming, I suggest having a global d = ti.ndarray(dtype=ti.u16, shape=(NUM, LEN + 1, LEN + 1)) so that in for i in range(N) each i can simply have a separate part of a global array (d[i, ..., ....]).

Thanks for reply. Although have a NUM * LEN * LEN seems simple, but I will quickly run out of GPU RAM if my input length increased. Anyway to have the intermediate array length to be linear with how many thread I have?

@leosongwei
Copy link
Author

leosongwei commented Feb 13, 2023

Now I have something like this, the result seems to be mysteriously correct. But I'm not sure I'm doing correct thing, because I don't know the if levenshtein_distance is executed in single thread, and I didn't find a way to ensure that, so I don't know the if the global_thread_idx will consistent through the function.

import taichi as ti
import taichi.math as tm
import time
import taichi.types
import numpy as np
import time

ti.init(arch=ti.cpu, offline_cache=False)
ti.set_logging_level(ti.ERROR)

NUM = 4
LEN = 10
print(f"when LEN = {LEN}")

u8string = taichi.types.vector(LEN, ti.u8)

strings = ti.ndarray(dtype=u8string, shape=NUM)
results = ti.ndarray(dtype=ti.u16, shape=(NUM, NUM))

PARALLEL = 128
ldist_arrays = ti.field(dtype=ti.u8, shape=(PARALLEL, LEN+1, LEN+1))

def gen_rand_with_np():
    return np.random.randint(0, 4, (NUM, LEN)) # ATCG

@ti.func
def levenshtein_distance(s1: u8string, s2: u8string):
    index = ti.global_thread_idx()

    for i in range(LEN + 1):
        for j in range(LEN + 1):
            ldist_arrays[index, i, j] = ti.u16(0)
    
    #d = ldist_arrays[index]
    for i in range(LEN + 1):
        ldist_arrays[index, i, 0] = ti.cast(i, ti.u16)
    for j in range(LEN + 1):
        ldist_arrays[index, 0, j] = ti.cast(j, ti.u16)

    ti.loop_config(serialize=True)
    for i in range(1, LEN + 1):
        ti.loop_config(serialize=True)
        for j in range(1, LEN + 1):
            cost = ti.u16(0)
            if s1[i - 1] == s2[j - 1]:
                cost = ti.u16(0)
            else:
                cost = ti.u16(1)
            ldist_arrays[index, i,j] = ti.min(
                ldist_arrays[index, i-1, j] + ti.u16(1),
                ldist_arrays[index, i, j-1] + ti.u16(1),
                ldist_arrays[index, i-1, j-1] + cost
            )
    #print(d)
    return ldist_arrays[index, LEN, LEN]

@ti.kernel
def compare_each(
    strings: taichi.types.ndarray(dtype=u8string, ndim=1),
    results: taichi.types.ndarray(dtype=ti.u16, ndim=2),
    ):
    N = results.shape[0]

    ti.loop_config(block_dim=PARALLEL)
    for i in range(N):
        for j in range(i+1, N):
            s1 = strings[i]
            s2 = strings[j]
            result = levenshtein_distance(s1, s2)
            results[i, j] = result


input_array = np.array([
    [1,2,3,4,5,6,7,8,9,0],
    [1,2,3,4,5,6,7,8,9,0],
    [1,2,3,4,5,6,1,1,1,1],
    [1,2,3,4,5,6,7,1,1,1],
])

strings.from_numpy(input_array)
t0 = time.time()
compare_each(strings, results)
t1 = time.time()
print(strings.to_numpy())
print(results.to_numpy())
print(f"time (warm up jit): {t1-t0}s")
results.fill(0)

@leosongwei
Copy link
Author

😂 even I do this, the compile time still seems to increase with the input size LEN. To make it useful, I would want the NUM to > 1000, LEN > 3000, which seems very hard.

@strongoier
Copy link
Contributor

  1. You should use ti.loop_config(parallelize=PARALLEL) to set the number of threads. Then different results will be returned for ti.global_thread_idx() on different threads.
  2. You can further remove the use of vectors to make the compilation time irrelevant to LEN.

@strongoier strongoier self-assigned this Feb 17, 2023
@strongoier strongoier moved this from Untriaged to In Progress in Taichi Lang Feb 17, 2023
@leosongwei
Copy link
Author

Thank you for reply.

  1. You should use ti.loop_config(parallelize=PARALLEL) to set the number of threads. Then different results will be returned for ti.global_thread_idx() on different threads.

I'm I able to ensure the whole levenshtein_distance function runs in the same thread?

@strongoier
Copy link
Contributor

I'm I able to ensure the whole levenshtein_distance function runs in the same thread?

Yes. According to https://docs.taichi-lang.org/docs/hello_world#parallel-for-loops, only the for loop at the outermost scope in a kernel (for i in range(N): in your code snippet) is parallelized.

@leosongwei
Copy link
Author

leosongwei commented Feb 20, 2023

only the for loop at the outermost scope in a kernel (for i in range(N): in your code snippet) is parallelized.

Interesting, but in this case, since I have 2 loops outside, if only the outermost level is parallelized, the parallelization will be unbalance (some have only have 1 loops, and some will have NUM loops).

So I have more questions:

  1. Any means to address the unbalance parallelization? (Or is it even unbalanced?)
  2. Why it seems no effect at all of adjusting the block_dim on the inner loop?
    ti.loop_config(block_dim=1)
    for i in range(N):
        ti.loop_config(block_dim=1)
        for j in range(i + 1, N):
  1. Why block_dim=1 is the fastest config? Could it be related to the unbalanced parallelization?

@strongoier
Copy link
Contributor

Any means to address the unbalance parallelization? (Or is it even unbalanced?)

You can write for i, j in ti.ndrange(N, N): to loop over all i, j pair in the outermost level.

Why it seems no effect at all of adjusting the block_dim on the inner loop?

As only the outermost loop is parallelized, ti.loop_config() only has effects on the outermost loop.

Why block_dim=1 is the fastest config? Could it be related to the unbalanced parallelization?

I'm not sure whether you are talking about the case under CPU or GPU. Under CPU block_dim should be irrelevant.

@leosongwei
Copy link
Author

Thank you so much. Therefore:

  1. ONLY the outermost loop can be parallelized? And I have NO way to configure if a inner loop should be parallelized or not?
  2. So I should have something like for i, j in ti.ndrange(N, N): to expand loops into the outermost one, so things can be parallelized properly without unbalance?
  3. I'm on GPU.

@leosongwei
Copy link
Author

leosongwei commented Feb 20, 2023

hmmm, interesting, after change, it got much faster, but I got strange CUDA error. While on CPU backend, it always works.

  1. If I set the result_type = ti.u32 it will crash. If I set to result_type = ti.u16, it will some how work.
  2. Even if I have result_type = ti.u16, when the NUM >= 75, it will crash.

The new code looks like this:

import taichi as ti
import taichi.math as tm
import time
import taichi.types
import numpy as np
import time

ti.init(arch=ti.gpu, offline_cache=False)

NUM = 74
LEN = 300
print(f"when LEN = {LEN}")

char_type = ti.u8
result_type = ti.u16

strings = ti.ndarray(dtype=char_type, shape=(NUM, LEN))
results = ti.ndarray(dtype=result_type, shape=(NUM, NUM))

PARALLEL = 128
ldist_arrays = ti.field(dtype=result_type, shape=(PARALLEL, LEN + 1, LEN + 1))


@ti.func
def levenshtein_distance(
    strings: taichi.types.ndarray(dtype=char_type, ndim=2),
    s1: ti.i32,
    s2: ti.i32,
):
    thread_id = ti.global_thread_idx()

    for i in range(LEN + 1):
        ldist_arrays[thread_id, i, 0] = ti.cast(i, result_type)
    for j in range(LEN + 1):
        ldist_arrays[thread_id, 0, j] = ti.cast(j, result_type)

    for i in range(1, LEN + 1):
        for j in range(1, LEN + 1):
            cost = result_type(0)
            if strings[s1, i - 1] == strings[s2, j - 1]:
                cost = result_type(0)
            else:
                cost = result_type(1)
            ldist_arrays[thread_id, i, j] = ti.min(
                ldist_arrays[thread_id, i - 1, j] + result_type(1),
                ldist_arrays[thread_id, i, j - 1] + result_type(1),
                ldist_arrays[thread_id, i - 1, j - 1] + cost,
            )
    return ldist_arrays[thread_id, LEN, LEN]


@ti.kernel
def compare_each(
    strings: taichi.types.ndarray(dtype=char_type, ndim=2),
    results: taichi.types.ndarray(dtype=result_type, ndim=2),
):
    N = results.shape[0]

    ti.loop_config(block_dim=PARALLEL)
    for i, j in ti.ndrange(N, N):
        if j < i + 1:
            continue
        result = levenshtein_distance(strings, i, j)
        results[i, j] = result


def gen_rand_with_np():
    return np.random.randint(0, 4, (NUM, LEN))  # ATCG


def gen_test():
    a = np.zeros((NUM, LEN))
    for i in range(NUM):
        for j in range(i, LEN):
            a[i, j] = j
    return a


input_array = gen_test()

strings.from_numpy(input_array)
t0 = time.time()
compare_each(strings, results)
t1 = time.time()
print(strings.to_numpy())
print(results.to_numpy())
print(f"time (warm up jit): {t1-t0}s")
results.fill(0)

When result_type=ti.u32:

[Taichi] version 1.4.1, llvm 15.0.4, commit e67c674e, linux, python 3.10.6
[Taichi] Starting on arch=cuda
when LEN = 300
[W 02/20/23 23:30:20.748 127493] [type_check.cpp:type_check_store@36] [$48] Global store may lose precision: u8 <- f64
File "/home/leo/learn/taichi/.venv/lib/python3.10/site-packages/taichi/_kernels.py", line 140, in ext_arr_to_ndarray:
        ndarray[I] = arr[I]
        ^^^^^^^^^^^^^^^^^^^

[E 02/20/23 23:30:20.994 127493] [cuda_driver.h:operator()@88] CUDA Error CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered while calling module_load_data_ex (cuModuleLoadDataEx)


Traceback (most recent call last):
  File "/home/leo/learn/taichi/edit_distance_taichi.py", line 85, in <module>
    print(strings.to_numpy())
  File "/home/leo/learn/taichi/.venv/lib/python3.10/site-packages/taichi/lang/util.py", line 310, in wrapped
    return func(*args, **kwargs)
  File "/home/leo/learn/taichi/.venv/lib/python3.10/site-packages/taichi/lang/_ndarray.py", line 255, in to_numpy
    return self._ndarray_to_numpy()
  File "/home/leo/learn/taichi/.venv/lib/python3.10/site-packages/taichi/lang/util.py", line 310, in wrapped
    return func(*args, **kwargs)
  File "/home/leo/learn/taichi/.venv/lib/python3.10/site-packages/taichi/lang/_ndarray.py", line 89, in _ndarray_to_numpy
    ndarray_to_ext_arr(self, arr)
  File "/home/leo/learn/taichi/.venv/lib/python3.10/site-packages/taichi/lang/kernel_impl.py", line 974, in wrapped
    return primal(*args, **kwargs)
  File "/home/leo/learn/taichi/.venv/lib/python3.10/site-packages/taichi/lang/kernel_impl.py", line 901, in __call__
    return self.runtime.compiled_functions[key](*args)
  File "/home/leo/learn/taichi/.venv/lib/python3.10/site-packages/taichi/lang/kernel_impl.py", line 826, in func__
    raise e from None
  File "/home/leo/learn/taichi/.venv/lib/python3.10/site-packages/taichi/lang/kernel_impl.py", line 823, in func__
    t_kernel(launch_ctx)
RuntimeError: [cuda_driver.h:operator()@88] CUDA Error CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered while calling module_load_data_ex (cuModuleLoadDataEx)
[E 02/20/23 23:30:21.013 127493] [cuda_driver.h:operator()@88] CUDA Error CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered while calling stream_synchronize (cuStreamSynchronize)


terminate called after throwing an instance of 'std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >'
已放弃 (核心已转储)

When NUM > 74:

[Taichi] version 1.4.1, llvm 15.0.4, commit e67c674e, linux, python 3.10.6
[Taichi] Starting on arch=cuda
when LEN = 300
[W 02/20/23 23:31:03.907 127725] [type_check.cpp:type_check_store@36] [$48] Global store may lose precision: u8 <- f64
File "/home/leo/learn/taichi/.venv/lib/python3.10/site-packages/taichi/_kernels.py", line 140, in ext_arr_to_ndarray:
        ndarray[I] = arr[I]
        ^^^^^^^^^^^^^^^^^^^

[E 02/20/23 23:31:04.163 127725] [cuda_driver.h:operator()@88] CUDA Error CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered while calling module_load_data_ex (cuModuleLoadDataEx)


Traceback (most recent call last):
  File "/home/leo/learn/taichi/edit_distance_taichi.py", line 85, in <module>
    print(strings.to_numpy())
  File "/home/leo/learn/taichi/.venv/lib/python3.10/site-packages/taichi/lang/util.py", line 310, in wrapped
    return func(*args, **kwargs)
  File "/home/leo/learn/taichi/.venv/lib/python3.10/site-packages/taichi/lang/_ndarray.py", line 255, in to_numpy
    return self._ndarray_to_numpy()
  File "/home/leo/learn/taichi/.venv/lib/python3.10/site-packages/taichi/lang/util.py", line 310, in wrapped
    return func(*args, **kwargs)
  File "/home/leo/learn/taichi/.venv/lib/python3.10/site-packages/taichi/lang/_ndarray.py", line 89, in _ndarray_to_numpy
    ndarray_to_ext_arr(self, arr)
  File "/home/leo/learn/taichi/.venv/lib/python3.10/site-packages/taichi/lang/kernel_impl.py", line 974, in wrapped
    return primal(*args, **kwargs)
  File "/home/leo/learn/taichi/.venv/lib/python3.10/site-packages/taichi/lang/kernel_impl.py", line 901, in __call__
    return self.runtime.compiled_functions[key](*args)
  File "/home/leo/learn/taichi/.venv/lib/python3.10/site-packages/taichi/lang/kernel_impl.py", line 826, in func__
    raise e from None
  File "/home/leo/learn/taichi/.venv/lib/python3.10/site-packages/taichi/lang/kernel_impl.py", line 823, in func__
    t_kernel(launch_ctx)
RuntimeError: [cuda_driver.h:operator()@88] CUDA Error CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered while calling module_load_data_ex (cuModuleLoadDataEx)
[E 02/20/23 23:31:04.185 127725] [cuda_driver.h:operator()@88] CUDA Error CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered while calling stream_synchronize (cuStreamSynchronize)


terminate called after throwing an instance of 'std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >'
已放弃 (核心已转储)

@strongoier
Copy link
Contributor

  1. ONLY the outermost loop can be parallelized? And I have NO way to configure if a inner loop should be parallelized or not?

Yes.

  1. So I should have something like for i, j in ti.ndrange(N, N): to expand loops into the outermost one, so things can be parallelized properly without unbalance?

Yes.

  1. I'm on GPU.

The previous solution (manually setting the number of threads) only works for CPU. On GPU, for i, j in ti.ndrange(N, N): always corresponds to N * N threads and your ldist_arrays is not large enough.

@leosongwei
Copy link
Author

leosongwei commented Feb 21, 2023

your ldist_arrays is not large enough.

😂 my grief stages [1]:

  1. Denial: Since N may be very large, so the GPU mem consumption of ldist_arrays will be NNLEN*LEN, which is unacceptably large and unnecessary.

  2. Anger: Looks like it's hopeless to allocate a the ldist_arrays which has space consumption according to the actual parallelization I have? Say I have N=10000 docs, I don't believe my GPU can do 10000 * 10000 at the same time.

  3. Bargaining: It seems that now I only have ti.global_thread_idx(), which is block_id * block_dim + thread_id. But in the runtime ./taichi/runtime/llvm/runtime_module/runtime.cpp, it looks like this:

i32 linear_thread_idx(RuntimeContext *context) {
#if ARCH_cuda || ARCH_amdgpu
  return block_idx() * block_dim() + thread_idx();
#else
  return context->cpu_thread_id;
#endif
}

Is it possible to somehow use the GPU thread_idx() plus some block private memory allocation to do that? Like to have a ldist_arrays sized of block_dim x (LEN+1) x (LEN +1) per block, and for each block thread, I can use the thread_idx() to access the intermediate array.

While the BLS in the doc seems only to be a caching mechanism, and can not use for this purpose, I'm wondering whether this is possible at all.

Wondering what's the status of: #2100

[1] The Stages of Grief: Accepting the Unacceptable

@strongoier
Copy link
Contributor

From the algorithm side, as you are only using ldist_arrays[index, i, ...] and ldist_arrays[index, i - 1, ...] to calculate ldist_arrays[index, i, ...], you only need an array of 2 * LEN instead of LEN * LEN per thread. Regarding GPU programming advice, cc @turbo0628

@turbo0628
Copy link
Member

turbo0628 commented Feb 22, 2023

Hi @leosongwei! Seems that you are calculating edit distance for a group of strings and run into a really memory-hungry case by assigning each string pair to a single GPU thread. However, this is not a proper way to utilize GPUs: the memory and register resources for each GPU thread are insufficient to support this large scale. To use GPUs efficiently, you have to think in parallel, even with Taichi.

To reduce the memory requirements for each thread, 1) you can take @strongoier 's advice to calculate with two rows instead of the entire matrix (see wiki page for details). 2) use a thread block or even an entire kernel for edit distance computing, instead of parallelization by pairs.

Regarding thread index, we can add the threadIdx and blockIdx equivalent in Taichi. It's just not there. For the time being, you can get the local thread index by calculating the residuals against block dimension.

Example:

block_dim=128

@ti.kernel
def foo():
    ti.loop_config(block_dim=block_dim)
    for i in range(N):
        g_tid = ti.global_thread_index()
        tid = g_tid % block_dim # which is the local thread

It's somehow a little inconvenient. I'll add the APIs some days later.

@turbo0628
Copy link
Member

turbo0628 commented Feb 22, 2023

Some further questions regarding your problem:

Does all strings share the same length?
Do you need a small Taichi example to properly calculate edit distance? Even with some hardware specific features that are not so easily understood?

@leosongwei
Copy link
Author

Does all strings share the same length?

Not necessary. I just assume to pad them in the same length would be easier.

Do you need a small Taichi example to properly calculate edit distance?

Well, this is just a case I met in work, we have moved on to some different approaches. I'm just continuing it to learn Taichi and establish the understanding of what kind of task is suitable with Taichi. Now I feel that the edit distance may not be a very good case :-(

So instead of telling me the exact approach, more directional advice could be more helpful, as I might not be on the right track at all.

Now, my feeling is that:

  1. Taichi is very different from writing ordinary CPU multi-threading programing, so I should adapt to the new pattern.
  2. Also the compiler seems to be not that good, therefore I should avoid abstraction-heavy programming paradigms (Like I started by defining u8string vector).
  3. Probably I should turn to some use case of spacial sparse parallelization computing, which Taichi is designed for.

Even with some hardware specific features that are not so easily understood?

So I'm very interested on the hardware specific features. (I guess I simply can not do CUDA programming without learning CUDA programming directly)

@turbo0628
Copy link
Member

Yes, Taichi is very different from CPU multithreading as it cannot override hardware limitations. The GPU threads are meant to be lightweight and highly parallelized in order to maximize throughput. Therefore, it feels tricky when processing problems with strong data dependencies. To put it on the right track, you might have to think about how to properly divide the computation into blocks, and assign fine-grained tasks to each GPU thread. This is the key problem that prevents GPUs to replace CPUs in "real" general purpose tasks: you have to parallelize the algorithms, but many of them cannot naturally parallelize.

The hardware specific features are warp-shuffling instructions that can permute data among GPU threads, and also shared memory. Utilizing such features would assume prior knowledge on CUDA GPUs, but the yielding would be highly efficient.

@github-project-automation github-project-automation bot moved this from In Progress to Done in Taichi Lang Feb 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Question on using Taichi
Projects
Status: Done
Development

No branches or pull requests

3 participants