-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why is compile time related to input shape? #7343
Comments
Hi @leosongwei. The problem here is that |
Thanks for reply. Although have a NUM * LEN * LEN seems simple, but I will quickly run out of GPU RAM if my input length increased. Anyway to have the intermediate array length to be linear with how many thread I have? |
Now I have something like this, the result seems to be mysteriously correct. But I'm not sure I'm doing correct thing, because I don't know the if levenshtein_distance is executed in single thread, and I didn't find a way to ensure that, so I don't know the if the global_thread_idx will consistent through the function. import taichi as ti
import taichi.math as tm
import time
import taichi.types
import numpy as np
import time
ti.init(arch=ti.cpu, offline_cache=False)
ti.set_logging_level(ti.ERROR)
NUM = 4
LEN = 10
print(f"when LEN = {LEN}")
u8string = taichi.types.vector(LEN, ti.u8)
strings = ti.ndarray(dtype=u8string, shape=NUM)
results = ti.ndarray(dtype=ti.u16, shape=(NUM, NUM))
PARALLEL = 128
ldist_arrays = ti.field(dtype=ti.u8, shape=(PARALLEL, LEN+1, LEN+1))
def gen_rand_with_np():
return np.random.randint(0, 4, (NUM, LEN)) # ATCG
@ti.func
def levenshtein_distance(s1: u8string, s2: u8string):
index = ti.global_thread_idx()
for i in range(LEN + 1):
for j in range(LEN + 1):
ldist_arrays[index, i, j] = ti.u16(0)
#d = ldist_arrays[index]
for i in range(LEN + 1):
ldist_arrays[index, i, 0] = ti.cast(i, ti.u16)
for j in range(LEN + 1):
ldist_arrays[index, 0, j] = ti.cast(j, ti.u16)
ti.loop_config(serialize=True)
for i in range(1, LEN + 1):
ti.loop_config(serialize=True)
for j in range(1, LEN + 1):
cost = ti.u16(0)
if s1[i - 1] == s2[j - 1]:
cost = ti.u16(0)
else:
cost = ti.u16(1)
ldist_arrays[index, i,j] = ti.min(
ldist_arrays[index, i-1, j] + ti.u16(1),
ldist_arrays[index, i, j-1] + ti.u16(1),
ldist_arrays[index, i-1, j-1] + cost
)
#print(d)
return ldist_arrays[index, LEN, LEN]
@ti.kernel
def compare_each(
strings: taichi.types.ndarray(dtype=u8string, ndim=1),
results: taichi.types.ndarray(dtype=ti.u16, ndim=2),
):
N = results.shape[0]
ti.loop_config(block_dim=PARALLEL)
for i in range(N):
for j in range(i+1, N):
s1 = strings[i]
s2 = strings[j]
result = levenshtein_distance(s1, s2)
results[i, j] = result
input_array = np.array([
[1,2,3,4,5,6,7,8,9,0],
[1,2,3,4,5,6,7,8,9,0],
[1,2,3,4,5,6,1,1,1,1],
[1,2,3,4,5,6,7,1,1,1],
])
strings.from_numpy(input_array)
t0 = time.time()
compare_each(strings, results)
t1 = time.time()
print(strings.to_numpy())
print(results.to_numpy())
print(f"time (warm up jit): {t1-t0}s")
results.fill(0) |
😂 even I do this, the compile time still seems to increase with the input size LEN. To make it useful, I would want the NUM to > 1000, LEN > 3000, which seems very hard. |
|
Thank you for reply.
I'm I able to ensure the whole levenshtein_distance function runs in the same thread? |
Yes. According to https://docs.taichi-lang.org/docs/hello_world#parallel-for-loops, only the |
Interesting, but in this case, since I have 2 loops outside, if only the outermost level is parallelized, the parallelization will be unbalance (some have only have 1 loops, and some will have NUM loops). So I have more questions:
ti.loop_config(block_dim=1)
for i in range(N):
ti.loop_config(block_dim=1)
for j in range(i + 1, N):
|
You can write
As only the outermost loop is parallelized,
I'm not sure whether you are talking about the case under CPU or GPU. Under CPU |
Thank you so much. Therefore:
|
hmmm, interesting, after change, it got much faster, but I got strange CUDA error. While on CPU backend, it always works.
The new code looks like this: import taichi as ti
import taichi.math as tm
import time
import taichi.types
import numpy as np
import time
ti.init(arch=ti.gpu, offline_cache=False)
NUM = 74
LEN = 300
print(f"when LEN = {LEN}")
char_type = ti.u8
result_type = ti.u16
strings = ti.ndarray(dtype=char_type, shape=(NUM, LEN))
results = ti.ndarray(dtype=result_type, shape=(NUM, NUM))
PARALLEL = 128
ldist_arrays = ti.field(dtype=result_type, shape=(PARALLEL, LEN + 1, LEN + 1))
@ti.func
def levenshtein_distance(
strings: taichi.types.ndarray(dtype=char_type, ndim=2),
s1: ti.i32,
s2: ti.i32,
):
thread_id = ti.global_thread_idx()
for i in range(LEN + 1):
ldist_arrays[thread_id, i, 0] = ti.cast(i, result_type)
for j in range(LEN + 1):
ldist_arrays[thread_id, 0, j] = ti.cast(j, result_type)
for i in range(1, LEN + 1):
for j in range(1, LEN + 1):
cost = result_type(0)
if strings[s1, i - 1] == strings[s2, j - 1]:
cost = result_type(0)
else:
cost = result_type(1)
ldist_arrays[thread_id, i, j] = ti.min(
ldist_arrays[thread_id, i - 1, j] + result_type(1),
ldist_arrays[thread_id, i, j - 1] + result_type(1),
ldist_arrays[thread_id, i - 1, j - 1] + cost,
)
return ldist_arrays[thread_id, LEN, LEN]
@ti.kernel
def compare_each(
strings: taichi.types.ndarray(dtype=char_type, ndim=2),
results: taichi.types.ndarray(dtype=result_type, ndim=2),
):
N = results.shape[0]
ti.loop_config(block_dim=PARALLEL)
for i, j in ti.ndrange(N, N):
if j < i + 1:
continue
result = levenshtein_distance(strings, i, j)
results[i, j] = result
def gen_rand_with_np():
return np.random.randint(0, 4, (NUM, LEN)) # ATCG
def gen_test():
a = np.zeros((NUM, LEN))
for i in range(NUM):
for j in range(i, LEN):
a[i, j] = j
return a
input_array = gen_test()
strings.from_numpy(input_array)
t0 = time.time()
compare_each(strings, results)
t1 = time.time()
print(strings.to_numpy())
print(results.to_numpy())
print(f"time (warm up jit): {t1-t0}s")
results.fill(0) When result_type=ti.u32:
When NUM > 74:
|
Yes.
Yes.
The previous solution (manually setting the number of threads) only works for CPU. On GPU, |
😂 my grief stages [1]:
i32 linear_thread_idx(RuntimeContext *context) {
#if ARCH_cuda || ARCH_amdgpu
return block_idx() * block_dim() + thread_idx();
#else
return context->cpu_thread_id;
#endif
} Is it possible to somehow use the GPU While the BLS in the doc seems only to be a caching mechanism, and can not use for this purpose, I'm wondering whether this is possible at all. Wondering what's the status of: #2100 |
From the algorithm side, as you are only using |
Hi @leosongwei! Seems that you are calculating edit distance for a group of strings and run into a really memory-hungry case by assigning each string pair to a single GPU thread. However, this is not a proper way to utilize GPUs: the memory and register resources for each GPU thread are insufficient to support this large scale. To use GPUs efficiently, you have to think in parallel, even with Taichi. To reduce the memory requirements for each thread, 1) you can take @strongoier 's advice to calculate with two rows instead of the entire matrix (see wiki page for details). 2) use a thread block or even an entire kernel for edit distance computing, instead of parallelization by pairs. Regarding thread index, we can add the threadIdx and blockIdx equivalent in Taichi. It's just not there. For the time being, you can get the local thread index by calculating the residuals against block dimension. Example:
It's somehow a little inconvenient. I'll add the APIs some days later. |
Some further questions regarding your problem: Does all strings share the same length? |
Not necessary. I just assume to pad them in the same length would be easier.
Well, this is just a case I met in work, we have moved on to some different approaches. I'm just continuing it to learn Taichi and establish the understanding of what kind of task is suitable with Taichi. Now I feel that the edit distance may not be a very good case :-( So instead of telling me the exact approach, more directional advice could be more helpful, as I might not be on the right track at all. Now, my feeling is that:
So I'm very interested on the hardware specific features. (I guess I simply can not do CUDA programming without learning CUDA programming directly) |
Yes, Taichi is very different from CPU multithreading as it cannot override hardware limitations. The GPU threads are meant to be lightweight and highly parallelized in order to maximize throughput. Therefore, it feels tricky when processing problems with strong data dependencies. To put it on the right track, you might have to think about how to properly divide the computation into blocks, and assign fine-grained tasks to each GPU thread. This is the key problem that prevents GPUs to replace CPUs in "real" general purpose tasks: you have to parallelize the algorithms, but many of them cannot naturally parallelize. The hardware specific features are warp-shuffling instructions that can permute data among GPU threads, and also shared memory. Utilizing such features would assume prior knowledge on CUDA GPUs, but the yielding would be highly efficient. |
I wrote a snippet of code to compare the levenshtein distance between every 2 strings in an array:
Here is the time duration varying from the input length, can be seen very easily that the compile time quickly goes very high:
Here is my code:
The text was updated successfully, but these errors were encountered: