Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[async] Support constant folding in async mode #1778

Merged
merged 4 commits into from
Aug 26, 2020

Conversation

k-ye
Copy link
Member

@k-ye k-ye commented Aug 25, 2020

Constant folding (CF) was previously disabled in async mode because it has caused deadlock. The problem is that when a user kernel gets compiled, it does the CF pass, which adds the evaluator kernel to the queue, which never gets scheduled because the queue is blocked waiting for the user kernel to finish...

We can fix this by making the CF kernels always run synchronously (see Kernel::operator()). However, we still need two more changes:

  1. I added Program::device_synchronize(), which is more primitive than synchronize(), and always calls into the targeted GPU backend's device/stream sync method.
  2. We have to remove synchronize() from Program::fetch_result. The implication of this is that, whenever we fetch the result, we must make sure a proper sync is done. The only exception to this that I could find is in Program:: initialize_runtime_system:
    llvm_runtime = fetch_result<void *>(taichi_result_buffer_ret_value_id);
    TI_TRACE("LLVMRuntime pointer fetched");
    if (arch_use_host_memory(config.arch) || config.use_unified_memory) {
    runtime->call<void *>("runtime_get_mem_req_queue", llvm_runtime);
    auto mem_req_queue =
    fetch_result<void *>(taichi_result_buffer_ret_value_id);
    . But I don't think we need a sync here anyway @yuanming-hu ?

FYI, here are the fetch_result calls I could find:

  • Kernel::get_ret_int, Kernel::get_ret_float.
    • If this is called from within an SNode reader kernel, then it's already synced in

      taichi/taichi/ir/snode.cpp

      Lines 159 to 160 in 5c39e84

      get_current_program().synchronize();
      auto ret = reader_kernel->get_ret_float(0);
    • Otherwise, this is called from a kernel that returns value. I've reordered kernel.py so that a sync is done before we fetch the result.
  • Program::check_runtime_error: sync is done in the beginning:
    void Program::check_runtime_error() {
    synchronize();
  • Program::initialize_runtime_system: has explained above.
  • the constant folding pass: this has been fixed in this PR

Related issue = #N/A

[Click here for the format server]


Update 08/26/2020:

yuanming discovered that I missed the fetch_result in Program::runtime_query, which has caused failures during GC on CUDA:

T runtime_query(const std::string &key, Args... args) {
TI_ASSERT(arch_uses_llvm(config.arch));
TaichiLLVMContext *tlctx = nullptr;
if (llvm_context_device) {
tlctx = llvm_context_device.get();
} else {
tlctx = llvm_context_host.get();
}
auto runtime = tlctx->runtime_jit_module;
runtime->call<void *, Args...>("runtime_" + key, llvm_runtime,
std::forward<Args>(args)...);
return fetch_result<T>(taichi_result_buffer_runtime_query_id);

As a result, we added back device_synchronize() to fetch_result.

@codecov
Copy link

codecov bot commented Aug 25, 2020

Codecov Report

Merging #1778 into master will increase coverage by 0.02%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1778      +/-   ##
==========================================
+ Coverage   42.58%   42.61%   +0.02%     
==========================================
  Files          44       44              
  Lines        6185     6183       -2     
  Branches     1072     1071       -1     
==========================================
+ Hits         2634     2635       +1     
+ Misses       3397     3394       -3     
  Partials      154      154              
Impacted Files Coverage Δ
python/taichi/lang/kernel.py 71.03% <100.00%> (+0.07%) ⬆️
python/taichi/code_format.py 12.37% <0.00%> (+0.37%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6937b1e...d17391f. Read the comment docs.

Copy link
Member

@yuanming-hu yuanming-hu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a random comment. I'll push my local commit later.

@@ -175,7 +175,7 @@ void ExecutionQueue::enqueue(KernelLaunchRecord &&ker) {
auto config = kernel->program.config;
auto ir = stmt;
offload_to_executable(
ir, config, /*verbose=*/false,
ir, config, /*verbose=*/config.print_ir,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately setting this to true makes parallel IR printing crash :-( #1750

Let's stick to the original version for now. I'll revert this in my commit in a minute.

Copy link
Member

@yuanming-hu yuanming-hu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about the force push... I fixed a small issue on CUDA. Please feel free to merge if my changes are reasonable. Thanks!

@k-ye k-ye merged commit d38e83c into taichi-dev:master Aug 26, 2020
@k-ye k-ye deleted the constant-fold branch August 26, 2020 10:31
@yuanming-hu yuanming-hu mentioned this pull request Sep 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants