-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TIR] Upstream Sync2 #40
Comments
During rebase, I find a very strange problem. This problem happens when we construct a new tir object(e.g. Block, BlockRealize) in python. In C++, the constructor will create a new object and everything is OK. In python, however, once finish the Another weird thing is that the problem only happens in Ubuntu 16.04. On macOS and Ubuntu 18.04, everything is OK. No matter the GCC and Python version, that issue happens every time on 16.04. And the bug has existed for a long time (from the beginning of tir). It did not be discovered only because we always use macOS or Ubuntu 18.04 to code. Do you have any idea about this? @tqchen |
I've done several experiments on the wild pointer issue. I use the same GCC and Python version on 16.04 as what I use on 18.04. The problem still occurs. Specifically, see https://github.com/merrymercy/tvm-tensorir/blob/dev/src/tir/api.cc TVM_REGISTER_GLOBAL("make.BlockRealize")
.set_body_typed<BlockRealize(Array<PrimExpr>, PrimExpr, Block)>(
[](Array<PrimExpr> values, PrimExpr predicate, Block block) {
if (!predicate.dtype().is_bool()) {
// To support python ir_builder
CHECK(is_one(predicate));
predicate = IntImm(DataType::Bool(), 1);
}
return BlockRealize(std::move(values), std::move(predicate), std::move(block));
}); After return, the newly created BlockRealize object will be deleted. I found a way to fix this, but I don't know why it works. I change make.BlockReailze as below Stmt BlockRealizeNode::make(Array<PrimExpr> values, PrimExpr predicate, Block block) {
CHECK_EQ(block->iter_vars.size(), values.size());
ObjectPtr<BlockRealizeNode> node = make_object<BlockRealizeNode>();
node->binding_values = std::move(values);
if (!predicate.dtype().is_bool()) {
// To support python ir_builder
CHECK(is_one(predicate));
predicate = IntImm(DataType::Bool(), 1);
}
node->predicate = std::move(predicate);
node->block = std::move(block);
return Stmt(node);
}
TVM_REGISTER_GLOBAL("make.BlockRealize")
.set_body_typed(BlockRealizeNode::make); which is the same style as the original nodes on AST. And now it's all ok. |
It would be great if we can create a minimum repro, is it only for the new tensor ir objects, or it also happens for upstream if we use a similar style(eg pass things via constructor). Would be interesting to dig into the issue a bit, if we do gdb or valgrind, or try to see exactly at which pt the deallocate get called. |
If you can get a clean repro— eg create a dummy object that is similar and registration function, and call via py. Which version of gcc. I can take a look as wel |
From what we I can see the problem might occurs in set_body_typed. But am not exactly sure where |
OK, here is my guess, there might be something wrong with the type deduction -
Add ->Stmt hint to the function signatureThis will make the function signature the same as the change @spectrometerHBH TVM_REGISTER_GLOBAL("make.BlockRealize")
.set_body_typed<BlockRealize(Array<PrimExpr>, PrimExpr, Block)>(
[](Array<PrimExpr> values, PrimExpr predicate, Block block) -> Stmt {
if (!predicate.dtype().is_bool()) {
// To support python ir_builder
CHECK(is_one(predicate));
predicate = IntImm(DataType::Bool(), 1);
}
return BlockRealize(std::move(values), std::move(predicate), std::move(block));
}); Change ->Stmt to ->BlockRealize |
I change the signature to @tqchen TVM_REGISTER_GLOBAL("make.BlockRealize")
.set_body_typed<Stmt(Array<PrimExpr>, PrimExpr, Block)>(
[](Array<PrimExpr> values, PrimExpr predicate, Block block) -> Stmt {
if (!predicate.dtype().is_bool()) {
// To support python ir_builder
CHECK(is_one(predicate));
predicate = IntImm(DataType::Bool(), 1);
}
return BlockRealize(std::move(values), std::move(predicate), std::move(block));
}); but it doesn't work. |
Post an experiment result here. template<typename R, int index, typename F>
struct unpack_call_dispatcher<R, 0, index, F> {
template<typename ...Args>
static void run(const F& f,
const TVMArgs& args_pack,
TVMRetValue* rv,
Args&&... unpacked_args) {
std::cout << "?" << std::endl; // here
*rv = R(f(std::forward<Args>(unpacked_args)...));
std::cout << "??" << std::endl; // and here
}
}; and inline void Object::IncRef() {
std::cout << "IncRef " << GetTypeKey() << std::endl; // here
ref_counter_.fetch_add(1, std::memory_order_relaxed);
}
inline void Object::DecRef() {
if (ref_counter_.fetch_sub(1, std::memory_order_release) == 1) {
std::cout << "Delete " << GetTypeKey() << std::endl; // here
std::atomic_thread_fence(std::memory_order_acquire);
if (this->deleter_ != nullptr) {
(*this->deleter_)(this);
}
} else {
std::cout << "DecRef " << GetTypeKey() << std::endl; // here
}
} and TVM_REGISTER_GLOBAL("make.BlockRealize")
.set_body_typed<BlockRealize(Array<PrimExpr>, PrimExpr, Block)>(
[](Array<PrimExpr> values, PrimExpr predicate, Block block) {
if (!predicate.dtype().is_bool()) {
// To support python ir_builder
CHECK(is_one(predicate));
predicate = IntImm(DataType::Bool(), 1);
}
return BlockRealize(std::move(values), std::move(predicate), std::move(block));
});
BlockRealize::BlockRealize(Array<PrimExpr> values, PrimExpr predicate, Block block) {
CHECK_EQ(block->iter_vars.size(), values.size());
ObjectPtr<BlockRealizeNode> node = make_object<BlockRealizeNode>();
node->binding_values = std::move(values);
node->predicate = std::move(predicate);
node->block = std::move(block);
data_ = std::move(node);
std::cout << "OK" << std::endl;
} Here's the test.py import tvm
from tvm import tir
target = tvm.make.Block(
[], [], [], tvm.make.SeqStmt([]),
[], [], 'target')
br = tvm.make.BlockRealize([], 1, target)
print(br) The last few lines of the running result of test.py are
|
I've created a repo for debugging use, and I have invited you @tqchen @Hzfengsy. https://github.com/spectrometerHBH/tvm-tir-debug I've removed the whole tir directory under python, now tir part only contains definitions, make and functors of tir nodes. I'll try to remove more in the following commits. |
Good job confirming the details. There are still ways to dig deeper to see where the Delete BlockRealize happens in the line of
If the error is reproducible, i think we should be able to get to the truth |
Some more experiment results: 1. BlockRealize contains an Stmt.Correct Stmt br_make(Stmt block) {
ObjectPtr<BlockRealizeNode> node = make_object<BlockRealizeNode>();
node->block = block;
return Stmt(node);
}
TVM_REGISTER_GLOBAL("make.BlockRealize")
.set_body_typed(br_make); Wrong TVM_REGISTER_GLOBAL("make.BlockRealize")
.set_body_typed<Stmt(Stmt)>(
[](Stmt block) -> Stmt {
ObjectPtr<BlockRealizeNode> node = make_object<BlockRealizeNode>();
node->block = std::move(block);
return Stmt(node);
}); 2. If BlockRealize is empty thenCorrect Stmt br_make() {
ObjectPtr<BlockRealizeNode> node = make_object<BlockRealizeNode>();
return Stmt(node);
}
TVM_REGISTER_GLOBAL("make.BlockRealize")
.set_body_typed(br_make); Correct TVM_REGISTER_GLOBAL("make.BlockRealize")
.set_body_typed<Stmt()>(
[]() -> Stmt {
ObjectPtr<BlockRealizeNode> node = make_object<BlockRealizeNode>();
return Stmt(node);
}); |
I change TVM_REGISTER_GLOBAL("tir.BlockRealize")
.set_body([](TVMArgs args, TVMRetValue* rv) {
Array<PrimExpr> values = args[0];
PrimExpr predicate = args[1];
Block block = args[2];
if (!predicate.dtype().is_bool()) {
// To support python ir_builder
CHECK(is_one(predicate));
predicate = IntImm(DataType::Bool(), 1);
}
*rv = BlockRealize(values, predicate, block);
}); |
Previously my gcc version=7.4.0, but my g++ version=5.4.0, after I change g++=7.4.0. Everything is OK. Upd: siyuan tried clang=9. It also works. |
Ok, then it could due a mismatch of compiler and libcxx |
synced @spectrometerHBH @tqchen |
FYI @spectrometerHBH @Hzfengsy I went back to look at the issue of BlockRealize and find the cause. The main reason was due to we are passing the FLambda signature to the set body typed. In particular, if we change the following code .set_body_typed<BlockRealize(Array<PrimExpr>, PrimExpr, Block)>(
[](Array<PrimExpr> values, PrimExpr predicate, Block block) {
..
}); to the following one .set_body_typed(
[](Array<PrimExpr> values, PrimExpr predicate, Block block) {
..
}); The then segfault disappears. Here is my guess about what was happening: In the new code(without the FLambda signature), the closure is directly passed into set_body_typed and being captured. The compiler knows that we need to keep the function align. On the version with Flambda though, the closure is first converted to type |
The take away is that we should not pass type signatures to the argument of the set_body_typed |
The python side has been updated to reflect the final form of the namespace organization
Update Guide apache/tvm#4647
cc @Hzfengsy
The text was updated successfully, but these errors were encountered: