-
Notifications
You must be signed in to change notification settings - Fork 613
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
image.rotate
Segfault with TensorFlow 2.2.0rc0 and tf-nightly
#1298
Comments
We're working on the 2.2 release right now, so it would be good to determine if this is a problem in tensorflow, or a problem in tfa. |
Thanks @MarkDaoust we've also seen issue with users running tf-nightly. I'll have some time to look into this on the weekend. @yifeif @gunan @perfinion @angerson Does anything come to mind that would break compaitbility. TFA links against libtesorflow of TF2.1 but I believe we'll still be shipping manylinux2010 whls built with dt7 for 2.2? Not sure why we would be getting segfaults. |
@seanpmorgan Since the ABI is not compatible from TF 2.1.0 to 2.2.0, isn't that completely normal that we get segfaults? We compile the nightly against 2.1, so it won't work with 2.2. We just need to recompile right? Or I am missing something here? When running the notebook with
So I don't really get why this segfault is a problem. |
So to my knowledge TF2.2 and TF2.1 should be compiled in the same way (devtoolset7/manylinux2010 compliant). Linking to the same ops shouldn't raise a segfault. We would like our package to be compatible with multiple versions of TF so that for example: our build matrix 0.8.3 can be compatible with For the moment the version check is the best way to prevent a poor user experience, but it shouldn't be required. We didn't have this issue when moving from 2.0 -> 2.1. The reason for the incompatibility at that point was we started using the |
I see, I though that even if the same toolchain was used, there was no garantee that the ABI would stay compatible if the C++ code in tensorflow changed. Thanks for the explanation. |
Yeah no ABI compatibility between compiler toolchains. When someone uses TFA with conda compiled TF we get a symbol mismatch because the ops are mangled differently (among other possible issues). If the C++ code in tensorflow is changed it could break if not backwards compatible, but the ops should all be public and stable between minor versions. |
Another reason why there should be a standard way(which IMO should be |
Agree tensorflow/community#133 is the fix for that but haven't heard any updates in a while. If that doesn't happen though then conda first would be great, but there's still too many python users not on conda to be excluded |
So if that helps: I tried with tensorflow-cpu to see if we could get the segfault. pip install tensorflow-cpu==2.1.0
python configure.py --no-deps
bash tools/install_so_files.sh
pip install tensorflow-cpu==2.2.0rc0
pytest tensorflow_addons/image
# segfault here But recompiling works: pip install tensorflow-cpu==2.2.0rc0
python configure.py --no-deps
bash tools/install_so_files.sh
pytest tensorflow_addons/image
# tests are passing I opened a pull request to allow us to transition gracefully to 2.2.0 without having to make an enormous pull request at once: #1303 I hope this help, my knowledge of C++/compiling is limited. |
Thanks very helpful! If I'm unable to find a resolution to this we'll likely need to backport the strict version check to 0.8 prior to a 0.9 release pinned to 2.2 |
So I tested using gdb for an image op and activation op and they both are crash on the same call. Unsure if this is just the first call or the only cause (both are ABSL string calls):
yields:
In order to confirm that that linking to
@mihaimaruseac I was wondering if you could weight in on this?
|
FWIW I checked the activation ops compiled against TF2.0 and they work successfully with TF2.0 or TF2.1 but fail on TF2.2rc0. At this point I'm pretty confident there was a breaking change in the 2.2 libtensorflow |
Afaik we don't have compatibility guarantees for libtensorflow. So we cannot link against libtensorflow built at a different version. |
Thanks! That's rather unfortunate because it means users who want new features from Addons will be required to upgrade their TF core version which is a tough ask for many users. I'll make a post on developers group discussion to describe this issue and see what we can do going forward. |
@mihaimaruseac if I understand correctly, tensorflow won't give any backward compatibilities garantees about the ABI until tensorflow/community#133 is implemented, is that right? Is that the case too when the minor version in increased? If we compile against TF 2.1.0, will it work with 2.1.1? |
@gabrieldemarmiesse patch number increase should be compatible There are multiple RFCs for modularization and to add ABI guarantees. For example, I'm working on tensorflow/community#101 and the general ideas are in tensorflow/community#77 |
Just to be clear, if there are breaking changes to libtensorflow as minor versions increase then ABI stable kernels (tensorflow/community#133) will still crash right? We're seeing this issue when compiling with the same toolchain so the ABI should be the same it's just that underlying op has changed |
Agreed, especially if the user has compiled TF from source |
They could change something in their public API, too. For example, TFA creates a helper object declared in the "outdated" TF 2.1 headers but TF 2.2 doesn't expect it of course. C++ name mangling does not handle it. |
I'm going to close this issue as the bug was adressed and we can't do anything about it. I'm going to open another one about how to make a better UX for the users affected by the unstability of the ABI. |
Just for reference: |
System information
OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
TensorFlow version and how it was installed (source or binary):
TensorFlow-Addons version and how it was installed (source or binary):
Python version:
Is GPU used? (yes/no):
Describe the bug
tfa.image.rotate
segfaults.It's the only tfa op I tried.
Code to reproduce the issue
Colab:
https://drive.google.com/open?id=1T5le-3aLhzsxa62uFGOsc6UpzzTEwN3a
Easy to reproduce locally on my macbook.
Other info / logs
See notebook.
I followed it in python up to this line: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/eager/execute.py#L59
The text was updated successfully, but these errors were encountered: