-
Notifications
You must be signed in to change notification settings - Fork 74.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Statically-linked libraries in TF binary can cause symbol collisions #9525
Comments
I think, this relates to #9391 |
I'm trying to solve this in #9391 . Follow up there. |
@nkhdiscovery @drpngx Should we dedup this with #9391? @nkhdiscovery We're planning to fix this for good using more restricted exports. In particular, protos should not appear in the public API once we're done. Do you mind if I close that other bug? |
@girving Thanks for closing that one, I would be happy if I could help. I already tried hiding symbols with adding version scripts to cc project but I couldn't figure a clean way to write the regex matching unnecessary symbols. I would do that if you give me a hint on what are the API functions to make them public (global) and make else hidden (local) (_TF* as didn't work as global), or a hint on how to hide symbols coming from specific headers. I Googled a lot and found almost nothing good enough on using version scripts to solve this. |
Version scripts are not so good for C++ based projects, because it's hard to control what needs to be public and not. This is way visibility attributes were added. Everything, except symbol versions are available as attributes :( Versions script is a good starting point (I would prefer to have symbol versions). |
We'll be marking exported symbols with a TF_EXPORT macro, but there's a bunch of upfront work to do to minimize the API surface area before we do that. |
@davidlt Are you suggesting to put @girving First, as I understood from your comment, this TF_EXPORT macro will then help us to recognize what has to be exported and what is not to, right? I mean at least I will be able to put every single function which has that macro in my version script and temporarily solve the problem for my own usage. Am I right? Thanks for your replies guys. |
@nkhdiscovery Once we're done, the code will be compiled with |
@girving Thanks for your answer, I just understood what you are doing as the solution. Is there any way I can contribute to accelerate this? Isn't it just enough to add this macro to all API functions? |
Most of the complexity is refactoring the code so that protos don't need to be exposed, since we don't have control over those. I'm not sure how to parallelize the required refactoring, and unfortunately a good chunk of the complexity is making sure said refactoring doesn't break non-opensource code. |
I'm wondering, will the approach described here also help with the problem of building a debug mode DLL for windows via cmake? Currently the issue is that the .def file generated by create_def_file.py contains more symbols than the 65535 limit. |
@adennie Yes, we should be able to fit within that limit. Rather embarrassing that we can't yet. :) |
@girving I just noticed, " non-opensource" ? Which parts are non-opensource? Did you just meant 3d parties? I thought TensorFlow is all open-source! |
TensorFlow is all open source, but there is a lot of downstream Google code that uses it. Some of it is tightly integrated and needs refactoring too. |
Any updates on this issue? My project is kind of blocked by the inability to build a debug tensorflow DLL. |
Still working on it, but no usable progress yet. |
Transferring issue ownership. |
It has been 14 days with no activity and this issue has an assignee.Please update the label and/or status accordingly. |
@adennie take a look at last commits of branch fd-devel in our fork: Hope it helps. Let me know the result. |
Nagging Assignee: It has been 14 days with no activity and this issue has an assignee. Please update the label and/or status accordingly. |
1 similar comment
Nagging Assignee: It has been 14 days with no activity and this issue has an assignee. Please update the label and/or status accordingly. |
Cross-referencing #16104. We now have better support for dynamic loading into split libraries, so it's close. |
Nagging Assignee @drpngx: It has been 14 days with no activity and this issue has an assignee. Please update the label and/or status accordingly. |
2 similar comments
Nagging Assignee @drpngx: It has been 14 days with no activity and this issue has an assignee. Please update the label and/or status accordingly. |
Nagging Assignee @drpngx: It has been 14 days with no activity and this issue has an assignee. Please update the label and/or status accordingly. |
TensorFlow is no longer exporting any symbols globally, which sounds like the original issue? The protobuf refactoring is still useful and ongoing AFAIK, but turned out to be tangential. And we'd still like to split more of our dependencies into separate shared objects, also ongoing. I'll close this, but @skye feel free to reopen if you had something else in mind. |
@allenlavoie Great news to hear, thanks! I would like to help splitting, let me know if you have any open issues ... |
@nkhdiscovery the next step IMO would be to split off a shared object with the implementations of our protocol buffers (libtensorflow_protobufs.so?). The main benefit would be that users of the C++ API would no longer need to link against libtensorflow_framework.so (or build libtensorflow_cc statically) for protocol buffer symbols, and so would run into fewer symbol conflicts. So basically #14267; it's closed at the moment, but you could re-open it and work on it. We have workarounds but no great solution for C++ API users who want to use OpenCV and use custom ops (custom ops won't work with static libtensorflow_cc, OpenCV won't work with dynamic libtensorflow_cc). There are two things to be moved: one is the static variables for protocol buffer registration ( Steps I think the split would include:
Happy to chat more if this sounds interesting. Sending an email to [email protected] with a rough plan and discussing would be a good start (@gunan and others are working on a related effort, so coordinating would be important). |
Hi, I need to load a model on a "plugin" for a c++ program, the codes is basically a header which calculates a matrix and returns it to the main program. There I need to:
the main program uses opencv functions imread, imencode and imwrite, the latter causing segfault if including tensorflow headers and linking dynamically tensorflow_cc and tensorflow_framework. Doing a monolithic build disables the ability to hide gpu devices from tensorflow session, otherwise it runs smoothly. I compiled using master, a particular commit of 1.8 and r1.9. (not all at the same time, and I tried all of them in the same machine) Configure: Bazel command: My problems:
Possible solutions: (may break other things)
|
@JosephIWB
How are you hiding them? CUDA_VISIBLE_DEVICES? I have no idea why this wouldn't work, but if you have a quick repro someone can take a look. There's also the "add yet another shared object" workaround for the OpenCV symbol conflict. |
I can't use CUDA_VISIBLE_DEVICES because the program is multi-threaded and is supposed to work on various gpus separately (you can configure the program to process multiple video streams, then you also can decide which gpu each thread should use), so CUDA_VISIBLE_DEVICES doesn't work, as you need to decide which gpu a thread uses on runtime. I use this code to generate the session configuration options:
The gpu_memory_fraction will be changed in the future to be passed by the function, and not hardcoded like that. When you do it like that, the program cries about linking issues with protobuf, and linking protobuf didn't solve the problem (-lprotobuf) Doing the trick I stated above made the program work correctly when using imread, imencode and imwrite from opencv. Another fix I think should work should be compiling tensorflow and opencv using the same dependencies, so they share the same symbols and don't generate any conflicts between them, but that is a lot of time, and as the workaround I found in an issue here worked, I think we will not be trying that (as it a lot more work to tell the compiler to use the same dependencies, bazel and cmake seem not to like each other very much) Thanks for your response. |
Oh I see, the issue is that C++ API doesn't include protobuf symbols. You need to link against libtensorflow_framework.so for those (unfortunately a known issue). Do you think the fvisibility change is submittable? May be worth running the tests (e.g. bazel test -c opt //tensorflow/core/... //tensorflow/python/...), and if they pass making a pull request out of it (I'm happy to review). If we don't need the symbols which conflict with OpenCV, we should stop exporting them. |
I don't think so, I'm not very savvy anyway. In issue #14627 @ruanjiandong proposed that workaround, and worked for me. A quote from what was said there regarding the probable use of the fvisibility change:
Currently I have a lot of dependency problems and I'm also very short on time to run these tests on my machine (I also have gpu driver problems, among other things), so I probably won't be able to help. I posted here to let others know that this workaround works, although other problems my arise because of that. Anyway, the compatibility problems between opencv and tensorflow may come from the image libraries, and the protobuf library, a good solution would probably involve testing if compiling against the same dependencies work, so both opencv and tensorflow work together. (this probably is a common issue among people who work on computer vision) Maybe I will do some test on the weekend, I will keep you posted. |
I tried running the tests suggested by @allenlavoie . Unfortunately, my workaround breaks the test build. k8-py3-opt/bin/_solib_local/libtensorflow_Score_Slibjpeg_Uinternal.so needs those jpeg symbols exported. external/local_config_cuda/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc -o bazel-out/k8-py3-opt/bin/tensorflow/core/grappler/costs/utils_test '-Wl,-rpath,$ORIGIN/../../../../_solib_local/' '-Wl,-rpath,$ORIGIN/../../../../_solib_local/_U_S_Stensorflow_Score_Sgrappler_Scosts_Cutils_Utest___Utensorflow' '-Wl,-rpath,$ORIGIN/../../../../_solib_local/_U@local_Uconfig_Ucuda_S_Scuda_Ccublas___Uexternal_Slocal_Uconfig_Ucuda_Scuda_Scuda_Slib' '-Wl,-rpath,$ORIGIN/../../../../_solib_local/_U@local_Uconfig_Ucuda_S_Scuda_Ccusolver___Uexternal_Slocal_Uconfig_Ucuda_Scuda_Scuda_Slib' '-Wl,-rpath,$ORIGIN/../../../../_solib_local/_U@local_Uconfig_Ucuda_S_Scuda_Ccudart___Uexternal_Slocal_Uconfig_Ucuda_Scuda_Scuda_Slib' -Lbazel-out/k8-py3-opt/bin/_solib_local/_U_S_Stensorflow_Score_Sgrappler_Scosts_Cutils_Utest___Utensorflow -Lbazel-out/k8-py3-opt/bin/_solib_local -Lbazel-out/k8-py3-opt/bin/_solib_local/_U@local_Uconfig_Ucuda_S_Scuda_Ccublas___Uexternal_Slocal_Uconfig_Ucuda_Scuda_Scuda_Slib -Lbazel-out/k8-py3-opt/bin/_solib_local/_U@local_Uconfig_Ucuda_S_Scuda_Ccusolver___Uexternal_Slocal_Uconfig_Ucuda_Scuda_Scuda_Slib -Lbazel-out/k8-py3-opt/bin/_solib_local/_U@local_Uconfig_Ucuda_S_Scuda_Ccudart___Uexternal_Slocal_Uconfig_Ucuda_Scuda_Scuda_Slib '-Wl,-rpath,$ORIGIN/,-rpath,$ORIGIN/..,-rpath,$ORIGIN/../..,-rpath,$ORIGIN/../../..' -Wl,-z,muldefs -Wl,-z,notext -Wl,-z,notext -Wl,-z,notext -Wl,-z,notext -Wl,-z,notext -Wl,-z,notext -Wl,-z,notext -pthread -Wl,-z,notext -Wl,-z,notext -Wl,-z,notext -Wl,-z,notext -Wl,-z,notext -Wl,-z,notext -Wl,-z,notext -Wl,-rpath,../local_config_cuda/cuda/lib64 -Wl,-rpath,../local_config_cuda/cuda/extras/CUPTI/lib64 -pthread -Wl,-no-as-needed -B/usr/bin/ -pie -Wl,-z,relro,-z,now -no-canonical-prefixes -pass-exit-codes '-Wl,--build-id=md5' '-Wl,--hash-style=gnu' -Wl,--gc-sections -Wl,@bazel-out/k8-py3-opt/bin/tensorflow/core/grappler/costs/utils_test-2.params) |
Thanks @ruanjiandong! I guess not super surprising, but was worth a try. So the options are still (1) split out proto symbols so people don't need to link in libtensorflow_framework.so, (2) move libjpeg to the language bindings / colocated with the kernel. Possibly (2) is easier? |
@allenlavoie , I took another look at the build failure. Those tests actually need jpeg_* symbols from libjpeg.so, not libtensorflow_framework.so. Without my change, bazel will produce both static and dynamic libjpeg library for test build. I made a new change which use ld version script to selectively hide jpeg symbols when linking libtensorflow_framework.so. With the new change, all the tests passed except for 3 grpc tests (related to my test environment). The new change works only for Linux. For OS X, I don't know how to selectively hide symbols using "-exported_symbols_list" option. I will create a pull request for the new change. |
@allenlavoie , could you please review pull request pull #19966 ? |
@gnossen gave a great overview in grpc/grpc#24992 of the overall problem. If a python process using both protobuf _and_ another native library linking in libprotobuf frequently can cause crashes. This seems to frequently affect tensorflow as well: tensorflow/tensorflow#8394, tensorflow/tensorflow#9525 (comment) tensorflow/tensorflow#24976, tensorflow/tensorflow#35573, https://github.com/tensorflow/tensorflow/blob/v2.0.0/tensorflow/contrib/makefile/rename_protobuf.sh, tensorflow/tensorflow#16104 Testing locally this fixes both crashes when linking in multiple versions of protobuf and fixes `DescriptorPool` clashes as well (e.g. Python and Native code import different versions of the same message).
@gnossen gave a great overview in grpc/grpc#24992 of the overall problem. If a python process using both protobuf _and_ another native library linking in libprotobuf frequently can cause crashes. This seems to frequently affect tensorflow as well: tensorflow/tensorflow#8394, tensorflow/tensorflow#9525 (comment) tensorflow/tensorflow#24976, tensorflow/tensorflow#35573, https://github.com/tensorflow/tensorflow/blob/v2.0.0/tensorflow/contrib/makefile/rename_protobuf.sh, tensorflow/tensorflow#16104 Testing locally this fixes both crashes when linking in multiple versions of protobuf and fixes `DescriptorPool` clashes as well (e.g. Python and Native code import different versions of the same message).
@gnossen gave a great overview in grpc/grpc#24992 of the overall problem. If a python process using both protobuf _and_ another native library linking in libprotobuf frequently can cause crashes. This seems to frequently affect tensorflow as well: tensorflow/tensorflow#8394, tensorflow/tensorflow#9525 (comment) tensorflow/tensorflow#24976, tensorflow/tensorflow#35573, https://github.com/tensorflow/tensorflow/blob/v2.0.0/tensorflow/contrib/makefile/rename_protobuf.sh, tensorflow/tensorflow#16104 Testing locally this fixes both crashes when linking in multiple versions of protobuf and fixes `DescriptorPool` clashes as well (e.g. Python and Native code import different versions of the same message).
@gnossen gave a great overview in grpc/grpc#24992 of the overall problem. If a python process using both protobuf _and_ another native library linking in libprotobuf frequently can cause crashes. This seems to frequently affect tensorflow as well: tensorflow/tensorflow#8394, tensorflow/tensorflow#9525 (comment) tensorflow/tensorflow#24976, tensorflow/tensorflow#35573, https://github.com/tensorflow/tensorflow/blob/v2.0.0/tensorflow/contrib/makefile/rename_protobuf.sh, tensorflow/tensorflow#16104 Testing locally this fixes both crashes when linking in multiple versions of protobuf and fixes `DescriptorPool` clashes as well (e.g. Python and Native code import different versions of the same message). Co-authored-by: Roy Williams <[email protected]>
TensorFlow currently statically links all dependencies. This sometimes causes hard-to-diagnose crashes (e.g. segfaults) when another version of a dependency is loaded into the process. This can even happen within TensorFlow if separate TensorFlow .so's are loaded into the same Python process.
Possible solutions would be to reduce the visibility of these symbols, dynamically link common libraries, or run TF in a separate process.
Known problematic libraries:
Other related issues:
The text was updated successfully, but these errors were encountered: