Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Statically-linked libraries in TF binary can cause symbol collisions #9525

Closed
skye opened this issue Apr 28, 2017 · 38 comments
Closed

Statically-linked libraries in TF binary can cause symbol collisions #9525

skye opened this issue Apr 28, 2017 · 38 comments
Assignees
Labels

Comments

@skye
Copy link
Member

skye commented Apr 28, 2017

TensorFlow currently statically links all dependencies. This sometimes causes hard-to-diagnose crashes (e.g. segfaults) when another version of a dependency is loaded into the process. This can even happen within TensorFlow if separate TensorFlow .so's are loaded into the same Python process.

Possible solutions would be to reduce the visibility of these symbols, dynamically link common libraries, or run TF in a separate process.

Known problematic libraries:

Other related issues:

@skye skye added stat:awaiting tensorflower Status - Awaiting response from tensorflower type:bug Bug labels Apr 28, 2017
@davidlt
Copy link

davidlt commented May 3, 2017

I think, this relates to #9391
This is needed if people want to integrated tensorflow binaries into their existing software.

@nkhdiscovery
Copy link

I'm trying to solve this in #9391 . Follow up there.

@girving
Copy link
Contributor

girving commented May 10, 2017

@nkhdiscovery @drpngx Should we dedup this with #9391?

@nkhdiscovery We're planning to fix this for good using more restricted exports. In particular, protos should not appear in the public API once we're done. Do you mind if I close that other bug?

@girving girving added stat:awaiting tensorflower Status - Awaiting response from tensorflower and removed stat:awaiting tensorflower Status - Awaiting response from tensorflower labels May 10, 2017
@nkhdiscovery
Copy link

nkhdiscovery commented May 11, 2017

@girving Thanks for closing that one, I would be happy if I could help. I already tried hiding symbols with adding version scripts to cc project but I couldn't figure a clean way to write the regex matching unnecessary symbols. I would do that if you give me a hint on what are the API functions to make them public (global) and make else hidden (local) (_TF* as didn't work as global), or a hint on how to hide symbols coming from specific headers. I Googled a lot and found almost nothing good enough on using version scripts to solve this.

@davidlt
Copy link

davidlt commented May 11, 2017

Version scripts are not so good for C++ based projects, because it's hard to control what needs to be public and not. This is way visibility attributes were added. Everything, except symbol versions are available as attributes :( Versions script is a good starting point (I would prefer to have symbol versions).

@girving
Copy link
Contributor

girving commented May 11, 2017

We'll be marking exported symbols with a TF_EXPORT macro, but there's a bunch of upfront work to do to minimize the API surface area before we do that.

@nkhdiscovery
Copy link

@davidlt Are you suggesting to put __attribute__ ((visibility ("default"))) in the code wherever there is something to hide? Because that requires lots of additions in third_party headers, e.g. protobuf itself, which will cause further expenses if one of those third_party libraries planned to be upgraded. I just didn't think of that as a solution because I felt it's just making another mess, even re-packing those libraries with new names and dynamically link against those specific versions seems much more cleaner to me.

@girving First, as I understood from your comment, this TF_EXPORT macro will then help us to recognize what has to be exported and what is not to, right? I mean at least I will be able to put every single function which has that macro in my version script and temporarily solve the problem for my own usage. Am I right?
Second, I see you have to do this later to avoid the redundant work (putting the macro wherever it is needed and then removing that whole part of API in minimization will not be logical, for sure); But if this is the only problem for not doing that now, I can do that redundant work in a fork or a temporary branch so we can use the remained parts after minimization. I have to solve this for myself as soon as possible, so let's do it in a way which is helpful for further contribution. Can you give me any hint on how to do that? I saw the usage in tensorflow/core/framework/types.h in master, I think I can give a try if I know what exactly has to be exported.

Thanks for your replies guys.

@girving
Copy link
Contributor

girving commented May 13, 2017

@nkhdiscovery Once we're done, the code will be compiled with -fvisibility=hidden, and only symbols marked with TF_EXPORT will be exported. Unfortunately I don't understand the details of versions scripts, so I don't know enough to answer what you can do that will be useful. Most of the work is restructuring the actual code to make it easy to restrict exports, not doing the actual export restriction.

@nkhdiscovery
Copy link

@girving Thanks for your answer, I just understood what you are doing as the solution. Is there any way I can contribute to accelerate this? Isn't it just enough to add this macro to all API functions?

@girving
Copy link
Contributor

girving commented May 15, 2017

Most of the complexity is refactoring the code so that protos don't need to be exposed, since we don't have control over those. I'm not sure how to parallelize the required refactoring, and unfortunately a good chunk of the complexity is making sure said refactoring doesn't break non-opensource code.

@adennie
Copy link

adennie commented May 18, 2017

I'm wondering, will the approach described here also help with the problem of building a debug mode DLL for windows via cmake? Currently the issue is that the .def file generated by create_def_file.py contains more symbols than the 65535 limit.

@girving
Copy link
Contributor

girving commented May 18, 2017

@adennie Yes, we should be able to fit within that limit. Rather embarrassing that we can't yet. :)

@nkhdiscovery
Copy link

@girving I just noticed, " non-opensource" ? Which parts are non-opensource? Did you just meant 3d parties? I thought TensorFlow is all open-source!

@girving
Copy link
Contributor

girving commented May 21, 2017

TensorFlow is all open source, but there is a lot of downstream Google code that uses it. Some of it is tightly integrated and needs refactoring too.

@adennie
Copy link

adennie commented Jun 16, 2017

Any updates on this issue? My project is kind of blocked by the inability to build a debug tensorflow DLL.

@girving
Copy link
Contributor

girving commented Jun 16, 2017

Still working on it, but no usable progress yet.

@girving girving assigned drpngx and unassigned girving Jun 29, 2017
@girving
Copy link
Contributor

girving commented Jun 29, 2017

Transferring issue ownership.

@tensorflowbutler
Copy link
Member

It has been 14 days with no activity and this issue has an assignee.Please update the label and/or status accordingly.

@nkhdiscovery
Copy link

@adennie take a look at last commits of branch fd-devel in our fork:
https://github.com/Faraadid/tensorflow/tree/fd-devel?files=1

Hope it helps. Let me know the result.

@tensorflowbutler
Copy link
Member

Nagging Assignee: It has been 14 days with no activity and this issue has an assignee. Please update the label and/or status accordingly.

1 similar comment
@tensorflowbutler
Copy link
Member

Nagging Assignee: It has been 14 days with no activity and this issue has an assignee. Please update the label and/or status accordingly.

@drpngx
Copy link
Contributor

drpngx commented Mar 6, 2018

Cross-referencing #16104.

We now have better support for dynamic loading into split libraries, so it's close.

@tensorflowbutler tensorflowbutler removed the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Mar 7, 2018
@tensorflowbutler
Copy link
Member

Nagging Assignee @drpngx: It has been 14 days with no activity and this issue has an assignee. Please update the label and/or status accordingly.

2 similar comments
@tensorflowbutler
Copy link
Member

Nagging Assignee @drpngx: It has been 14 days with no activity and this issue has an assignee. Please update the label and/or status accordingly.

@tensorflowbutler
Copy link
Member

Nagging Assignee @drpngx: It has been 14 days with no activity and this issue has an assignee. Please update the label and/or status accordingly.

@drpngx drpngx assigned allenlavoie and unassigned drpngx Apr 26, 2018
@allenlavoie
Copy link
Member

TensorFlow is no longer exporting any symbols globally, which sounds like the original issue? The protobuf refactoring is still useful and ongoing AFAIK, but turned out to be tangential. And we'd still like to split more of our dependencies into separate shared objects, also ongoing.

I'll close this, but @skye feel free to reopen if you had something else in mind.

@nkhdiscovery
Copy link

nkhdiscovery commented Apr 29, 2018

@allenlavoie Great news to hear, thanks! I would like to help splitting, let me know if you have any open issues ...

@allenlavoie
Copy link
Member

@nkhdiscovery the next step IMO would be to split off a shared object with the implementations of our protocol buffers (libtensorflow_protobufs.so?).

The main benefit would be that users of the C++ API would no longer need to link against libtensorflow_framework.so (or build libtensorflow_cc statically) for protocol buffer symbols, and so would run into fewer symbol conflicts. So basically #14267; it's closed at the moment, but you could re-open it and work on it. We have workarounds but no great solution for C++ API users who want to use OpenCV and use custom ops (custom ops won't work with static libtensorflow_cc, OpenCV won't work with dynamic libtensorflow_cc).

There are two things to be moved: one is the static variables for protocol buffer registration (@protobuf_archive//:protobuf), the other is the generated implementations of TensorFlow's protocol buffers. My thought is that these should stay together for now.

Steps I think the split would include:

  1. Hacking around with build rules until bazel query 'somepath(//tensorflow:libtensorflow_framework.so, @protobuf_archive//:protobuf)' and bazel query 'somepath(//tensorflow:libtensorflow_framework.so, //tensorflow/core:protos_all_cc_impl)' return empty results
  2. Include these explicitly in a new //tensorflow:libtensorflow_protobuf.so rule (near libtensorflow_framework.so), and include //tensorflow:libtensorflow_protobuf.so in tf_binary_additional_srcs.
  3. Make sure all the tests pass :). Easy to have undefined symbols when messing with linking.
  4. The final step would be dealing with packaging issues, such as for the Java bindings (which I believe still hard-code the names of TensorFlow libraries).

Happy to chat more if this sounds interesting. Sending an email to [email protected] with a rough plan and discussing would be a good start (@gunan and others are working on a related effort, so coordinating would be important).

@ghost
Copy link

ghost commented Jun 7, 2018

Hi,

I need to load a model on a "plugin" for a c++ program, the codes is basically a header which calculates a matrix and returns it to the main program. There I need to:

  1. select the gpu on which to load the model, no other gpu should be seen by tensorflow, this needs to be done on runtime, as the gpu_id is a parameter of the constructor of the class which runs the tensorflow model.
  2. load the model and run.

the main program uses opencv functions imread, imencode and imwrite, the latter causing segfault if including tensorflow headers and linking dynamically tensorflow_cc and tensorflow_framework.

Doing a monolithic build disables the ability to hide gpu devices from tensorflow session, otherwise it runs smoothly.

I compiled using master, a particular commit of 1.8 and r1.9. (not all at the same time, and I tried all of them in the same machine)

Configure:
cuda 8.0,
cudnn 7.1.

Bazel command:
bazel build -c opt --copt=-mavx --copt=-mavx2 --copt=-mfma --copt=-mfpmath=both --copt=-msse4.2 //tensorflow:libtensorflow_cc.so

My problems:

  1. using tensorflow headers and opencv imwrite, imread and imencode doens't work unless you use monolithic build.
  2. using monolithic build disables setting gpu visible devices, which is a very important issue for some, as it needs protocol buffer symbols, linking libprotobuf doesn't work.

Possible solutions: (may break other things)

  1. Posible fix:
    someone in another issue suggested doing a change around line 41 in thirdparty/jpeg/jpeg.BUILD. "-fPIC" is needed as the result of "-fvisibility".
    "//conditions:default": [
    "-fvisibility=hidden -fPIC"
    ],
    This enables the use of opencv imread, imwrite and imencode, but it probably breaks other things.

@allenlavoie
Copy link
Member

@JosephIWB

Doing a monolithic build disables the ability to hide gpu devices from tensorflow session,

How are you hiding them? CUDA_VISIBLE_DEVICES? I have no idea why this wouldn't work, but if you have a quick repro someone can take a look.

There's also the "add yet another shared object" workaround for the OpenCV symbol conflict.

@ghost
Copy link

ghost commented Jun 11, 2018

I can't use CUDA_VISIBLE_DEVICES because the program is multi-threaded and is supposed to work on various gpus separately (you can configure the program to process multiple video streams, then you also can decide which gpu each thread should use), so CUDA_VISIBLE_DEVICES doesn't work, as you need to decide which gpu a thread uses on runtime.

I use this code to generate the session configuration options:

    tf::Session* session_ptr;
    auto options = tf::SessionOptions();
    const std::string gpu_vis_device = std::to_string(gpu_id);
    options.config.mutable_gpu_options()->set_visible_device_list(gpu_vis_device);
    options.config.mutable_gpu_options()->set_per_process_gpu_memory_fraction(0.025);
    options.config.mutable_gpu_options()->set_allow_growth(true);
    auto status = NewSession(options, &session_ptr);
    if (!status.ok()) {
        std::cout << status.ToString() << "\n";
        return false;
    }
    session.reset(session_ptr);

The gpu_memory_fraction will be changed in the future to be passed by the function, and not hardcoded like that.

When you do it like that, the program cries about linking issues with protobuf, and linking protobuf didn't solve the problem (-lprotobuf)

Doing the trick I stated above made the program work correctly when using imread, imencode and imwrite from opencv.

Another fix I think should work should be compiling tensorflow and opencv using the same dependencies, so they share the same symbols and don't generate any conflicts between them, but that is a lot of time, and as the workaround I found in an issue here worked, I think we will not be trying that (as it a lot more work to tell the compiler to use the same dependencies, bazel and cmake seem not to like each other very much)

Thanks for your response.

@allenlavoie
Copy link
Member

Oh I see, the issue is that C++ API doesn't include protobuf symbols. You need to link against libtensorflow_framework.so for those (unfortunately a known issue).

Do you think the fvisibility change is submittable? May be worth running the tests (e.g. bazel test -c opt //tensorflow/core/... //tensorflow/python/...), and if they pass making a pull request out of it (I'm happy to review). If we don't need the symbols which conflict with OpenCV, we should stop exporting them.

@ghost
Copy link

ghost commented Jun 11, 2018

I don't think so, I'm not very savvy anyway.

In issue #14627 @ruanjiandong proposed that workaround, and worked for me.

A quote from what was said there regarding the probable use of the fvisibility change:

There are some linking warnings for undefined dynamic symbol in tensorflow/contrib/lite/toco/toco, which I don't use.

Currently I have a lot of dependency problems and I'm also very short on time to run these tests on my machine (I also have gpu driver problems, among other things), so I probably won't be able to help.

I posted here to let others know that this workaround works, although other problems my arise because of that.

Anyway, the compatibility problems between opencv and tensorflow may come from the image libraries, and the protobuf library, a good solution would probably involve testing if compiling against the same dependencies work, so both opencv and tensorflow work together. (this probably is a common issue among people who work on computer vision)

Maybe I will do some test on the weekend, I will keep you posted.

@ruanjiandong
Copy link
Contributor

ruanjiandong commented Jun 12, 2018

I tried running the tests suggested by @allenlavoie . Unfortunately, my workaround breaks the test build. k8-py3-opt/bin/_solib_local/libtensorflow_Score_Slibjpeg_Uinternal.so needs those jpeg symbols exported.

external/local_config_cuda/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc -o bazel-out/k8-py3-opt/bin/tensorflow/core/grappler/costs/utils_test '-Wl,-rpath,$ORIGIN/../../../../_solib_local/' '-Wl,-rpath,$ORIGIN/../../../../_solib_local/_U_S_Stensorflow_Score_Sgrappler_Scosts_Cutils_Utest___Utensorflow' '-Wl,-rpath,$ORIGIN/../../../../_solib_local/_U@local_Uconfig_Ucuda_S_Scuda_Ccublas___Uexternal_Slocal_Uconfig_Ucuda_Scuda_Scuda_Slib' '-Wl,-rpath,$ORIGIN/../../../../_solib_local/_U@local_Uconfig_Ucuda_S_Scuda_Ccusolver___Uexternal_Slocal_Uconfig_Ucuda_Scuda_Scuda_Slib' '-Wl,-rpath,$ORIGIN/../../../../_solib_local/_U@local_Uconfig_Ucuda_S_Scuda_Ccudart___Uexternal_Slocal_Uconfig_Ucuda_Scuda_Scuda_Slib' -Lbazel-out/k8-py3-opt/bin/_solib_local/_U_S_Stensorflow_Score_Sgrappler_Scosts_Cutils_Utest___Utensorflow -Lbazel-out/k8-py3-opt/bin/_solib_local -Lbazel-out/k8-py3-opt/bin/_solib_local/_U@local_Uconfig_Ucuda_S_Scuda_Ccublas___Uexternal_Slocal_Uconfig_Ucuda_Scuda_Scuda_Slib -Lbazel-out/k8-py3-opt/bin/_solib_local/_U@local_Uconfig_Ucuda_S_Scuda_Ccusolver___Uexternal_Slocal_Uconfig_Ucuda_Scuda_Scuda_Slib -Lbazel-out/k8-py3-opt/bin/_solib_local/_U@local_Uconfig_Ucuda_S_Scuda_Ccudart___Uexternal_Slocal_Uconfig_Ucuda_Scuda_Scuda_Slib '-Wl,-rpath,$ORIGIN/,-rpath,$ORIGIN/..,-rpath,$ORIGIN/../..,-rpath,$ORIGIN/../../..' -Wl,-z,muldefs -Wl,-z,notext -Wl,-z,notext -Wl,-z,notext -Wl,-z,notext -Wl,-z,notext -Wl,-z,notext -Wl,-z,notext -pthread -Wl,-z,notext -Wl,-z,notext -Wl,-z,notext -Wl,-z,notext -Wl,-z,notext -Wl,-z,notext -Wl,-z,notext -Wl,-rpath,../local_config_cuda/cuda/lib64 -Wl,-rpath,../local_config_cuda/cuda/extras/CUPTI/lib64 -pthread -Wl,-no-as-needed -B/usr/bin/ -pie -Wl,-z,relro,-z,now -no-canonical-prefixes -pass-exit-codes '-Wl,--build-id=md5' '-Wl,--hash-style=gnu' -Wl,--gc-sections -Wl,@bazel-out/k8-py3-opt/bin/tensorflow/core/grappler/costs/utils_test-2.params)
bazel-out/k8-py3-opt/bin/_solib_local/libtensorflow_Score_Slibjpeg_Uinternal.so: undefined reference to jpeg_abort' bazel-out/k8-py3-opt/bin/_solib_local/libtensorflow_Score_Slibjpeg_Uinternal.so: undefined reference to jpeg_set_defaults'
...

@allenlavoie
Copy link
Member

Thanks @ruanjiandong! I guess not super surprising, but was worth a try. So the options are still (1) split out proto symbols so people don't need to link in libtensorflow_framework.so, (2) move libjpeg to the language bindings / colocated with the kernel. Possibly (2) is easier?

@ruanjiandong
Copy link
Contributor

@allenlavoie , I took another look at the build failure. Those tests actually need jpeg_* symbols from libjpeg.so, not libtensorflow_framework.so. Without my change, bazel will produce both static and dynamic libjpeg library for test build.

I made a new change which use ld version script to selectively hide jpeg symbols when linking libtensorflow_framework.so. With the new change, all the tests passed except for 3 grpc tests (related to my test environment). The new change works only for Linux. For OS X, I don't know how to selectively hide symbols using "-exported_symbols_list" option.

I will create a pull request for the new change.

@ruanjiandong
Copy link
Contributor

@allenlavoie , could you please review pull request pull #19966 ?

rowillia pushed a commit to rowillia/protobuf that referenced this issue Feb 25, 2021
@gnossen gave a great overview in grpc/grpc#24992 of the overall problem.

If a python process using both protobuf _and_ another native library linking in libprotobuf
frequently can cause crashes.  This seems to frequently affect tensorflow as well:

tensorflow/tensorflow#8394,
tensorflow/tensorflow#9525 (comment)
tensorflow/tensorflow#24976,
tensorflow/tensorflow#35573,
https://github.com/tensorflow/tensorflow/blob/v2.0.0/tensorflow/contrib/makefile/rename_protobuf.sh,
tensorflow/tensorflow#16104

Testing locally this fixes both crashes when linking in multiple versions of protobuf
and fixes `DescriptorPool` clashes as well (e.g. Python and Native code import different versions of the same message).
TeBoring pushed a commit to protocolbuffers/protobuf that referenced this issue Apr 24, 2021
@gnossen gave a great overview in grpc/grpc#24992 of the overall problem.

If a python process using both protobuf _and_ another native library linking in libprotobuf
frequently can cause crashes.  This seems to frequently affect tensorflow as well:

tensorflow/tensorflow#8394,
tensorflow/tensorflow#9525 (comment)
tensorflow/tensorflow#24976,
tensorflow/tensorflow#35573,
https://github.com/tensorflow/tensorflow/blob/v2.0.0/tensorflow/contrib/makefile/rename_protobuf.sh,
tensorflow/tensorflow#16104

Testing locally this fixes both crashes when linking in multiple versions of protobuf
and fixes `DescriptorPool` clashes as well (e.g. Python and Native code import different versions of the same message).
acozzette pushed a commit to acozzette/protobuf that referenced this issue Jan 22, 2022
@gnossen gave a great overview in grpc/grpc#24992 of the overall problem.

If a python process using both protobuf _and_ another native library linking in libprotobuf
frequently can cause crashes.  This seems to frequently affect tensorflow as well:

tensorflow/tensorflow#8394,
tensorflow/tensorflow#9525 (comment)
tensorflow/tensorflow#24976,
tensorflow/tensorflow#35573,
https://github.com/tensorflow/tensorflow/blob/v2.0.0/tensorflow/contrib/makefile/rename_protobuf.sh,
tensorflow/tensorflow#16104

Testing locally this fixes both crashes when linking in multiple versions of protobuf
and fixes `DescriptorPool` clashes as well (e.g. Python and Native code import different versions of the same message).
acozzette added a commit to protocolbuffers/protobuf that referenced this issue Jan 25, 2022
@gnossen gave a great overview in grpc/grpc#24992 of the overall problem.

If a python process using both protobuf _and_ another native library linking in libprotobuf
frequently can cause crashes.  This seems to frequently affect tensorflow as well:

tensorflow/tensorflow#8394,
tensorflow/tensorflow#9525 (comment)
tensorflow/tensorflow#24976,
tensorflow/tensorflow#35573,
https://github.com/tensorflow/tensorflow/blob/v2.0.0/tensorflow/contrib/makefile/rename_protobuf.sh,
tensorflow/tensorflow#16104

Testing locally this fixes both crashes when linking in multiple versions of protobuf
and fixes `DescriptorPool` clashes as well (e.g. Python and Native code import different versions of the same message).

Co-authored-by: Roy Williams <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

10 participants