Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Building (and using) libtensorflow.so #316

Closed
sebpuetz opened this issue Feb 9, 2019 · 26 comments
Closed

Building (and using) libtensorflow.so #316

sebpuetz opened this issue Feb 9, 2019 · 26 comments
Assignees
Labels
enhancement New feature or request

Comments

@sebpuetz
Copy link

sebpuetz commented Feb 9, 2019

Please make sure that this is a build/installation issue. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:build_template

System information

  • OS Platform and Distribution: Linux Mint 19.1
  • TensorFlow installed from (source or binary): Source
  • TensorFlow version: 1.12
  • Python version: 3.6.7
  • Installed using virtualenv? pip? conda?: pyenv
  • Bazel version (if compiling from source): 0.16.0, 0.19.2 and 0.21.0
  • GCC/Compiler version (if compiling from source): 7.3.0
  • ROCm version: 2.1
  • GPU model and memory: Radeon VII, 16GB

Describe the problem
I want to use tensorflow from rust, to do so I need to build the libtensorflow.so shared library. Compilation goes through on r1.12 but when trying to execute the graph I get a runtime exception (see other info/logs section).

I don't encounter any issues with tensorflow in python, running a graph and training model works like a charm there. Although that was not compiled from source but installed from pypi.

Provide the exact sequence of commands / steps that you executed before running into the problem

Install bazel 19.2 as recommended in #304 
git clone -b r1.12-rocm [email protected]:ROCmSoftwarePlatform/tensorflow-upstream
cd tensorflow-upstream
./configure n for everything except ROCm support
bazel build --config=opt --config=rocm --action_env=HIP_PLATFORM=hcc tensorflow:libtensorflow.so

Any other info / logs

Runtime exception:

2019-02-09 11:14:33.291267: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1531] Found device 0 with properties: 
name: Device 66af
AMDGPU ISA: gfx906
memoryClockRate (GHz) 1.802
pciBusID 0000:28:00.0
Total memory: 15.98GiB
Free memory: 15.73GiB
2019-02-09 11:14:33.291334: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1642] Adding visible gpu devices: 0
2019-02-09 11:14:33.291371: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-09 11:14:33.291383: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1059]      0 
2019-02-09 11:14:33.291391: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1072] 0:   N 
2019-02-09 11:14:33.291489: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1189] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 15306 MB memory) -> physical GPU (device: 0, name: Device 66af, pci bus id: 0000:28:00.0)
terminate called after throwing an instance of 'std::runtime_error'
  what():  Missing metadata for __global__ function: _ZN10tensorflow7functor28FillPhiloxRandomKernelLaunchINS_6random19UniformDistributionINS2_12PhiloxRandomEfEEEEvS4_PNT_17ResultElementTypeExS6_
[1]    11952 abort (core dumped)  LD_PRELOAD="/home/seb/.libtf/libtensorflow.so" 

hipconfig

HIP version  : 1.5.19025

== hipconfig
HIP_PATH     : /opt/rocm/hip
HIP_PLATFORM : hcc
CPP_CONFIG   :  -D__HIP_PLATFORM_HCC__=   -I/opt/rocm/hip/include -I/opt/rocm/hcc/include

== hcc
HSA_PATH     : /opt/rocm/hsa
HCC_HOME     : /opt/rocm/hcc
HCC clang version 8.0.0 (ssh://gerritgit/compute/ec/hcc-tot/clang 683c680a6bff215baa3bd9d3099ba1a43e24cf2e) (ssh://gerritgit/lightning/ec/llvm 6e349ce344586b4254654aea8f34444a13aedb67) (based on HCC 1.3.19045-fea3e2b-683c680-6e349ce )
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/rocm/hcc/bin
LLVM (http://llvm.org/):
  LLVM version 8.0.0svn
  Optimized build.
  Default target: x86_64-unknown-linux-gnu
  Host CPU: znver1

  Registered Targets:
    amdgcn - AMD GCN GPUs
    r600   - AMD GPUs HD2XXX-HD6XXX
    x86    - 32-bit X86: Pentium-Pro and above
    x86-64 - 64-bit X86: EM64T and AMD64
HCC-cxxflags :  -hc -std=c++amp -I/opt/rocm/hcc/includeHCC-ldflags  :  -hc -std=c++amp -L/opt/rocm/hcc/lib -Wl,--rpath=/opt/rocm/hcc/lib -ldl -lm -lpthread -lhc_am -Wl,--whole-archive -lmcwamp -Wl,--no-whole-archive

=== Environment Variables
PATH=/opt/rocm/hcc/bin:/opt/rocm/hip/bin:/home/seb/.pyenv/shims:/home/seb/.cargo/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/opt/rocm/bin:/opt/rocm/profiler/bin:/opt/rocm/opencl/bin/x86_64:/home/seb/.pyenv/bin
LD_LIBRARY_PATH=/opt/rocm/opencl/lib
HIP_PATH=/opt/rocm/hip
HCC_HOME=/opt/rocm/hcc

== Linux Kernel
Hostname     : seb-desktop
Linux seb-desktop 4.15.0-45-generic #48-Ubuntu SMP Tue Jan 29 16:28:13 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
No LSB modules are available.
Distributor ID:	LinuxMint
Description:	Linux Mint 19.1 Tessa
Release:	19.1
Codename:	tessa

hcc --version

HCC clang version 8.0.0 (ssh://gerritgit/compute/ec/hcc-tot/clang 683c680a6bff215baa3bd9d3099ba1a43e24cf2e) (ssh://gerritgit/lightning/ec/llvm 6e349ce344586b4254654aea8f34444a13aedb67) (based on HCC 1.3.19045-fea3e2b-683c680-6e349ce )
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/rocm/hcc/bin

rocminfo

=====================    
HSA System Attributes    
=====================    
Runtime Version:         1.1
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (number of timestamp)
Machine Model:           LARGE                              
System Endianness:       LITTLE                             

==========               
HSA Agents               
==========               
*******                  
Agent 1                  
*******                  
  Name:                    AMD Ryzen 7 2700X Eight-Core Processor
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0                                  
  Queue Min Size:          0                                  
  Queue Max Size:          0                                  
  Queue Type:              MULTI                              
  Node:                    0                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768KB                            
  Chip ID:                 0                                  
  Cacheline Size:          64                                 
  Max Clock Frequency (MHz):3700                               
  BDFID:                   0                                  
  Compute Unit:            16                                 
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    49448920KB                         
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Acessible by all:        TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    49448920KB                         
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Acessible by all:        TRUE                               
  ISA Info:                
    N/A                      
*******                  
Agent 2                  
*******                  
  Name:                    gfx906                             
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128                                
  Queue Min Size:          4096                               
  Queue Max Size:          131072                             
  Queue Type:              MULTI                              
  Node:                    1                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16KB                               
  Chip ID:                 26287                              
  Cacheline Size:          64                                 
  Max Clock Frequency (MHz):1802                               
  BDFID:                   10240                              
  Compute Unit:            60                                 
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      FALSE                              
  Wavefront Size:          64                                 
  Workgroup Max Size:      1024                               
  Workgroup Max Size Per Dimension:
    Dim[0]:                  67109888                           
    Dim[1]:                  671089664                          
    Dim[2]:                  0                                  
  Grid Max Size:           4294967295                         
  Waves Per CU:            40                                 
  Max Work-item Per CU:    2560                               
  Grid Max Size per Dimension:
    Dim[0]:                  4294967295                         
    Dim[1]:                  4294967295                         
    Dim[2]:                  4294967295                         
  Max number Of fbarriers Per Workgroup:32                                 
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    16760832KB                         
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Acessible by all:        FALSE                              
    Pool 2                   
      Segment:                 GROUP                              
      Size:                    64KB                               
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Alignment:         0KB                                
      Acessible by all:        FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx906          
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Dimension: 
        Dim[0]:                  67109888                           
        Dim[1]:                  1024                               
        Dim[2]:                  16777217                           
      Workgroup Max Size:      1024                               
      Grid Max Dimension:      
        x                        4294967295                         
        y                        4294967295                         
        z                        4294967295                         
      Grid Max Size:           4294967295                         
      FBarrier Max Size:       32                                 
*** Done ***            
@whchung
Copy link
Collaborator

whchung commented Feb 9, 2019

Hi I’m wondering could you try build _pywrap_tensorflow_internal.so as well?

In the current setup, libtensorflow.so won’t contain any GPU kernels so that’s why you are seeing this issue.

Also could you shed some lights on how rust builds/links/invokes Tensorflow?

@whchung whchung self-assigned this Feb 9, 2019
@sebpuetz
Copy link
Author

sebpuetz commented Feb 9, 2019

Hi, thanks for your reply!

Hi I’m wondering could you try build _pywrap_tensorflow_internal.so as well?

I'm currently building with this command, that's the closest I could find to __pywrap__tensorflow_internal.so. Once this goes through, what would you recommend doing with the built file?

bazel build --config=opt --config=rocm --action_env=HIP_PLATFORM=hcc tensorflow/python:pywrap_tensorflow_internal --verbose_failures

edit: Build was succesful, I can import the library in python without errors.

Unfortunately I'm neither experienced in using bazel nor in building tensorflow from sources.

Also could you shed some lights on how rust builds/links/invokes Tensorflow?

Rust calls the tensorflow C API, so it requires building the libtensorflow.so

@whchung
Copy link
Collaborator

whchung commented Feb 9, 2019

Let me do some studies and get back to you. Thus far applications I’ve encountered are either python-based (works fine) or C++ applications which builds inside TensorFlow (also works fine, with some known limitations). For other high-level languages which uses TF C API I’ll need some info.

Could you help provide some pointers to the rust binding to TensorFlow and how to build/use it?

@sebpuetz
Copy link
Author

sebpuetz commented Feb 9, 2019

The Rust bindings for TF would be found here and this includes the directions to build the libtensorflow.so.

Usage is generally to define a graph using the high level methods in python, serializing it and executing it through the Rust bindings of the C api.

Thanks again for looking into this!

@sunway513 sunway513 added the enhancement New feature or request label Feb 9, 2019
@whchung
Copy link
Collaborator

whchung commented Feb 11, 2019

libtensorflow.so does have GPU kernels inside. It's libtensorflow_framework.so which does not contain GPU kernels. My earlier comments were a bit incorrect.

I'll check TensorFlow rust binding, and see how it loads libtensorflow.so. I'm now suspecting it meets the "with some known limitations" in my earlier comments wrt C++ applications.

We have an internal ticket tracking this and I'm working on the fix.

@pricebenjamin
Copy link

Compilation goes through on r1.12 but when trying to execute the graph I get a runtime exception (see other info/logs section).

Runtime exception:

terminate called after throwing an instance of 'std::runtime_error'
  what():  Missing metadata for __global__ function: _ZN10tensorflow7functor28FillPhiloxRandomKernelLaunchINS_6random19UniformDistributionINS2_12PhiloxRandomEfEEEEvS4_PNT_17ResultElementTypeExS6_
[1]    11952 abort (core dumped)  LD_PRELOAD="/home/seb/.libtf/libtensorflow.so" 

I've just encountered the same runtime error after building r1.12-rocm from source and attempting to execute a simple graph in Python (3.6.7).

import tensorflow as tf
x = tf.random.uniform(shape=(10, 10))

with tf.Session() as sess:
    print(sess.run(x))

@whchung
Copy link
Collaborator

whchung commented Feb 12, 2019

@pricebenjamin Since you are using Python please refer to this article on how to build TensorFlow ROCm:
https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/blob/develop-upstream/rocm_docs/tensorflow-build-from-source.md

Once the PIP package is built and installed, import tensorflow as tf would correctly loads all GPU kernels within TensorFlow.

On the other hand, from your log it seems you want to run your application in other means:

[1]    11952 abort (core dumped)  LD_PRELOAD="/home/seb/.libtf/libtensorflow.so" 

Could you help shed more details on how you plan to load and run your TensorFlow-based applications?

We have an internal ticket that a bug in HIP runtime where GPU kernels within shared libraries loaded after TensorFlow is initialized won't be properly identified. Shall a fix be devised, it may help resolve the issue from @sebpuetz, but I'd like to understand your application better to see if it's the same issue.

@sebpuetz
Copy link
Author

sebpuetz commented Feb 13, 2019

I can actually confirm what @pricebenjamin describes. I built a wheel from r1.13-rocm-rc1 through the build_rocm_python3 script, installed it through pip and tried to run a graph in python and got the same error from above. Below is the example @pricebenjamin posted above with the output:

Python 3.6.7 (default, Feb 11 2019, 18:10:44) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.2.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import tensorflow as tf 
   ...: x = tf.random.uniform(shape=(10, 10)) 
   ...:  
   ...: with tf.Session() as sess: 
   ...:     print(sess.run(x)) 
   ...:                                                                                                                                    
2019-02-13 01:05:37.987912: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1531] Found device 0 with properties: 
name: Device 66af
AMDGPU ISA: gfx906
memoryClockRate (GHz) 1.802
pciBusID 0000:28:00.0
Total memory: 15.98GiB
Free memory: 15.73GiB
2019-02-13 01:05:37.987955: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1642] Adding visible gpu devices: 0
2019-02-13 01:05:37.987982: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-13 01:05:37.987990: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1059]      0 
2019-02-13 01:05:37.987996: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1072] 0:   N 
2019-02-13 01:05:37.988057: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1189] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 15306 MB memory) -> physical GPU (device: 0, name: Device 66af, pci bus id: 0000:28:00.0)
terminate called after throwing an instance of 'std::runtime_error'
  what():  Missing metadata for __global__ function: _ZN10tensorflow7functor28FillPhiloxRandomKernelLaunchINS_6random19UniformDistributionINS2_12PhiloxRandomEfEEEEvS4_PNT_17ResultElementTypeExS6_
[1]    12486 abort (core dumped)  ipython

I am building a wheel from r1.12-rocm now to see if it's related to r1.13-rocm-rc1 and will edit this post with the results.

edit: building on r1.12-rocm leads to the same behaviour:

Python 3.6.7 (default, Feb 11 2019, 18:10:44) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.2.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import tensorflow as tf 
   ...: x = tf.random.uniform(shape=(10, 10)) 
   ...:  
   ...: with tf.Session() as sess: 
   ...:     print(sess.run(x)) 
   ...:                                                                                                                                                                                        
WARNING:tensorflow:From /home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/distributions/distribution.py:265: ReparameterizationType.__init__ (from tensorflow.python.ops.distributions.distribution) is deprecated and will be removed after 2019-01-01.
Instructions for updating:
The TensorFlow Distributions library has moved to TensorFlow Probability (https://github.com/tensorflow/probability). You should update all references to use `tfp.distributions` instead of `tf.distributions`.
WARNING:tensorflow:From /home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/distributions/bernoulli.py:169: RegisterKL.__init__ (from tensorflow.python.ops.distributions.kullback_leibler) is deprecated and will be removed after 2019-01-01.
Instructions for updating:
The TensorFlow Distributions library has moved to TensorFlow Probability (https://github.com/tensorflow/probability). You should update all references to use `tfp.distributions` instead of `tf.distributions`.
2019-02-13 01:49:49.209993: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1530] Found device 0 with properties: 
name: Device 66af
AMDGPU ISA: gfx906
memoryClockRate (GHz) 1.802
pciBusID 0000:28:00.0
Total memory: 15.98GiB
Free memory: 15.73GiB
2019-02-13 01:49:49.210018: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1641] Adding visible gpu devices: 0
2019-02-13 01:49:49.210035: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1051] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-13 01:49:49.210041: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1057]      0 
2019-02-13 01:49:49.210046: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1070] 0:   N 
2019-02-13 01:49:49.210077: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1189] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 15306 MB memory) -> physical GPU (device: 0, name: Device 66af, pci bus id: 0000:28:00.0)
terminate called after throwing an instance of 'std::runtime_error'
  what():  Missing metadata for __global__ function: _ZN10tensorflow7functor28FillPhiloxRandomKernelLaunchINS_6random19UniformDistributionINS2_12PhiloxRandomEfEEEEvS4_PNT_17ResultElementTypeExS6_
[1]    28830 abort (core dumped)  ipython

@pricebenjamin
Copy link

@whchung I've just rebuilt the package in a Docker container using a minimally modified version of Dockerfile.rocm. The major differences are:

  1. FROM ubuntu:bionic instead of FROM ubuntu:xenial
  2. clang-3.9 instead of clang-3.8
  3. Did not add external repository ppa:george-edison55/cmake-3.x (no bionic release)
  4. Removed installation of nvinfer from install_deb_packages.sh
  5. Skipped easy_install of pip in install_pip_packages.sh

I'm able to build and install the wheel successfully, but the same runtime error is thrown. To be clear, the tensorflow-rocm package from PyPI works just fine, and the xenial-based Docker image also works fine. I think the issue @sebpuetz and I are running into have the same root cause.

@whchung
Copy link
Collaborator

whchung commented Feb 13, 2019

@pricebenjamin is it possible to share your Dockerfile somewhere so I can reproduce it?

I'm recently working on #318 which changes how ROCm components get loaded by TensorFlow, to cope with upcoming changes in future versions of TensorFlow. And I'm at the spot where issue in this ticket is now blocking me and thus I'm actively looking into getting a fix implemented. Please stay tuned.

@pricebenjamin
Copy link

pricebenjamin commented Feb 14, 2019

@whchung Sure thing.

Edit: Hopefully you saw the hyperlink there. I just realized that it's not obvious on some screens: https://github.com/pricebenjamin/tf-bionic-rocm-docker

@whchung
Copy link
Collaborator

whchung commented Feb 15, 2019

HIP runtime has to be changed. And all user-level ROCm components, including TensorFlow ROCm would need to be rebuilt.

A work-in-progress branch is at: https://github.com/ROCm-Developer-Tools/HIP/compare/feature_maybe_dlopen . It doesn't solve all the corner cases yet and we're still working on it.

@whchung
Copy link
Collaborator

whchung commented Feb 20, 2019

The branch in HIP runtime is getting matured and passing some initial tests. Checking rust binding now.

@sebpuetz
Copy link
Author

Thanks for the update!

@whchung
Copy link
Collaborator

whchung commented Feb 20, 2019

@sebpuetz I'm new to rust, could you check if this sounds right to you:

$ cargo test

    Finished dev [unoptimized + debuginfo] target(s) in 0.08s
     Running target/debug/deps/tensorflow-20425da1cb4cec5b

running 37 tests
test buffer::tests::basic ... ok
test io::tests::writer_identical_to_python ... ok
test tests::tensor_clone ... ok
test tests::tensor_eq ... ok
test tests::test_bfloat16 ... ok
test tests::tensor_display ... ok
test tests::test_set_target ... ok
test tests::test_set_config ... ok
test tests::test_tensor ... ok
test tests::test_tensor_native_type_zero ... ok
test tests::test_get_registered_kernels_for_op ... ok
2019-02-20 21:20:18.768873: I tensorflow/cc/saved_model/reader.cc:31] Reading SavedModel from: test_resources/regression-model
test graph::tests::graph_get_op_def ... ok
test graph::tests::graph_versions ... ok
test graph::tests::test_get_tensor_shape ... ok
test graph::tests::smoke ... ok
test graph::tests::graph_generate_operation_name ... ok
test graph::tests::test_import_graph_def ... ok
test graph::tests::import_graph_def_results_missing_unused_input_mappings ... ok
2019-02-20 21:20:18.772631: I tensorflow/cc/saved_model/reader.cc:54] Reading meta graph with tags { serve train }
2019-02-20 21:20:18.775770: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1562] Found device 0 with properties:
name: Device 66a1
AMDGPU ISA: gfx906
memoryClockRate (GHz) 1.701
pciBusID 0000:09:00.0
Total memory: 31.98GiB
Free memory: 31.73GiB
2019-02-20 21:20:18.776075: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1562] Found device 1 with properties:
name: Device 66a1
AMDGPU ISA: gfx906
memoryClockRate (GHz) 1.701
pciBusID 0000:0c:00.0
Total memory: 31.98GiB
Free memory: 31.73GiB
2019-02-20 21:20:18.776166: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1673] Adding visible gpu devices: 0, 1
2019-02-20 21:20:18.776218: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1084] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-20 21:20:18.776235: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1090]      0 1
2019-02-20 21:20:18.776249: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] 0:   N Y
2019-02-20 21:20:18.776262: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] 1:   Y N
2019-02-20 21:20:18.776400: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1220] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30871 MB memory) -> physical GPU (device: 0, name: Device 66a1, pci bus id: 0000:09:00.0)
test graph::tests::import_graph_def_results_return_operations ... ok
test graph::tests::import_graph_def_results_return_outputs ... ok
test graph::tests::operation_attributes ... ok
test graph::tests::import_graph_def_uniquify_prefix ... ok
test graph::tests::import_graph_def_uniquify_names ... ok
test graph::tests::graph_to_function ... ok
test while_loop::tests::generated_name_while ... ok
test graph::tests::graph_add_gradients ... ok
test tests::test_get_all_registered_kernels ... ok
2019-02-20 21:20:18.821814: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1220] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 30871 MB memory) -> physical GPU (device: 1, name: Device 66a1, pci bus id: 0000:0c:00.0)
2019-02-20 21:20:18.869854: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1673] Adding visible gpu devices: 0, 1
2019-02-20 21:20:18.869988: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1084] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-20 21:20:18.870004: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1090]      0 1
2019-02-20 21:20:18.870015: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] 0:   N Y
2019-02-20 21:20:18.870025: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] 1:   Y N
2019-02-20 21:20:18.870105: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1220] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30871 MB memory) -> physical GPU (device: 0, name: Device 66a1, pci bus id: 0000:09:00.0)
2019-02-20 21:20:18.870340: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1220] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 30871 MB memory) -> physical GPU (device: 1, name: Device 66a1, pci bus id: 0000:0c:00.0)
2019-02-20 21:20:18.871007: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1673] Adding visible gpu devices: 0, 1
2019-02-20 21:20:18.871110: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1084] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-20 21:20:18.871129: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1090]      0 1
2019-02-20 21:20:18.871143: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] 0:   N Y
2019-02-20 21:20:18.871154: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] 1:   Y N
2019-02-20 21:20:18.871231: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1220] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30871 MB memory) -> physical GPU (device: 0, name: Device 66a1, pci bus id: 0000:09:00.0)
2019-02-20 21:20:18.871777: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1220] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 30871 MB memory) -> physical GPU (device: 1, name: Device 66a1, pci bus id: 0000:0c:00.0)
2019-02-20 21:20:18.872197: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1673] Adding visible gpu devices: 0, 1
2019-02-20 21:20:18.872273: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1084] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-20 21:20:18.872306: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1090]      0 1
2019-02-20 21:20:18.872319: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] 0:   N Y
2019-02-20 21:20:18.872331: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] 1:   Y N
2019-02-20 21:20:18.872394: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1220] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30871 MB memory) -> physical GPU (device: 0, name: Device 66a1, pci bus id: 0000:09:00.0)
2019-02-20 21:20:18.872728: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1220] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 30871 MB memory) -> physical GPU (device: 1, name: Device 66a1, pci bus id: 0000:0c:00.0)
2019-02-20 21:20:18.873065: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1673] Adding visible gpu devices: 0, 1
2019-02-20 21:20:18.873135: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1084] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-20 21:20:18.873159: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1090]      0 1
2019-02-20 21:20:18.873171: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] 0:   N Y
2019-02-20 21:20:18.873181: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] 1:   Y N
2019-02-20 21:20:18.873264: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1220] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30871 MB memory) -> physical GPU (device: 0, name: Device 66a1, pci bus id: 0000:09:00.0)
2019-02-20 21:20:18.873890: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1220] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 30871 MB memory) -> physical GPU (device: 1, name: Device 66a1, pci bus id: 0000:0c:00.0)
test tests::smoke ... ok
test tests::test_close ... ok
test session::tests::test_close ... ok
2019-02-20 21:20:18.874284: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1673] Adding visible gpu devices: 0, 1
2019-02-20 21:20:18.874342: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1084] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-20 21:20:18.874367: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1090]      0 1
2019-02-20 21:20:18.874379: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] 0:   N Y
2019-02-20 21:20:18.874388: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] 1:   Y N
2019-02-20 21:20:18.874443: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1220] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30871 MB memory) -> physical GPU (device: 0, name: Device 66a1, pci bus id: 0000:09:00.0)
2019-02-20 21:20:18.874666: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1220] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 30871 MB memory) -> physical GPU (device: 1, name: Device 66a1, pci bus id: 0000:0c:00.0)
2019-02-20 21:20:18.874945: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1673] Adding visible gpu devices: 0, 1
2019-02-20 21:20:18.875010: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1084] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-20 21:20:18.875032: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1090]      0 1
2019-02-20 21:20:18.875044: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] 0:   N Y
2019-02-20 21:20:18.875055: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] 1:   Y N
2019-02-20 21:20:18.875130: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1220] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30871 MB memory) -> physical GPU (device: 0, name: Device 66a1, pci bus id: 0000:09:00.0)
test session::tests::smoke ... ok
2019-02-20 21:20:18.875431: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1220] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 30871 MB memory) -> physical GPU (device: 1, name: Device 66a1, pci bus id: 0000:0c:00.0)
2019-02-20 21:20:18.875794: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1673] Adding visible gpu devices: 0, 1
2019-02-20 21:20:18.875862: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1084] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-20 21:20:18.875884: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1090]      0 1
2019-02-20 21:20:18.875897: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] 0:   N Y
2019-02-20 21:20:18.875913: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] 1:   Y N
2019-02-20 21:20:18.875975: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1220] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30871 MB memory) -> physical GPU (device: 0, name: Device 66a1, pci bus id: 0000:09:00.0)
2019-02-20 21:20:18.876195: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1220] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 30871 MB memory) -> physical GPU (device: 1, name: Device 66a1, pci bus id: 0000:0c:00.0)
2019-02-20 21:20:18.876509: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1673] Adding visible gpu devices: 0, 1
2019-02-20 21:20:18.876581: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1084] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-20 21:20:18.876604: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1090]      0 1
2019-02-20 21:20:18.876617: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] 0:   N Y
2019-02-20 21:20:18.876632: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] 1:   Y N
2019-02-20 21:20:18.876696: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1220] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30871 MB memory) -> physical GPU (device: 0, name: Device 66a1, pci bus id: 0000:09:00.0)
2019-02-20 21:20:18.876985: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1220] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 30871 MB memory) -> physical GPU (device: 1, name: Device 66a1, pci bus id: 0000:0c:00.0)
2019-02-20 21:20:18.877416: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1673] Adding visible gpu devices: 0, 1
2019-02-20 21:20:18.877479: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1084] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-20 21:20:18.877498: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1090]      0 1
2019-02-20 21:20:18.877512: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] 0:   N Y
2019-02-20 21:20:18.877524: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] 1:   Y N
2019-02-20 21:20:18.877580: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1220] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30871 MB memory) -> physical GPU (device: 0, name: Device 66a1, pci bus id: 0000:09:00.0)
test session::tests::test_device_list ... ok
2019-02-20 21:20:18.877889: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1220] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 30871 MB memory) -> physical GPU (device: 1, name: Device 66a1, pci bus id: 0000:0c:00.0)
2019-02-20 21:20:18.881304: I tensorflow/cc/saved_model/loader.cc:201] Restoring SavedModel bundle.
test tests::test_strings ... ok
2019-02-20 21:20:24.769851: I tensorflow/cc/saved_model/loader.cc:310] SavedModel load for tags { serve train }; Status: success. Took 6001352 microseconds.
test while_loop::tests::simple_while ... ok
test session::tests::test_run ... ok
test tests::test_run ... ok
test session::tests::test_savedmodelbundle ... ok

test result: ok. 37 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out

   Doc-tests tensorflow

running 2 tests
test src/session.rs - session::SessionRunArgs (line 243) ... ignored
test src/lib.rs - Tensor<T>::with_values (line 1132) ... ok

test result: ok. 1 passed; 0 failed; 1 ignored; 0 measured; 0 filtered out

@sebpuetz
Copy link
Author

@whchung
This looks fine, but I'm not sure where ops are executed and where it's only initializing TF and defining a graph. I think the easiest way to verify that the GPU kernels work would be:

cargo build
./target/debug/examples/regression

This might then result in the following error:

./target/debug/examples/regression: error while loading shared libraries: libtensorflow.so: cannot open shared object file: No such file or directory

which can be mitigated by:

LD_PRELOAD=/path/to/libtensorflow.so ./target/debug/examples/regression

examples/regression reads a graph that was defined in python from examples/regression/model.pb and runs the graph in a way that's similar to my use case.
Again, thanks!

@whchung
Copy link
Collaborator

whchung commented Feb 21, 2019

@sebpuetz my environment got a bit mixed up so it took me sometime to rebuild everything:

$ ./target/debug/examples/regression
2019-02-21 15:52:40.505504: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1562] Found device 0 with properties: 
name: Device 66a1
AMDGPU ISA: gfx906
memoryClockRate (GHz) 1.701
pciBusID 0000:09:00.0
Total memory: 31.98GiB
Free memory: 31.73GiB
2019-02-21 15:52:40.505669: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1562] Found device 1 with properties: 
name: Device 66a1
AMDGPU ISA: gfx906
memoryClockRate (GHz) 1.701
pciBusID 0000:0c:00.0
Total memory: 31.98GiB
Free memory: 31.73GiB
2019-02-21 15:52:40.505761: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1673] Adding visible gpu devices: 0, 1
2019-02-21 15:52:40.505811: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1084] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-21 15:52:40.505828: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1090]      0 1 
2019-02-21 15:52:40.505842: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] 0:   N Y 
2019-02-21 15:52:40.505855: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] 1:   Y N 
2019-02-21 15:52:40.505976: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1220] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30871 MB memory) -> physical GPU (device: 0, name: Device 66a1, pci bus id: 0000:09:00.0)
2019-02-21 15:52:40.547319: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1220] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 30871 MB memory) -> physical GPU (device: 1, name: Device 66a1, pci bus id: 0000:0c:00.0)
Checking w: expected 0.1, got 0.10000001. Success!
Checking b: expected 0.3, got 0.3. Success!

also tried examples/addition, examples/regression, examples/regression_checkpoint, examples/regression_savedmodel and they all seem to execute properly.

Since everything above HIP have to be rebuilt from source (HIP / rocRAND / rocFFT / rocBLAS / MIOpen / TensorFlow), and the change in HIP is still under validation for other ROCm applications, we plan to release this fix in ROCm 2.3, scheduled in (late?) March 2019. Meanwhile I'm wondering if it's possible to give you a docker container image so you can validate the fix on your end?

@sebpuetz
Copy link
Author

also tried examples/addition, examples/regression, examples/regression_checkpoint, examples/regression_savedmodel and they all seem to execute properly.

Nice!

Since everything above HIP have to be rebuilt from source (HIP / rocRAND / rocFFT / rocBLAS / MIOpen / TensorFlow), and the change in HIP is still under validation for other ROCm applications, we plan to release this fix in ROCm 2.3, scheduled in (late?) March 2019. Meanwhile I'm wondering if it's possible to give you a docker container image so you can validate the fix on your end?

Sure, I could check if my program works with the fix. Although I might run into #325 along the way ;)

@whchung
Copy link
Collaborator

whchung commented Feb 21, 2019

@sebpuetz could you try this tag on dockerhub:

whchung/hiptensorflow:rocm2.1-tf1.12-python3-dynamic-load-dev

I'm still pushing the tag so you'll need to wait a bit until it appears on dockerhub.

Notice you'll need ROCm 2.1 in your bare metal.

I run it with:

alias drun='sudo docker run -it --network=host --device=/dev/kfd --device=/dev/dri --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v $HOME:/data'
drun whchung/hiptensorflow:rocm2.1-tf1.12-python3-dynamic-load-dev

TensorFlow rust binding is stored under ~root/rust.

@sebpuetz
Copy link
Author

I'll probably get around to check this tomorrow, I want to switch to the fully supported Ubuntu 18.04 before testing things.

@sebpuetz
Copy link
Author

Everything seems to be loaded properly, everything gets properly initialized and training works! Thanks

@whchung
Copy link
Collaborator

whchung commented Feb 22, 2019

@pricebenjamin I'm wondering could you also give the docker container a shot?

whchung/hiptensorflow:rocm2.1-tf1.12-python3-dynamic-load-dev

@pricebenjamin
Copy link

@whchung Pulling now. I'll test it some time this afternoon.

@pricebenjamin
Copy link

@whchung The issue appears to be resolved in the develop-upstream branch. I was able to build and run the wheel without issue within your container and within my own Ubuntu 18.04 container.

@sebpuetz
Copy link
Author

sebpuetz commented Apr 13, 2019

@whchung just wondering whether the changes shipped with 2.3? I installed 2.3 and tried to compile libtensorflow.so on branch r1.13-rocm. I'm still getting the exception:

 terminate called after throwing an instance of 'std::exception'
  what():  std::exception
[1]    7667 abort (core dumped

I was able to execute a graph by copying the libtensorflow.so and libtensorflow_framework.so from the docker image you provided earlier. How were you able to build the shared library?

Thanks in advance!

@sebpuetz
Copy link
Author

sebpuetz commented Apr 13, 2019

edit: Building in the rocm2.3-tf1.13-python3 docker container worked.

Compiling the tensorflow wheel for python also doesn't work on that branch:
I replaced the distribution's python path in build_rocm_python3 with PYTHON_BIN_PATH=$(which python) since I don't want to touch my distribution's python.

cd tensorflow-upstream
./build_rocm_python3

failed because it tried to install for python 3.5?

Sa 13. Apr 22:19:21 CEST 2019 : === Output wheel file is in: /tmp/tensorflow_pkg
Requirement '/tmp/tensorflow_pkg/tensorflow-1.13.2-cp35-cp35m-linux_x86_64.whl' looks like a filename, but the file does not exist
tensorflow-1.13.2-cp35-cp35m-linux_x86_64.whl is not a supported wheel on this platform.

Installed the .whl manually by:

pip install /tmp/tensorflow_pkg/tensorflow-1.13.2-cp36-cp36m-linux_x86_64.whl

Trying to import tensorflow throws an exception in python:

  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow.py", line 58, in <module>
    from tensorflow.python.pywrap_tensorflow_internal import *
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 28, in <module>
    _pywrap_tensorflow_internal = swig_import_helper()
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 24, in swig_import_helper
    _mod = imp.load_module('_pywrap_tensorflow_internal', fp, pathname, description)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/imp.py", line 243, in load_module
    return load_dynamic(name, filename, file)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/imp.py", line 343, in load_dynamic
    return _load(spec)
ImportError: libtensorflow_framework.so: undefined symbol: _ZN8hip_impl8hash_forB5cxx11EP12ihipModule_t


Failed to load the native TensorFlow runtime.

See https://www.tensorflow.org/install/errors

for some common reasons and solutions.  Include the entire stack trace
above this error message when asking for help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants