Add Lookup Op and unit tests. #3

sebpuetz · 2019-09-19T12:17:28Z

This PR adds functionality to do embedding lookups in tensorflow.

sebpuetz · 2019-09-19T12:38:29Z

Not sure what's going on here. OSX build succeeds, local builds work but linux CI build breaks down with segfault in the graph-mode-test.

sebpuetz · 2019-09-19T13:00:05Z

Apparently there's a mismatch between tensorflow-rocm and tensorflow. I also get the segfault when I install tensorflow via pip instead of tensorflow-rocm.

danieldk · 2019-09-19T12:59:47Z

finalfusion-tf/kernel/FFLookupKernels.cc

+      std::vector<float> embedding = lookup->embedding(query(i));
+      // optionally mask failed lookups and/or empty string. Generally, empty string will lead to a failed lookup.
+      if ((query(i).empty() && mask_empty_string_) || (mask_failed_lookup_ && embedding.empty())) {
+        std::memset(&output_flat(i * dims), 0., dims * 4);


I guess 4 is supposed to be sizeof(float)? If so, put that here ;).

Also, is the data guaranteed to be row-major? (I guess so)

I guess 4 is supposed to be sizeof(float)? If so, put that here ;).

Fixed

Also, is the data guaranteed to be row-major? (I guess so)

The Eigen documentation is not too nice, I spent some time looking for suggestions on how to move data from vec(s) into tensors, and the recommended way was using copy_n or memcpy

I found this SO thread from 3 years ago that claims tf data is always row-major:

https://stackoverflow.com/questions/37645308/determine-memory-layout-colmajor-rowmajor-in-tensorflow-custom-op

danieldk · 2019-09-19T13:02:23Z

Whoops, pressed a button ;).

sebpuetz · 2019-09-19T14:45:25Z

If I comment the SetShapeFn out in ops/FFLookupOps.cc, the segfault goes away. So it looks like something's messed up with that.

sebpuetz · 2019-09-19T14:53:52Z

Following this description, it's possible to get shapes through ::tensorflow::shape_inference::InferenceContext. Unfortunately, any usage of c in the following leads to segfaults:

    .SetShapeFn([](::tensorflow::shape_inference::InferenceContext* c) {
      ShapeHandle strings_shape = c->input(1);
      ShapeHandle output_shape;
      int embedding_len;
      TF_RETURN_IF_ERROR(c->GetAttr("embedding_len", &embedding_len));
      TF_RETURN_IF_ERROR(
        c->Concatenate(strings_shape, c->Vector(embedding_len), &output_shape)
      );
      ShapeHandle embeds = c->output(0);
      TF_RETURN_IF_ERROR(c->WithRank(c->input(0), 0, &embeds));
      c->set_output(0, output_shape);
      return Status::OK();
    });

sebpuetz · 2019-09-19T15:18:48Z

tensorflow/tensorflow#29951 (comment) downgrading the compiler to g++ 4.8 supposedly fixes it.

sebpuetz · 2019-09-19T19:40:12Z

A maintainer in the thread suggested following the instructions at https://github.com/tensorflow/custom-op which means using a docker with the correct environment. That docker comes with gcc-4.8 which doesn't support make_unique yet. Which is quite confusing to me since they state that these are the environments under which the official wheels are built, but the tf codebase contains quite a lot of make_uniques. I'm somewhat confused by this whole situation.

It might make sense to restructure the project and follow the custom-op example. Although that wouldn't solve my confusion about the compiler version / c++ features either.

It would be great if you could also take a quick look at this and give your opinion!

edit: Seems to work both in the custom-op docker container and on Ubuntu with gcc 4.8.4, set(CMAKE_CXX_STANDARD 11) and replacing the make_uniques with unique_ptr<T>(new T). Making this work on CI requires changing finalfusion-cxx accordingly...

danieldk · 2019-09-20T08:19:23Z

That docker comes with gcc-4.8

That is old. But they probably want to build with an old glibc/gcc for compatibility with older distributions.

which doesn't support make_unique yet. Which is quite confusing to me since they state that these are the environments under which the official wheels are built, but the tf codebase contains quite a lot of make_uniques. I'm somewhat confused by this whole situation.

They probably get make_unique from abseil:

https://abseil.io/

It might make sense to restructure the project and follow the custom-op example. Although that wouldn't solve my confusion about the compiler version / c++ features either.

The ABI between compiler versions is not guaranteed to be stable. So, if Tensorflow is compiled with a different version than an op, there may be subtle ABI incompatibilities. You generally don't notice. If you are lucky, you get segmentation faults, if you are unlucky, there is silent data corruption. This is why Rust prefers statically linking of units compiled with the same compiler version.

At any rate, when you compile with a newer g++, you could try to force an older ABI with -fabi-version. According to

https://gcc.gnu.org/onlinedocs/gcc/C_002b_002b-Dialect-Options.html

you want -fabi-version=7. Of course, assuming that Tensorflow wasn't compiled with a different ABI version. If setting the ABI version works, I think this is vastly preferred over complicating builds with Docker images (which just plain sucks) or ancient Linux distributions (though Ubuntu 18.04 is not ancient, of course).

sebpuetz · 2019-09-20T08:56:36Z

The ABI between compiler versions is not guaranteed to be stable. So, if Tensorflow is compiled with a different version than an op, there may be subtle ABI incompatibilities. You generally don't notice. If you are lucky, you get segmentation faults, if you are unlucky, there is silent data corruption. This is why Rust prefers statically linking of units compiled with the same compiler version.

At any rate, when you compile with a newer g++, you could try to force an older ABI with -fabi-version. According to

https://gcc.gnu.org/onlinedocs/gcc/C_002b_002b-Dialect-Options.html

you want -fabi-version=7. Of course, assuming that Tensorflow wasn't compiled with a different ABI version. If setting the ABI version works, I think this is vastly preferred over complicating builds with Docker images (which just plain sucks) or ancient Linux distributions (though Ubuntu 18.04 is not ancient, of course).

There already is a CXX11_ABI_FLAG that we are setting. Without setting this to the correct value (we get that from the python package), symbols from libtensorflow_framework aren't found.

I tried setting various -fabi-versions, but none solve the segfaulting. The only approach working so far was using g++-4.8, which worked on both Ubuntu 18.04 and 14.04. Is there any obvious downside to this?

The custom op repo also states that newer (manylinux2010-compatible) wheels are built with Ubuntu 16.04 docker images, which comes with g++-5.4.

Fwiw, not setting the ABI in cmake should allow people to compile against self-compiled tf packages with a compiler of their choice.

sebpuetz · 2019-09-20T09:26:53Z

This AWS project also pins gcc to 4.8 to build their custom ops: https://github.com/aws/sagemaker-tensorflow-extensions

danieldk · 2019-09-20T10:26:04Z

There already is a CXX11_ABI_FLAG that we are setting. Without setting this to the correct value (we get that from the python package), symbols from libtensorflow_framework aren't found.

That's another know. This sets "_GLIBCXX_USE_CXX11_ABI", which uses newer C++11-compatible declarations of some classes. This changes the C++ standard library API and thus ABI.

So, the library can change the ABI and the compiler can change the ABI (e.g. by changing alignment preferences).

I tried setting various -fabi-versions, but none solve the segfaulting.

I think it is standard for Google to compile C++ without exceptions, so besides setting the ABI you might also want to add -fno-exceptions.

Fwiw, not setting the ABI in cmake should allow people to compile against self-compiled tf packages with a compiler of their choice.

It could just be a CMake option.

danieldk · 2019-09-20T10:35:36Z

More info on the upstream library:

$ readelf -p .comment /nix/store/s27l1m937bsw60fcdv3cabzbcdy4bpy4-python3.7-tensorflow-1.14.0/lib/python3.7/site-packages/tensorflow/libtensorflow_framework.so.1

String dump of section '.comment':
  [     1]  GCC: (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609

So, they are actually using gcc 5.4.0. (Which confirms the statement.) Did you try to compile on CI with 5.4 and no additional flags?

sebpuetz · 2019-09-20T10:40:26Z

That's different for me:

readelf -p .comment /home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/libtensorflow_framework.so.1 

String dump of section '.comment':
  [     1]  GCC: (Ubuntu 4.8.5-4ubuntu8~14.04.2) 4.8.5

5.4 is what is supposedly used for packages built after August 1st 19. I haven't tried 5.4.0 yet.

danieldk · 2019-09-20T10:42:24Z

That's different for me:
readelf -p .comment /home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/libtensorflow_framework.so.1 

String dump of section '.comment':
  [     1]  GCC: (Ubuntu 4.8.5-4ubuntu8~14.04.2) 4.8.5
5.4 is what is supposedly used for packages built after August 1st 19. I haven't tried 5.4.0 yet.

Maybe they used different build environments for different Python versions? You are on 3.6, while I am on 3.7.

sebpuetz · 2019-09-20T10:53:09Z

I tried setting various -fabi-versions, but none solve the segfaulting.

I think it is standard for Google to compile C++ without exceptions, so besides setting the ABI you might also want to add -fno-exceptions.

We're throwing an exception in the constructor if constructing the embeddings in Rust fails. So with that flag we can't build unless we introduce some Status type.

From what I can tell there is also no such flag set in the custom op example repo: https://github.com/tensorflow/custom-op/blob/master/Makefile

TF_CFLAGS := $(shell $(PYTHON_BIN_PATH) -c 'import tensorflow as tf; print(" ".join(tf.sysconfig.get_compile_flags()))')
TF_LFLAGS := $(shell $(PYTHON_BIN_PATH) -c 'import tensorflow as tf; print(" ".join(tf.sysconfig.get_link_flags()))')

CFLAGS = ${TF_CFLAGS} -fPIC -O2 -std=c++11
LDFLAGS = -shared ${TF_LFLAGS}

ZERO_OUT_TARGET_LIB = tensorflow_zero_out/python/ops/_zero_out_ops.so
# zero_out op for CPU
zero_out_op: $(ZERO_OUT_TARGET_LIB)

$(ZERO_OUT_TARGET_LIB): $(ZERO_OUT_SRCS)
	$(CXX) $(CFLAGS) -o $@ $^ ${LDFLAGS}

Maybe they used different build environments for different Python versions? You are on 3.6, while I am on 3.7.

I verified that, 3.7.3 is built on 16.04 with g++-5.4

danieldk · 2019-09-20T10:56:55Z

So then the probably absolutely correct way would be to build for Python 3.6 using gcc 4.8 and for Python 3.7 using gcc 5.4. (Since there are two compiler ABI changes in between. Not sure whether they matter in this case.)

sebpuetz · 2019-09-20T15:11:21Z

I found some more info on this at tensorflow/tensorflow#27067 and tensorflow/community#77. For now, matching the compiler version and flags seems like the only way to guarantee working builds.

We can get the compiler version from the python package, too:

import tensorflow as tf
tf.version.COMPILER_VERSION

returns this, so I guess we can use that to select the correct compiler, too.

With tensorflow/community#77 it would be possible to build packages independent of what was used to compile pip tensorflow.

sebpuetz · 2019-09-22T19:52:18Z

I found some more info on this at tensorflow/tensorflow#27067 and tensorflow/community#77. For now, matching the compiler version and flags seems like the only way to guarantee working builds.

We can get the compiler version from the python package, too:

python

import tensorflow as tf
tf.version.COMPILER_VERSION

returns this, so I guess we can use that to select the correct compiler, too.

With tensorflow/community#77 it would be possible to build packages independent of what was used to compile pip tensorflow.

I put together a setup.py by following what uber is doing with horovod (https://github.com/horovod/horovod/blob/master/setup.py), they set a min compiler version through tf.version.COMPILER_VERSION and an optional max version by checking whether the major version of COMPILER_VERSION == 4. Then they select the highest compatible gcc / g++ version found in $PATH and export those as CC / CXX. If either CC or CXX are exported at the invocation of setup.py, the explicitly chosen compiler is used instead.

That'll be part of a future PR, so I think the only question for this PR is how to set up the CI and whether what I did is correct for our purposes.

danieldk · 2019-09-23T06:44:58Z

I put together a setup.py by following what uber is doing with horovod (https://github.com/horovod/horovod/blob/master/setup.py), they set a min compiler version through tf.version.COMPILER_VERSION and an optional max version by checking whether the major version of COMPILER_VERSION == 4. Then they select the highest compatible gcc / g++ version found in $PATH and export those as CC / CXX. If either CC or CXX are exported at the invocation of setup.py, the explicitly chosen compiler is used instead.

Sounds good, especially with the possibility to override the compiler.

That'll be part of a future PR, so I think the only question for this PR is how to set up the CI and whether what I did is correct for our purposes.

I think for this PR then, it would be good enough to use a fixed compiler version (corresponding to what the Python module needs).

sebpuetz · 2019-09-23T07:46:15Z

I think for this PR then, it would be good enough to use a fixed compiler version (corresponding to what the Python module needs).

This PR now has two Linux builds, one for python 3.6 and one for 3.7, for the 3.6 build I export g++-4.8 as CXX, for 3.7 I just leave the default compiler (7.4) since there's no ABI incompatibility. I think for building the shared library, setting the compiler version explicitly is fine, we just might want to document this somewhere. Automated detection and setup could be left with setup.py.

Does that sound good to you?

danieldk · 2019-09-23T08:04:22Z

Excellent!

danieldk · 2019-09-23T08:05:06Z

.travis.yml

+      addons:
+        apt:
+          sources:
+            - ubuntu-toolchain-r-test


This is to get an old toolchain?

Copy paste from the travis-ci docs. Seems like it doesn't actually work ;)

Disallowing sources: ubuntu-toolchain-r-test To add unlisted APT sources, follow instructions in https://docs.travis-ci.com/user/installing-dependencies#Installing-Packages-with-the-APT-Addon

I'll push a version without this, shouldn't break the build.

sebpuetz requested a review from danieldk September 19, 2019 12:17

sebpuetz force-pushed the lookup_op branch from b847725 to 6d12f52 Compare September 19, 2019 12:39

danieldk reviewed Sep 19, 2019

View reviewed changes

danieldk closed this Sep 19, 2019

danieldk reopened this Sep 19, 2019

sebpuetz force-pushed the lookup_op branch from 6d12f52 to 889a94f Compare September 19, 2019 13:02

sebpuetz force-pushed the lookup_op branch 5 times, most recently from 9c26671 to 453e4b9 Compare September 19, 2019 19:37

sebpuetz force-pushed the lookup_op branch 2 times, most recently from 95c1620 to 53a8ae4 Compare September 19, 2019 19:45

sebpuetz mentioned this pull request Sep 19, 2019

Use C++11 standard instead of C++14. finalfusion/finalfusion-cxx#7

Merged

sebpuetz force-pushed the lookup_op branch from 53a8ae4 to 6fc143f Compare September 20, 2019 08:58

sebpuetz requested a review from danieldk September 20, 2019 09:09

sebpuetz force-pushed the lookup_op branch 7 times, most recently from 6235ea5 to 0c5e049 Compare September 20, 2019 13:36

sebpuetz force-pushed the lookup_op branch from 0c5e049 to 149f6b6 Compare September 23, 2019 07:24

danieldk approved these changes Sep 23, 2019

View reviewed changes

Add Lookup Op and unit tests.

9313d0e

sebpuetz force-pushed the lookup_op branch from 149f6b6 to 9313d0e Compare September 23, 2019 08:15

sebpuetz merged commit 9d912c8 into master Sep 23, 2019

sebpuetz deleted the lookup_op branch September 23, 2019 08:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Lookup Op and unit tests. #3

Add Lookup Op and unit tests. #3

sebpuetz commented Sep 19, 2019

sebpuetz commented Sep 19, 2019

sebpuetz commented Sep 19, 2019

danieldk Sep 19, 2019

sebpuetz Sep 19, 2019

danieldk commented Sep 19, 2019

sebpuetz commented Sep 19, 2019

sebpuetz commented Sep 19, 2019

sebpuetz commented Sep 19, 2019

sebpuetz commented Sep 19, 2019 •

edited

Loading

danieldk commented Sep 20, 2019 •

edited

Loading

sebpuetz commented Sep 20, 2019

sebpuetz commented Sep 20, 2019

danieldk commented Sep 20, 2019

danieldk commented Sep 20, 2019 •

edited

Loading

sebpuetz commented Sep 20, 2019 •

edited

Loading

danieldk commented Sep 20, 2019

sebpuetz commented Sep 20, 2019

danieldk commented Sep 20, 2019

sebpuetz commented Sep 20, 2019

sebpuetz commented Sep 22, 2019 •

edited

Loading

danieldk commented Sep 23, 2019

sebpuetz commented Sep 23, 2019

danieldk commented Sep 23, 2019

danieldk Sep 23, 2019

sebpuetz Sep 23, 2019

Add Lookup Op and unit tests. #3

Add Lookup Op and unit tests. #3

Conversation

sebpuetz commented Sep 19, 2019

sebpuetz commented Sep 19, 2019

sebpuetz commented Sep 19, 2019

danieldk Sep 19, 2019

Choose a reason for hiding this comment

sebpuetz Sep 19, 2019

Choose a reason for hiding this comment

danieldk commented Sep 19, 2019

sebpuetz commented Sep 19, 2019

sebpuetz commented Sep 19, 2019

sebpuetz commented Sep 19, 2019

sebpuetz commented Sep 19, 2019 • edited Loading

danieldk commented Sep 20, 2019 • edited Loading

sebpuetz commented Sep 20, 2019

sebpuetz commented Sep 20, 2019

danieldk commented Sep 20, 2019

danieldk commented Sep 20, 2019 • edited Loading

sebpuetz commented Sep 20, 2019 • edited Loading

danieldk commented Sep 20, 2019

sebpuetz commented Sep 20, 2019

danieldk commented Sep 20, 2019

sebpuetz commented Sep 20, 2019

sebpuetz commented Sep 22, 2019 • edited Loading

danieldk commented Sep 23, 2019

sebpuetz commented Sep 23, 2019

danieldk commented Sep 23, 2019

danieldk Sep 23, 2019

Choose a reason for hiding this comment

sebpuetz Sep 23, 2019

Choose a reason for hiding this comment

sebpuetz commented Sep 19, 2019 •

edited

Loading

danieldk commented Sep 20, 2019 •

edited

Loading

danieldk commented Sep 20, 2019 •

edited

Loading

sebpuetz commented Sep 20, 2019 •

edited

Loading

sebpuetz commented Sep 22, 2019 •

edited

Loading