Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Lookup Op and unit tests. #3

Merged
merged 1 commit into from
Sep 23, 2019
Merged

Add Lookup Op and unit tests. #3

merged 1 commit into from
Sep 23, 2019

Conversation

sebpuetz
Copy link
Member

This PR adds functionality to do embedding lookups in tensorflow.

@sebpuetz sebpuetz requested a review from danieldk September 19, 2019 12:17
@sebpuetz
Copy link
Member Author

Not sure what's going on here. OSX build succeeds, local builds work but linux CI build breaks down with segfault in the graph-mode-test.

@sebpuetz
Copy link
Member Author

Apparently there's a mismatch between tensorflow-rocm and tensorflow. I also get the segfault when I install tensorflow via pip instead of tensorflow-rocm.

std::vector<float> embedding = lookup->embedding(query(i));
// optionally mask failed lookups and/or empty string. Generally, empty string will lead to a failed lookup.
if ((query(i).empty() && mask_empty_string_) || (mask_failed_lookup_ && embedding.empty())) {
std::memset(&output_flat(i * dims), 0., dims * 4);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess 4 is supposed to be sizeof(float)? If so, put that here ;).

Also, is the data guaranteed to be row-major? (I guess so)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess 4 is supposed to be sizeof(float)? If so, put that here ;).

Fixed

Also, is the data guaranteed to be row-major? (I guess so)

The Eigen documentation is not too nice, I spent some time looking for suggestions on how to move data from vec(s) into tensors, and the recommended way was using copy_n or memcpy

I found this SO thread from 3 years ago that claims tf data is always row-major:

https://stackoverflow.com/questions/37645308/determine-memory-layout-colmajor-rowmajor-in-tensorflow-custom-op

@danieldk danieldk closed this Sep 19, 2019
@danieldk danieldk reopened this Sep 19, 2019
@danieldk
Copy link
Member

Whoops, pressed a button ;).

@sebpuetz
Copy link
Member Author

If I comment the SetShapeFn out in ops/FFLookupOps.cc, the segfault goes away. So it looks like something's messed up with that.

@sebpuetz
Copy link
Member Author

Following this description, it's possible to get shapes through ::tensorflow::shape_inference::InferenceContext. Unfortunately, any usage of c in the following leads to segfaults:

    .SetShapeFn([](::tensorflow::shape_inference::InferenceContext* c) {
      ShapeHandle strings_shape = c->input(1);
      ShapeHandle output_shape;
      int embedding_len;
      TF_RETURN_IF_ERROR(c->GetAttr("embedding_len", &embedding_len));
      TF_RETURN_IF_ERROR(
        c->Concatenate(strings_shape, c->Vector(embedding_len), &output_shape)
      );
      ShapeHandle embeds = c->output(0);
      TF_RETURN_IF_ERROR(c->WithRank(c->input(0), 0, &embeds));
      c->set_output(0, output_shape);
      return Status::OK();
    });

@sebpuetz
Copy link
Member Author

tensorflow/tensorflow#29951 (comment) downgrading the compiler to g++ 4.8 supposedly fixes it.

@sebpuetz sebpuetz force-pushed the lookup_op branch 5 times, most recently from 9c26671 to 453e4b9 Compare September 19, 2019 19:37
@sebpuetz
Copy link
Member Author

sebpuetz commented Sep 19, 2019

A maintainer in the thread suggested following the instructions at https://github.com/tensorflow/custom-op which means using a docker with the correct environment. That docker comes with gcc-4.8 which doesn't support make_unique yet. Which is quite confusing to me since they state that these are the environments under which the official wheels are built, but the tf codebase contains quite a lot of make_uniques. I'm somewhat confused by this whole situation.

It might make sense to restructure the project and follow the custom-op example. Although that wouldn't solve my confusion about the compiler version / c++ features either.

It would be great if you could also take a quick look at this and give your opinion!

edit: Seems to work both in the custom-op docker container and on Ubuntu with gcc 4.8.4, set(CMAKE_CXX_STANDARD 11) and replacing the make_uniques with unique_ptr<T>(new T). Making this work on CI requires changing finalfusion-cxx accordingly...

@danieldk
Copy link
Member

danieldk commented Sep 20, 2019

That docker comes with gcc-4.8

That is old. But they probably want to build with an old glibc/gcc for compatibility with older distributions.

which doesn't support make_unique yet. Which is quite confusing to me since they state that these are the environments under which the official wheels are built, but the tf codebase contains quite a lot of make_uniques. I'm somewhat confused by this whole situation.

They probably get make_unique from abseil:

https://abseil.io/

It might make sense to restructure the project and follow the custom-op example. Although that wouldn't solve my confusion about the compiler version / c++ features either.

The ABI between compiler versions is not guaranteed to be stable. So, if Tensorflow is compiled with a different version than an op, there may be subtle ABI incompatibilities. You generally don't notice. If you are lucky, you get segmentation faults, if you are unlucky, there is silent data corruption. This is why Rust prefers statically linking of units compiled with the same compiler version.

At any rate, when you compile with a newer g++, you could try to force an older ABI with -fabi-version. According to

https://gcc.gnu.org/onlinedocs/gcc/C_002b_002b-Dialect-Options.html

you want -fabi-version=7. Of course, assuming that Tensorflow wasn't compiled with a different ABI version. If setting the ABI version works, I think this is vastly preferred over complicating builds with Docker images (which just plain sucks) or ancient Linux distributions (though Ubuntu 18.04 is not ancient, of course).

@sebpuetz
Copy link
Member Author

The ABI between compiler versions is not guaranteed to be stable. So, if Tensorflow is compiled with a different version than an op, there may be subtle ABI incompatibilities. You generally don't notice. If you are lucky, you get segmentation faults, if you are unlucky, there is silent data corruption. This is why Rust prefers statically linking of units compiled with the same compiler version.

At any rate, when you compile with a newer g++, you could try to force an older ABI with -fabi-version. According to

https://gcc.gnu.org/onlinedocs/gcc/C_002b_002b-Dialect-Options.html

you want -fabi-version=7. Of course, assuming that Tensorflow wasn't compiled with a different ABI version. If setting the ABI version works, I think this is vastly preferred over complicating builds with Docker images (which just plain sucks) or ancient Linux distributions (though Ubuntu 18.04 is not ancient, of course).

There already is a CXX11_ABI_FLAG that we are setting. Without setting this to the correct value (we get that from the python package), symbols from libtensorflow_framework aren't found.

I tried setting various -fabi-versions, but none solve the segfaulting. The only approach working so far was using g++-4.8, which worked on both Ubuntu 18.04 and 14.04. Is there any obvious downside to this?

The custom op repo also states that newer (manylinux2010-compatible) wheels are built with Ubuntu 16.04 docker images, which comes with g++-5.4.

Fwiw, not setting the ABI in cmake should allow people to compile against self-compiled tf packages with a compiler of their choice.

@sebpuetz
Copy link
Member Author

This AWS project also pins gcc to 4.8 to build their custom ops: https://github.com/aws/sagemaker-tensorflow-extensions

@danieldk
Copy link
Member

There already is a CXX11_ABI_FLAG that we are setting. Without setting this to the correct value (we get that from the python package), symbols from libtensorflow_framework aren't found.

That's another know. This sets "_GLIBCXX_USE_CXX11_ABI", which uses newer C++11-compatible declarations of some classes. This changes the C++ standard library API and thus ABI.

So, the library can change the ABI and the compiler can change the ABI (e.g. by changing alignment preferences).

I tried setting various -fabi-versions, but none solve the segfaulting.

I think it is standard for Google to compile C++ without exceptions, so besides setting the ABI you might also want to add -fno-exceptions.

Fwiw, not setting the ABI in cmake should allow people to compile against self-compiled tf packages with a compiler of their choice.

It could just be a CMake option.

@danieldk
Copy link
Member

danieldk commented Sep 20, 2019

More info on the upstream library:

$ readelf -p .comment /nix/store/s27l1m937bsw60fcdv3cabzbcdy4bpy4-python3.7-tensorflow-1.14.0/lib/python3.7/site-packages/tensorflow/libtensorflow_framework.so.1

String dump of section '.comment':
  [     1]  GCC: (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609

So, they are actually using gcc 5.4.0. (Which confirms the statement.) Did you try to compile on CI with 5.4 and no additional flags?

@sebpuetz
Copy link
Member Author

sebpuetz commented Sep 20, 2019

That's different for me:

readelf -p .comment /home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/libtensorflow_framework.so.1 

String dump of section '.comment':
  [     1]  GCC: (Ubuntu 4.8.5-4ubuntu8~14.04.2) 4.8.5

5.4 is what is supposedly used for packages built after August 1st 19. I haven't tried 5.4.0 yet.

@danieldk
Copy link
Member

That's different for me:

readelf -p .comment /home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/libtensorflow_framework.so.1 

String dump of section '.comment':
  [     1]  GCC: (Ubuntu 4.8.5-4ubuntu8~14.04.2) 4.8.5

5.4 is what is supposedly used for packages built after August 1st 19. I haven't tried 5.4.0 yet.

Maybe they used different build environments for different Python versions? You are on 3.6, while I am on 3.7.

@sebpuetz
Copy link
Member Author

I tried setting various -fabi-versions, but none solve the segfaulting.

I think it is standard for Google to compile C++ without exceptions, so besides setting the ABI you might also want to add -fno-exceptions.

We're throwing an exception in the constructor if constructing the embeddings in Rust fails. So with that flag we can't build unless we introduce some Status type.

From what I can tell there is also no such flag set in the custom op example repo: https://github.com/tensorflow/custom-op/blob/master/Makefile

TF_CFLAGS := $(shell $(PYTHON_BIN_PATH) -c 'import tensorflow as tf; print(" ".join(tf.sysconfig.get_compile_flags()))')
TF_LFLAGS := $(shell $(PYTHON_BIN_PATH) -c 'import tensorflow as tf; print(" ".join(tf.sysconfig.get_link_flags()))')

CFLAGS = ${TF_CFLAGS} -fPIC -O2 -std=c++11
LDFLAGS = -shared ${TF_LFLAGS}

ZERO_OUT_TARGET_LIB = tensorflow_zero_out/python/ops/_zero_out_ops.so
# zero_out op for CPU
zero_out_op: $(ZERO_OUT_TARGET_LIB)

$(ZERO_OUT_TARGET_LIB): $(ZERO_OUT_SRCS)
	$(CXX) $(CFLAGS) -o $@ $^ ${LDFLAGS}

Maybe they used different build environments for different Python versions? You are on 3.6, while I am on 3.7.

I verified that, 3.7.3 is built on 16.04 with g++-5.4

@danieldk
Copy link
Member

So then the probably absolutely correct way would be to build for Python 3.6 using gcc 4.8 and for Python 3.7 using gcc 5.4. (Since there are two compiler ABI changes in between. Not sure whether they matter in this case.)

@sebpuetz sebpuetz force-pushed the lookup_op branch 7 times, most recently from 6235ea5 to 0c5e049 Compare September 20, 2019 13:36
@sebpuetz
Copy link
Member Author

I found some more info on this at tensorflow/tensorflow#27067 and tensorflow/community#77. For now, matching the compiler version and flags seems like the only way to guarantee working builds.

We can get the compiler version from the python package, too:

import tensorflow as tf
tf.version.COMPILER_VERSION

returns this, so I guess we can use that to select the correct compiler, too.

With tensorflow/community#77 it would be possible to build packages independent of what was used to compile pip tensorflow.

@sebpuetz
Copy link
Member Author

sebpuetz commented Sep 22, 2019

I found some more info on this at tensorflow/tensorflow#27067 and tensorflow/community#77. For now, matching the compiler version and flags seems like the only way to guarantee working builds.

We can get the compiler version from the python package, too:

python

import tensorflow as tf
tf.version.COMPILER_VERSION

returns this, so I guess we can use that to select the correct compiler, too.

With tensorflow/community#77 it would be possible to build packages independent of what was used to compile pip tensorflow.

I put together a setup.py by following what uber is doing with horovod (https://github.com/horovod/horovod/blob/master/setup.py), they set a min compiler version through tf.version.COMPILER_VERSION and an optional max version by checking whether the major version of COMPILER_VERSION == 4. Then they select the highest compatible gcc / g++ version found in $PATH and export those as CC / CXX. If either CC or CXX are exported at the invocation of setup.py, the explicitly chosen compiler is used instead.

That'll be part of a future PR, so I think the only question for this PR is how to set up the CI and whether what I did is correct for our purposes.

@danieldk
Copy link
Member

I put together a setup.py by following what uber is doing with horovod (https://github.com/horovod/horovod/blob/master/setup.py), they set a min compiler version through tf.version.COMPILER_VERSION and an optional max version by checking whether the major version of COMPILER_VERSION == 4. Then they select the highest compatible gcc / g++ version found in $PATH and export those as CC / CXX. If either CC or CXX are exported at the invocation of setup.py, the explicitly chosen compiler is used instead.

Sounds good, especially with the possibility to override the compiler.

That'll be part of a future PR, so I think the only question for this PR is how to set up the CI and whether what I did is correct for our purposes.

I think for this PR then, it would be good enough to use a fixed compiler version (corresponding to what the Python module needs).

@sebpuetz
Copy link
Member Author

I think for this PR then, it would be good enough to use a fixed compiler version (corresponding to what the Python module needs).

This PR now has two Linux builds, one for python 3.6 and one for 3.7, for the 3.6 build I export g++-4.8 as CXX, for 3.7 I just leave the default compiler (7.4) since there's no ABI incompatibility. I think for building the shared library, setting the compiler version explicitly is fine, we just might want to document this somewhere. Automated detection and setup could be left with setup.py.

Does that sound good to you?

@danieldk
Copy link
Member

Excellent!

.travis.yml Outdated
addons:
apt:
sources:
- ubuntu-toolchain-r-test
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to get an old toolchain?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy paste from the travis-ci docs. Seems like it doesn't actually work ;)

Disallowing sources: ubuntu-toolchain-r-test

To add unlisted APT sources, follow instructions in https://docs.travis-ci.com/user/installing-dependencies#Installing-Packages-with-the-APT-Addon

I'll push a version without this, shouldn't break the build.

@sebpuetz sebpuetz merged commit 9d912c8 into master Sep 23, 2019
@sebpuetz sebpuetz deleted the lookup_op branch September 23, 2019 08:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants