Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3 filesystem pure virtual method called; terminate called without an active exception #1912

Closed
rivershah opened this issue Jan 1, 2024 · 23 comments · Fixed by #2023
Closed

Comments

@rivershah
Copy link

rivershah commented Jan 1, 2024

I am getting a core dump during interpreter teardown, when using the s3 filesystem. Can I please be given guidance how to handle this issue. Please see script to reproduce inside docker:

FROM tensorflow/tensorflow:2.14.0-gpu

The following environment variables are set

"AWS_ACCESS_KEY_ID": xxx,
"AWS_SECRET_ACCESS_KEY": xxx,
"AWS_ENDPOINT_URL_S3": xxx,
"AWS_REGION": "us-east-1",
"S3_USE_HTTPS": "1",
"S3_VERIFY_SSL": "1",
"S3_DISABLE_MULTI_PART_DOWNLOAD": "0",
"S3_ENDPOINT": xxx,
import os

import tensorflow as tf
import tensorflow_io as tfio

def illustrate_core_dump():
    print(f"tf version: {tf.__version__}")
    print(f"tfio version: {tfio.__version__}")
    filename = f"{os.environ['CLOUD_MOUNT']}/tmp/test_tfrecord.tfrecord"
    assert filename.startswith("s3://"), "problem appears to be be for s3 filesystem only"
    ds = tf.data.TFRecordDataset(filename, "GZIP")

    for i in ds:
        print(f"i.shape: {i.shape}")


if __name__ == "__main__":
    illustrate_core_dump()
    print("reaches here successfully")
    print("something broken during destruction and tf")

    # during interpreter teardown if s3 filesystem used we will get
    # pure virtual method called
    # terminate called without an active exception
    # Aborted (core dumped)

    # gs:// and file:// do not exhibit this issue which don't rely on tfio
TF_CPP_MIN_LOG_LEVEL=0 python notebooks/illustrate_core_dump.py 
2024-01-01 18:07:11.253238: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-01 18:07:11.253287: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-01 18:07:11.253323: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-01-01 18:07:11.262384: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
tf version: 2.14.0
tfio version: 0.35.0
2024-01-01 18:07:14.402239: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-01-01 18:07:14.413303: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-01-01 18:07:14.416545: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-01-01 18:07:14.421598: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-01-01 18:07:14.423868: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-01-01 18:07:14.426098: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-01-01 18:07:15.494277: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-01-01 18:07:15.496519: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-01-01 18:07:15.498484: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-01-01 18:07:15.500342: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 13589 MB memory:  -> device: 0, name: Tesla T4, pci bus id: 0000:00:04.0, compute capability: 7.5
i.shape: ()
reaches here successfully
something broken during destruction and tf
pure virtual method called
terminate called without an active exception
Aborted (core dumped)
@rivershah
Copy link
Author

tensorflow-io==0.34.0 # works
tensorflow-io==0.35.0 # crashing

Can we please verify why the latest so exhibiting this issue. Thank you

@jpambrun
Copy link

I had the same issue and it was driving me insane. I have some unrelated custom c++ ops and wasted a day digging into those. I am using s3 and going back to 0.34.0 fixed it.

@saimidu
Copy link

saimidu commented Feb 6, 2024

Facing the same issue but for tensorflow==2.13, with tensorflow-io==0.34.0 (and with tensorflow-io==0.35.0). There is no straightforward root-cause, and reverting to tensorflow-io==0.33.0 fixes it.

I've also faced the same error with tensorflow==2.14, with tensorflow-io==0.35.0, which is the only version that supports TF 2.14 as per the compatibility chart on the README.md. But reverting to tensorflow-io==0.33.0 seems to fix it.

@saimidu
Copy link

saimidu commented Feb 12, 2024

As an update, I followed the build instructions for tensorflow-io (Ubuntu 22.04 and then Python Wheels), and discovered that this particular pure virtual method called error does not occur when I use a locally built wheel for tensorflow-io.

Note: The link in the docker build instructions is broken - https://github.com/tensorflow/io/blob/master/docs/development.md#docker - and the latest image in tfsigio/tfio is about 2 years old.

@rivershah
Copy link
Author

@saimi Is there any chance you can please post the steps you took to build? I tried to build but was thwarted by the issues you mentioned.

@saimidu
Copy link

saimidu commented Feb 13, 2024

@rivershah I pulled the ubuntu:22.04 image from dockerhub

docker run --name tfio_builder -itd ubuntu:22.04 bash
docker exec -it tfio_builder bash

and installed all the packages and bazel as instructed in https://github.com/tensorflow/io/blob/master/docs/development.md#ubuntu-2204 (without the sudo)

apt-get -y -qq update
apt-get -y -qq install gcc g++ git unzip curl python3-pip python-is-python3 libntirpc-dev
curl -sSOL https://github.com/bazelbuild/bazelisk/releases/download/v1.11.0/bazelisk-linux-amd64
mv bazelisk-linux-amd64 /usr/local/bin/bazel
chmod +x /usr/local/bin/bazel

python3 --version  # made sure I had python version>=3.9
python3 -m pip install -U pip
git clone https://github.com/tensorflow/io
cd io/
git checkout v0.35.0
pip install "tensorflow==2.14.1"
./configure.sh
export TF_PYTHON_VERSION=3.10
bazel build -s --verbose_failures --copt="-Wno-error=array-parameter=" --copt="-I/usr/include/tirpc" //tensorflow_io/... //tensorflow_io_gcs_filesystem/...

I then followed the instructions at https://github.com/tensorflow/io/blob/master/docs/development.md#python-wheels:

python3 setup.py bdist_wheel --data bazel-bin

Then, within the same container, I was able to validate tf-io's S3 filesystem functionality by trying to checkpoint a model to S3.

I'll need to do some additional work to reproduce the failure I got when copying the generated tf-io wheel out into a different container, since I've terminated all of that setup now.

@rivershah
Copy link
Author

Bumping this issue. Needs looking at to ensure build process handling correctly

@rivershah
Copy link
Author

This problem persists in tensorflow-io==0.37.0 Please fix, this is rendering s3 based io unusable without resorting to old versions

@skye
Copy link
Member

skye commented May 24, 2024

@yongtang would you be able to help here? Sounds like this is a pretty serious issue, so it would be much appreciated!!

@ruomingp
Copy link

This is blocking us from upgrading the tensorstore version. A quick fix will be much appreciated!

@CecileRobertMichon
Copy link

+1, also running into this issue

yarri-oss added a commit to yarri-oss/io that referenced this issue May 30, 2024
Bump Ubuntu version for Linux Wheel to address issue tensorflow#1912 tensorflow#1912
@yarri-oss
Copy link
Contributor

@yongtang would you be able to help here? Sounds like this is a pretty serious issue, so it would be much appreciated!!

@yongtang per the comment #1912 (comment) above, assuming my PR #2005 passes can you please consider a minor release (0.37.1 maybe?) to address the S3 issues discussed above. Thanks!

yongtang pushed a commit that referenced this issue Jun 18, 2024
Bump Ubuntu version for Linux Wheel to address issue #1912 #1912
@rivershah
Copy link
Author

@yongtang Thanks for the fix. In the interest of us being able to upgrade tensorflow, can you please do a 0.37.1 release as per @yarri-oss request. Thanks again

@spolloni
Copy link

spolloni commented Jul 2, 2024

I am still seeing

pure virtual method called
terminate called without an active exception

on 0.37.1. anyone else?

@spolloni
Copy link

spolloni commented Jul 3, 2024

@yarri-oss how is #2005 supposed to fix this issue?

@rivershah
Copy link
Author

The problem still persists. Replicable with above

pure virtual method called
terminate called without an active exception
Aborted (core dumped)

@spolloni
Copy link

spolloni commented Jul 3, 2024

cc @yongtang -- can we reopen the issue?

@yarri-oss
Copy link
Contributor

End users have confirmed this issue fixed.

@spolloni If you can post a repro (with S3 bucket blob) we can investigate further. I would prefer a new issue be opened against your specific repro tho.

@spolloni
Copy link

spolloni commented Jul 4, 2024

End users have confirmed this issue fixed.

? which users?

If you can post a repro

the repro has not changed, it is the one posted here: #1912 (comment)

@rivershah
Copy link
Author

Which users? I posted the issue and it repros as per above

@txchen
Copy link

txchen commented Sep 10, 2024

We are still having this issue with tensorflow-io 0.37.1, please help to reopen this issue. @yarri-oss

import tensorflow as tf
import tensorflow_io as tfio

tf.io.gfile.glob("s3://mybucket/dir")
my-server ~ > python test_tf.py                                                                                                                                          17:54
2024-09-10 17:55:34.571626: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-09-10 17:55:34.756820: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-09-10 17:55:35.673576: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-09-10 17:55:35.673613: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-09-10 17:55:35.679356: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-09-10 17:55:36.191044: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-09-10 17:55:36.193843: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-09-10 17:55:40.594805: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
pure virtual method called
terminate called without an active exception
zsh: IOT instruction  python test_tf.py

my-server ~ > echo $?                                                                                                                                                    17:55
134

@CecileRobertMichon
Copy link

Also still seeing the same issue after upgrading to v0.37.1

@jessicaxiejw
Copy link

jessicaxiejw commented Nov 14, 2024

With tensorflow-cpu fixed at 2.14.0, I can reproduce the issue by changing tensorflow-io from 0.34.0 to 0.35.0.

This is my script:

import tensorflow as tf
import tensorflow_io as tfio


def main():
    print(f"tf version: {tf.__version__}")
    print(f"tfio version: {tfio.__version__}")

    filename = "s3://my/s3/dataset/path.tfrecord"
    assert filename.startswith("s3://"), "problem appears to be be for s3 filesystem only"
    ds = tf.data.TFRecordDataset(filename)
    for i in ds:
        print(f"i.shape: {i.shape}")


if __name__ == "__main__":
    main()
    print("End of main()")

If I run pip install tensorflow-io==0.34.0 tensorflow-cpu==2.14.0 && python main.py

tf version: 2.14.0
tfio version: 0.34.0
i.shape: ()
End of main()

And the process exited immediately with status 0.

If I run pip install tensorflow-io==0.35.0 tensorflow-cpu==2.14.0 && python main.py

tf version: 2.14.0
tfio version: 0.35.0
i.shape: ()
End of main()
pure virtual method called
terminate called without an active exception
bash: line 1:  1763 Aborted                 (core dumped) python main.py

and the process exited with status 134.

So, I went through every single commit between 0.34.0 and 0.35.0. This was the only commit that seems to touch s3: https://github.com/tensorflow/io/pull/1857/files.

Does anybody know how to build TensorFlow myself so I can see if reverting the commit fixes the issue?

cc @yongtang

update: I found the doc! https://github.com/tensorflow/io/blob/master/docs/development.md

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants