Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault when train the net #17

Open
Harvey-Mei opened this issue Aug 1, 2021 · 3 comments
Open

Segmentation fault when train the net #17

Harvey-Mei opened this issue Aug 1, 2021 · 3 comments

Comments

@Harvey-Mei
Copy link

Harvey-Mei commented Aug 1, 2021

python train.py --batch_size 24 --experiment_name shapenet-ldif
--model_directory $models --model_type "ldif"
--dataset_directory $dataset
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
INFO: Making dataset...
INFO: Optimized dataset detected at ./shapenet/optimized
INFO: Mapping...
INFO: is_invalid vs lower_coords: [24, 32, 1] vs [24, 32, 3]
INFO: Post-where lower_coords: [24, 32, 3]
INFO: is_invalid vs sdf coords: [24, 32, 1] vs [24, 32, 1]
INFO: In-out image summaries have been removed.
INFO: The 0-th GPU has 22390 MB free.
INFO: TensorFlow can use up to 93.1397945511389% of the total GPU memory.
INFO: Initializing variables...
INFO: No previous checkpoint detected, training from scratch.
Fatal Python error: Segmentation fault

Thread 0x00007fd78cff9700 (most recent call first):
File "/home/mayo/anaconda3/envs/tf-1.15/lib/python3.8/threading.py", line 302 in wait
File "/home/mayo/anaconda3/envs/tf-1.15/lib/python3.8/queue.py", line 170 in get
File "/home/mayo/anaconda3/envs/tf-1.15/lib/python3.8/site-packages/tensorflow_core/python/summary/writer/event_file_writer.py", line 159 in run
File "/home/mayo/anaconda3/envs/tf-1.15/lib/python3.8/threading.py", line 932 in _bootstrap_inner
File "/home/mayo/anaconda3/envs/tf-1.15/lib/python3.8/threading.py", line 890 in _bootstrap

Thread 0x00007fd9b5258340 (most recent call first):
File "/home/mayo/anaconda3/envs/tf-1.15/lib/python3.8/site-packages/tensorflow_core/python/client/session.py", line 1441 in _call_tf_sessionrun
File "/home/mayo/anaconda3/envs/tf-1.15/lib/python3.8/site-packages/tensorflow_core/python/client/session.py", line 1349 in _run_fn
File "/home/mayo/anaconda3/envs/tf-1.15/lib/python3.8/site-packages/tensorflow_core/python/client/session.py", line 1365 in _do_call
File "/home/mayo/anaconda3/envs/tf-1.15/lib/python3.8/site-packages/tensorflow_core/python/client/session.py", line 1358 in _do_run
File "/home/mayo/anaconda3/envs/tf-1.15/lib/python3.8/site-packages/tensorflow_core/python/client/session.py", line 1179 in _run
File "/home/mayo/anaconda3/envs/tf-1.15/lib/python3.8/site-packages/tensorflow_core/python/client/session.py", line 955 in run
File "train.py", line 263 in main
File "/home/mayo/anaconda3/envs/tf-1.15/lib/python3.8/site-packages/absl/app.py", line 258 in _run_main
File "/home/mayo/anaconda3/envs/tf-1.15/lib/python3.8/site-packages/absl/app.py", line 312 in run
File "train.py", line 283 in
./reproduce_shapenet_autoencoder.sh: line 50: 1295263 Segmentation fault (core dumped) python train.py --batch_size 24 --experiment_name shapenet-ldif --model_directory $models --model_type "ldif" --dataset_directory $dataset

I have generated the dataset from raw ShapnetCoreV1/03001627 models, by converting .obj file to .ply and then generating watertight .ply file using gaps tools. After that I used the command in the script named reproduce_shapenet_autoencoder.sh to make dataset, everything done successfully. But when I tried to train the net with the dataset, it failed and got the log showed above.

BTW, the enviroment with my computer: ubuntu20.04 with RTX3090, cuda version = 11.1, and I run the code on tensorflow-1.15.
Could you give me some advice for this issue?
Thank you!

@Harvey-Mei Harvey-Mei reopened this Aug 9, 2021
@Harvey-Mei
Copy link
Author

Harvey-Mei commented Aug 9, 2021

Also, I have successfully run build_gas.sh, gaps_is_installed.sh and build_kernel.sh. with some modification to suit my environment, the scripts showed log as expected and generated all the needed executable files.

@Harvey-Mei
Copy link
Author

Thread 100 "train.py" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ff865fff700 (LWP 1307527)]
0x00007fff7ca95890 in tensorflow::data::experimental::ParallelInterleaveDatasetOp::Dataset::Iterator::EnsureWorkerThreadsStarted(tensorflow::data::IteratorContext*) ()
from /home/mayo/anaconda3/envs/tf-1.15/lib/python3.8/site-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so

I got this message when debug with GDB.

@susuhu
Copy link

susuhu commented Oct 6, 2022

I have the same problem Segmentation fault (core dumped) . I ran build_gas.sh successfully but I can't run build_kernel.sh because "unsupported GNU version! gcc versions later than 6 are not supported!". But since it's optional, it shouldn't affect training, right?
I'm using Ubuntu20.4 with RTX2080. CUDA Version: 11.3. The env is created with the ymal file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants