Export to onnx "randomly fails" #78

thistlillo · 2022-06-10T11:10:23Z

I am experiencing a strange issue when I export to ONNX networks that include a recurrent layer: sometimes the export fails. I was partially able to replicate the issue by modifying the text_generation example on the PyEDLL website. You may find the script at the end of this message.
In all cases (at least with the actual UC5 code), if the first export to ONNX succeeds, then it will never fail till the end of the training.

Why partially able? With my actual code (UC5) I do not get any segmentation fault when I change the eddl_cs_mem parameter to low_mem. With the modified text_generation.py:

the error occurs with full_mem or mid_mem;
when using low_mem, I get a segmentation fault with messages that may differ between two consecutive runs:

[...]
Recurrent net output sequence length=20
munmap_chunk(): invalid pointer
Aborted (core dumped)

[...]
Recurrent net output sequence length=20
Segmentation fault (core dumped)

Have a look at the following logs. They correspond to the output of five and three consecutive executions with the flag --gpu of the script without touching the Python code for, respectively, eddl_cs_mem=full_mem and eddl_cs_mem=mid_mem. After it fails, it keeps failing for a while, then it runs fine again.

FIVE FOR "FULL MEM"

** FULL MEM, FIRST:**

(eddl2) root@uc5-pipeline-pylibs-cudnn-79586ff6b7-6zh95:/mnt/datasets/uc5/UC5_last/src/example-api# python text_generation.py --gpu
Downloading resnet18.onnx
resnet18.onnx ✓
Import ONNX...
Generating Random Table
removing resnetv15_dense0_fwd
Warning: output layer has been removed
CS with full memory setup
Building model
Selecting GPU device 0
EDDL is running on GPU device 0, Tesla V100-SXM2-32GB
CuBlas initialized on GPU device 0, Tesla V100-SXM2-32GB
CuRand initialized on GPU device 0, Tesla V100-SXM2-32GB
CuDNN initialized on GPU device 0, Tesla V100-SXM2-32GB
-------------------------------------------------------------------------------
model
-------------------------------------------------------------------------------
[CUT DUE TO GITHUB LIMITS]
-------------------------------------------------------------------------------
Total params: 14496868
Trainable params: 14487268
Non-trainable params: 9600

Vec2Seq 1 to 20
Recurrent net output sequence length=20
CS with full memory setup
Building model without initialization
Unroll on device
Recurrent net output sequence length=20
1 epochs of 2 batches of size 24
Epoch 1
[██████████████████████████████████████████████████] 2 out_cnn[loss=4.315 metric=0.271] 1.8908 secs/batch
3.7816 secs/epoch
about to export
[ONNX::Export] Warning: The LSTM layer LSTM1 has mask_zeros=true. This attribute is not supported in ONNX, so the model exported will not have this attribute.
All done

** FULL MEM, SECOND:**

(eddl2) root@uc5-pipeline-pylibs-cudnn-79586ff6b7-6zh95:/mnt/datasets/uc5/UC5_last/src/example-api# python text_generation.py --gpu
Downloading resnet18.onnx
resnet18.onnx ✓
Import ONNX...
Generating Random Table
removing resnetv15_dense0_fwd
Warning: output layer has been removed
CS with full memory setup
Building model
Selecting GPU device 0
EDDL is running on GPU device 0, Tesla V100-SXM2-32GB
CuBlas initialized on GPU device 0, Tesla V100-SXM2-32GB
CuRand initialized on GPU device 0, Tesla V100-SXM2-32GB
CuDNN initialized on GPU device 0, Tesla V100-SXM2-32GB
-------------------------------------------------------------------------------
model
-------------------------------------------------------------------------------
[CUT DUE TO GITHUB LIMITS]
-------------------------------------------------------------------------------
Total params: 14496868
Trainable params: 14487268
Non-trainable params: 9600

Vec2Seq 1 to 20
Recurrent net output sequence length=20
CS with full memory setup
Building model without initialization
Unroll on device
Recurrent net output sequence length=20
1 epochs of 2 batches of size 24
Epoch 1
[██████████████████████████████████████████████████] 2 out_cnn[loss=4.347 metric=0.257] 1.7958 secs/batch
3.5917 secs/epoch
about to export
[ONNX::Export] Warning: The LSTM layer LSTM1 has mask_zeros=true. This attribute is not supported in ONNX, so the model exported will not have this attribute.
All done

** FULL MEM, THIRD: ***

(eddl2) root@uc5-pipeline-pylibs-cudnn-79586ff6b7-6zh95:/mnt/datasets/uc5/UC5_last/src/example-api# python text_generation.py --gpu
Downloading resnet18.onnx
resnet18.onnx ✓
Import ONNX...
Generating Random Table
removing resnetv15_dense0_fwd
Warning: output layer has been removed
CS with full memory setup
Building model
Selecting GPU device 0
EDDL is running on GPU device 0, Tesla V100-SXM2-32GB
CuBlas initialized on GPU device 0, Tesla V100-SXM2-32GB
CuRand initialized on GPU device 0, Tesla V100-SXM2-32GB
CuDNN initialized on GPU device 0, Tesla V100-SXM2-32GB
-------------------------------------------------------------------------------
model
-------------------------------------------------------------------------------
[CUT DUE TO GITHUB LIMITS]
-------------------------------------------------------------------------------
Total params: 14496868
Trainable params: 14487268
Non-trainable params: 9600

Vec2Seq 1 to 20
Recurrent net output sequence length=20
CS with full memory setup
Building model without initialization
Unroll on device
Recurrent net output sequence length=20
1 epochs of 2 batches of size 24
Epoch 1
[██████████████████████████████████████████████████] 2 out_cnn[loss=4.336 metric=0.242] 1.7833 secs/batch
3.5667 secs/epoch
about to export
[ONNX::Export] Warning: The LSTM layer LSTM1 has mask_zeros=true. This attribute is not supported in ONNX, so the model exported will not have this attribute.
All done

FULL MEM, FOURTH:

(eddl2) root@uc5-pipeline-pylibs-cudnn-79586ff6b7-6zh95:/mnt/datasets/uc5/UC5_last/src/example-api# python text_generation.py --gpu
Downloading resnet18.onnx
resnet18.onnx ✓
Import ONNX...
Generating Random Table
removing resnetv15_dense0_fwd
Warning: output layer has been removed
CS with full memory setup
Building model
Selecting GPU device 0
EDDL is running on GPU device 0, Tesla V100-SXM2-32GB
CuBlas initialized on GPU device 0, Tesla V100-SXM2-32GB
CuRand initialized on GPU device 0, Tesla V100-SXM2-32GB
CuDNN initialized on GPU device 0, Tesla V100-SXM2-32GB
-------------------------------------------------------------------------------
model
-------------------------------------------------------------------------------
[CUT DUE TO GITHUB LIMITS]
-------------------------------------------------------------------------------
Total params: 14496868
Trainable params: 14487268
Non-trainable params: 9600

Vec2Seq 1 to 20
Recurrent net output sequence length=20
CS with full memory setup
Building model without initialization
Unroll on device
Recurrent net output sequence length=20
1 epochs of 2 batches of size 24
Epoch 1
[██████████████████████████████████████████████████] 2 out_cnn[loss=4.357 metric=0.243] 1.9022 secs/batch
3.8044 secs/epoch
about to export
[ONNX::Export] Warning: The LSTM layer LSTM1 has mask_zeros=true. This attribute is not supported in ONNX, so the model exported will not have this attribute.
All done

FULL MEM, FIFTH:

(eddl2) root@uc5-pipeline-pylibs-cudnn-79586ff6b7-6zh95:/mnt/datasets/uc5/UC5_last/src/example-api# python text_generation.py --gpu
Downloading resnet18.onnx
resnet18.onnx ✓
Import ONNX...
Generating Random Table
removing resnetv15_dense0_fwd
Warning: output layer has been removed
CS with full memory setup
Building model
Selecting GPU device 0
EDDL is running on GPU device 0, Tesla V100-SXM2-32GB
CuBlas initialized on GPU device 0, Tesla V100-SXM2-32GB
CuRand initialized on GPU device 0, Tesla V100-SXM2-32GB
CuDNN initialized on GPU device 0, Tesla V100-SXM2-32GB
-------------------------------------------------------------------------------
model
-------------------------------------------------------------------------------
[CUT DUE TO GITHUB LIMITS]
-------------------------------------------------------------------------------
Total params: 14496868
Trainable params: 14487268
Non-trainable params: 9600

Vec2Seq 1 to 20
Recurrent net output sequence length=20
CS with full memory setup
Building model without initialization
Unroll on device
Recurrent net output sequence length=20
1 epochs of 2 batches of size 24
Epoch 1
[██████████████████████████████████████████████████] 2 out_cnn[loss=4.343 metric=0.254] 1.7141 secs/batch
3.4281 secs/epoch
about to export
==================================================================
⚠️  Error exporting the merge layer concat1. To export this model you need to provide the 'seq_len' argument with a value higher than 0 in the export function. (ONNX::ExportNet) ⚠️
==================================================================

Traceback (most recent call last):
  File "text_generation.py", line 84, in <module>
    main(parser.parse_args(sys.argv[1:]))
  File "text_generation.py", line 71, in main
    eddl.save_net_to_onnx_file(net, "img2text.onnx")
  File "/root/miniconda3/envs/eddl2/lib/python3.8/site-packages/pyeddl/eddl.py", line 2894, in save_net_to_onnx_file
    return _eddl.save_net_to_onnx_file(net, path)
RuntimeError: RuntimeError: ONNX::ExportNet

THREE FOR "MID MEM"

Using MID_MEM the behaviour is the same: the first two runs are ok, the third fails.
MID_MEM, FIRST

eddl2) root@uc5-pipeline-pylibs-cudnn-79586ff6b7-6zh95:/mnt/datasets/uc5/UC5_last/src/example-api# python text_generation.py --gpu
Downloading resnet18.onnx
resnet18.onnx ✓
Import ONNX...
Generating Random Table
removing resnetv15_dense0_fwd
Warning: output layer has been removed
CS with mid memory setup
Building model
Selecting GPU device 0
EDDL is running on GPU device 0, Tesla V100-SXM2-32GB
CuBlas initialized on GPU device 0, Tesla V100-SXM2-32GB
CuRand initialized on GPU device 0, Tesla V100-SXM2-32GB
CuDNN initialized on GPU device 0, Tesla V100-SXM2-32GB
-------------------------------------------------------------------------------
model
-------------------------------------------------------------------------------
[CUT DUE TO GITHUB LIMITS]
-------------------------------------------------------------------------------
Total params: 14496868
Trainable params: 14487268
Non-trainable params: 9600

Vec2Seq 1 to 20
Recurrent net output sequence length=20
CS with mid memory setup
Building model without initialization
Unroll on device
Recurrent net output sequence length=20
1 epochs of 2 batches of size 24
Epoch 1
[██████████████████████████████████████████████████] 2 out_cnn[loss=4.347 metric=0.246] 1.7054 secs/batch
3.4108 secs/epoch
about to export
[ONNX::Export] Warning: The LSTM layer LSTM1 has mask_zeros=true. This attribute is not supported in ONNX, so the model exported will not have this attribute.
All done

MID_MEM, SECOND

(eddl2) root@uc5-pipeline-pylibs-cudnn-79586ff6b7-6zh95:/mnt/datasets/uc5/UC5_last/src/example-api# python text_generation.py --gpu
Downloading resnet18.onnx
resnet18.onnx ✓
Import ONNX...
Generating Random Table
removing resnetv15_dense0_fwd
Warning: output layer has been removed
CS with mid memory setup
Building model
Selecting GPU device 0
EDDL is running on GPU device 0, Tesla V100-SXM2-32GB
CuBlas initialized on GPU device 0, Tesla V100-SXM2-32GB
CuRand initialized on GPU device 0, Tesla V100-SXM2-32GB
CuDNN initialized on GPU device 0, Tesla V100-SXM2-32GB
-------------------------------------------------------------------------------
model
-------------------------------------------------------------------------------
[CUT DUE TO GITHUB LIMITS]
-------------------------------------------------------------------------------
Total params: 14496868
Trainable params: 14487268
Non-trainable params: 9600

Vec2Seq 1 to 20
Recurrent net output sequence length=20
CS with mid memory setup
Building model without initialization
Unroll on device
Recurrent net output sequence length=20
1 epochs of 2 batches of size 24
Epoch 1
[██████████████████████████████████████████████████] 2 out_cnn[loss=4.345 metric=0.272] 1.8341 secs/batch
3.6682 secs/epoch
about to export
[ONNX::Export] Warning: The LSTM layer LSTM1 has mask_zeros=true. This attribute is not supported in ONNX, so the model exported will not have this attribute.
All done

MID_MEM, THIRD:

(eddl2) root@uc5-pipeline-pylibs-cudnn-79586ff6b7-6zh95:/mnt/datasets/uc5/UC5_last/src/example-api# python text_generation.py --gpu
Downloading resnet18.onnx
resnet18.onnx ✓
Import ONNX...
Generating Random Table
removing resnetv15_dense0_fwd
Warning: output layer has been removed
CS with mid memory setup
Building model
Selecting GPU device 0
EDDL is running on GPU device 0, Tesla V100-SXM2-32GB
CuBlas initialized on GPU device 0, Tesla V100-SXM2-32GB
CuRand initialized on GPU device 0, Tesla V100-SXM2-32GB
CuDNN initialized on GPU device 0, Tesla V100-SXM2-32GB
-------------------------------------------------------------------------------
model
-------------------------------------------------------------------------------
[CUT DUE TO GITHUB LIMITS]
-------------------------------------------------------------------------------
Total params: 14496868
Trainable params: 14487268
Non-trainable params: 9600

Vec2Seq 1 to 20
Recurrent net output sequence length=20
CS with mid memory setup
Building model without initialization
Unroll on device
Recurrent net output sequence length=20
1 epochs of 2 batches of size 24
Epoch 1
[██████████████████████████████████████████████████] 2 out_cnn[loss=4.327 metric=0.240] 1.9775 secs/batch
3.9549 secs/epoch
about to export
==================================================================
⚠️  Error exporting the merge layer concat1. To export this model you need to provide the 'seq_len' argument with a value higher than 0 in the export function. (ONNX::ExportNet) ⚠️
==================================================================

Traceback (most recent call last):
  File "text_generation.py", line 84, in <module>
    main(parser.parse_args(sys.argv[1:]))
  File "text_generation.py", line 71, in main
    eddl.save_net_to_onnx_file(net, "img2text.onnx")
  File "/root/miniconda3/envs/eddl2/lib/python3.8/site-packages/pyeddl/eddl.py", line 2894, in save_net_to_onnx_file
    return _eddl.save_net_to_onnx_file(net, path)
RuntimeError: RuntimeError: ONNX::ExportNet
(eddl2) root@uc5-pipeline-pylibs-cudnn-79586ff6b7-6zh95:/mnt/datasets/uc5/UC5_last/src/example-api#

SCRIPT (MOD. TEXT GENERATION)

"""\
Text generation (modified).
"""

import argparse
import sys

import pyeddl.eddl as eddl
from pyeddl.tensor import Tensor
import numpy as np

MEM_CHOICES = ("low_mem", "mid_mem", "full_mem")


def main(args):
    epochs = 1
    olength = 20
    outvs = 2000
    embdim = 32

    # True: remove last layers and set new top = flatten
    # new input_size: [3, 256, 256] (from [224, 224, 3])
    net = eddl.download_resnet18(True, [3, 256, 256])
    lreshape = eddl.getLayer(net, "top")
    dense_layer = eddl.HeUniform(eddl.Dense(lreshape, 20, name="out_dense"))
    cnn_out = eddl.Sigmoid(dense_layer, name="cnn_out")
    concat = eddl.Concat([lreshape, cnn_out], name="cnn_concat")

    # create a new model from input output
    image_in = eddl.getLayer(net, "input")

    # Decoder
    ldecin = eddl.Input([outvs])
    ldec = eddl.ReduceArgMax(ldecin, [0])
    ldec = eddl.RandomUniform(
        eddl.Embedding(ldec, outvs, 1, embdim, True), -0.05, 0.05
    )

    ldec = eddl.Concat([ldec, concat])
    layer = eddl.LSTM(ldec, 512, True)
    out = eddl.Softmax(eddl.Dense(layer, outvs), name="out_cnn")

    eddl.setDecoder(ldecin)
    net = eddl.Model([image_in], [out])

    # Build model
    eddl.build(
        net,
        eddl.adam(0.01),
        ["softmax_cross_entropy"],
        ["accuracy"],
        eddl.CS_GPU(mem=args.mem) if args.gpu else eddl.CS_CPU(mem=args.mem)
    )
    eddl.summary(net)

    # Load dataset
    x_train = Tensor.randn([48, 256, 256, 3])  # Tensor.load("flickr_trX.bin", "bin")
    y_train = Tensor.fromarray( np.random.randint(0,2,(48,20)) )
    
    xtrain = Tensor.permute(x_train, [0, 3, 1, 2])
    y_train = Tensor.onehot(y_train, outvs)
    # batch x timesteps x input_dim
    y_train.reshape_([y_train.shape[0], olength, outvs])

    

    eddl.fit(net, [xtrain], [y_train], args.batch_size, epochs)
    # eddl.save(net, "img2text.bin", "bin")
    print("about to export")
    eddl.save_net_to_onnx_file(net, "img2text.onnx")
    
    # error here
    print("All done")


if __name__ == "__main__":
    parser = argparse.ArgumentParser(description=__doc__)
    parser.add_argument("--batch-size", type=int, metavar="INT", default=24)
    parser.add_argument("--gpu", action="store_true")
    parser.add_argument("--small", action="store_true")
    # crashes with a segfault on low_mem
    parser.add_argument("--mem", metavar="|".join(MEM_CHOICES),
                        choices=MEM_CHOICES, default="full_mem")
    main(parser.parse_args(sys.argv[1:]))

Python:

(eddl2) root@uc5-pipeline-pylibs-cudnn-79586ff6b7-6zh95:/mnt/datasets/uc5/UC5_last/src/example-api# python --version
Python 3.8.6

nVidia/CUDA:

NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6

Libraries:

(eddl2) root@uc5-pipeline-pylibs-cudnn-79586ff6b7-6zh95:/mnt/datasets/uc5/UC5_last/src/example-api# conda env export | grep eddl
name: eddl2
  - eddl-cudnn=1.1b0=h476a1fd_0
  - pyeddl-cudnn=1.3.0=py38hf64f055_0
prefix: /root/miniconda3/envs/eddl2
(eddl2) root@uc5-pipeline-pylibs-cudnn-79586ff6b7-6zh95:/mnt/datasets/uc5/UC5_last/src/example-api# conda env export | grep ecvl
  - ecvl-cudnn=1.0.3=py38h65a929d_0
  - pyecvl-cudnn=1.3.0=py38hf64f055_0

Test run on a linux pod running on the OpenDeepHealth platform:

(eddl2) root@uc5-pipeline-pylibs-cudnn-79586ff6b7-6zh95:/mnt/datasets/uc5/UC5_last/src/example-api# cat /etc/os-release 
NAME="Ubuntu"
VERSION="18.04.5 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.5 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic

The text was updated successfully, but these errors were encountered:

simleo · 2022-06-10T12:50:37Z

The Python version of save_net_to_onnx_file just calls the corresponding C++ function. The only difference is that in PyEDDL 1.3.0 the seq_len argument was not exposed to the Python interface, but that has no effect on the function's behavior when the argument is not used. You should report this to the EDDL team. You can also try again with PyEDDL 1.3.1 (note it's just been released, so Docker images and Conda packages are not available yet) and see if setting seq_len makes any difference.

thistlillo · 2022-06-10T14:48:56Z

Thanks, @simleo. I will open an issue on the EDDL site.

thistlillo mentioned this issue Jun 10, 2022

Export to ONNX randomly fails deephealthproject/eddl#339

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Export to onnx "randomly fails" #78

Export to onnx "randomly fails" #78

thistlillo commented Jun 10, 2022 •

edited by simleo

Loading

simleo commented Jun 10, 2022

thistlillo commented Jun 10, 2022

Export to onnx "randomly fails" #78

Export to onnx "randomly fails" #78

Comments

thistlillo commented Jun 10, 2022 • edited by simleo Loading

FIVE FOR "FULL MEM"

THREE FOR "MID MEM"

SCRIPT (MOD. TEXT GENERATION)

simleo commented Jun 10, 2022

thistlillo commented Jun 10, 2022

thistlillo commented Jun 10, 2022 •

edited by simleo

Loading