You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am experiencing a strange issue when I export to ONNX networks that include a recurrent layer: sometimes the export fails. I was partially able to replicate the issue by modifying the text_generation example on the PyEDLL website. You may find the script at the end of this message.
In all cases (at least with the actual UC5 code), if the first export to ONNX succeeds, then it will never fail till the end of the training.
Why partially able? With my actual code (UC5) I do not get any segmentation fault when I change the eddl_cs_mem parameter to low_mem. With the modified text_generation.py:
the error occurs with full_mem or mid_mem;
when using low_mem, I get a segmentation fault with messages that may differ between two consecutive runs:
Have a look at the following logs. They correspond to the output of five and three consecutive executions with the flag --gpu of the script without touching the Python code for, respectively, eddl_cs_mem=full_mem and eddl_cs_mem=mid_mem. After it fails, it keeps failing for a while, then it runs fine again.
FIVE FOR "FULL MEM"
** FULL MEM, FIRST:**
(eddl2) root@uc5-pipeline-pylibs-cudnn-79586ff6b7-6zh95:/mnt/datasets/uc5/UC5_last/src/example-api# python text_generation.py --gpu
Downloading resnet18.onnx
resnet18.onnx ✓
Import ONNX...
Generating Random Table
removing resnetv15_dense0_fwd
Warning: output layer has been removed
CS with full memory setup
Building model
Selecting GPU device 0
EDDL is running on GPU device 0, Tesla V100-SXM2-32GB
CuBlas initialized on GPU device 0, Tesla V100-SXM2-32GB
CuRand initialized on GPU device 0, Tesla V100-SXM2-32GB
CuDNN initialized on GPU device 0, Tesla V100-SXM2-32GB
-------------------------------------------------------------------------------
model
-------------------------------------------------------------------------------
[CUT DUE TO GITHUB LIMITS]
-------------------------------------------------------------------------------
Total params: 14496868
Trainable params: 14487268
Non-trainable params: 9600
Vec2Seq 1 to 20
Recurrent net output sequence length=20
CS with full memory setup
Building model without initialization
Unroll on device
Recurrent net output sequence length=20
1 epochs of 2 batches of size 24
Epoch 1
[██████████████████████████████████████████████████] 2 out_cnn[loss=4.315 metric=0.271] 1.8908 secs/batch
3.7816 secs/epoch
about to export
[ONNX::Export] Warning: The LSTM layer LSTM1 has mask_zeros=true. This attribute is not supported in ONNX, so the model exported will not have this attribute.
All done
** FULL MEM, SECOND:**
(eddl2) root@uc5-pipeline-pylibs-cudnn-79586ff6b7-6zh95:/mnt/datasets/uc5/UC5_last/src/example-api# python text_generation.py --gpu
Downloading resnet18.onnx
resnet18.onnx ✓
Import ONNX...
Generating Random Table
removing resnetv15_dense0_fwd
Warning: output layer has been removed
CS with full memory setup
Building model
Selecting GPU device 0
EDDL is running on GPU device 0, Tesla V100-SXM2-32GB
CuBlas initialized on GPU device 0, Tesla V100-SXM2-32GB
CuRand initialized on GPU device 0, Tesla V100-SXM2-32GB
CuDNN initialized on GPU device 0, Tesla V100-SXM2-32GB
-------------------------------------------------------------------------------
model
-------------------------------------------------------------------------------
[CUT DUE TO GITHUB LIMITS]
-------------------------------------------------------------------------------
Total params: 14496868
Trainable params: 14487268
Non-trainable params: 9600
Vec2Seq 1 to 20
Recurrent net output sequence length=20
CS with full memory setup
Building model without initialization
Unroll on device
Recurrent net output sequence length=20
1 epochs of 2 batches of size 24
Epoch 1
[██████████████████████████████████████████████████] 2 out_cnn[loss=4.347 metric=0.257] 1.7958 secs/batch
3.5917 secs/epoch
about to export
[ONNX::Export] Warning: The LSTM layer LSTM1 has mask_zeros=true. This attribute is not supported in ONNX, so the model exported will not have this attribute.
All done
** FULL MEM, THIRD: ***
(eddl2) root@uc5-pipeline-pylibs-cudnn-79586ff6b7-6zh95:/mnt/datasets/uc5/UC5_last/src/example-api# python text_generation.py --gpu
Downloading resnet18.onnx
resnet18.onnx ✓
Import ONNX...
Generating Random Table
removing resnetv15_dense0_fwd
Warning: output layer has been removed
CS with full memory setup
Building model
Selecting GPU device 0
EDDL is running on GPU device 0, Tesla V100-SXM2-32GB
CuBlas initialized on GPU device 0, Tesla V100-SXM2-32GB
CuRand initialized on GPU device 0, Tesla V100-SXM2-32GB
CuDNN initialized on GPU device 0, Tesla V100-SXM2-32GB
-------------------------------------------------------------------------------
model
-------------------------------------------------------------------------------
[CUT DUE TO GITHUB LIMITS]
-------------------------------------------------------------------------------
Total params: 14496868
Trainable params: 14487268
Non-trainable params: 9600
Vec2Seq 1 to 20
Recurrent net output sequence length=20
CS with full memory setup
Building model without initialization
Unroll on device
Recurrent net output sequence length=20
1 epochs of 2 batches of size 24
Epoch 1
[██████████████████████████████████████████████████] 2 out_cnn[loss=4.336 metric=0.242] 1.7833 secs/batch
3.5667 secs/epoch
about to export
[ONNX::Export] Warning: The LSTM layer LSTM1 has mask_zeros=true. This attribute is not supported in ONNX, so the model exported will not have this attribute.
All done
FULL MEM, FOURTH:
(eddl2) root@uc5-pipeline-pylibs-cudnn-79586ff6b7-6zh95:/mnt/datasets/uc5/UC5_last/src/example-api# python text_generation.py --gpu
Downloading resnet18.onnx
resnet18.onnx ✓
Import ONNX...
Generating Random Table
removing resnetv15_dense0_fwd
Warning: output layer has been removed
CS with full memory setup
Building model
Selecting GPU device 0
EDDL is running on GPU device 0, Tesla V100-SXM2-32GB
CuBlas initialized on GPU device 0, Tesla V100-SXM2-32GB
CuRand initialized on GPU device 0, Tesla V100-SXM2-32GB
CuDNN initialized on GPU device 0, Tesla V100-SXM2-32GB
-------------------------------------------------------------------------------
model
-------------------------------------------------------------------------------
[CUT DUE TO GITHUB LIMITS]
-------------------------------------------------------------------------------
Total params: 14496868
Trainable params: 14487268
Non-trainable params: 9600
Vec2Seq 1 to 20
Recurrent net output sequence length=20
CS with full memory setup
Building model without initialization
Unroll on device
Recurrent net output sequence length=20
1 epochs of 2 batches of size 24
Epoch 1
[██████████████████████████████████████████████████] 2 out_cnn[loss=4.357 metric=0.243] 1.9022 secs/batch
3.8044 secs/epoch
about to export
[ONNX::Export] Warning: The LSTM layer LSTM1 has mask_zeros=true. This attribute is not supported in ONNX, so the model exported will not have this attribute.
All done
FULL MEM, FIFTH:
(eddl2) root@uc5-pipeline-pylibs-cudnn-79586ff6b7-6zh95:/mnt/datasets/uc5/UC5_last/src/example-api# python text_generation.py --gpu
Downloading resnet18.onnx
resnet18.onnx ✓
Import ONNX...
Generating Random Table
removing resnetv15_dense0_fwd
Warning: output layer has been removed
CS with full memory setup
Building model
Selecting GPU device 0
EDDL is running on GPU device 0, Tesla V100-SXM2-32GB
CuBlas initialized on GPU device 0, Tesla V100-SXM2-32GB
CuRand initialized on GPU device 0, Tesla V100-SXM2-32GB
CuDNN initialized on GPU device 0, Tesla V100-SXM2-32GB
-------------------------------------------------------------------------------
model
-------------------------------------------------------------------------------
[CUT DUE TO GITHUB LIMITS]
-------------------------------------------------------------------------------
Total params: 14496868
Trainable params: 14487268
Non-trainable params: 9600
Vec2Seq 1 to 20
Recurrent net output sequence length=20
CS with full memory setup
Building model without initialization
Unroll on device
Recurrent net output sequence length=20
1 epochs of 2 batches of size 24
Epoch 1
[██████████████████████████████████████████████████] 2 out_cnn[loss=4.343 metric=0.254] 1.7141 secs/batch
3.4281 secs/epoch
about to export
==================================================================
⚠️ Error exporting the merge layer concat1. To export this model you need to provide the 'seq_len' argument with a value higher than 0 in the export function. (ONNX::ExportNet) ⚠️
==================================================================
Traceback (most recent call last):
File "text_generation.py", line 84, in <module>
main(parser.parse_args(sys.argv[1:]))
File "text_generation.py", line 71, in main
eddl.save_net_to_onnx_file(net, "img2text.onnx")
File "/root/miniconda3/envs/eddl2/lib/python3.8/site-packages/pyeddl/eddl.py", line 2894, in save_net_to_onnx_file
return _eddl.save_net_to_onnx_file(net, path)
RuntimeError: RuntimeError: ONNX::ExportNet
THREE FOR "MID MEM"
Using MID_MEM the behaviour is the same: the first two runs are ok, the third fails. MID_MEM, FIRST
eddl2) root@uc5-pipeline-pylibs-cudnn-79586ff6b7-6zh95:/mnt/datasets/uc5/UC5_last/src/example-api# python text_generation.py --gpu
Downloading resnet18.onnx
resnet18.onnx ✓
Import ONNX...
Generating Random Table
removing resnetv15_dense0_fwd
Warning: output layer has been removed
CS with mid memory setup
Building model
Selecting GPU device 0
EDDL is running on GPU device 0, Tesla V100-SXM2-32GB
CuBlas initialized on GPU device 0, Tesla V100-SXM2-32GB
CuRand initialized on GPU device 0, Tesla V100-SXM2-32GB
CuDNN initialized on GPU device 0, Tesla V100-SXM2-32GB
-------------------------------------------------------------------------------
model
-------------------------------------------------------------------------------
[CUT DUE TO GITHUB LIMITS]
-------------------------------------------------------------------------------
Total params: 14496868
Trainable params: 14487268
Non-trainable params: 9600
Vec2Seq 1 to 20
Recurrent net output sequence length=20
CS with mid memory setup
Building model without initialization
Unroll on device
Recurrent net output sequence length=20
1 epochs of 2 batches of size 24
Epoch 1
[██████████████████████████████████████████████████] 2 out_cnn[loss=4.347 metric=0.246] 1.7054 secs/batch
3.4108 secs/epoch
about to export
[ONNX::Export] Warning: The LSTM layer LSTM1 has mask_zeros=true. This attribute is not supported in ONNX, so the model exported will not have this attribute.
All done
MID_MEM, SECOND
(eddl2) root@uc5-pipeline-pylibs-cudnn-79586ff6b7-6zh95:/mnt/datasets/uc5/UC5_last/src/example-api# python text_generation.py --gpu
Downloading resnet18.onnx
resnet18.onnx ✓
Import ONNX...
Generating Random Table
removing resnetv15_dense0_fwd
Warning: output layer has been removed
CS with mid memory setup
Building model
Selecting GPU device 0
EDDL is running on GPU device 0, Tesla V100-SXM2-32GB
CuBlas initialized on GPU device 0, Tesla V100-SXM2-32GB
CuRand initialized on GPU device 0, Tesla V100-SXM2-32GB
CuDNN initialized on GPU device 0, Tesla V100-SXM2-32GB
-------------------------------------------------------------------------------
model
-------------------------------------------------------------------------------
[CUT DUE TO GITHUB LIMITS]
-------------------------------------------------------------------------------
Total params: 14496868
Trainable params: 14487268
Non-trainable params: 9600
Vec2Seq 1 to 20
Recurrent net output sequence length=20
CS with mid memory setup
Building model without initialization
Unroll on device
Recurrent net output sequence length=20
1 epochs of 2 batches of size 24
Epoch 1
[██████████████████████████████████████████████████] 2 out_cnn[loss=4.345 metric=0.272] 1.8341 secs/batch
3.6682 secs/epoch
about to export
[ONNX::Export] Warning: The LSTM layer LSTM1 has mask_zeros=true. This attribute is not supported in ONNX, so the model exported will not have this attribute.
All done
MID_MEM, THIRD:
(eddl2) root@uc5-pipeline-pylibs-cudnn-79586ff6b7-6zh95:/mnt/datasets/uc5/UC5_last/src/example-api# python text_generation.py --gpu
Downloading resnet18.onnx
resnet18.onnx ✓
Import ONNX...
Generating Random Table
removing resnetv15_dense0_fwd
Warning: output layer has been removed
CS with mid memory setup
Building model
Selecting GPU device 0
EDDL is running on GPU device 0, Tesla V100-SXM2-32GB
CuBlas initialized on GPU device 0, Tesla V100-SXM2-32GB
CuRand initialized on GPU device 0, Tesla V100-SXM2-32GB
CuDNN initialized on GPU device 0, Tesla V100-SXM2-32GB
-------------------------------------------------------------------------------
model
-------------------------------------------------------------------------------
[CUT DUE TO GITHUB LIMITS]
-------------------------------------------------------------------------------
Total params: 14496868
Trainable params: 14487268
Non-trainable params: 9600
Vec2Seq 1 to 20
Recurrent net output sequence length=20
CS with mid memory setup
Building model without initialization
Unroll on device
Recurrent net output sequence length=20
1 epochs of 2 batches of size 24
Epoch 1
[██████████████████████████████████████████████████] 2 out_cnn[loss=4.327 metric=0.240] 1.9775 secs/batch
3.9549 secs/epoch
about to export
==================================================================
⚠️ Error exporting the merge layer concat1. To export this model you need to provide the 'seq_len' argument with a value higher than 0 in the export function. (ONNX::ExportNet) ⚠️
==================================================================
Traceback (most recent call last):
File "text_generation.py", line 84, in <module>
main(parser.parse_args(sys.argv[1:]))
File "text_generation.py", line 71, in main
eddl.save_net_to_onnx_file(net, "img2text.onnx")
File "/root/miniconda3/envs/eddl2/lib/python3.8/site-packages/pyeddl/eddl.py", line 2894, in save_net_to_onnx_file
return _eddl.save_net_to_onnx_file(net, path)
RuntimeError: RuntimeError: ONNX::ExportNet
(eddl2) root@uc5-pipeline-pylibs-cudnn-79586ff6b7-6zh95:/mnt/datasets/uc5/UC5_last/src/example-api#
SCRIPT (MOD. TEXT GENERATION)
"""\Text generation (modified)."""importargparseimportsysimportpyeddl.eddlaseddlfrompyeddl.tensorimportTensorimportnumpyasnpMEM_CHOICES= ("low_mem", "mid_mem", "full_mem")
defmain(args):
epochs=1olength=20outvs=2000embdim=32# True: remove last layers and set new top = flatten# new input_size: [3, 256, 256] (from [224, 224, 3])net=eddl.download_resnet18(True, [3, 256, 256])
lreshape=eddl.getLayer(net, "top")
dense_layer=eddl.HeUniform(eddl.Dense(lreshape, 20, name="out_dense"))
cnn_out=eddl.Sigmoid(dense_layer, name="cnn_out")
concat=eddl.Concat([lreshape, cnn_out], name="cnn_concat")
# create a new model from input outputimage_in=eddl.getLayer(net, "input")
# Decoderldecin=eddl.Input([outvs])
ldec=eddl.ReduceArgMax(ldecin, [0])
ldec=eddl.RandomUniform(
eddl.Embedding(ldec, outvs, 1, embdim, True), -0.05, 0.05
)
ldec=eddl.Concat([ldec, concat])
layer=eddl.LSTM(ldec, 512, True)
out=eddl.Softmax(eddl.Dense(layer, outvs), name="out_cnn")
eddl.setDecoder(ldecin)
net=eddl.Model([image_in], [out])
# Build modeleddl.build(
net,
eddl.adam(0.01),
["softmax_cross_entropy"],
["accuracy"],
eddl.CS_GPU(mem=args.mem) ifargs.gpuelseeddl.CS_CPU(mem=args.mem)
)
eddl.summary(net)
# Load datasetx_train=Tensor.randn([48, 256, 256, 3]) # Tensor.load("flickr_trX.bin", "bin")y_train=Tensor.fromarray( np.random.randint(0,2,(48,20)) )
xtrain=Tensor.permute(x_train, [0, 3, 1, 2])
y_train=Tensor.onehot(y_train, outvs)
# batch x timesteps x input_dimy_train.reshape_([y_train.shape[0], olength, outvs])
eddl.fit(net, [xtrain], [y_train], args.batch_size, epochs)
# eddl.save(net, "img2text.bin", "bin")print("about to export")
eddl.save_net_to_onnx_file(net, "img2text.onnx")
# error hereprint("All done")
if__name__=="__main__":
parser=argparse.ArgumentParser(description=__doc__)
parser.add_argument("--batch-size", type=int, metavar="INT", default=24)
parser.add_argument("--gpu", action="store_true")
parser.add_argument("--small", action="store_true")
# crashes with a segfault on low_memparser.add_argument("--mem", metavar="|".join(MEM_CHOICES),
choices=MEM_CHOICES, default="full_mem")
main(parser.parse_args(sys.argv[1:]))
The Python version of save_net_to_onnx_file just calls the corresponding C++ function. The only difference is that in PyEDDL 1.3.0 the seq_len argument was not exposed to the Python interface, but that has no effect on the function's behavior when the argument is not used. You should report this to the EDDL team. You can also try again with PyEDDL 1.3.1 (note it's just been released, so Docker images and Conda packages are not available yet) and see if setting seq_len makes any difference.
I am experiencing a strange issue when I export to ONNX networks that include a recurrent layer: sometimes the export fails. I was partially able to replicate the issue by modifying the text_generation example on the PyEDLL website. You may find the script at the end of this message.
In all cases (at least with the actual UC5 code), if the first export to ONNX succeeds, then it will never fail till the end of the training.
Why partially able? With my actual code (UC5) I do not get any segmentation fault when I change the
eddl_cs_mem
parameter tolow_mem
. With the modifiedtext_generation.py
:full_mem
ormid_mem
;low_mem
, I get asegmentation fault
with messages that may differ between two consecutive runs:Have a look at the following logs. They correspond to the output of five and three consecutive executions with the flag
--gpu
of the script without touching the Python code for, respectively,eddl_cs_mem=full_mem
andeddl_cs_mem=mid_mem
. After it fails, it keeps failing for a while, then it runs fine again.FIVE FOR "FULL MEM"
** FULL MEM, FIRST:**
** FULL MEM, SECOND:**
** FULL MEM, THIRD: ***
FULL MEM, FOURTH:
FULL MEM, FIFTH:
THREE FOR "MID MEM"
Using MID_MEM the behaviour is the same: the first two runs are ok, the third fails.
MID_MEM, FIRST
MID_MEM, SECOND
MID_MEM, THIRD:
SCRIPT (MOD. TEXT GENERATION)
Python:
nVidia/CUDA:
Libraries:
Test run on a linux pod running on the OpenDeepHealth platform:
The text was updated successfully, but these errors were encountered: