Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error on Windows 10 #33

Open
Magotraa opened this issue Apr 13, 2017 · 22 comments
Open

error on Windows 10 #33

Magotraa opened this issue Apr 13, 2017 · 22 comments

Comments

@Magotraa
Copy link

Magotraa commented Apr 13, 2017

Hi,
Thank you for the repository.
I have installed the requirements and started the process as mentioned.
I could prepare the prepossessed data. However, when I execute Train.py,
" I get the error "ERROR"TypeError: Cannot convert Type TensorType(int32, vector) (of Variable <TensorType(int32, vector)>) into Type TensorType(int64, vector). You can try to manually convert <TensorType(int32, vector)> into a TensorType(int64, vector)."

@Magotraa Magotraa changed the title ERROR"TypeError: Cannot convert Type TensorType(int32, vector) (of Variable <TensorType(int32, vector)>) into Type TensorType(int64, vector). You can try to manually convert <TensorType(int32, vector)> into a TensorType(int64, vector)." error on Windows 10:: ERROR"TypeError: Cannot convert Type TensorType(int32, vector) (of Variable <TensorType(int32, vector)>) into Type TensorType(int64, vector). You can try to manually convert <TensorType(int32, vector)> into a TensorType(int64, vector)." Apr 13, 2017
@hma02
Copy link
Contributor

hma02 commented Apr 13, 2017

@AryanBhardwaj
Could you provide the full Traceback of the error? I just want to see the files that produce this error.

@Magotraa
Copy link
Author

@hma02
Thank you for your reply. I was able to resolve the error by making modification in alex_net.py, in line 26 y = T.ivector('y')

@hma02
Copy link
Contributor

hma02 commented Apr 14, 2017

This problem was also mentioned in #32

@Magotraa
Copy link
Author

Magotraa commented Apr 16, 2017

@hma02
yes, I did refer it. Thank you. However, I am still getting this issue. Can you please suggest some solution.

Error:
epoch 56: validation loss nan
epoch 56: validation error nan %

Complete Output is here:

WARNING (theano.sandbox.cuda): The cuda backend is deprecated and will be removed in the next release (v0.10). Please switch to the gpuarray backend. You can get more information about how to switch at this URL:
https://github.com/Theano/Theano/wiki/Converting-to-the-new-gpu-back-end%28gpuarray%29

Using gpu device 0: GeForce GTX 1080 (CNMeM is enabled with initial size: 80.0% of memory, cuDNN 5110)
WARNING (theano.sandbox.cuda): The cuda backend is deprecated and will be removed in the next release (v0.10). Please switch to the gpuarray backend. You can get more information about how to switch at this URL:
https://github.com/Theano/Theano/wiki/Converting-to-the-new-gpu-back-end%28gpuarray%29

... building the model
conv (cudnn) layer with shape_in: (3, 227, 227, 256)
conv (cudnn) layer with shape_in: (96, 27, 27, 256)
conv (cudnn) layer with shape_in: (256, 13, 13, 256)
conv (cudnn) layer with shape_in: (384, 13, 13, 256)
conv (cudnn) layer with shape_in: (384, 13, 13, 256)
fc layer with num_in: 9216 num_out: 4096
dropout layer with P_drop: 0.5
fc layer with num_in: 4096 num_out: 4096
dropout layer with P_drop: 0.5
softmax layer with num_in: 4096 num_out: 1000
... training
epoch 1: validation loss nan
epoch 1: validation error nan %
weight saved: W_0_1
weight saved: b_0_1
weight saved: W0_1_1
weight saved: W1_1_1
weight saved: b0_1_1
weight saved: b1_1_1
weight saved: W_2_1
weight saved: b_2_1
weight saved: W0_3_1
weight saved: W1_3_1
weight saved: b0_3_1
weight saved: b1_3_1
weight saved: W0_4_1
weight saved: W1_4_1
weight saved: b0_4_1
weight saved: b1_4_1
weight saved: W_5_1
weight saved: b_5_1
weight saved: W_6_1
weight saved: b_6_1
weight saved: W_7_1
weight saved: b_7_1
epoch 2: validation loss nan
epoch 2: validation error nan %
weight saved: W_0_2
weight saved: b_0_2
weight saved: W0_1_2
weight saved: W1_1_2
weight saved: b0_1_2
weight saved: b1_1_2
weight saved: W_2_2
weight saved: b_2_2
weight saved: W0_3_2
weight saved: W1_3_2
weight saved: b0_3_2
weight saved: b1_3_2
weight saved: W0_4_2
weight saved: W1_4_2
weight saved: b0_4_2
weight saved: b1_4_2
weight saved: W_5_2
weight saved: b_5_2
weight saved: W_6_2
weight saved: b_6_2
weight saved: W_7_2
weight saved: b_7_2
epoch 3: validation loss nan
epoch 3: validation error nan %
weight saved: W_0_3
weight saved: b_0_3
weight saved: W0_1_3
weight saved: W1_1_3
weight saved: b0_1_3
weight saved: b1_1_3
weight saved: W_2_3
weight saved: b_2_3
weight saved: W0_3_3
weight saved: W1_3_3
weight saved: b0_3_3
weight saved: b1_3_3
weight saved: W0_4_3
weight saved: W1_4_3
weight saved: b0_4_3
weight saved: b1_4_3
weight saved: W_5_3
weight saved: b_5_3
weight saved: W_6_3
weight saved: b_6_3
weight saved: W_7_3
weight saved: b_7_3
epoch 4: validation loss nan
epoch 4: validation error nan %
weight saved: W_0_4
weight saved: b_0_4
weight saved: W0_1_4
weight saved: W1_1_4
weight saved: b0_1_4
weight saved: b1_1_4
weight saved: W_2_4
weight saved: b_2_4
weight saved: W0_3_4
weight saved: W1_3_4
weight saved: b0_3_4
weight saved: b1_3_4
weight saved: W0_4_4
weight saved: W1_4_4
weight saved: b0_4_4
weight saved: b1_4_4
weight saved: W_5_4
weight saved: b_5_4
weight saved: W_6_4
weight saved: b_6_4
weight saved: W_7_4
weight saved: b_7_4
epoch 5: validation loss nan
epoch 5: validation error nan %
weight saved: W_0_5
weight saved: b_0_5
weight saved: W0_1_5
weight saved: W1_1_5
weight saved: b0_1_5
weight saved: b1_1_5
weight saved: W_2_5
weight saved: b_2_5
weight saved: W0_3_5
weight saved: W1_3_5
weight saved: b0_3_5
weight saved: b1_3_5
weight saved: W0_4_5
weight saved: W1_4_5
weight saved: b0_4_5
weight saved: b1_4_5
weight saved: W_5_5
weight saved: b_5_5
weight saved: W_6_5
weight saved: b_6_5
weight saved: W_7_5
weight saved: b_7_5
epoch 6: validation loss nan
epoch 6: validation error nan %
weight saved: W_0_6
weight saved: b_0_6
weight saved: W0_1_6
weight saved: W1_1_6
weight saved: b0_1_6
weight saved: b1_1_6
weight saved: W_2_6
weight saved: b_2_6
weight saved: W0_3_6
weight saved: W1_3_6
weight saved: b0_3_6
weight saved: b1_3_6
weight saved: W0_4_6
weight saved: W1_4_6
weight saved: b0_4_6
weight saved: b1_4_6
weight saved: W_5_6
weight saved: b_5_6
weight saved: W_6_6
weight saved: b_6_6
weight saved: W_7_6
weight saved: b_7_6
epoch 7: validation loss nan
epoch 7: validation error nan %
weight saved: W_0_7
weight saved: b_0_7
weight saved: W0_1_7
weight saved: W1_1_7
weight saved: b0_1_7
weight saved: b1_1_7
weight saved: W_2_7
weight saved: b_2_7
weight saved: W0_3_7
weight saved: W1_3_7
weight saved: b0_3_7
weight saved: b1_3_7
weight saved: W0_4_7
weight saved: W1_4_7
weight saved: b0_4_7
weight saved: b1_4_7
weight saved: W_5_7
weight saved: b_5_7
weight saved: W_6_7
weight saved: b_6_7
weight saved: W_7_7
weight saved: b_7_7
epoch 8: validation loss nan
epoch 8: validation error nan %
weight saved: W_0_8
weight saved: b_0_8
weight saved: W0_1_8
weight saved: W1_1_8
weight saved: b0_1_8
weight saved: b1_1_8
weight saved: W_2_8
weight saved: b_2_8
weight saved: W0_3_8
weight saved: W1_3_8
weight saved: b0_3_8
weight saved: b1_3_8
weight saved: W0_4_8
weight saved: W1_4_8
weight saved: b0_4_8
weight saved: b1_4_8
weight saved: W_5_8
weight saved: b_5_8
weight saved: W_6_8
weight saved: b_6_8
weight saved: W_7_8
weight saved: b_7_8
epoch 9: validation loss nan
epoch 9: validation error nan %
weight saved: W_0_9
weight saved: b_0_9
weight saved: W0_1_9
weight saved: W1_1_9
weight saved: b0_1_9
weight saved: b1_1_9
weight saved: W_2_9
weight saved: b_2_9
weight saved: W0_3_9
weight saved: W1_3_9
weight saved: b0_3_9
weight saved: b1_3_9
weight saved: W0_4_9
weight saved: W1_4_9
weight saved: b0_4_9
weight saved: b1_4_9
weight saved: W_5_9
weight saved: b_5_9
weight saved: W_6_9
weight saved: b_6_9
weight saved: W_7_9
weight saved: b_7_9
epoch 10: validation loss nan
epoch 10: validation error nan %
('Learning rate changed to:', array(0.0009999999310821295, dtype=float32))
weight saved: W_0_10
weight saved: b_0_10
weight saved: W0_1_10
weight saved: W1_1_10
weight saved: b0_1_10
weight saved: b1_1_10
weight saved: W_2_10
weight saved: b_2_10
weight saved: W0_3_10
weight saved: W1_3_10
weight saved: b0_3_10
weight saved: b1_3_10
weight saved: W0_4_10
weight saved: W1_4_10
weight saved: b0_4_10
weight saved: b1_4_10
weight saved: W_5_10
weight saved: b_5_10
weight saved: W_6_10
weight saved: b_6_10
weight saved: W_7_10
weight saved: b_7_10
epoch 11: validation loss nan
epoch 11: validation error nan %
weight saved: W_0_11
weight saved: b_0_11
weight saved: W0_1_11
weight saved: W1_1_11
weight saved: b0_1_11
weight saved: b1_1_11
weight saved: W_2_11
weight saved: b_2_11
weight saved: W0_3_11
weight saved: W1_3_11
weight saved: b0_3_11
weight saved: b1_3_11
weight saved: W0_4_11
weight saved: W1_4_11
weight saved: b0_4_11
weight saved: b1_4_11
weight saved: W_5_11
weight saved: b_5_11
weight saved: W_6_11
weight saved: b_6_11
weight saved: W_7_11
weight saved: b_7_11
epoch 12: validation loss nan
epoch 12: validation error nan %
weight saved: W_0_12
weight saved: b_0_12
weight saved: W0_1_12
weight saved: W1_1_12
weight saved: b0_1_12
weight saved: b1_1_12
weight saved: W_2_12
weight saved: b_2_12
weight saved: W0_3_12
weight saved: W1_3_12
weight saved: b0_3_12
weight saved: b1_3_12
weight saved: W0_4_12
weight saved: W1_4_12
weight saved: b0_4_12
weight saved: b1_4_12
weight saved: W_5_12
weight saved: b_5_12
weight saved: W_6_12
weight saved: b_6_12
weight saved: W_7_12
weight saved: b_7_12
epoch 13: validation loss nan
epoch 13: validation error nan %
weight saved: W_0_13
weight saved: b_0_13
weight saved: W0_1_13
weight saved: W1_1_13
weight saved: b0_1_13
weight saved: b1_1_13
weight saved: W_2_13
weight saved: b_2_13
weight saved: W0_3_13
weight saved: W1_3_13
weight saved: b0_3_13
weight saved: b1_3_13
weight saved: W0_4_13
weight saved: W1_4_13
weight saved: b0_4_13
weight saved: b1_4_13
weight saved: W_5_13
weight saved: b_5_13
weight saved: W_6_13
weight saved: b_6_13
weight saved: W_7_13
weight saved: b_7_13
epoch 14: validation loss nan
epoch 14: validation error nan %
weight saved: W_0_14
weight saved: b_0_14
weight saved: W0_1_14
weight saved: W1_1_14
weight saved: b0_1_14
weight saved: b1_1_14
weight saved: W_2_14
weight saved: b_2_14
weight saved: W0_3_14
weight saved: W1_3_14
weight saved: b0_3_14
weight saved: b1_3_14
weight saved: W0_4_14
weight saved: W1_4_14
weight saved: b0_4_14
weight saved: b1_4_14
weight saved: W_5_14
weight saved: b_5_14
weight saved: W_6_14
weight saved: b_6_14
weight saved: W_7_14
weight saved: b_7_14
epoch 15: validation loss nan
epoch 15: validation error nan %
weight saved: W_0_15
weight saved: b_0_15
weight saved: W0_1_15
weight saved: W1_1_15
weight saved: b0_1_15
weight saved: b1_1_15
weight saved: W_2_15
weight saved: b_2_15
weight saved: W0_3_15
weight saved: W1_3_15
weight saved: b0_3_15
weight saved: b1_3_15
weight saved: W0_4_15
weight saved: W1_4_15
weight saved: b0_4_15
weight saved: b1_4_15
weight saved: W_5_15
weight saved: b_5_15
weight saved: W_6_15
weight saved: b_6_15
weight saved: W_7_15
weight saved: b_7_15
epoch 16: validation loss nan
epoch 16: validation error nan %
weight saved: W_0_16
weight saved: b_0_16
weight saved: W0_1_16
weight saved: W1_1_16
weight saved: b0_1_16
weight saved: b1_1_16
weight saved: W_2_16
weight saved: b_2_16
weight saved: W0_3_16
weight saved: W1_3_16
weight saved: b0_3_16
weight saved: b1_3_16
weight saved: W0_4_16
weight saved: W1_4_16
weight saved: b0_4_16
weight saved: b1_4_16
weight saved: W_5_16
weight saved: b_5_16
weight saved: W_6_16
weight saved: b_6_16
weight saved: W_7_16
weight saved: b_7_16
epoch 17: validation loss nan
epoch 17: validation error nan %
weight saved: W_0_17
weight saved: b_0_17
weight saved: W0_1_17
weight saved: W1_1_17
weight saved: b0_1_17
weight saved: b1_1_17
weight saved: W_2_17
weight saved: b_2_17
weight saved: W0_3_17
weight saved: W1_3_17
weight saved: b0_3_17
weight saved: b1_3_17
weight saved: W0_4_17
weight saved: W1_4_17
weight saved: b0_4_17
weight saved: b1_4_17
weight saved: W_5_17
weight saved: b_5_17
weight saved: W_6_17
weight saved: b_6_17
weight saved: W_7_17
weight saved: b_7_17
epoch 18: validation loss nan
epoch 18: validation error nan %
weight saved: W_0_18
weight saved: b_0_18
weight saved: W0_1_18
weight saved: W1_1_18
weight saved: b0_1_18
weight saved: b1_1_18
weight saved: W_2_18
weight saved: b_2_18
weight saved: W0_3_18
weight saved: W1_3_18
weight saved: b0_3_18
weight saved: b1_3_18
weight saved: W0_4_18
weight saved: W1_4_18
weight saved: b0_4_18
weight saved: b1_4_18
weight saved: W_5_18
weight saved: b_5_18
weight saved: W_6_18
weight saved: b_6_18
weight saved: W_7_18
weight saved: b_7_18
epoch 19: validation loss nan
epoch 19: validation error nan %
weight saved: W_0_19
weight saved: b_0_19
weight saved: W0_1_19
weight saved: W1_1_19
weight saved: b0_1_19
weight saved: b1_1_19
weight saved: W_2_19
weight saved: b_2_19
weight saved: W0_3_19
weight saved: W1_3_19
weight saved: b0_3_19
weight saved: b1_3_19
weight saved: W0_4_19
weight saved: W1_4_19
weight saved: b0_4_19
weight saved: b1_4_19
weight saved: W_5_19
weight saved: b_5_19
weight saved: W_6_19
weight saved: b_6_19
weight saved: W_7_19
weight saved: b_7_19
epoch 20: validation loss nan
epoch 20: validation error nan %
('Learning rate changed to:', array(9.99999901978299e-05, dtype=float32))
weight saved: W_0_20
weight saved: b_0_20
weight saved: W0_1_20
weight saved: W1_1_20
weight saved: b0_1_20
weight saved: b1_1_20
weight saved: W_2_20
weight saved: b_2_20
weight saved: W0_3_20
weight saved: W1_3_20
weight saved: b0_3_20
weight saved: b1_3_20
weight saved: W0_4_20
weight saved: W1_4_20
weight saved: b0_4_20
weight saved: b1_4_20
weight saved: W_5_20
weight saved: b_5_20
weight saved: W_6_20
weight saved: b_6_20
weight saved: W_7_20
weight saved: b_7_20
epoch 21: validation loss nan
epoch 21: validation error nan %
weight saved: W_0_21
weight saved: b_0_21
weight saved: W0_1_21
weight saved: W1_1_21
weight saved: b0_1_21
weight saved: b1_1_21
weight saved: W_2_21
weight saved: b_2_21
weight saved: W0_3_21
weight saved: W1_3_21
weight saved: b0_3_21
weight saved: b1_3_21
weight saved: W0_4_21
weight saved: W1_4_21
weight saved: b0_4_21
weight saved: b1_4_21
weight saved: W_5_21
weight saved: b_5_21
weight saved: W_6_21
weight saved: b_6_21
weight saved: W_7_21
weight saved: b_7_21
epoch 22: validation loss nan
epoch 22: validation error nan %
weight saved: W_0_22
weight saved: b_0_22
weight saved: W0_1_22
weight saved: W1_1_22
weight saved: b0_1_22
weight saved: b1_1_22
weight saved: W_2_22
weight saved: b_2_22
weight saved: W0_3_22
weight saved: W1_3_22
weight saved: b0_3_22
weight saved: b1_3_22
weight saved: W0_4_22
weight saved: W1_4_22
weight saved: b0_4_22
weight saved: b1_4_22
weight saved: W_5_22
weight saved: b_5_22
weight saved: W_6_22
weight saved: b_6_22
weight saved: W_7_22
weight saved: b_7_22
epoch 23: validation loss nan
epoch 23: validation error nan %
weight saved: W_0_23
weight saved: b_0_23
weight saved: W0_1_23
weight saved: W1_1_23
weight saved: b0_1_23
weight saved: b1_1_23
weight saved: W_2_23
weight saved: b_2_23
weight saved: W0_3_23
weight saved: W1_3_23
weight saved: b0_3_23
weight saved: b1_3_23
weight saved: W0_4_23
weight saved: W1_4_23
weight saved: b0_4_23
weight saved: b1_4_23
weight saved: W_5_23
weight saved: b_5_23
weight saved: W_6_23
weight saved: b_6_23
weight saved: W_7_23
weight saved: b_7_23
epoch 24: validation loss nan
epoch 24: validation error nan %
weight saved: W_0_24
weight saved: b_0_24
weight saved: W0_1_24
weight saved: W1_1_24
weight saved: b0_1_24
weight saved: b1_1_24
weight saved: W_2_24
weight saved: b_2_24
weight saved: W0_3_24
weight saved: W1_3_24
weight saved: b0_3_24
weight saved: b1_3_24
weight saved: W0_4_24
weight saved: W1_4_24
weight saved: b0_4_24
weight saved: b1_4_24
weight saved: W_5_24
weight saved: b_5_24
weight saved: W_6_24
weight saved: b_6_24
weight saved: W_7_24
weight saved: b_7_24
epoch 25: validation loss nan
epoch 25: validation error nan %
weight saved: W_0_25
weight saved: b_0_25
weight saved: W0_1_25
weight saved: W1_1_25
weight saved: b0_1_25
weight saved: b1_1_25
weight saved: W_2_25
weight saved: b_2_25
weight saved: W0_3_25
weight saved: W1_3_25
weight saved: b0_3_25
weight saved: b1_3_25
weight saved: W0_4_25
weight saved: W1_4_25
weight saved: b0_4_25
weight saved: b1_4_25
weight saved: W_5_25
weight saved: b_5_25
weight saved: W_6_25
weight saved: b_6_25
weight saved: W_7_25
weight saved: b_7_25
epoch 26: validation loss nan
epoch 26: validation error nan %
weight saved: W_0_26
weight saved: b_0_26
weight saved: W0_1_26
weight saved: W1_1_26
weight saved: b0_1_26
weight saved: b1_1_26
weight saved: W_2_26
weight saved: b_2_26
weight saved: W0_3_26
weight saved: W1_3_26
weight saved: b0_3_26
weight saved: b1_3_26
weight saved: W0_4_26
weight saved: W1_4_26
weight saved: b0_4_26
weight saved: b1_4_26
weight saved: W_5_26
weight saved: b_5_26
weight saved: W_6_26
weight saved: b_6_26
weight saved: W_7_26
weight saved: b_7_26
epoch 27: validation loss nan
epoch 27: validation error nan %
weight saved: W_0_27
weight saved: b_0_27
weight saved: W0_1_27
weight saved: W1_1_27
weight saved: b0_1_27
weight saved: b1_1_27
weight saved: W_2_27
weight saved: b_2_27
weight saved: W0_3_27
weight saved: W1_3_27
weight saved: b0_3_27
weight saved: b1_3_27
weight saved: W0_4_27
weight saved: W1_4_27
weight saved: b0_4_27
weight saved: b1_4_27
weight saved: W_5_27
weight saved: b_5_27
weight saved: W_6_27
weight saved: b_6_27
weight saved: W_7_27
weight saved: b_7_27
epoch 28: validation loss nan
epoch 28: validation error nan %
weight saved: W_0_28
weight saved: b_0_28
weight saved: W0_1_28
weight saved: W1_1_28
weight saved: b0_1_28
weight saved: b1_1_28
weight saved: W_2_28
weight saved: b_2_28
weight saved: W0_3_28
weight saved: W1_3_28
weight saved: b0_3_28
weight saved: b1_3_28
weight saved: W0_4_28
weight saved: W1_4_28
weight saved: b0_4_28
weight saved: b1_4_28
weight saved: W_5_28
weight saved: b_5_28
weight saved: W_6_28
weight saved: b_6_28
weight saved: W_7_28
weight saved: b_7_28
epoch 29: validation loss nan
epoch 29: validation error nan %
weight saved: W_0_29
weight saved: b_0_29
weight saved: W0_1_29
weight saved: W1_1_29
weight saved: b0_1_29
weight saved: b1_1_29
weight saved: W_2_29
weight saved: b_2_29
weight saved: W0_3_29
weight saved: W1_3_29
weight saved: b0_3_29
weight saved: b1_3_29
weight saved: W0_4_29
weight saved: W1_4_29
weight saved: b0_4_29
weight saved: b1_4_29
weight saved: W_5_29
weight saved: b_5_29
weight saved: W_6_29
weight saved: b_6_29
weight saved: W_7_29
weight saved: b_7_29
epoch 30: validation loss nan
epoch 30: validation error nan %
weight saved: W_0_30
weight saved: b_0_30
weight saved: W0_1_30
weight saved: W1_1_30
weight saved: b0_1_30
weight saved: b1_1_30
weight saved: W_2_30
weight saved: b_2_30
weight saved: W0_3_30
weight saved: W1_3_30
weight saved: b0_3_30
weight saved: b1_3_30
weight saved: W0_4_30
weight saved: W1_4_30
weight saved: b0_4_30
weight saved: b1_4_30
weight saved: W_5_30
weight saved: b_5_30
weight saved: W_6_30
weight saved: b_6_30
weight saved: W_7_30
weight saved: b_7_30
epoch 31: validation loss nan
epoch 31: validation error nan %
weight saved: W_0_31
weight saved: b_0_31
weight saved: W0_1_31
weight saved: W1_1_31
weight saved: b0_1_31
weight saved: b1_1_31
weight saved: W_2_31
weight saved: b_2_31
weight saved: W0_3_31
weight saved: W1_3_31
weight saved: b0_3_31
weight saved: b1_3_31
weight saved: W0_4_31
weight saved: W1_4_31
weight saved: b0_4_31
weight saved: b1_4_31
weight saved: W_5_31
weight saved: b_5_31
weight saved: W_6_31
weight saved: b_6_31
weight saved: W_7_31
weight saved: b_7_31
epoch 32: validation loss nan
epoch 32: validation error nan %
weight saved: W_0_32
weight saved: b_0_32
weight saved: W0_1_32
weight saved: W1_1_32
weight saved: b0_1_32
weight saved: b1_1_32
weight saved: W_2_32
weight saved: b_2_32
weight saved: W0_3_32
weight saved: W1_3_32
weight saved: b0_3_32
weight saved: b1_3_32
weight saved: W0_4_32
weight saved: W1_4_32
weight saved: b0_4_32
weight saved: b1_4_32
weight saved: W_5_32
weight saved: b_5_32
weight saved: W_6_32
weight saved: b_6_32
weight saved: W_7_32
weight saved: b_7_32
epoch 33: validation loss nan
epoch 33: validation error nan %
weight saved: W_0_33
weight saved: b_0_33
weight saved: W0_1_33
weight saved: W1_1_33
weight saved: b0_1_33
weight saved: b1_1_33
weight saved: W_2_33
weight saved: b_2_33
weight saved: W0_3_33
weight saved: W1_3_33
weight saved: b0_3_33
weight saved: b1_3_33
weight saved: W0_4_33
weight saved: W1_4_33
weight saved: b0_4_33
weight saved: b1_4_33
weight saved: W_5_33
weight saved: b_5_33
weight saved: W_6_33
weight saved: b_6_33
weight saved: W_7_33
weight saved: b_7_33
epoch 34: validation loss nan
epoch 34: validation error nan %
weight saved: W_0_34
weight saved: b_0_34
weight saved: W0_1_34
weight saved: W1_1_34
weight saved: b0_1_34
weight saved: b1_1_34
weight saved: W_2_34
weight saved: b_2_34
weight saved: W0_3_34
weight saved: W1_3_34
weight saved: b0_3_34
weight saved: b1_3_34
weight saved: W0_4_34
weight saved: W1_4_34
weight saved: b0_4_34
weight saved: b1_4_34
weight saved: W_5_34
weight saved: b_5_34
weight saved: W_6_34
weight saved: b_6_34
weight saved: W_7_34
weight saved: b_7_34
epoch 35: validation loss nan
epoch 35: validation error nan %
weight saved: W_0_35
weight saved: b_0_35
weight saved: W0_1_35
weight saved: W1_1_35
weight saved: b0_1_35
weight saved: b1_1_35
weight saved: W_2_35
weight saved: b_2_35
weight saved: W0_3_35
weight saved: W1_3_35
weight saved: b0_3_35
weight saved: b1_3_35
weight saved: W0_4_35
weight saved: W1_4_35
weight saved: b0_4_35
weight saved: b1_4_35
weight saved: W_5_35
weight saved: b_5_35
weight saved: W_6_35
weight saved: b_6_35
weight saved: W_7_35
weight saved: b_7_35
epoch 36: validation loss nan
epoch 36: validation error nan %
weight saved: W_0_36
weight saved: b_0_36
weight saved: W0_1_36
weight saved: W1_1_36
weight saved: b0_1_36
weight saved: b1_1_36
weight saved: W_2_36
weight saved: b_2_36
weight saved: W0_3_36
weight saved: W1_3_36
weight saved: b0_3_36
weight saved: b1_3_36
weight saved: W0_4_36
weight saved: W1_4_36
weight saved: b0_4_36
weight saved: b1_4_36
weight saved: W_5_36
weight saved: b_5_36
weight saved: W_6_36
weight saved: b_6_36
weight saved: W_7_36
weight saved: b_7_36
epoch 37: validation loss nan
epoch 37: validation error nan %
weight saved: W_0_37
weight saved: b_0_37
weight saved: W0_1_37
weight saved: W1_1_37
weight saved: b0_1_37
weight saved: b1_1_37
weight saved: W_2_37
weight saved: b_2_37
weight saved: W0_3_37
weight saved: W1_3_37
weight saved: b0_3_37
weight saved: b1_3_37
weight saved: W0_4_37
weight saved: W1_4_37
weight saved: b0_4_37
weight saved: b1_4_37
weight saved: W_5_37
weight saved: b_5_37
weight saved: W_6_37
weight saved: b_6_37
weight saved: W_7_37
weight saved: b_7_37
epoch 38: validation loss nan
epoch 38: validation error nan %
weight saved: W_0_38
weight saved: b_0_38
weight saved: W0_1_38
weight saved: W1_1_38
weight saved: b0_1_38
weight saved: b1_1_38
weight saved: W_2_38
weight saved: b_2_38
weight saved: W0_3_38
weight saved: W1_3_38
weight saved: b0_3_38
weight saved: b1_3_38
weight saved: W0_4_38
weight saved: W1_4_38
weight saved: b0_4_38
weight saved: b1_4_38
weight saved: W_5_38
weight saved: b_5_38
weight saved: W_6_38
weight saved: b_6_38
weight saved: W_7_38
weight saved: b_7_38
epoch 39: validation loss nan
epoch 39: validation error nan %
weight saved: W_0_39
weight saved: b_0_39
weight saved: W0_1_39
weight saved: W1_1_39
weight saved: b0_1_39
weight saved: b1_1_39
weight saved: W_2_39
weight saved: b_2_39
weight saved: W0_3_39
weight saved: W1_3_39
weight saved: b0_3_39
weight saved: b1_3_39
weight saved: W0_4_39
weight saved: W1_4_39
weight saved: b0_4_39
weight saved: b1_4_39
weight saved: W_5_39
weight saved: b_5_39
weight saved: W_6_39
weight saved: b_6_39
weight saved: W_7_39
weight saved: b_7_39
epoch 40: validation loss nan
epoch 40: validation error nan %
weight saved: W_0_40
weight saved: b_0_40
weight saved: W0_1_40
weight saved: W1_1_40
weight saved: b0_1_40
weight saved: b1_1_40
weight saved: W_2_40
weight saved: b_2_40
weight saved: W0_3_40
weight saved: W1_3_40
weight saved: b0_3_40
weight saved: b1_3_40
weight saved: W0_4_40
weight saved: W1_4_40
weight saved: b0_4_40
weight saved: b1_4_40
weight saved: W_5_40
weight saved: b_5_40
weight saved: W_6_40
weight saved: b_6_40
weight saved: W_7_40
weight saved: b_7_40
epoch 41: validation loss nan
epoch 41: validation error nan %
weight saved: W_0_41
weight saved: b_0_41
weight saved: W0_1_41
weight saved: W1_1_41
weight saved: b0_1_41
weight saved: b1_1_41
weight saved: W_2_41
weight saved: b_2_41
weight saved: W0_3_41
weight saved: W1_3_41
weight saved: b0_3_41
weight saved: b1_3_41
weight saved: W0_4_41
weight saved: W1_4_41
weight saved: b0_4_41
weight saved: b1_4_41
weight saved: W_5_41
weight saved: b_5_41
weight saved: W_6_41
weight saved: b_6_41
weight saved: W_7_41
weight saved: b_7_41
epoch 42: validation loss nan
epoch 42: validation error nan %
weight saved: W_0_42
weight saved: b_0_42
weight saved: W0_1_42
weight saved: W1_1_42
weight saved: b0_1_42
weight saved: b1_1_42
weight saved: W_2_42
weight saved: b_2_42
weight saved: W0_3_42
weight saved: W1_3_42
weight saved: b0_3_42
weight saved: b1_3_42
weight saved: W0_4_42
weight saved: W1_4_42
weight saved: b0_4_42
weight saved: b1_4_42
weight saved: W_5_42
weight saved: b_5_42
weight saved: W_6_42
weight saved: b_6_42
weight saved: W_7_42
weight saved: b_7_42
epoch 43: validation loss nan
epoch 43: validation error nan %
weight saved: W_0_43
weight saved: b_0_43
weight saved: W0_1_43
weight saved: W1_1_43
weight saved: b0_1_43
weight saved: b1_1_43
weight saved: W_2_43
weight saved: b_2_43
weight saved: W0_3_43
weight saved: W1_3_43
weight saved: b0_3_43
weight saved: b1_3_43
weight saved: W0_4_43
weight saved: W1_4_43
weight saved: b0_4_43
weight saved: b1_4_43
weight saved: W_5_43
weight saved: b_5_43
weight saved: W_6_43
weight saved: b_6_43
weight saved: W_7_43
weight saved: b_7_43
epoch 44: validation loss nan
epoch 44: validation error nan %
weight saved: W_0_44
weight saved: b_0_44
weight saved: W0_1_44
weight saved: W1_1_44
weight saved: b0_1_44
weight saved: b1_1_44
weight saved: W_2_44
weight saved: b_2_44
weight saved: W0_3_44
weight saved: W1_3_44
weight saved: b0_3_44
weight saved: b1_3_44
weight saved: W0_4_44
weight saved: W1_4_44
weight saved: b0_4_44
weight saved: b1_4_44
weight saved: W_5_44
weight saved: b_5_44
weight saved: W_6_44
weight saved: b_6_44
weight saved: W_7_44
weight saved: b_7_44
epoch 45: validation loss nan
epoch 45: validation error nan %
weight saved: W_0_45
weight saved: b_0_45
weight saved: W0_1_45
weight saved: W1_1_45
weight saved: b0_1_45
weight saved: b1_1_45
weight saved: W_2_45
weight saved: b_2_45
weight saved: W0_3_45
weight saved: W1_3_45
weight saved: b0_3_45
weight saved: b1_3_45
weight saved: W0_4_45
weight saved: W1_4_45
weight saved: b0_4_45
weight saved: b1_4_45
weight saved: W_5_45
weight saved: b_5_45
weight saved: W_6_45
weight saved: b_6_45
weight saved: W_7_45
weight saved: b_7_45
epoch 46: validation loss nan
epoch 46: validation error nan %
weight saved: W_0_46
weight saved: b_0_46
weight saved: W0_1_46
weight saved: W1_1_46
weight saved: b0_1_46
weight saved: b1_1_46
weight saved: W_2_46
weight saved: b_2_46
weight saved: W0_3_46
weight saved: W1_3_46
weight saved: b0_3_46
weight saved: b1_3_46
weight saved: W0_4_46
weight saved: W1_4_46
weight saved: b0_4_46
weight saved: b1_4_46
weight saved: W_5_46
weight saved: b_5_46
weight saved: W_6_46
weight saved: b_6_46
weight saved: W_7_46
weight saved: b_7_46
epoch 47: validation loss nan
epoch 47: validation error nan %
weight saved: W_0_47
weight saved: b_0_47
weight saved: W0_1_47
weight saved: W1_1_47
weight saved: b0_1_47
weight saved: b1_1_47
weight saved: W_2_47
weight saved: b_2_47
weight saved: W0_3_47
weight saved: W1_3_47
weight saved: b0_3_47
weight saved: b1_3_47
weight saved: W0_4_47
weight saved: W1_4_47
weight saved: b0_4_47
weight saved: b1_4_47
weight saved: W_5_47
weight saved: b_5_47
weight saved: W_6_47
weight saved: b_6_47
weight saved: W_7_47
weight saved: b_7_47
epoch 48: validation loss nan
epoch 48: validation error nan %
weight saved: W_0_48
weight saved: b_0_48
weight saved: W0_1_48
weight saved: W1_1_48
weight saved: b0_1_48
weight saved: b1_1_48
weight saved: W_2_48
weight saved: b_2_48
weight saved: W0_3_48
weight saved: W1_3_48
weight saved: b0_3_48
weight saved: b1_3_48
weight saved: W0_4_48
weight saved: W1_4_48
weight saved: b0_4_48
weight saved: b1_4_48
weight saved: W_5_48
weight saved: b_5_48
weight saved: W_6_48
weight saved: b_6_48
weight saved: W_7_48
weight saved: b_7_48
epoch 49: validation loss nan
epoch 49: validation error nan %
weight saved: W_0_49
weight saved: b_0_49
weight saved: W0_1_49
weight saved: W1_1_49
weight saved: b0_1_49
weight saved: b1_1_49
weight saved: W_2_49
weight saved: b_2_49
weight saved: W0_3_49
weight saved: W1_3_49
weight saved: b0_3_49
weight saved: b1_3_49
weight saved: W0_4_49
weight saved: W1_4_49
weight saved: b0_4_49
weight saved: b1_4_49
weight saved: W_5_49
weight saved: b_5_49
weight saved: W_6_49
weight saved: b_6_49
weight saved: W_7_49
weight saved: b_7_49
epoch 50: validation loss nan
epoch 50: validation error nan %
weight saved: W_0_50
weight saved: b_0_50
weight saved: W0_1_50
weight saved: W1_1_50
weight saved: b0_1_50
weight saved: b1_1_50
weight saved: W_2_50
weight saved: b_2_50
weight saved: W0_3_50
weight saved: W1_3_50
weight saved: b0_3_50
weight saved: b1_3_50
weight saved: W0_4_50
weight saved: W1_4_50
weight saved: b0_4_50
weight saved: b1_4_50
weight saved: W_5_50
weight saved: b_5_50
weight saved: W_6_50
weight saved: b_6_50
weight saved: W_7_50
weight saved: b_7_50
epoch 51: validation loss nan
epoch 51: validation error nan %
weight saved: W_0_51
weight saved: b_0_51
weight saved: W0_1_51
weight saved: W1_1_51
weight saved: b0_1_51
weight saved: b1_1_51
weight saved: W_2_51
weight saved: b_2_51
weight saved: W0_3_51
weight saved: W1_3_51
weight saved: b0_3_51
weight saved: b1_3_51
weight saved: W0_4_51
weight saved: W1_4_51
weight saved: b0_4_51
weight saved: b1_4_51
weight saved: W_5_51
weight saved: b_5_51
weight saved: W_6_51
weight saved: b_6_51
weight saved: W_7_51
weight saved: b_7_51
epoch 52: validation loss nan
epoch 52: validation error nan %
weight saved: W_0_52
weight saved: b_0_52
weight saved: W0_1_52
weight saved: W1_1_52
weight saved: b0_1_52
weight saved: b1_1_52
weight saved: W_2_52
weight saved: b_2_52
weight saved: W0_3_52
weight saved: W1_3_52
weight saved: b0_3_52
weight saved: b1_3_52
weight saved: W0_4_52
weight saved: W1_4_52
weight saved: b0_4_52
weight saved: b1_4_52
weight saved: W_5_52
weight saved: b_5_52
weight saved: W_6_52
weight saved: b_6_52
weight saved: W_7_52
weight saved: b_7_52
epoch 53: validation loss nan
epoch 53: validation error nan %
weight saved: W_0_53
weight saved: b_0_53
weight saved: W0_1_53
weight saved: W1_1_53
weight saved: b0_1_53
weight saved: b1_1_53
weight saved: W_2_53
weight saved: b_2_53
weight saved: W0_3_53
weight saved: W1_3_53
weight saved: b0_3_53
weight saved: b1_3_53
weight saved: W0_4_53
weight saved: W1_4_53
weight saved: b0_4_53
weight saved: b1_4_53
weight saved: W_5_53
weight saved: b_5_53
weight saved: W_6_53
weight saved: b_6_53
weight saved: W_7_53
weight saved: b_7_53
epoch 54: validation loss nan
epoch 54: validation error nan %
weight saved: W_0_54
weight saved: b_0_54
weight saved: W0_1_54
weight saved: W1_1_54
weight saved: b0_1_54
weight saved: b1_1_54
weight saved: W_2_54
weight saved: b_2_54
weight saved: W0_3_54
weight saved: W1_3_54
weight saved: b0_3_54
weight saved: b1_3_54
weight saved: W0_4_54
weight saved: W1_4_54
weight saved: b0_4_54
weight saved: b1_4_54
weight saved: W_5_54
weight saved: b_5_54
weight saved: W_6_54
weight saved: b_6_54
weight saved: W_7_54
weight saved: b_7_54
epoch 55: validation loss nan
epoch 55: validation error nan %
weight saved: W_0_55
weight saved: b_0_55
weight saved: W0_1_55
weight saved: W1_1_55
weight saved: b0_1_55
weight saved: b1_1_55
weight saved: W_2_55
weight saved: b_2_55
weight saved: W0_3_55
weight saved: W1_3_55
weight saved: b0_3_55
weight saved: b1_3_55
weight saved: W0_4_55
weight saved: W1_4_55
weight saved: b0_4_55
weight saved: b1_4_55
weight saved: W_5_55
weight saved: b_5_55
weight saved: W_6_55
weight saved: b_6_55
weight saved: W_7_55
weight saved: b_7_55
epoch 56: validation loss nan
epoch 56: validation error nan %
weight saved: W_0_56
weight saved: b_0_56
weight saved: W0_1_56
weight saved: W1_1_56
weight saved: b0_1_56
weight saved: b1_1_56
weight saved: W_2_56
weight saved: b_2_56
weight saved: W0_3_56
weight saved: W1_3_56
weight saved: b0_3_56
weight saved: b1_3_56
weight saved: W0_4_56
weight saved: W1_4_56
weight saved: b0_4_56
weight saved: b1_4_56
weight saved: W_5_56
weight saved: b_5_56
weight saved: W_6_56
weight saved: b_6_56
weight saved: W_7_56
weight saved: b_7_56
epoch 57: validation loss nan
epoch 57: validation error nan %
weight saved: W_0_57
weight saved: b_0_57
weight saved: W0_1_57
weight saved: W1_1_57
weight saved: b0_1_57
weight saved: b1_1_57
weight saved: W_2_57
weight saved: b_2_57
weight saved: W0_3_57
weight saved: W1_3_57
weight saved: b0_3_57
weight saved: b1_3_57
weight saved: W0_4_57
weight saved: W1_4_57
weight saved: b0_4_57
weight saved: b1_4_57
weight saved: W_5_57
weight saved: b_5_57
weight saved: W_6_57
weight saved: b_6_57
weight saved: W_7_57
weight saved: b_7_57
epoch 58: validation loss nan
epoch 58: validation error nan %
weight saved: W_0_58
weight saved: b_0_58
weight saved: W0_1_58
weight saved: W1_1_58
weight saved: b0_1_58
weight saved: b1_1_58
weight saved: W_2_58
weight saved: b_2_58
weight saved: W0_3_58
weight saved: W1_3_58
weight saved: b0_3_58
weight saved: b1_3_58
weight saved: W0_4_58
weight saved: W1_4_58
weight saved: b0_4_58
weight saved: b1_4_58
weight saved: W_5_58
weight saved: b_5_58
weight saved: W_6_58
weight saved: b_6_58
weight saved: W_7_58
weight saved: b_7_58
epoch 59: validation loss nan
epoch 59: validation error nan %
weight saved: W_0_59
weight saved: b_0_59
weight saved: W0_1_59
weight saved: W1_1_59
weight saved: b0_1_59
weight saved: b1_1_59
weight saved: W_2_59
weight saved: b_2_59
weight saved: W0_3_59
weight saved: W1_3_59
weight saved: b0_3_59
weight saved: b1_3_59
weight saved: W0_4_59
weight saved: W1_4_59
weight saved: b0_4_59
weight saved: b1_4_59
weight saved: W_5_59
weight saved: b_5_59
weight saved: W_6_59
weight saved: b_6_59
weight saved: W_7_59
weight saved: b_7_59
epoch 60: validation loss nan
epoch 60: validation error nan %
weight saved: W_0_60
weight saved: b_0_60
weight saved: W0_1_60
weight saved: W1_1_60
weight saved: b0_1_60
weight saved: b1_1_60
weight saved: W_2_60
weight saved: b_2_60
weight saved: W0_3_60
weight saved: W1_3_60
weight saved: b0_3_60
weight saved: b1_3_60
weight saved: W0_4_60
weight saved: W1_4_60
weight saved: b0_4_60
weight saved: b1_4_60
weight saved: W_5_60
weight saved: b_5_60
weight saved: W_6_60
weight saved: b_6_60
weight saved: W_7_60
weight saved: b_7_60
Optimization complete.

PyCUDA ERROR: The context stack was not empty upon module cleanup.

A context was still active when the context stack was being
cleaned up. At this point in our execution, CUDA may already
have been deinitialized, so there is no way we can finish
cleanly. The program will be aborted now.
Use Context.pop() to avoid this problem.

This application has requested the Runtime to terminate it in an unusual way.
Please contact the application's support team for more information.

Process finished with exit code 0

Also:

If para_load: True
then I get this error
LogicError: cuIpcGetMemHandle failed: OS call failed or operation not supported on this OS

(C:\Users\arjun\Anaconda2) D:\xxxxxyyyy>python train.py
Process Process-2:
Traceback (most recent call last):
File "C:\Users\arjun\Anaconda2\lib\multiprocessing\process.py", line 258, in _bootstrap
self.run()
File "C:\Users\arjun\Anaconda2\lib\multiprocessing\process.py", line 114, in run
self._target(self._args, **self._kwargs)
File "D:\Rough\random\xxx\xxxxxyyyy\proc_load.py", line 98, in fun_load
sock.bind('tcp://
:{0}'.format(sock_data))
File "zmq/backend/cython/socket.pyx", line 495, in zmq.backend.cython.socket.Socket.bind (zmq\backend\cython\socket.c:5653)
File "zmq/backend/cython/checkrc.pxd", line 25, in zmq.backend.cython.checkrc._check_rc (zmq\backend\cython\socket.c:10014)
raise ZMQError(errno)
ZMQError: Address in use

PyCUDA ERROR: The context stack was not empty upon module cleanup.

A context was still active when the context stack was being
cleaned up. At this point in our execution, CUDA may already
have been deinitialized, so there is no way we can finish
cleanly. The program will be aborted now.
Use Context.pop() to avoid this problem.

This application has requested the Runtime to terminate it in an unusual way.
Please contact the application's support team for more information.
WARNING (theano.sandbox.cuda): The cuda backend is deprecated and will be removed in the next release (v0.10). Please switch to the gpuarray backend. You can get more information about how to switch at this URL:
https://github.com/Theano/Theano/wiki/Converting-to-the-new-gpu-back-end%28gpuarray%29

Using gpu device 0: GeForce GTX 1080 (CNMeM is enabled with initial size: 80.0% of memory, cuDNN 5110)
... building the model
conv (cudnn) layer with shape_in: (3, 227, 227, 256)
conv (cudnn) layer with shape_in: (96, 27, 27, 256)
conv (cudnn) layer with shape_in: (256, 13, 13, 256)
conv (cudnn) layer with shape_in: (384, 13, 13, 256)
conv (cudnn) layer with shape_in: (384, 13, 13, 256)
fc layer with num_in: 9216 num_out: 4096
dropout layer with P_drop: 0.5
fc layer with num_in: 4096 num_out: 4096
dropout layer with P_drop: 0.5
softmax layer with num_in: 4096 num_out: 1000
... training
Process Process-1:
Traceback (most recent call last):
File "C:\Users\arjun\Anaconda2\lib\multiprocessing\process.py", line 258, in _bootstrap
self.run()
File "C:\Users\arjun\Anaconda2\lib\multiprocessing\process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "D:\Rough\random\xxx\xxxxxyyyy\train.py", line 69, in train_net
h = drv.mem_get_ipc_handle(gpuarray_batch.ptr)
LogicError: cuIpcGetMemHandle failed: OS call failed or operation not supported on this OS

PyCUDA ERROR: The context stack was not empty upon module cleanup.

A context was still active when the context stack was being
cleaned up. At this point in our execution, CUDA may already
have been deinitialized, so there is no way we can finish
cleanly. The program will be aborted now.
Use Context.pop() to avoid this problem.

This application has requested the Runtime to terminate it in an unusual way.
Please contact the application's support team for more information.

It will be really helpful, if i can get some suggestions or approximate solution. Thank you in advance.

@hma02
Copy link
Contributor

hma02 commented Apr 16, 2017

The "ZMQError: Address in use" error happens when the previous run failed and the socket port opened in the previous run was not closed properly causing port conflict in the next run. You can search the process opening the port by:

netstat -ltnp

and kill the corresponding process.

For the NAN issue, if it happened from the first epoch, this could be caused by input batch not being fed or preprocessed correctly. Or it can be caused by using too large learning rate. See issue #27.

@Magotraa
Copy link
Author

@hma02
Thanks for sharing. I am trying the suggested solutions. Do we have any solution on windows os 10 for

LogicError: cuIpcGetMemHandle failed: OS call failed or operation not supported on this OS

@Magotraa
Copy link
Author

@hma02
Thank you for your suggestions for "For the NAN issue", the problem was it was not able to read the training data. Now it is training. Now, I want to know how to get the good accuracy results.

Can you share, after how many iterations I should expect for accuracy. Also, if you can share optimized hyper parameters file config.yml.

current status is:

('training error rate:', array(0.984375))
('training @ iter = ', 2765)
('training cost:', array(6.374232292175293, dtype=float32))
('training error rate:', array(0.99609375))
('training @ iter = ', 2770)
('training cost:', array(6.3500189781188965, dtype=float32))
('training error rate:', array(0.984375))
('training @ iter = ', 2775)
('training cost:', array(6.216220855712891, dtype=float32))
('training error rate:', array(0.98828125))
('training @ iter = ', 2780)
('training cost:', array(6.231907844543457, dtype=float32))
('training error rate:', array(0.98828125))
('training @ iter = ', 2785)
('training cost:', array(6.30079460144043, dtype=float32))
('training error rate:', array(0.99609375))

@Magotraa
Copy link
Author

@hma02
Hi, I have this experiment running with this current results: Can you suggest any improvements to, achieve better accuracy and less training error.

('training cost:', array(4.295770168304443, dtype=float32))
('training error rate:', array(0.8046875))
('training @ iter = ', 8165)
('training cost:', array(4.224380016326904, dtype=float32))
('training error rate:', array(0.8125))
('training @ iter = ', 8170)
('training cost:', array(4.512507438659668, dtype=float32))
('training error rate:', array(0.90234375))
('training @ iter = ', 8175)
('training cost:', array(4.5337233543396, dtype=float32))
('training error rate:', array(0.8515625))
('training @ iter = ', 8180)
('training cost:', array(4.498597145080566, dtype=float32))
('training error rate:', array(0.82421875))
('training @ iter = ', 8185)
('training cost:', array(4.465353012084961, dtype=float32))
('training error rate:', array(0.84375))
('training @ iter = ', 8190)
('training cost:', array(4.593122482299805, dtype=float32))
('training error rate:', array(0.82421875))

@hma02
Copy link
Contributor

hma02 commented Apr 21, 2017

@AryanBhardwaj ,

Your training cost looks okay so far. Are you training on ImageNet data? If you follow the preprocess steps in this project, you will see 5004 batch files of batch size 256 for single GPU training. That means one epoch will take 5004 iterations. The hyperparams in config.yaml are already the optimized values found so far. That means you need to train for 60 epochs or 60*5004 iterations in total.

@Magotraa
Copy link
Author

@hma02
Thank you for the quick reply. yes. you are correct, but may be number of batch files is little different. However, why we have two training data folders, _hkl_b256_b_128 and train_hkl_b256_b_256. Is there specific reason to have size 128 folder.

@hma02 hma02 changed the title error on Windows 10:: ERROR"TypeError: Cannot convert Type TensorType(int32, vector) (of Variable <TensorType(int32, vector)>) into Type TensorType(int64, vector). You can try to manually convert <TensorType(int32, vector)> into a TensorType(int64, vector)." error on Windows 10 Apr 24, 2017
@hma02
Copy link
Contributor

hma02 commented Apr 24, 2017

@AryanBhardwaj
This preprocessing setup is for doing multi-GPU training. Specifically, single GPU trains with batch_size=256, two GPUs train with batch_size=128 on each GPU, and 4 GPUs will train with batch_size=64 on each GPU...etc.
This is to preserve the effective batch size (n_GPUs*batch_size) when scaling to multiple GPUs.

@Magotraa
Copy link
Author

@hma02
Thank you for this insight. Just wondering how long it should take to complete the training. Also if you know some way to understand the weights better. As in if I can read the weights and bias and understand it better.

I mean visualize hidden layer weight and bias values, read the values with some tool or may be some text or reference to know in detail about hidden layer weights and bias.

@Magotraa
Copy link
Author

Magotraa commented May 2, 2017

@hma02
Is there any specific naming patterns that's used for naming the weights for different layers of the network?
For my understanding, any suggestions?

Also if you can share some insight on using "group" in the convolution layers..

thank you in advance.

@hma02
Copy link
Contributor

hma02 commented May 4, 2017

@AryanBhardwaj

We benchmarked training speed on GTX 1080 and Tesla K80.
For GTX 1080, it takes 0.91h per epoch.
For Tesla K80, it takes 1.96h per epoch.
Totally 60 epochs, so it takes around 54h for GTX 1080 and around 120h for Tesla K80.

We didn't experiment on visualizing weights. You can simply read those weight files using numpy.load().

To visualize the activation like here, you can construct another theano function to output the self.output of each layer and plot them using imshow from matplotlib.

The naming pattern of saved weights is defined in this function, basically just "layer_index" + "epoch". Some weights has a number following W or b like W0 or b0 and W1 or b1, because they are from the alexnet grouped convolution layer. Inside those layers, there are two parallel sub-convolutions. Each has a weight.

@Magotraa
Copy link
Author

Magotraa commented May 5, 2017

@hma02
I am able to train the alexnet now, thank you for all the suggestions.

Now, I am trying to train on imagenet using my network. But the training error or validation error does not improve at all.

Any suggestions!!!

('training @ iter = ', 61040)
('training cost:', array(6.920103549957275, dtype=float32))
('training error rate:', array(0.9921875))
('training @ iter = ', 61045)
('training cost:', array(6.905889511108398, dtype=float32))
('training error rate:', array(1.0))
('training @ iter = ', 61050)
('training cost:', array(6.9157304763793945, dtype=float32))
('training error rate:', array(1.0))
('training @ iter = ', 61055)
('training cost:', array(6.915121078491211, dtype=float32))
('training error rate:', array(0.9921875))
('training @ iter = ', 61060)
('training cost:', array(6.9073486328125, dtype=float32))
('training error rate:', array(1.0))
('training @ iter = ', 61065)
('training cost:', array(6.910022735595703, dtype=float32))
('training error rate:', array(1.0))
('training @ iter = ', 61070)
('training cost:', array(6.898440361022949, dtype=float32))
('training error rate:', array(1.0))
('training @ iter = ', 61075)
('training cost:', array(6.900564193725586, dtype=float32))
('training error rate:', array(1.0))
('training @ iter = ', 61080)
('training cost:', array(6.9025468826293945, dtype=float32))
('training error rate:', array(1.0))
('training @ iter = ', 61085)
('training cost:', array(6.906184196472168, dtype=float32))
('training error rate:', array(1.0))
('training @ iter = ', 61090)
('training cost:', array(6.913963317871094, dtype=float32))
('training error rate:', array(1.0))
('training @ iter = ', 61095)
('training cost:', array(6.90643310546875, dtype=float32))
('training error rate:', array(1.0))
('training @ iter = ', 61100)
('training cost:', array(6.9034423828125, dtype=float32))
('training error rate:', array(0.9921875))
('training @ iter = ', 61105)
('training cost:', array(6.9006123542785645, dtype=float32))
('training error rate:', array(1.0))
('training @ iter = ', 61110)
('training cost:', array(6.908158302307129, dtype=float32))
('training error rate:', array(1.0))
('training @ iter = ', 61115)
('training cost:', array(6.901939392089844, dtype=float32))
('training error rate:', array(1.0))
('training @ iter = ', 61120)
('training cost:', array(6.902793884277344, dtype=float32))
('training error rate:', array(1.0))
('training @ iter = ', 61125)
('training cost:', array(6.899314880371094, dtype=float32))
('training error rate:', array(1.0))
('training @ iter = ', 61130)
('training cost:', array(6.9046478271484375, dtype=float32))
('training error rate:', array(1.0))
('training @ iter = ', 61135)
('training cost:', array(6.907194137573242, dtype=float32))
('training error rate:', array(1.0))
('training @ iter = ', 61140)
('training cost:', array(6.91206169128418, dtype=float32))
('training error rate:', array(1.0))
('training @ iter = ', 61145)
('training cost:', array(6.901838302612305, dtype=float32))
('training error rate:', array(1.0))
('training @ iter = ', 61150)
('training cost:', array(6.904903411865234, dtype=float32))
('training error rate:', array(1.0))
('training @ iter = ', 61155)
('training cost:', array(6.90507698059082, dtype=float32))
('training error rate:', array(1.0))
('training @ iter = ', 61160)
('training cost:', array(6.911441802978516, dtype=float32))
('training error rate:', array(1.0))
('training @ iter = ', 61165)
('training cost:', array(6.907763957977295, dtype=float32))
('training error rate:', array(1.0))
('training @ iter = ', 61170)
('training cost:', array(6.909838676452637, dtype=float32))
('training error rate:', array(1.0))
('training @ iter = ', 61175)
('training cost:', array(6.905656814575195, dtype=float32))
('training error rate:', array(1.0))
('training @ iter = ', 61180)
('training cost:', array(6.905083179473877, dtype=float32))
('training error rate:', array(1.0))
('training @ iter = ', 61185)
('training cost:', array(6.907958984375, dtype=float32))
('training error rate:', array(1.0))
('training @ iter = ', 61190)
('training cost:', array(6.904727935791016, dtype=float32))
('training error rate:', array(0.9921875))
('training @ iter = ', 61195)
('training cost:', array(6.9050397872924805, dtype=float32))
('training error rate:', array(1.0))
('training @ iter = ', 61200)
('training cost:', array(6.90727424621582, dtype=float32))
('training error rate:', array(0.9921875))
('training @ iter = ', 61205)
('training cost:', array(6.905116558074951, dtype=float32))
('training error rate:', array(1.0))
('training @ iter = ', 61210)
('training cost:', array(6.899809837341309, dtype=float32))
('training error rate:', array(1.0))

@Magotraa
Copy link
Author

@hma02
If possible please suggest something on the above-mentioned issue. Also, if there any relation between the depth of network and learning rate.

@gwding
Copy link
Contributor

gwding commented May 10, 2017

@AryanBhardwaj usually you can try small learning rates until you see some training progress on training data (if you don't see training loss decrease at all, usually there's a bug, maybe in the data pipeline). and then try larger learning rate to learn faster.

@hma02
Copy link
Contributor

hma02 commented May 10, 2017

@AryanBhardwaj
Yes, data pipeline would be the first to check. Verify that your training data matches the training labels.
The cost not decreasing issue could be due to a bad network initialization as well. For example, try tweaking the mean and std of your gaussian initializer. You can follow some of the standard ways of initializing weights like here.

You can also monitor the gradient flow along training to see if the gradient is in a reasonable magnitude (e.g. 1e-1 to 1e-3). Try constructing a theano function that outputs self.grads.

@Magotraa
Copy link
Author

@gwding and @hma02
Thank you, I will try to find the solution on these directions.

@Magotraa
Copy link
Author

Magotraa commented Jun 28, 2017

@hma02 and @gwding

I want to thank you for your suggestions they were helpful.
I am currently trying to test the results using actual images from google that if the learned weights can label that images correctly. Do we have any existing sample to refer? Or please suggest any ideas that may be helpful.

@Magotraa
Copy link
Author

Magotraa commented Jul 1, 2017

@hma02 If possible please suggest something on the above-mentioned issue.

@hma02
Copy link
Contributor

hma02 commented Jul 1, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants