-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
error on Windows 10 #33
Comments
@AryanBhardwaj |
@hma02 |
This problem was also mentioned in #32 |
@hma02 Error: Complete Output is here: WARNING (theano.sandbox.cuda): The cuda backend is deprecated and will be removed in the next release (v0.10). Please switch to the gpuarray backend. You can get more information about how to switch at this URL: Using gpu device 0: GeForce GTX 1080 (CNMeM is enabled with initial size: 80.0% of memory, cuDNN 5110) ... building the model
|
The "ZMQError: Address in use" error happens when the previous run failed and the socket port opened in the previous run was not closed properly causing port conflict in the next run. You can search the process opening the port by: netstat -ltnp and kill the corresponding process. For the NAN issue, if it happened from the first epoch, this could be caused by input batch not being fed or preprocessed correctly. Or it can be caused by using too large learning rate. See issue #27. |
@hma02 LogicError: cuIpcGetMemHandle failed: OS call failed or operation not supported on this OS |
@hma02 Can you share, after how many iterations I should expect for accuracy. Also, if you can share optimized hyper parameters file config.yml. current status is: ('training error rate:', array(0.984375)) |
@hma02 ('training cost:', array(4.295770168304443, dtype=float32)) |
Your training cost looks okay so far. Are you training on ImageNet data? If you follow the preprocess steps in this project, you will see 5004 batch files of batch size 256 for single GPU training. That means one epoch will take 5004 iterations. The hyperparams in config.yaml are already the optimized values found so far. That means you need to train for 60 epochs or 60*5004 iterations in total. |
@hma02 |
@AryanBhardwaj |
@hma02 I mean visualize hidden layer weight and bias values, read the values with some tool or may be some text or reference to know in detail about hidden layer weights and bias. |
@hma02 Also if you can share some insight on using "group" in the convolution layers.. thank you in advance. |
We benchmarked training speed on GTX 1080 and Tesla K80. We didn't experiment on visualizing weights. You can simply read those weight files using numpy.load(). To visualize the activation like here, you can construct another theano function to output the self.output of each layer and plot them using imshow from matplotlib. The naming pattern of saved weights is defined in this function, basically just "layer_index" + "epoch". Some weights has a number following W or b like W0 or b0 and W1 or b1, because they are from the alexnet grouped convolution layer. Inside those layers, there are two parallel sub-convolutions. Each has a weight. |
@hma02 Now, I am trying to train on imagenet using my network. But the training error or validation error does not improve at all. Any suggestions!!! ('training @ iter = ', 61040) |
@hma02 |
@AryanBhardwaj usually you can try small learning rates until you see some training progress on training data (if you don't see training loss decrease at all, usually there's a bug, maybe in the data pipeline). and then try larger learning rate to learn faster. |
@AryanBhardwaj You can also monitor the gradient flow along training to see if the gradient is in a reasonable magnitude (e.g. 1e-1 to 1e-3). Try constructing a theano function that outputs |
I want to thank you for your suggestions they were helpful. |
@hma02 If possible please suggest something on the above-mentioned issue. |
@AryanBhardwaj
Interesting. I haven't tried that yet. But I imagine that would require the object to be in some ratio range with respect to the image size as the way they gather imagenet images.
Then you can do the same preprocessing as in the processing folder, e.g., resizing to 256 by 256 and saving into hkl files in int8.
Finally load those hkl files and crop 227 by 227 patches to feed the network.
… On Jul 1, 2017, at 10:02, aryanbhardwaj ***@***.***> wrote:
@hma02 If possible please suggest something on the above-mentioned issue.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Hi,
Thank you for the repository.
I have installed the requirements and started the process as mentioned.
I could prepare the prepossessed data. However, when I execute Train.py,
" I get the error "ERROR"TypeError: Cannot convert Type TensorType(int32, vector) (of Variable <TensorType(int32, vector)>) into Type TensorType(int64, vector). You can try to manually convert <TensorType(int32, vector)> into a TensorType(int64, vector)."
The text was updated successfully, but these errors were encountered: