-
Notifications
You must be signed in to change notification settings - Fork 2.3k
0.14.0.dev3 Segmentation fault when reconnecting to pool after connection was lost / disconnected. #848
Comments
CUDA Hardware Test Launch Command:
Try with latest Ethminer version to ensure that's not hardware problem and feedback please. |
Nothing to do with CUDA .. The ethminer process running on CUDA / Nvidia GPUs were still running when I opened this issue. It's the ethminer process running on OpenCL / AMD GPUs that segfaulted when trying to reconnect after the connection to the pool was lost / disconnected.
OK, I am confused, because I AM using the latest ethminer version ... 0.14.0.dev3 .. .from the releases page ( https://github.com/ethereum-mining/ethminer/releases ), which shows 0.14.0.dev3 ... or are you referring for me to do a git clone and build from master ? |
it is possible its the same issue @jean-m-cyr seen at my pr #828 and we are in the process of cleaning this up. |
@smurfy Possibly unrelated, but I've noticed that sometimes under Linux a console ctl-c will result in similar segfault with crash dump. Only happens when using SSL. |
I've moved the io_service.stop() to after the m_socket.close() locally so I don't get the failover crashes runing regular tcp anymore. It doesn't help for SSL crashes. Not exactly the same as #828 |
Well @jmsjr gets:
Which you also get. Not sure where the exception is thrown. (probably in PoolClient either while trying to write or read). And then catched somewhere. So maybe after catching it late something gets destroyed, in an ugly state which causes the reconnect to segfault :) So i try to actually reproduce the problem. In my dev environment and then debug it by adding a shitload of debug output :) |
@smurfy When I was playing with this trying to figure out the SSL case, I recall finding out that the disconnect method was being called recursively. Maybe not recursively, but at least twice somehow!!! Yes. the protocol is shutdown message applies only to SSL, and typically precedes the crash by a second or two. I don't think catching the write on closed socket exception is going to fix it. The crash doesn't occur immediately after the exception. |
FYI .. Happened again just a few minutes ago... and again to the same ethminer process running on OpenCL / AMD GPUs. Note that I do not have them overclocked at all, just stock settings. The ethminer process on CUDA still running... Don't know if the ethminer process ever got disconnected / lost TCP connection though... so maybe just a coincidence it is always the ethminer process running on OpenCL that is always disconnected.
|
May I suggest that .. perhaps, while debugging, in an attempt to reproduce the issue ... put a rule in iptables to drop packets from the pool .. to get the "No new work from xxxx message" ... and then drop the iptables rule after the message "Disconnected from xxxx" and when just about to reconnect ?? |
FYI .. I also managed to somehow trigger it ( although randomly I suppose ) just by hitting Ctrl-C on the running ethminer process ( this time on the process running on CUDA GPUs ):
|
I believe this is fixed with PR #828. Please re-test and close. |
Was the fix / latest master for TLS / SSL connections only ? Just an FYI .. I have not tried the latest from master ( still using release/0.14.0dev3 ), I just had this happen on a non-TLS / non-SSL connection as well:
|
The windows version seems ok. My internet router lost connection and reconnected, ethminer 1.4.0 dev 3 was able to reestablish the nanopool connection. |
I don't think this is fixed. I just built ethminer from master ( git commit hash de05cc3 )
I got this after about 1 hour running TLS on asia1.ethermine.org
So it seems like ethminer decided to disconnect because it did not receive a response to the submission of a share after 2 seconds ...I thought I recall that previously, some of the accepted response from the pool was longer than 2 seconds. P.S. As I type this, I got another one segfault due to that 2 seconds without response from pool on submission of a share. Switching back to 0.14.0.dev3 release proper and without using TLS for now. |
Also, I get segfault as well when simply hitting Ctrl-C:
|
After switching back to 0.14.0dev3 release proper and WITHOUT using TLS, the console output shows the response from the pool for submitting a share received after more than 2 seconds ( 2.9 seconds ) without forcing ethminer to reconnect:
I presume the reconnect if no response from the pool when submitting a share is a recent change after 0.14.0dev3 release proper. Can we re-open this issue ? |
Ethminer Version:
OS:
Been running 0.14.0.dev3 for 18 hours. I actually have 2 instances / processes of ethminer running on the rig:
The ethminer process for the OpenCL GPUs segfaulted after the connection to the pool was lost and ethminer tries to reconnect.
The ethminer process for the CUDA GPU is still running as I speak.
Observations from the console output below ( note that I replaced my wallet address / userpass argument below with xxxxxxxx ):
Note that I am also using TLS on port 5555
FWIW, I actually am using ethminer because of a similar behaviour I have with Claymore:
Claymore DevFee couldn't connect to stratum, restarts miner, complains not enough GPU memory DAG
... so I am giving ethminer a try. Hopefully it is not the same issue as with Claymore ( e.g. seems to be that GPU memory was not being cleared before restarting the miner process )
The text was updated successfully, but these errors were encountered: