Skip to content
This repository has been archived by the owner on Apr 24, 2022. It is now read-only.

Part 2 - 0.14.0.dev4 Segmentation fault when reconnecting to pool after connection was lost / disconnected #890

Closed
jmsjr opened this issue Mar 14, 2018 · 10 comments

Comments

@jmsjr
Copy link

jmsjr commented Mar 14, 2018

This is a continuation of #848, given that:

  1. 0.14.0.dev4 has been tagged
  2. The issue that I raised was marked as closed, but I just built from master yesterday ( just before the 0.14.0.dev4 tag ) and showed that the issue was not fixed. Do not know how to re-open the issue.
  3. The commits after the git hash of the master build that I did locally yesterday does not contain anything that would change the behaviour ( The last 2 changes were to do with --list-devices and Python3 )

So pardon me as I copy and paste the last few comments on that original issue after it has been marked as closed into this new issue:

----- START COPY FROM ORIGINAL ISSUE ---

I don't think this is fixed. I just built ethminer from master ( git commit hash de05cc3 )

bin/ethminer --version
ethminer version 0.14.0.dev3+git.de05cc3
Build: linux/release/gnu

I got this after about 1 hour running TLS on asia1.ethermine.org

 ℹ  23:28:15|cuda-1  |  Nonce 0xf02f5c9cf53d0272 submitted to asia1.ethermine.org
  m  23:28:16|ethminer|  Speed  56.88 Mh/s    gpu/0 28.44 50C 45% 90W  gpu/1 28.44 56C 75% 100W  [A58+2:R0+0:F0] Time: 01:13
  ✘  23:28:17|stratum |  No no response received in 2 seconds.
  ℹ  23:28:18|stratum |  Disconnected from asia1.ethermine.org
  ℹ  23:28:18|stratum |  Shutting down miners...
  ℹ  23:28:19|stratum |  Retrying in 3 ...
  ℹ  23:28:20|stratum |  Retrying in 2 ...
  m  23:28:21|ethminer|  not-connected 
  ℹ  23:28:21|stratum |  Retrying in 1 ...
./start_ethermine.org-cuda.sh: line 20:  5243 Segmentation fault      (core dumped) bin/ethminer --farm-recheck 2000 -U --stratum asia1.ethermine.org:5555 --stratum-failover us2.ethermine.org:5555 --userpass 0xexxxxxxxx --stratum-protocol 0 --report-hashrate --verbosity 9 -HW
MON 1 --stratum-ssl 0 --cuda-parallel-hash 6 --cuda-streams 4 --cuda-grid-size 4096

So it seems like ethminer decided to disconnect because it did not receive a response to the submission of a share after 2 seconds ...I thought I recall that previously, some of the accepted response from the pool was longer than 2 seconds.

Also, I get segfault as well when simply hitting Ctrl-C:

  m  23:47:37|ethminer|  Speed  27.26 Mh/s    gpu/0 27.26 49C 48%   [A34+1:R0+0:F0] Time: 01:32
  m  23:47:42|ethminer|  Speed  27.18 Mh/s    gpu/0 27.18 49C 48%   [A34+1:R0+0:F0] Time: 01:32
  m  23:47:47|ethminer|  Speed  26.94 Mh/s    gpu/0 26.94 49C 48%   [A34+1:R0+0:F0] Time: 01:32
^C  ℹ  23:47:49|ethminer|  Shutting down...
  ✘  23:47:49|stratum |  Read response failed: End of file
./start_ethermine.org-opencl.sh: line 18:  5261 Segmentation fault      (core dumped) bin/ethminer --farm-recheck 2000 -G --stratum asia1.ethermine.org:5555 --stratum-failover us2.ethermine.org:5555 --userpass 0xxxxxx --stratum-protocol 0 --report-hashrate --verbosity 9 -HWMON 1 --stratum-ssl 0 --cl-parallel-hash 8

After switching back to 0.14.0dev3 release proper and WITHOUT using TLS, the console output shows the response from the pool for submitting a share received after more than 2 seconds ( 2.9 seconds ) without forcing ethminer to reconnect:

 ℹ  00:35:55|cuda-0  |  Nonce 0xc50898a1de3502c9 submitted to asia1.ethermine.org
  m  00:35:57|ethminer|  Speed  56.97 Mh/s    gpu/0 28.47 50C 45% 90W  gpu/1 28.50 55C 75% 102W  [A36+0:R0+0:F0] Time: 00:50
  ℹ  00:35:58|stratum |  Received new job #37a30f4f… from asia1.ethermine.org
  ℹ  00:35:58|stratum |  **Accepted  in 2938 ms.

I presume the reconnect if no response from the pool when submitting a share is a recent change after 0.14.0dev3 release proper.

@smurfy
Copy link
Collaborator

smurfy commented Mar 14, 2018

@smurfy
Copy link
Collaborator

smurfy commented Mar 14, 2018

I presume the reconnect if no response from the pool when submitting a share is a recent change after 0.14.0dev3 release proper.

this is different. try changing this line:

https://github.com/ethereum-mining/ethminer/blob/master/libpoolprotocols/stratum/EthStratumClient.cpp#L696

@jmsjr
Copy link
Author

jmsjr commented Mar 14, 2018

try commenting out this:

https://github.com/ethereum-mining/ethminer/blob/master/libpoolprotocols/stratum/EthStratumClient.cpp#L190

and report back, thanks

I presume the reconnect if no response from the pool when submitting a share is a recent change after 0.14.0dev3 release proper.

this is different. try changing this line:https://github.com/ethereum-mining/ethminer/blob/master/libpoolprotocols/stratum/EthStratumClient.cpp#L696

OK .. I pulled latest master so that I now have:

$ bin/ethminer --version
ethminer version 0.14.0.dev4+git.81ad571
Build: linux/release/gnu

.. but with the following changes as suggested:

$ git diff
diff --git a/libpoolprotocols/stratum/EthStratumClient.cpp b/libpoolprotocols/stratum/EthStratumClient.cpp
index 27a9b5c..f4b1671 100644
--- a/libpoolprotocols/stratum/EthStratumClient.cpp
+++ b/libpoolprotocols/stratum/EthStratumClient.cpp
@@ -187,7 +187,7 @@ void EthStratumClient::disconnect()
                        m_securesocket->shutdown(sec);
                }
 
-               m_socket->close();
+               //m_socket->close();
                m_io_service.stop();
        }
        catch (std::exception const& _e) {
@@ -693,7 +693,7 @@ void EthStratumClient::submitSolution(Solution solution) {
                                boost::asio::placeholders::error));
        }
        m_response_pending = true;
-       m_responsetimer.expires_from_now(boost::posix_time::seconds(2));
+       m_responsetimer.expires_from_now(boost::posix_time::seconds(20));
        m_responsetimer.async_wait(boost::bind(&EthStratumClient::response_timeout_handler, this, boost::asio::placeholders::error));
 }

I put 20 seconds instead of 2 seconds .. not sure if that's good. Also, by increasing to 20 seconds, I maybe preventing the re-connect in the first place which is where the segfault was happening.

I am running with the above changes for the next few hours, and if okay, I will dial back the 20 seconds to 4 seconds tomorrow.

@jmsjr
Copy link
Author

jmsjr commented Mar 14, 2018

Also forgot to mention that I am running with the above changes AND using TLS ( port 5555 on ethermine.org ))

@jmsjr
Copy link
Author

jmsjr commented Mar 14, 2018

After running several hours ... I got this:

  m  02:17:18|ethminer|  Speed  27.26 Mh/s    gpu/0 27.26 48C 48%   [A145+5:R0+0:F0] Time: 06:05
  m  02:17:23|ethminer|  Speed  27.26 Mh/s    gpu/0 27.26 48C 48%   [A145+5:R0+0:F0] Time: 06:05
  m  02:17:28|ethminer|  Speed  27.18 Mh/s    gpu/0 27.18 48C 48%   [A145+5:R0+0:F0] Time: 06:05
  m  02:17:33|ethminer|  Speed  27.18 Mh/s    gpu/0 27.18 48C 48%   [A145+5:R0+0:F0] Time: 06:05
  m  02:17:38|ethminer|  Speed  27.18 Mh/s    gpu/0 27.18 48C 48%   [A145+5:R0+0:F0] Time: 06:05
  m  02:17:43|ethminer|  Speed  27.26 Mh/s    gpu/0 27.26 48C 48%   [A145+5:R0+0:F0] Time: 06:05
  m  02:17:48|ethminer|  Speed  27.26 Mh/s    gpu/0 27.26 48C 48%   [A145+5:R0+0:F0] Time: 06:06
  ℹ  02:17:49|stratum |  Disconnected from asia1.ethermine.org
  ℹ  02:17:49|stratum |  Shutting down miners...
  ℹ  02:17:49|stratum |  Retrying in 3 ...
  ℹ  02:17:50|stratum |  Retrying in 2 ...
  ℹ  02:17:51|stratum |  Retrying in 1 ...
  ✘  02:17:52|stratum |  Handle response failed: protocol is shutdown
  ✘  02:17:52|stratum |  Handle response failed: protocol is shutdown
  ✘  02:17:52|stratum |  Handle response failed: protocol is shutdown
  ✘  02:17:52|stratum |  Handle response failed: protocol is shutdown
  ✘  02:17:52|stratum |  Handle response failed: protocol is shutdown
  ✘  02:17:52|stratum |  Handle response failed: protocol is shutdown
  ✘  02:17:52|stratum |  Handle response failed: protocol is shutdown
./start_ethermine.org-opencl.sh: line 18:  9260 Segmentation fault      (core dumped) bin/ethminer --farm-recheck 2000 -G --stratum asia1.ethermine.org:5555 --stratum-failover us2.ethermine.org:5555 --userpass 0xexxxxxxx --stratum-protocol 0 --report-hashrate --verbosity 9 -HWMON 1 --stratum-ssl 0 --cl-parallel-hash 8

The other process of ethminer running WITHOUT using TLS / SSL ( port 4444 on ethermine.org ) is still running, and it seems to have reconnected ... based on the fact that netstat now shows a different TCP source port from when I started it several hours ago.

I also forgot to mention that Ctrl-C also was causing a segfault even with the above code changes.

@rwaters71
Copy link

I believe I just experienced very similar, if not the same under Windows. It got disconnected and kept retrying to connect in a loop with the following error repeating:

The I/O operation has been aborted because of either a thread exit or an application request

Looks like it is not handling an error when trying to read from a socket that has been closed for whatever reason?

@joequant
Copy link
Contributor

I suspect this may be the same problem as #887

@Hakkk2002
Copy link
Contributor

Same problem under Windows with stratum2+tcp:

ethminer -U -P stratum2+tcp://USER.WORKER@ADD:PORT --cuda-devices 1 3 --cuda-noeval

......
  i  11:14:20|cuda-0  |  Nonce 0x2afbf70003b6fbd9 submitted to ADD
  X  11:14:22|stratum |  No no response received in 2 seconds.
  i  11:14:22|stratum |  Disconnected from ADD
  i  11:14:22|stratum |  Shutting down miners...
  i  11:14:23|stratum |  Retrying in 3 ...

(crash)

joequant added a commit to joequant/ethminer that referenced this issue Mar 27, 2018
This patch changes the socket pointers to shared ptrs which will should
not be released if there is a call to async_write during a delete.
This is a fix for ethereum-mining#929 ethereum-mining#892 ethereum-mining#890 and ethereum-mining#887
@AndreaLanfranchi
Copy link
Collaborator

To address this problem please try 0.14.0rc9 or 0.15.0dev7 and report.

@AndreaLanfranchi
Copy link
Collaborator

Addressed in #1135

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants