-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Farmer faulted 2 times #500
Comments
I want to confirm two things:
|
Yes beta 17. Yes, I have 1 local harvester, 5 remote harvesters and 1 local farmer. |
I hit a similar issue with my farmer crashing unexpectedly. I have two nodes that are each running their own full node, farmer, and harvester. The full node and harvester continued to run but the farmer on both machines died after several days of running fine. The farmer seems to have died a while back and I didn't immediately notice. All messages from the farmer disappeared after the crash:
And here is the last mentions of the string
Same error message on the second machine, but looks like it happened a couple hours earlier on that machine at 03:33:42.197 with the |
Found this in Machine 1:
Machine 2:
|
I've had the same farming failure described in the issue. I had 3 machines on b15. Machine A had full stack, Machine B had full stack, Machine C (pi) had just harvester pointed to Machine B. This setup ran perfectly for a long time. I upgraded to b17 all 3 machines, and wanted to check full stack on the pi, so had the full stack running on the pi. My config for the pi however was still pointing to Machine B. Since then I've had several farming failures. I've shut down the pi farmer, and have 2 machines running full stack, and will monitor for failures. |
Same issue with my setup. dmesg output from the farmer: |
I ran "chia start farmer" on the latest run, instead of using the GUI. Same issue. Not sure if it has any value, but these are the chia processes still running & defunct (venv) user1@farmer01:~$ ps -ef |grep chia |
nm -D -n blspy.cpython-38-x86_64-linux-gnu.so with offset 0x52293 says its in 00000000000520f0 T _ZN3bls7CoreMPL13DeriveChildSkERKNS_10PrivateKeyEj gdb blspy.cpython-38-x86_64-linux-gnu.so Dump of assembler code for function _ZN3bls7CoreMPL13DeriveChildSkERKNS_10PrivateKeyEj: Guessed possible call stack, based on md_hmac call? DeriveChildSk
Could that N double + ceil be causing mischief? Or maybe a sodium_malloc issue? mov %r13,%rsi means movups %xmm2,0x0(%r13) is crashing on a reference to the second parameter to md_hmac which is ikm (passed in as key) or hmacInput1/hmacInput (which could have problems if N were wrong). Maybe memcpy is having issues with the zero length info from here
All just guesses |
0x000000000005226d <+381>: cmp $0x1,%r14 // if (i == 1) It isn't happy copying T to hmacInput and fails on the first write. Since hmacInput allocates 33 bytes using sodium_malloc, the memory will be unaligned. Memcpy shouldn't care about that though (it is using unaligned movups). infoLen = 0 has been optimized out.
hmmm. this must be failing. maybe we are running into this: it might be worth trying setrlimit |
OK a different spot from above. This crashed at 0x528B8 Machine 2: [302029.128623] show_signal_msg: 18 callbacks suppressed above branched to <+1968> if i == 1 This is the i == 1 case. In this case it is trying to write into hmacInput1, somewhat the same. This fails?
|
here are a few observations, just by lightly browsing the code at: https://github.com/Chia-Network/bls-signatures/blob/d1e8f892d1941ff38da08a85cf17fa2e40f4ea2a/src/hkdf.hpp#L46
|
I am now thinking this is caused by a sodium_malloc leak. Stock Ubuntu will only tolerate 15000 of these sodium leaks and it looks like BNWrapper may have a leak. |
@wjblanke has a potential fix for the crashing farmer. Can folks try the farm-crash-test https://github.com/Chia-Network/chia-blockchain/tree/farm-crash-test branch? `git fetch; git checkout farm-crash-test; git pull; sh install.sh' etc... I can make a Windows installer available too if anyone has seen this problem on Windows. |
I ran the farm-crash-test branch on the farmer overnight, sill crashed. (Ran from 20:50 to 05:49) [12580.795642] perf: interrupt took too long (2530 > 2500), lowering kernel.perf_event_max_sample_rate to 79000 |
You actually need to update the bls library - blspy is it's python name. The probably fixed version is 0.2.5 - https://pypi.org/project/blspy/#history I modified setup.py in the branch above to pull that in when you run install.sh. If you wait like 20 minutes I'll have one more possible fix in a version 0.2.6 of blspy. |
Ok, I've updated farm-crash-test branch with the newest blspy of 0.2.6. Please pull and run that one. |
I have been running the new version since 11:30 this morning (6.5 hours ago) and no crash yet, I will let it run through the night and report back tomorrow morning |
Unfortunately crashed again. Let me know if I need to post any logs/dmesg output |
Plots winding down now, will try out farm-crash-test today. One farmer did fail on regular b17 again, this time with just two plotters running full stack. |
beta-1.0b18.dev19 (0.2.5?) crashed after 24 hours of running. |
Yes please post the dmesg crash info |
This is the dmesg of my latest crash: [12580.795642] perf: interrupt took too long (2530 > 2500), lowering kernel.perf_event_max_sample_rate to 79000 |
hmm same spot. we've found another leak, more info soon. i wonder why it picks this particular spot (well, and one other we know about) to crash though. |
I also get the same kind of crash with
|
I've just pushed a potentially fixed blspy. Please git pull the farm-crash-test branch and run it as you can. |
Apologies for the delay in reporting back, I ended up reinstalling Linux on three of the harvesters to sort our the Python issues (These were upgrades from Ubuntu 19.x to 20.04) I have the farm-crash-test branch running on my Farmer and all the harvesters for the last 9 hours and still going without any errors. (Typically crashed +-6 hours in) I will keep an eye on it during our day. |
I've got two machines running strong on dev20 farm-crash-test for 3+ days. I've just installed dev21 on a third machine that had only been plotting. |
Nearly 24 hours on the latest farm-crash-test branch and everything is still working fine. I ran the Farmer initially from the command line and then via the GUI, both working fine. |
1.0b18.dev21 blspy 0.2.7 running for 47 hours without a problem. |
I merged this into the dev branch. If anyone comes new to this issue, that's the branch you now need and not farm-crash-test branch. Feel free to keep running if you have the most recent farm-crash-test however. I'm going to leave this open for another couple of days to make sure we don't see any more of this over longer time periods. |
I have been running farm-crash-test for 5 days straight with no seg fault on the harvester. Peak memory usage is 983MB for me, and current usage is 928MB. What is everyone seeing as peak and current memory usage for their harvester? |
Having heard no further reports of failure, I'm closing this as fixed for the next release. |
hi my version fresh install on ubuntu without GUI, I run harvester only to farm to a remote farmer. It crashes every couple hours. When I check
|
Almost always a bad plot. |
our harvester should tolerate bad plots though.. |
Indeed it should. @sonpython could you figure out which plot is bad and upload it to us? |
sorry for late repy. It took me There is only 1 bad plot as the result of
|
Describe the bug
Farmer has been unstable and closes abruptly.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Run indefinitely
Screenshots
Desktop (please complete the following information):
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: