Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TREX: Can’t Stop Device #557

Open
wmirza89 opened this issue Aug 4, 2021 · 7 comments
Open

TREX: Can’t Stop Device #557

wmirza89 opened this issue Aug 4, 2021 · 7 comments

Comments

@wmirza89
Copy link

wmirza89 commented Aug 4, 2021

20210804 05:50:27 TREX: Can't stop device [ID=2, GPU #2], cuda exception: CUDA_ERROR_LAUNCH_FAILED, try to reduce overclock to stabilize GPU state

I keep getting above error. I’ve reduced overclock just like it suggests but I still keep getting errors and having my rig go offline at random times. Brought down OC to 1850 (hiveos) and still crashing. When the rig crashes and goes offline it takes the other rig offline that’s connected to the same Ethernet switch. Please help

@trexminer
Copy link
Owner

Please upload the full log file

@klokit03
Copy link

klokit03 commented Aug 5, 2021

@trexminer so you are a support tech?

@trexminer
Copy link
Owner

@klokit03 I'm one of the miner developers

@CryptoSuperman123
Copy link

I'm seeing this exact same error. Also ONLY on device with ID=2, GPU #2.

@trexminer
Copy link
Owner

trexminer commented Aug 7, 2021

I'm afraid I can't provide more info unless someone uploads the log file

@wmirza89
Copy link
Author

wmirza89 commented Aug 8, 2021

Please upload the full log file

@trexminer Thank you for your response. I'm not sure if this is sufficient. If not I can try to download the full log file. Just to clarify, which log file? The miner, sys log file, or hive-agent log?

The error code is all the way at the bottom. What's weird too is that when I get this error the hive watchdog doesn't work anymore, the miner takes my rig offline, and my other separate rig offline ( they are connected on the same ethernet switch). The only way to turn my rig back on is by physically turning off and on the rig through the pc power switch. When I restart the rig that generated the error, it automatically brings the other rig back online.

My OC settings are 1200 core 1850 mem and 255 PL. I keep lowering the mem clock by 25 each time and still getting this error. I started with 2000 memory clock. The GPU is EVGA RTX 3080 XC3

Any way you can help?

---------------20210807 21:51:32 ---------------
Mining at usw-eth.hiveon.net:24443, diff: 5.00 G
.[0;97mGPU #0: .[0m.[0;97mZotac RTX 3080 - 98.26 MH/s, [.[0m.[0;97mT:.[0m.[32m56.[0m.[0;97mC, .[0m.[0;97mP:
.[0m.[0;97mGPU #1: .[0m.[0;97mEVGA RTX 3080 - 97.67 MH/s, [.[0m.[0;97mT:.[0m.[32m62.[0m.[0;97mC, .[0m.[0;9
.[0m.[0;97mGPU #2: .[0m.[0;97mEVGA RTX 3080 - 97.55 MH/s, [.[0m.[0;97mT:.[0m.[32m64.[0m.[0;97mC, .[0m.[0;9
.[0m.[0;97mGPU #3: .[0m.[0;97mEVGA RTX 3080 - 99.92 MH/s, [.[0m.[0;97mT:.[0m.[32m60.[0m.[0;97mC, .[0m.[0;9
.[0m.[0;97mGPU #4: .[0m.[0;97mEVGA RTX 3080 - 99.57 MH/s, [.[0m.[0;97mT:.[0m.[32m63.[0m.[0;97mC, .[0m.[0;9
.[0m.[0;97mGPU #5: .[0m.[0;97mEVGA RTX 3080 - 99.92 MH/s, [.[0m.[0;97mT:.[0m.[32m61.[0m.[0;97mC, .[0m.[0;9
.[0mHashrate: 592.87 MH/s, Shares/min: 6.278 (Avr. 5.512), Avr.P: 1375W, Avr.E: 431kH/W
Uptime: 2 hours 16 mins 49 secs | Algo: ethash | T-Rex v0.21.5

20210807 21:51:34 ethash epoch: 432, diff: 10.00 G
20210807 21:51:39 .[32m[ OK ].[0m 755/755 - 592.88 MH/s, 25ms ... GPU #1
.[0m20210807 21:51:51 .[32m[ OK ].[0m 756/756 - 592.89 MH/s, 22ms ... GPU #4
.[0m
---------------20210807 21:52:02 ----------------
Mining at usw-eth.hiveon.net:24443, diff: 10.00 G
.[0;97mGPU #0: .[0m.[0;97mZotac RTX 3080 - 98.27 MH/s, [.[0m.[0;97mT:.[0m.[32m56.[0m.[0;97mC, .[0m.[0;97mP:
.[0m.[0;97mGPU #1: .[0m.[0;97mEVGA RTX 3080 - 97.66 MH/s, [.[0m.[0;97mT:.[0m.[32m62.[0m.[0;97mC, .[0m.[0;9
.[0m.[0;97mGPU #2: .[0m.[0;97mEVGA RTX 3080 - 97.54 MH/s, [.[0m.[0;97mT:.[0m.[32m64.[0m.[0;97mC, .[0m.[0;9
.[0m.[0;97mGPU #3: .[0m.[0;97mEVGA RTX 3080 - 99.92 MH/s, [.[0m.[0;97mT:.[0m.[32m60.[0m.[0;97mC, .[0m.[0;9
.[0m.[0;97mGPU #4: .[0m.[0;97mEVGA RTX 3080 - 99.56 MH/s, [.[0m.[0;97mT:.[0m.[32m63.[0m.[0;97mC, .[0m.[0;9
.[0m.[0;97mGPU #5: .[0m.[0;97mEVGA RTX 3080 - 99.91 MH/s, [.[0m.[0;97mT:.[0m.[32m61.[0m.[0;97mC, .[0m.[0;9
.[0mHashrate: 592.86 MH/s, Shares/min: 6.712 (Avr. 5.513), Avr.P: 1376W, Avr.E: 431kH/W
Uptime: 2 hours 17 mins 19 secs | Algo: ethash | T-Rex v0.21.5

20210807 21:52:26 .[32m[ OK ].[0m 757/757 - 592.87 MH/s, 28ms ... GPU #3
.[0m
---------------20210807 21:52:32 ----------------
Mining at usw-eth.hiveon.net:24443, diff: 10.00 G
.[0;97mGPU #0: .[0m.[0;97mZotac RTX 3080 - 98.27 MH/s, [.[0m.[0;97mT:.[0m.[32m56.[0m.[0;97mC, .[0m.[0;97mP:
.[0m.[0;97mGPU #1: .[0m.[0;97mEVGA RTX 3080 - 97.66 MH/s, [.[0m.[0;97mT:.[0m.[32m62.[0m.[0;97mC, .[0m.[0;9
.[0m.[0;97mGPU #2: .[0m.[0;97mEVGA RTX 3080 - 97.54 MH/s, [.[0m.[0;97mT:.[0m.[32m64.[0m.[0;97mC, .[0m.[0;9
.[0m.[0;97mGPU #3: .[0m.[0;97mEVGA RTX 3080 - 99.91 MH/s, [.[0m.[0;97mT:.[0m.[32m60.[0m.[0;97mC, .[0m.[0;9
.[0m.[0;97mGPU #4: .[0m.[0;97mEVGA RTX 3080 - 99.57 MH/s, [.[0m.[0;97mT:.[0m.[32m63.[0m.[0;97mC, .[0m.[0;9
.[0m.[0;97mGPU #5: .[0m.[0;97mEVGA RTX 3080 - 99.91 MH/s, [.[0m.[0;97mT:.[0m.[32m61.[0m.[0;97mC, .[0m.[0;9
.[0mHashrate: 592.87 MH/s, Shares/min: 6.411 (Avr. 5.497), Avr.P: 1376W, Avr.E: 431kH/W
Uptime: 2 hours 17 mins 49 secs | Algo: ethash | T-Rex v0.21.5

20210807 21:52:42 .[32m[ OK ].[0m 758/758 - 592.86 MH/s, 27ms ... GPU #2
.[0m20210807 21:52:54 .[32m[ OK ].[0m 759/759 - 592.86 MH/s, 27ms ... GPU #2
.[0m20210807 21:53:00 .[32m[ OK ].[0m 760/760 - 592.87 MH/s, 21ms ... GPU #4
.[0m
---------------20210807 21:53:02 ----------------
Mining at usw-eth.hiveon.net:24443, diff: 10.00 G
.[0;97mGPU #0: .[0m.[0;97mZotac RTX 3080 - 98.27 MH/s, [.[0m.[0;97mT:.[0m.[32m56.[0m.[0;97mC, .[0m.[0;97mP:
.[0m.[0;97mGPU #1: .[0m.[0;97mEVGA RTX 3080 - 97.67 MH/s, [.[0m.[0;97mT:.[0m.[32m62.[0m.[0;97mC, .[0m.[0;9
.[0m.[0;97mGPU #2: .[0m.[0;97mEVGA RTX 3080 - 97.54 MH/s, [.[0m.[0;97mT:.[0m.[32m64.[0m.[0;97mC, .[0m.[0;9
.[0m.[0;97mGPU #3: .[0m.[0;97mEVGA RTX 3080 - 99.91 MH/s, [.[0m.[0;97mT:.[0m.[32m60.[0m.[0;97mC, .[0m.[0;9
.[0m.[0;97mGPU #4: .[0m.[0;97mEVGA RTX 3080 - 99.55 MH/s, [.[0m.[0;97mT:.[0m.[32m63.[0m.[0;97mC, .[0m.[0;9
.[0m.[0;97mGPU #5: .[0m.[0;97mEVGA RTX 3080 - 99.91 MH/s, [.[0m.[0;97mT:.[0m.[32m61.[0m.[0;97mC, .[0m.[0;9
.[0mHashrate: 592.86 MH/s, Shares/min: 6.342 (Avr. 5.496), Avr.P: 1376W, Avr.E: 431kH/W
Uptime: 2 hours 18 mins 19 secs | Algo: ethash | T-Rex v0.21.5

20210807 21:53:22 .[32m[ OK ].[0m 761/761 - 592.88 MH/s, 19ms ... GPU #5
.[0m20210807 21:53:24 .[32m[ OK ].[0m 762/762 - 592.87 MH/s, 19ms ... GPU #3
.[0m20210807 21:53:24 .[32m[ OK ].[0m 763/763 - 592.87 MH/s, 30ms ... GPU #3
.[0m20210807 21:53:35 TREX: Can't stop device [ID=2, GPU #2], cuda exception: CUDA_ERROR_LAUNCH_FAILED, try
20210807 21:53:35 WARN: Miner is going to shutdown...

---------------20210807 21:53:35 ----------------
Mining at usw-eth.hiveon.net:24443, diff: 10.00 G
.[0;97mGPU #0: .[0m.[0;97mZotac RTX 3080 - 96.00 MH/s, [.[0m.[0;97mT:.[0m.[32m56.[0m.[0;97mC, .[0m.[0;97mP:
.[0m.[0;97mGPU #1: .[0m.[0;97mEVGA RTX 3080 - 97.67 MH/s, [.[0m.[0;97mT:.[0m.[32m62.[0m.[0;97mC, .[0m.[0;9
.[0m.[0;97mGPU #2: .[0m.[0;97mEVGA RTX 3080 - 97.37 MH/s, [.[0m.[0;97mT:.[0m.[32m64.[0m.[0;97mC, .[0m.[0;9
.[0m.[0;97mGPU #3: .[0m.[0;97mEVGA RTX 3080 - 99.76 MH/s, [.[0m.[0;97mT:.[0m.[32m60.[0m.[0;97mC, .[0m.[0;9
.[0m.[0;97mGPU #4: .[0m.[0;97mEVGA RTX 3080 - 99.36 MH/s, [.[0m.[0;97mT:.[0m.[32m63.[0m.[0;97mC, .[0m.[0;9
.[0m.[0;97mGPU #5: .[0m.[0;97mEVGA RTX 3080 - 97.31 MH/s, [.[0m.[0;97mT:.[0m.[32m61.[0m.[0;97mC, .[0m.[0;9
.[0mHashrate: 587.47 MH/s, Shares/min: 6.435 (Avr. 5.502), Avr.P: 1377W, Avr.E: 427kH/W
Uptime: 2 hours 18 mins 52 secs | Algo: ethash | T-Rex v0.21.5

20210807 21:53:35 Main loop finished. Cleaning up resources...
20210807 21:53:35 ApiServer: stopped listening on 127.0.0.1:4059
20210807 21:54:05 Committing suicide...

@trexminer
Copy link
Owner

Thanks. That seems like an overlock issue. Try running with no OC at all on GPU#2 and see if it crashes again. If it doesn't then that's your root cause.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants