Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue 64 #52

Merged
merged 20 commits into from
Aug 21, 2020
Merged

Issue 64 #52

merged 20 commits into from
Aug 21, 2020

Conversation

AndreaLanfranchi
Copy link
Contributor

@AndreaLanfranchi AndreaLanfranchi commented Mar 6, 2020

Amend #51

Rationale

The problem highlighted by Kik revealed a vulnerable condition which allows the mining activity to be carried out bypassing, almost completely, the characteristics of memory-hardness embedded into the algo.
Must be evidenced that this issue could be exploited only in bundle with a custom node implementation and the current "public" mining infrastructure does not allow the threat vector to go through.
Nevertheless the concern is present and the code provided by Kik has proven real.

Kik's discovery relies on two basic assumptions :

  • each block's header_hash is modifiable along with the providing of an according value for field "extraData"
  • due to above mining can be carried out bypassing, almost completely, DAG memory accesses and "transforming" in a "goal seeking" only on header_hash using high Keccak computational power.

As additional consideration a "custom node" must be modifed to accept a mining result from work consumer with a modified header_hash (see below)

The two above assumptions are valid for both ethash and progpow algorithms even if the latter, due to different width of "seed" is way more affected.
Both ethash and progpow share the same "macro" steps with a difference on step 3:

  • input data (header_hash + nonce) go through a Keccak round which produces a "seed" (256 bits for ethash, 64 bits for ProgPoW)
  • "seed" drives the access to DAG memory locations which eventually produce a "mix_hash" (256 bits for both algos)
  • In ethash last round of Keccak uses as input : 16 words as carry-over from previous keccak round + 8 words mix_hash; in progpow the input is 8 words from header_hash + "seed" + 8 words from mix_hash

The logic concatenation of operations allows in ProgPoW (Kik's code) to:
1 skip the first step and directly set an arbitrary value for "seed"
2 obtain mix_hash (doing only one round of memory accesses)
3 linearly iterate through "extraData" increments and build a new_header_hash (Keccak256) starting from the header_hash sent by work provider
4 Apply new_header_hash + seed + mix as input to last keccak round and compare first two words of digest result to target. If matching go ahead otherwise goto 3
5 We now have a new_header_hash and an "extraData" solving the algo. One thing left to do is to find a nonce which, combined with new_header in the initial Keccak round, produces a seed (and by consequence a mix_hash) which is equal to the one arbitrary set
6 To do so simply scan nonce range (2**64) and do as many Keccak rounds as possible to find the right one. Should range be exhausted goto 3
7 Eventually the miner will notify the "custom node" that the hashing has found a solution (nonce) changing header_hash due to extraData.

The difference in width for the seed and the different absorb phase for last keccak round makes the issue orders of magnitude more relevant in ProgPoW than in ethash but nevertheless the same pattern apply.
It goes without saying this method is highly impractical and economically unviable on ethash as of today ... but is there.

Basically in both algos we have :

- seed  == F(header_hash, nonce)
- mix   == F(seed)              // where memory access occurr
- final == F(seed, mix)

This means also that setting an arbitrary seed we get

- mix   == F(seed)              // where memory access occurr
- final == F(seed, mix)         // where I find a "new_header"
- nonce == F(new_header, seed)  // which can be iterated very quickly

Thus making possible to brute force keccak while bypassing memory accesses

Purpose of this PR is to extend the dependency from header_hash to mix too so having

- digest_init == F(header_hash, nonce)
- seed  == F(digest_init)
- mix   == F(seed)       // where memory access occurr
- digest_final == F(digest_init, mix)

Thus making final hash depend on 256 bit digest from initial keccak which depends from header and nonce.

Rationale

The problem highlighted by Kik revealed a vulnerable condition which allows the mining activity to be carried out bypassing, almost completely, the characteristics of memory-hardness embedded into the algo.
Must be evidenced that this issue could be exploited only in bundle with a custom node implementation and the current "public" mining infrastructure does not allow the threat vector to go through.
Nevertheless the concern is present and the code provided by Kik has proven real.

Kik's discovery relies on two basic assumptions :
- each block's header_hash is modifiable along with the providing of an according value for field "extraData"
- due to above mining can be carried out bypassing, almost completely, DAG memory accesses and "transforming" in a "goal seeking" only on header_hash using high Keccak computational power.

As additional consideration a "custom node" must be modifed to accept a mining result from work consumer with a modified header_hash (see below)

The two above assumptions are valid for **both ethash and progpow** algorithms even if the latter, due to different width of "seed" is way more affected.
Both ethash and progpow share the same "macro" steps with a difference on step 3:
- input data (header_hash + nonce) go through a Keccak round which produces a "seed" (256 bits for ethash, 64 bits for ProgPoW)
- "seed" drives the access to DAG memory locations which eventually produce a "mix_hash" (256 bits for both algos)
- In **ethash** last round of Keccak uses as input : 16 words as carry-over from previous keccak round + 8 words mix_hash; in **progpow** the input is 8 words from header_hash + "seed" + 8 words from mix_hash

The logic concatenation of operations allows **in ProgPoW** (Kik's code) to:
1 skip the first step and directly set an arbitrary value for "seed"
2 obtain mix_hash (doing only one round of memory accesses)
3 linearly iterate through "extraData" increments and build a `new_header_hash` (Keccak256) starting from the header_hash sent by work provider
4 Apply new_header_hash + seed + mix as input to last keccak round and compare first two words of digest result to target. If matching go ahead otherwise goto 3
5 We now have a new_header_hash and an "extraData" solving the algo. One thing left to do is to find a **nonce** which, combined with new_header in the initial Keccak round, produces a seed (and by consequence a mix_hash) which is equal to the one arbitrary set
6 To do so simply scan nonce range (2**64) and do as many Keccak rounds as possible to find the right one. Should range be exhausted goto 4
7 Eventually the miner will notify the "custom node" that the hashing has found a solution (nonce) changing header_hash due to extraData.

The difference in width for the seed and the different absorb phase for last keccak round makes the issue orders of magnitude more relevant in ProgPoW than in ethash but nevertheless the same pattern apply.
It goes without saying this method is highly impractical and economically unviable on ethash as of tiday ... but is there.

Basically in both algos we have :
- seed  == F(header_hash, nonce)
- mix   == F(seed)              // where memory access occurr
- final == F(seed, mix)

This means also that setting an arbitrary seed we get
- mix   == F(seed)              // where memory access occurr
- final == F(seed, mix)         // where I find a "new_header"
- nonce == F(new_header, seed)  // which can be iterated very quickly
Thus making possible to brute force keccak while bypassing memory accesses

Purpose of this PR is to extend the dependency from header_hash to mix too so having
- seed  == F(header_hash, nonce)
- mix   == F(header_hash)       // where memory access occurr
- final == F(seed, mix)

Thus having **both** mix and seed being dependant on header_hash thus making impossible to goal seek a seed without having to also change the mix.
@OhGodAGirl
Copy link
Collaborator

OhGodAGirl commented Mar 6, 2020

@solardiz - Would appreciate your eyes on this.

@AndreaLanfranchi - Will review shortly. Are we able to get resulting profiler results for both AMD (perhaps we'll need assistance from a community member) as well as NVIDIA?

We will also need to ensure this is also applied to libprogpow and libethash-cl.

bits consumed in fill_mix are only initiators to KIS99
@AndreaLanfranchi
Copy link
Contributor Author

Benchmarking on Windows EVGA Gtx 1060 3Gb

0.9.3.

 ptxas info    : 0 bytes gmem, 96 bytes cmem[3]
ptxas info    : Compiling entry function '_Z14progpow_searchy8hash32_tyPK5dag_tPV14search_resultsb' for 'sm_61'
ptxas info    : Function properties for _Z14progpow_searchy8hash32_tyPK5dag_tPV14search_resultsb
ptxas         .     0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 56 registers, 16384 bytes smem, 385 bytes cmem[0], 4 bytes cmem[2]
ptxas info    : Function properties for _Z11keccak_f8008hash32_tyS_
ptxas         .     0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
 cu  15:37:30|cuda-0  |  JIT err:

 cu  15:37:30|cuda-0  |  Mangled name: _Z14progpow_searchy8hash32_tyPK5dag_tPV14search_resultsb
 cu  15:37:30|cuda-0  |  done compiling
Trial 1...
8328502
Trial 2...
8814208
Trial 3...
8924497
Trial 4...
8833632
Trial 5...
8823921
Trial 6...
8818620
Trial 7...
8821270
Trial 8...
8927179
Trial 9...
8836046
Trial 10...
8828343
min/mean/max: 8328502/8795621/8927179 H/s
inner mean: 8837567 H/s

0.9.4 (proposed)

 ptxas info    : 0 bytes gmem, 96 bytes cmem[3]
ptxas info    : Compiling entry function '_Z14progpow_searchy8hash32_tyPK5dag_tPV14search_resultsb' for 'sm_61'
ptxas info    : Function properties for _Z14progpow_searchy8hash32_tyPK5dag_tPV14search_resultsb
ptxas         .     0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 72 registers, 16384 bytes smem, 385 bytes cmem[0], 4 bytes cmem[2]
 cu  15:41:09|cuda-0  |  JIT err:

 cu  15:41:10|cuda-0  |  Mangled name: _Z14progpow_searchy8hash32_tyPK5dag_tPV14search_resultsb
 cu  15:41:10|cuda-0  |  done compiling
Trial 1...
7634696
Trial 2...
7973961
Trial 3...
7973164
Trial 4...
7971569
Trial 5...
7971569
Trial 6...
7972366
Trial 7...
7972366
Trial 8...
7974759
Trial 9...
7974759
Trial 10...
7975558
min/mean/max: 7634696/7939476/7975558 H/s
inner mean: 7973064 H/s

@solardiz
Copy link
Contributor

Hi all. I am impressed by how quickly all of you jumped in to try and fix this. That's great!

I will probably have more comments, but for starters and most relevant to your current experiments: somehow in my experiments I was able to double the number of "ProgPoW VMs" registers and when done along with doubling the block size (from 256 to 512 bytes) this had very little performance impact - much less than we're seeing above. So maybe it's the opportunity to revisit that and try to fix the new issue on top of those changes, so that the revised ProgPoW would also require more registers in custom hardware?

@lookfirst
Copy link

Looks like a 9.78% hit on hashrate?

@solardiz
Copy link
Contributor

solardiz commented Mar 10, 2020

@AndreaLanfranchi Can you try settings like what I had in my 0.9.3m1 proposal? -

Parameter 0.9.2 0.9.3 0.9.3m1
PROGPOW_PERIOD 50 10 10
PROGPOW_LANES 16 16 16
PROGPOW_REGS 32 32 64
PROGPOW_DAG_LOADS 4 4 8
PROGPOW_CACHE_BYTES 16x1024 16x1024 16x1024
PROGPOW_CNT_DAG 64 64 32
PROGPOW_CNT_CACHE 12 11 22
PROGPOW_CNT_MATH 20 18 36

Due to merging of #35 back then, this should just work.

In 0.9.3m1+, I also had a code tweak to use groups of 4 sequential 512-byte blocks each, thus random reads of 2048-byte blocks, which helped Maxwell a lot. I guess that isn't essential for your testing on Pascal and newer, but is something we can try re-adding as well.

In the above, I set REGS to 64, but with more register pressure from the fix here I guess you can reasonably try e.g. 56 or 48 as well.

@solardiz
Copy link
Contributor

The tweak I had proposed in #40 (with two versions of code changes that are ready for use) might also help regain some of the performance.

@AndreaLanfranchi
Copy link
Contributor Author

AndreaLanfranchi commented Mar 10, 2020

@solardiz
there you go

0.9.3m1

 ptxas info    : 0 bytes gmem, 96 bytes cmem[3]
ptxas info    : Compiling entry function '_Z14progpow_searchy8hash32_tyPK5dag_tPV14search_resultsb' for 'sm_61'
ptxas info    : Function properties for _Z14progpow_searchy8hash32_tyPK5dag_tPV14search_resultsb
ptxas         .     0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 96 registers, 16384 bytes smem, 385 bytes cmem[0], 4 bytes cmem[2]
ptxas info    : Function properties for _Z11keccak_f8008hash32_tyS_
ptxas         .     0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
 cu  17:12:54|cuda-0  |  JIT err:

 cu  17:12:54|cuda-0  |  Mangled name: _Z14progpow_searchy8hash32_tyPK5dag_tPV14search_resultsb
 cu  17:12:54|cuda-0  |  done compiling
Trial 1...
7633847
Trial 2...
8287894
Trial 3...
8184622
Trial 4...
8287065
Trial 5...
8076458
Trial 6...
8075650
Trial 7...
7970771
Trial 8...
8076458
Trial 9...
8077266
Trial 10...
8080499
min/mean/max: 7633847/8075053/8287894 H/s
inner mean: 8103598 H/s

Worth mention that those tweaks have no effect at all in mitigation of #51

@solardiz
Copy link
Contributor

@AndreaLanfranchi Is this the 0.9.3m1 settings (but not its code tweaks) + your code tweaks proposed here? In other words, a slight speedup over what you had? If so, can you also try REGS 56, 48, or such? Thanks!

@AndreaLanfranchi
Copy link
Contributor Author

AndreaLanfranchi commented Mar 10, 2020

No. This is knob tweaks in third column applied over 0.9.3.
Want me to apply those knobs to proposed 0.9.4 ?

@solardiz
Copy link
Contributor

@AndreaLanfranchi Hmm, that's a bigger slowdown than I expected. Maybe the move to 2048 byte blocks is needed to reduce the cost of those changes on Pascal as well. Can you try my final code from https://github.com/solardiz/ProgPOW (known as 0.9.3m2) on its own and along with your changes? Thanks!

@AndreaLanfranchi
Copy link
Contributor Author

AndreaLanfranchi commented Mar 10, 2020

@solardiz I will as soon as I have a chance. You have my word. :)

This issue though is not related to hashrate chasing rather to void the "bypass memory hardness" issue. Would like all of us to focus on that,

@solardiz
Copy link
Contributor

@AndreaLanfranchi Of course, we're primarily fixing the vulnerability here. We just want to do so with minimum impact on hashrate, and maybe also use the opportunity to merge other previously postponed tweaks that make better use of GPUs vs. specialized ASICs.

Also, I imagine it should be possible to write the code such that we explicitly spill the now-larger intermediate hash value to memory (maybe to shared aka local? or even to global+cache, which is probably better than having the compiler spill something more frequently used), and then load it back from memory when it's needed, so that there's no increase in register pressure during the main loop.

@AndreaLanfranchi
Copy link
Contributor Author

@solardiz this proposal (CUDA only yet) increase registers use from 56 to 72 and no spills.
There is no "larger" intermediate hash. state[25] was already needed to run any of the 22 keccak round and the "only" modification is that last keccak round absorbs not the header_hash but the carry-over (8 words) from initial keccak round.

@solardiz
Copy link
Contributor

@AndreaLanfranchi OK, I probably need to actually review/try it before speaking, but since I already started: from the source, it looks like the increase in register pressure inside the per-lane loop is only slight, by maybe 3 registers for hash_seed[1] and state[1]. 56 to 72 is 16 extra registers. Perhaps we can reduce that by manual spilling/reloading of some of state (index 2 and higher) for the duration of this loop?

@AndreaLanfranchi
Copy link
Contributor Author

AndreaLanfranchi commented Mar 10, 2020

Inside the per-lane loop there is no increase in register usage.
hash_seed in 0.9.3. was a 64bit uint byteswapped from from first two words of first keccak.
In proposed 0.9.4. is simply those two words not byteswapped as endianness in the loop is totally irrelevant

@solardiz
Copy link
Contributor

@AndreaLanfranchi What block number are your benchmark results above for? I'd probably use the same for mine.

@solardiz
Copy link
Contributor

For me, the slowdown with these changes (as-is, except for the Hunter/Boost build fixes) on GTX 1080 for block 10M appears to be 3.1% using the Trial 10 speeds below:

@AndreaLanfranchi's master branch (cbd1c16):

$ ethminer/ethminer -U -M 10000000 --cuda-devices 0 --benchmark-trials 10
[...]
 cu  17:09:48|cuda-0  |  Using device: GeForce GTX 1080  (Compute 6.1)
 cu  17:09:48|cuda-0  |  Set Device to current
 cu  17:09:48|cuda-0  |  Resetting device
 cu  17:09:48|cuda-0  |  Allocating light with size: 60423872
 cu  17:09:48|cuda-0  |  Generating mining buffers
 cu  17:09:48|cuda-0  |  Generating DAG for GPU # 0  with dagBytes: 3867147904  gridSize: 1024
Trial 1...
0
Trial 2...
 cu  17:10:03|cuda-0  |  Finished DAG
 cu  17:10:03|cuda-0  |  Compile log:
 cu  17:10:03|cuda-0  |  JIT info:
 ptxas info    : 0 bytes gmem, 96 bytes cmem[3]
ptxas info    : Compiling entry function '_Z14progpow_searchy8hash32_tyPK5dag_tPV14search_resultsb' for 'sm_61'
ptxas info    : Function properties for _Z14progpow_searchy8hash32_tyPK5dag_tPV14search_resultsb
ptxas         .     0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 63 registers, 16384 bytes smem, 385 bytes cmem[0], 4 bytes cmem[2]
ptxas info    : Function properties for _Z11keccak_f8008hash32_tyS_
ptxas         .     0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
 cu  17:10:03|cuda-0  |  JIT err:

 cu  17:10:03|cuda-0  |  Mangled name: _Z14progpow_searchy8hash32_tyPK5dag_tPV14search_resultsb
 cu  17:10:03|cuda-0  |  done compiling
11010048
Trial 3...
13424457
Trial 4...
14028211
Trial 5...
15104025
Trial 6...
15104025
Trial 7...
15104025
Trial 8...
15104025
Trial 9...
15104025
Trial 10...
15104025
min/mean/max: 0/12908686/15104025 H/s
inner mean: 14247855 H/s

@AndreaLanfranchi's Issue-64 branch (04f4bee):

$ ethminer/ethminer -U -M 10000000 --cuda-devices 0 --benchmark-trials 10
[...]
 cu  17:07:32|cuda-0  |  Using device: GeForce GTX 1080  (Compute 6.1)
 cu  17:07:32|cuda-0  |  Set Device to current
 cu  17:07:32|cuda-0  |  Resetting device
 cu  17:07:32|cuda-0  |  Allocating light with size: 60423872
 cu  17:07:32|cuda-0  |  Generating mining buffers
 cu  17:07:32|cuda-0  |  Generating DAG for GPU # 0  with dagBytes: 3867147904  gridSize: 1024
Trial 1...
0
Trial 2...
 cu  17:07:46|cuda-0  |  Finished DAG
 cu  17:07:47|cuda-0  |  Compile log:
 cu  17:07:47|cuda-0  |  JIT info:
 ptxas info    : 0 bytes gmem, 96 bytes cmem[3]
ptxas info    : Compiling entry function '_Z14progpow_searchy8hash32_tyPK5dag_tPV14search_resultsb' for 'sm_61'
ptxas info    : Function properties for _Z14progpow_searchy8hash32_tyPK5dag_tPV14search_resultsb
ptxas         .     0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 72 registers, 16384 bytes smem, 385 bytes cmem[0], 4 bytes cmem[2]
 cu  17:07:47|cuda-0  |  JIT err:

 cu  17:07:47|cuda-0  |  Mangled name: _Z14progpow_searchy8hash32_tyPK5dag_tPV14search_resultsb
 cu  17:07:47|cuda-0  |  done compiling
11272192
Trial 3...
13214700
Trial 4...
13766001
Trial 5...
14632024
Trial 6...
14632024
Trial 7...
14632024
Trial 8...
14632024
Trial 9...
14632024
Trial 10...
14632024
min/mean/max: 0/12604503/14632024 H/s
inner mean: 13926626 H/s

@solardiz
Copy link
Contributor

For me, the below hack reduces the register usage from 72 to 64 (original was 63 here) and completely removes the performance impact on the GTX 1080 here:

+++ b/libethash-cuda/CUDAMiner_kernel.cu
@@ -159,12 +159,14 @@ progpow_search(
     // Force threads to sync and ensure shared mem is in sync
     __syncthreads();
 
-    uint32_t state[25];     // Keccak's state
     uint32_t hash_seed[2];  // KISS99 initiator
     hash32_t digest;        // Carry-over from mix output
 
+    uint32_t state8[8];
+{
     // Absorb phase for initial round of keccak
     // 1st fill with header data (8 words)
+    uint32_t state[25];     // Keccak's state
     for (int i = 0; i < 8; i++)
         state[i] = header.uint32s[i];
     // 2nd fill with nonce (2 words)
@@ -177,6 +179,10 @@ progpow_search(
     // Run intial keccak round
     keccak_f800(state);
 
+    for (int i = 0; i < 8; i++)
+        state8[i] = state[i];
+}
+
     // Main loop
     #pragma unroll 1
     for (uint32_t h = 0; h < PROGPOW_LANES; h++)
@@ -184,8 +190,8 @@ progpow_search(
         uint32_t mix[PROGPOW_REGS];
 
         // share the first two words of digest across all lanes
-        hash_seed[0] = __shfl_sync(0xFFFFFFFF, state[0], h, PROGPOW_LANES);
-        hash_seed[1] = __shfl_sync(0xFFFFFFFF, state[1], h, PROGPOW_LANES);
+        hash_seed[0] = __shfl_sync(0xFFFFFFFF, state8[0], h, PROGPOW_LANES);
+        hash_seed[1] = __shfl_sync(0xFFFFFFFF, state8[1], h, PROGPOW_LANES);
 
         // initialize mix for all lanes using first
         // two words from header_hash
@@ -217,6 +223,10 @@ progpow_search(
             digest = digest_temp;
     }
 
+    uint32_t state[25];     // Keccak's state
+
+    for (int i = 0; i < 8; i++)
+        state[i] = state8[i];
 
     // Absorb phase for last round of keccak (256 bits)
     // 1st initial 8 words of state are kept as carry-over from initial keccak
 cu  17:59:12|cuda-0  |  Using device: GeForce GTX 1080  (Compute 6.1)
 cu  17:59:12|cuda-0  |  Set Device to current
 cu  17:59:12|cuda-0  |  Resetting device
 cu  17:59:12|cuda-0  |  Allocating light with size: 60423872
 cu  17:59:12|cuda-0  |  Generating mining buffers
 cu  17:59:12|cuda-0  |  Generating DAG for GPU # 0  with dagBytes: 3867147904  gridSize: 1024
Trial 1... 
0
Trial 2... 
 cu  17:59:27|cuda-0  |  Finished DAG
 cu  17:59:27|cuda-0  |  Compile log: 
 cu  17:59:28|cuda-0  |  JIT info: 
 ptxas info    : 0 bytes gmem, 96 bytes cmem[3]
ptxas info    : Compiling entry function '_Z14progpow_searchy8hash32_tyPK5dag_tPV14search_resultsb' for 'sm_61'
ptxas info    : Function properties for _Z14progpow_searchy8hash32_tyPK5dag_tPV14search_resultsb
ptxas         .     0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 64 registers, 16384 bytes smem, 385 bytes cmem[0], 4 bytes cmem[2]
 cu  17:59:28|cuda-0  |  JIT err: 
 
 cu  17:59:28|cuda-0  |  Mangled name: _Z14progpow_searchy8hash32_tyPK5dag_tPV14search_resultsb
 cu  17:59:28|cuda-0  |  done compiling
10485760
Trial 3... 
13319579
Trial 4... 
13962658
Trial 5... 
15156470
Trial 6... 
15104025
Trial 7... 
15156470
Trial 8... 
15104025
Trial 9... 
15156470
Trial 10... 
15104025
min/mean/max: 0/12854948/15156470 H/s
inner mean: 14174126 H/s

To have greater confidence that this is right, debugging output needs to be enabled (such as with #46) to make sure the computed hashes stay the same. Also, someone needs to implement the same changes in host-only code such as https://github.com/solardiz/c-progpow and in OpenCL, and make sure all 3 compute the same hashes.

@AndreaLanfranchi
Copy link
Contributor Author

@AndreaLanfranchi What block number are your benchmark results above for? I'd probably use the same for mine.

I'm testing on block 0 (epoch 0)

@AndreaLanfranchi
Copy link
Contributor Author

AndreaLanfranchi commented Mar 15, 2020

For me, the below hack reduces the register usage from 72 to 64 (original was 63 here) and completely removes the performance impact on the GTX 1080 here: [cut]

Nice catch and confirm it voids performance impact.
0.9.4 (proposed) (same environment as in my previous benchmarks)

ethminer -U -M 0 --benchmark-trials 10
  m  23:32:11|main    |  ethminer version 0.15.0.dev0
  m  23:32:11|main    |  Build: windows / release +git. 04f4bee
 cu  23:32:11|main    |  Using grid size 1024 , block size 512
Benchmarking on platform: CUDA
Preparing DAG for block #0
  i  23:32:12|cuda-0  |  No work.
  i  23:32:12|cuda-0  |  No work.
  i  23:32:12|cuda-0  |  No work.
Warming up...
  i  23:32:12|cuda-0  |  Initialising miner 0
 cu  23:32:13|cuda-0  |  Using device: GeForce GTX 1060 3GB  (Compute 6.1)
 cu  23:32:13|cuda-0  |  Set Device to current
 cu  23:32:13|cuda-0  |  Resetting device
 cu  23:32:14|cuda-0  |  Allocating light with size: 16776896
 cu  23:32:14|cuda-0  |  Generating mining buffers
 cu  23:32:14|cuda-0  |  Generating DAG for GPU # 0  with dagBytes: 1073739904  gridSize: 1024
 cu  23:32:20|cuda-0  |  Finished DAG
 cu  23:32:20|cuda-0  |  Compile log:
 cu  23:32:20|cuda-0  |  JIT info:
 ptxas info    : 0 bytes gmem, 96 bytes cmem[3]
ptxas info    : Compiling entry function '_Z14progpow_searchy8hash32_tyPK5dag_tPV14search_resultsb' for 'sm_61'
ptxas info    : Function properties for _Z14progpow_searchy8hash32_tyPK5dag_tPV14search_resultsb
ptxas         .     0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 64 registers, 16384 bytes smem, 385 bytes cmem[0], 4 bytes cmem[2]
 cu  23:32:20|cuda-0  |  JIT err:

 cu  23:32:20|cuda-0  |  Mangled name: _Z14progpow_searchy8hash32_tyPK5dag_tPV14search_resultsb
 cu  23:32:20|cuda-0  |  done compiling
Trial 1...
8448739
Trial 2...
8915570
Trial 3...
8811563
Trial 4...
8813326
Trial 5...
8816855
Trial 6...
8822153
Trial 7...
8827458
Trial 8...
8831884
Trial 9...
8836314
Trial 10...
8839861
min/mean/max: 8448739/8796372/8915570 H/s
inner mean: 8824926 H/s

Nevertheless, optimizations aside, I'd like a comment on effectiveness of change wrt the issue raised by Kikx. Thank you.

Copy link

@blondfrogs blondfrogs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the digest selection is correct.

@EtherCoreAdmin
Copy link

Why not we fix this also? #49

@AndreaLanfranchi
Copy link
Contributor Author

@ethcoreorg cause it's outside the scope of this PR.
Feel free to propose a new PR

@AndreaLanfranchi
Copy link
Contributor Author

@blondfrogs you're absolutely right.
Amending

LLVM is smart enough to figure out which elements are zeroes and which are copied.
@AndreaLanfranchi
Copy link
Contributor Author

@solardiz

    return keccak_f800_progpow(header, seed, digest);

to:

    return keccak_f800_progpow(header, nonce, digest);

This IMHO doesnt solve the issue and keeps us in the situation of 0.9.3 as, again, we can pick an arbitrary seed (64 bits) value which produces a digest. brute force last keccak round to find a header (+extraNonce + nonce) producing a result matching target and with a seed output identical to the one chosen. We're still bruteforcing Keccak in search of a 64 bit value.

With carry-over from intial keccak instead of header (pretty much like in ethash) we force the linear search of header+nonce which produces the same keccak squeeze on a 256 bit basis.

@AndreaLanfranchi
Copy link
Contributor Author

I'd personally add more ... in ethash the first keccak absorb is constrained (10th word is 0x00000001 and 17th word is 0x80000000) and the second too (24th word is 0x00000001 and 37th word is 0x80000000).

Constraining Keccak absorb (initial and last) further reduces (practically voids) the likelyhood of a succesful brute force attempt..

@solardiz
Copy link
Contributor

This IMHO doesnt solve the issue and keeps us in the situation of 0.9.3 as, again, we can pick an arbitrary seed (64 bits) value which produces a digest. brute force last keccak round to find a header (+extraNonce + nonce) producing a result matching target and with a seed output identical to the one chosen. We're still bruteforcing Keccak in search of a 64 bit value.

This is what I described as "variation of the attack that would still work would involve re-doing steps 3 and 4, then 5 again, etc. until both conditions are met at once" and explained why it's "clearly impractical". That said, I agree we'd better defeat even this possibility as well, which we'd do by increasing seed size as part of addressing #54.

With carry-over from intial keccak instead of header (pretty much like in ethash) we force the linear search of header+nonce which produces the same keccak squeeze on a 256 bit basis.

I don't understand this. To me, your changes are equivalent to the trivial change I proposed. The variation of attack described above is possible for both. The only difference appears to be 2 vs. 1 Keccak computations per trial, which is an insignificant difference for our decision-making.

@AndreaLanfranchi
Copy link
Contributor Author

AndreaLanfranchi commented Mar 17, 2020

The only difference appears to be 2 vs. 1 Keccak computations per trial, which is an insignificant difference for our decision-making.

... in search of an identical 256 bit value instead one of 64 bit.

@solardiz
Copy link
Contributor

... in search of an identical 256 bit value instead one of 64 bit.

I don't see your proposal achieve that. Given your own description:

- digest == F(header_hash, nonce)
- seed  == F(digest)
- mix   == F(seed)       // where memory access occurr
- final == F(digest, mix)

Here, let's assume seed and thus mix are fixed. To make final meet target, we'd be altering header_hash and nonce, and we'd also require that seed stays the same. This doesn't imply that digest stays the same - it will freely keep changing.

@AndreaLanfranchi
Copy link
Contributor Author

I don't see your proposal achieve that. Given your own description:

Yes ... sorry ... please focus on code ... description is messed up.

@solardiz
Copy link
Contributor

please focus on code ... description is messed up.

I see no relevant differences in code, and that part of the description actually looks good enough for the present discussion, despite of the simplifications. The part of description that was relevantly-wrong and unusable for the discussion was "extend the dependency from header_hash to mix too".

@AndreaLanfranchi
Copy link
Contributor Author

AndreaLanfranchi commented Apr 30, 2020

NVIDIA Profiling of this PR
0.9.4
image

this is previous release (0.9.3)
image

@ifdefelse
Copy link
Owner

This looks good to us. It would be good to get profiling and performance results for a HBM2 card and a GDDR5X card.

@lookfirst
Copy link

0.88% L2 is ok?

@gcolvin
Copy link

gcolvin commented Aug 19, 2020

It's been a long time. Is there a reason this cannot be merged, or at least referenced as correct? I'm wanting to get a nearly-final EIP-1057 to the other core devs soon.

@AndreaLanfranchi @ifdefelse @OhGodAGirl @lookfirst @solardiz @blondfrogs

@ifdefelse ifdefelse merged commit d035ae5 into ifdefelse:master Aug 21, 2020
@gcolvin
Copy link

gcolvin commented Aug 21, 2020

Thanks @ifdefelse !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants