Issue 64 #52

AndreaLanfranchi · 2020-03-06T11:36:07Z

Amend #51

Rationale

The problem highlighted by Kik revealed a vulnerable condition which allows the mining activity to be carried out bypassing, almost completely, the characteristics of memory-hardness embedded into the algo.
Must be evidenced that this issue could be exploited only in bundle with a custom node implementation and the current "public" mining infrastructure does not allow the threat vector to go through.
Nevertheless the concern is present and the code provided by Kik has proven real.

Kik's discovery relies on two basic assumptions :

each block's header_hash is modifiable along with the providing of an according value for field "extraData"
due to above mining can be carried out bypassing, almost completely, DAG memory accesses and "transforming" in a "goal seeking" only on header_hash using high Keccak computational power.

As additional consideration a "custom node" must be modifed to accept a mining result from work consumer with a modified header_hash (see below)

The two above assumptions are valid for both ethash and progpow algorithms even if the latter, due to different width of "seed" is way more affected.
Both ethash and progpow share the same "macro" steps with a difference on step 3:

input data (header_hash + nonce) go through a Keccak round which produces a "seed" (256 bits for ethash, 64 bits for ProgPoW)
"seed" drives the access to DAG memory locations which eventually produce a "mix_hash" (256 bits for both algos)
In ethash last round of Keccak uses as input : 16 words as carry-over from previous keccak round + 8 words mix_hash; in progpow the input is 8 words from header_hash + "seed" + 8 words from mix_hash

The logic concatenation of operations allows in ProgPoW (Kik's code) to:
1 skip the first step and directly set an arbitrary value for "seed"
2 obtain mix_hash (doing only one round of memory accesses)
3 linearly iterate through "extraData" increments and build a new_header_hash (Keccak256) starting from the header_hash sent by work provider
4 Apply new_header_hash + seed + mix as input to last keccak round and compare first two words of digest result to target. If matching go ahead otherwise goto 3
5 We now have a new_header_hash and an "extraData" solving the algo. One thing left to do is to find a nonce which, combined with new_header in the initial Keccak round, produces a seed (and by consequence a mix_hash) which is equal to the one arbitrary set
6 To do so simply scan nonce range (2**64) and do as many Keccak rounds as possible to find the right one. Should range be exhausted goto 3
7 Eventually the miner will notify the "custom node" that the hashing has found a solution (nonce) changing header_hash due to extraData.

The difference in width for the seed and the different absorb phase for last keccak round makes the issue orders of magnitude more relevant in ProgPoW than in ethash but nevertheless the same pattern apply.
It goes without saying this method is highly impractical and economically unviable on ethash as of today ... but is there.

Basically in both algos we have :

- seed  == F(header_hash, nonce)
- mix   == F(seed)              // where memory access occurr
- final == F(seed, mix)

This means also that setting an arbitrary seed we get

- mix   == F(seed)              // where memory access occurr
- final == F(seed, mix)         // where I find a "new_header"
- nonce == F(new_header, seed)  // which can be iterated very quickly

Thus making possible to brute force keccak while bypassing memory accesses

Purpose of this PR is to extend the dependency from header_hash to mix too so having

- digest_init == F(header_hash, nonce)
- seed  == F(digest_init)
- mix   == F(seed)       // where memory access occurr
- digest_final == F(digest_init, mix)

Thus making final hash depend on 256 bit digest from initial keccak which depends from header and nonce.

Rationale The problem highlighted by Kik revealed a vulnerable condition which allows the mining activity to be carried out bypassing, almost completely, the characteristics of memory-hardness embedded into the algo. Must be evidenced that this issue could be exploited only in bundle with a custom node implementation and the current "public" mining infrastructure does not allow the threat vector to go through. Nevertheless the concern is present and the code provided by Kik has proven real. Kik's discovery relies on two basic assumptions : - each block's header_hash is modifiable along with the providing of an according value for field "extraData" - due to above mining can be carried out bypassing, almost completely, DAG memory accesses and "transforming" in a "goal seeking" only on header_hash using high Keccak computational power. As additional consideration a "custom node" must be modifed to accept a mining result from work consumer with a modified header_hash (see below) The two above assumptions are valid for **both ethash and progpow** algorithms even if the latter, due to different width of "seed" is way more affected. Both ethash and progpow share the same "macro" steps with a difference on step 3: - input data (header_hash + nonce) go through a Keccak round which produces a "seed" (256 bits for ethash, 64 bits for ProgPoW) - "seed" drives the access to DAG memory locations which eventually produce a "mix_hash" (256 bits for both algos) - In **ethash** last round of Keccak uses as input : 16 words as carry-over from previous keccak round + 8 words mix_hash; in **progpow** the input is 8 words from header_hash + "seed" + 8 words from mix_hash The logic concatenation of operations allows **in ProgPoW** (Kik's code) to: 1 skip the first step and directly set an arbitrary value for "seed" 2 obtain mix_hash (doing only one round of memory accesses) 3 linearly iterate through "extraData" increments and build a `new_header_hash` (Keccak256) starting from the header_hash sent by work provider 4 Apply new_header_hash + seed + mix as input to last keccak round and compare first two words of digest result to target. If matching go ahead otherwise goto 3 5 We now have a new_header_hash and an "extraData" solving the algo. One thing left to do is to find a **nonce** which, combined with new_header in the initial Keccak round, produces a seed (and by consequence a mix_hash) which is equal to the one arbitrary set 6 To do so simply scan nonce range (2**64) and do as many Keccak rounds as possible to find the right one. Should range be exhausted goto 4 7 Eventually the miner will notify the "custom node" that the hashing has found a solution (nonce) changing header_hash due to extraData. The difference in width for the seed and the different absorb phase for last keccak round makes the issue orders of magnitude more relevant in ProgPoW than in ethash but nevertheless the same pattern apply. It goes without saying this method is highly impractical and economically unviable on ethash as of tiday ... but is there. Basically in both algos we have : - seed == F(header_hash, nonce) - mix == F(seed) // where memory access occurr - final == F(seed, mix) This means also that setting an arbitrary seed we get - mix == F(seed) // where memory access occurr - final == F(seed, mix) // where I find a "new_header" - nonce == F(new_header, seed) // which can be iterated very quickly Thus making possible to brute force keccak while bypassing memory accesses Purpose of this PR is to extend the dependency from header_hash to mix too so having - seed == F(header_hash, nonce) - mix == F(header_hash) // where memory access occurr - final == F(seed, mix) Thus having **both** mix and seed being dependant on header_hash thus making impossible to goal seek a seed without having to also change the mix.

OhGodAGirl · 2020-03-06T13:46:50Z

@solardiz - Would appreciate your eyes on this.

@AndreaLanfranchi - Will review shortly. Are we able to get resulting profiler results for both AMD (perhaps we'll need assistance from a community member) as well as NVIDIA?

We will also need to ensure this is also applied to libprogpow and libethash-cl.

plus amend compile errors

bits consumed in fill_mix are only initiators to KIS99

AndreaLanfranchi · 2020-03-10T14:42:25Z

Benchmarking on Windows EVGA Gtx 1060 3Gb

0.9.3.

 ptxas info    : 0 bytes gmem, 96 bytes cmem[3]
ptxas info    : Compiling entry function '_Z14progpow_searchy8hash32_tyPK5dag_tPV14search_resultsb' for 'sm_61'
ptxas info    : Function properties for _Z14progpow_searchy8hash32_tyPK5dag_tPV14search_resultsb
ptxas         .     0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 56 registers, 16384 bytes smem, 385 bytes cmem[0], 4 bytes cmem[2]
ptxas info    : Function properties for _Z11keccak_f8008hash32_tyS_
ptxas         .     0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
 cu  15:37:30|cuda-0  |  JIT err:

 cu  15:37:30|cuda-0  |  Mangled name: _Z14progpow_searchy8hash32_tyPK5dag_tPV14search_resultsb
 cu  15:37:30|cuda-0  |  done compiling
Trial 1...
8328502
Trial 2...
8814208
Trial 3...
8924497
Trial 4...
8833632
Trial 5...
8823921
Trial 6...
8818620
Trial 7...
8821270
Trial 8...
8927179
Trial 9...
8836046
Trial 10...
8828343
min/mean/max: 8328502/8795621/8927179 H/s
inner mean: 8837567 H/s

0.9.4 (proposed)

 ptxas info    : 0 bytes gmem, 96 bytes cmem[3]
ptxas info    : Compiling entry function '_Z14progpow_searchy8hash32_tyPK5dag_tPV14search_resultsb' for 'sm_61'
ptxas info    : Function properties for _Z14progpow_searchy8hash32_tyPK5dag_tPV14search_resultsb
ptxas         .     0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 72 registers, 16384 bytes smem, 385 bytes cmem[0], 4 bytes cmem[2]
 cu  15:41:09|cuda-0  |  JIT err:

 cu  15:41:10|cuda-0  |  Mangled name: _Z14progpow_searchy8hash32_tyPK5dag_tPV14search_resultsb
 cu  15:41:10|cuda-0  |  done compiling
Trial 1...
7634696
Trial 2...
7973961
Trial 3...
7973164
Trial 4...
7971569
Trial 5...
7971569
Trial 6...
7972366
Trial 7...
7972366
Trial 8...
7974759
Trial 9...
7974759
Trial 10...
7975558
min/mean/max: 7634696/7939476/7975558 H/s
inner mean: 7973064 H/s

solardiz · 2020-03-10T15:34:47Z

Hi all. I am impressed by how quickly all of you jumped in to try and fix this. That's great!

I will probably have more comments, but for starters and most relevant to your current experiments: somehow in my experiments I was able to double the number of "ProgPoW VMs" registers and when done along with doubling the block size (from 256 to 512 bytes) this had very little performance impact - much less than we're seeing above. So maybe it's the opportunity to revisit that and try to fix the new issue on top of those changes, so that the revised ProgPoW would also require more registers in custom hardware?

lookfirst · 2020-03-10T15:34:49Z

Looks like a 9.78% hit on hashrate?

solardiz · 2020-03-10T15:57:07Z

@AndreaLanfranchi Can you try settings like what I had in my 0.9.3m1 proposal? -

Parameter	0.9.2	0.9.3	0.9.3m1
`PROGPOW_PERIOD`	`50`	`10`	`10`
`PROGPOW_LANES`	`16`	`16`	`16`
`PROGPOW_REGS`	`32`	`32`	`64`
`PROGPOW_DAG_LOADS`	`4`	`4`	`8`
`PROGPOW_CACHE_BYTES`	`16x1024`	`16x1024`	`16x1024`
`PROGPOW_CNT_DAG`	`64`	`64`	`32`
`PROGPOW_CNT_CACHE`	`12`	`11`	`22`
`PROGPOW_CNT_MATH`	`20`	`18`	`36`

Due to merging of #35 back then, this should just work.

In 0.9.3m1+, I also had a code tweak to use groups of 4 sequential 512-byte blocks each, thus random reads of 2048-byte blocks, which helped Maxwell a lot. I guess that isn't essential for your testing on Pascal and newer, but is something we can try re-adding as well.

In the above, I set REGS to 64, but with more register pressure from the fix here I guess you can reasonably try e.g. 56 or 48 as well.

solardiz · 2020-03-10T16:08:15Z

The tweak I had proposed in #40 (with two versions of code changes that are ready for use) might also help regain some of the performance.

AndreaLanfranchi · 2020-03-10T16:14:18Z

@solardiz
there you go

0.9.3m1

 ptxas info    : 0 bytes gmem, 96 bytes cmem[3]
ptxas info    : Compiling entry function '_Z14progpow_searchy8hash32_tyPK5dag_tPV14search_resultsb' for 'sm_61'
ptxas info    : Function properties for _Z14progpow_searchy8hash32_tyPK5dag_tPV14search_resultsb
ptxas         .     0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 96 registers, 16384 bytes smem, 385 bytes cmem[0], 4 bytes cmem[2]
ptxas info    : Function properties for _Z11keccak_f8008hash32_tyS_
ptxas         .     0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
 cu  17:12:54|cuda-0  |  JIT err:

 cu  17:12:54|cuda-0  |  Mangled name: _Z14progpow_searchy8hash32_tyPK5dag_tPV14search_resultsb
 cu  17:12:54|cuda-0  |  done compiling
Trial 1...
7633847
Trial 2...
8287894
Trial 3...
8184622
Trial 4...
8287065
Trial 5...
8076458
Trial 6...
8075650
Trial 7...
7970771
Trial 8...
8076458
Trial 9...
8077266
Trial 10...
8080499
min/mean/max: 7633847/8075053/8287894 H/s
inner mean: 8103598 H/s

Worth mention that those tweaks have no effect at all in mitigation of #51

solardiz · 2020-03-10T16:16:22Z

@AndreaLanfranchi Is this the 0.9.3m1 settings (but not its code tweaks) + your code tweaks proposed here? In other words, a slight speedup over what you had? If so, can you also try REGS 56, 48, or such? Thanks!

AndreaLanfranchi · 2020-03-10T16:17:44Z

No. This is knob tweaks in third column applied over 0.9.3.
Want me to apply those knobs to proposed 0.9.4 ?

solardiz · 2020-03-10T16:39:03Z

@AndreaLanfranchi Hmm, that's a bigger slowdown than I expected. Maybe the move to 2048 byte blocks is needed to reduce the cost of those changes on Pascal as well. Can you try my final code from https://github.com/solardiz/ProgPOW (known as 0.9.3m2) on its own and along with your changes? Thanks!

AndreaLanfranchi · 2020-03-10T16:44:53Z

@solardiz I will as soon as I have a chance. You have my word. :)

This issue though is not related to hashrate chasing rather to void the "bypass memory hardness" issue. Would like all of us to focus on that,

solardiz · 2020-03-10T16:54:54Z

@AndreaLanfranchi Of course, we're primarily fixing the vulnerability here. We just want to do so with minimum impact on hashrate, and maybe also use the opportunity to merge other previously postponed tweaks that make better use of GPUs vs. specialized ASICs.

Also, I imagine it should be possible to write the code such that we explicitly spill the now-larger intermediate hash value to memory (maybe to shared aka local? or even to global+cache, which is probably better than having the compiler spill something more frequently used), and then load it back from memory when it's needed, so that there's no increase in register pressure during the main loop.

AndreaLanfranchi · 2020-03-10T17:01:18Z

@solardiz this proposal (CUDA only yet) increase registers use from 56 to 72 and no spills.
There is no "larger" intermediate hash. state[25] was already needed to run any of the 22 keccak round and the "only" modification is that last keccak round absorbs not the header_hash but the carry-over (8 words) from initial keccak round.

solardiz · 2020-03-10T17:20:43Z

@AndreaLanfranchi OK, I probably need to actually review/try it before speaking, but since I already started: from the source, it looks like the increase in register pressure inside the per-lane loop is only slight, by maybe 3 registers for hash_seed[1] and state[1]. 56 to 72 is 16 extra registers. Perhaps we can reduce that by manual spilling/reloading of some of state (index 2 and higher) for the duration of this loop?

AndreaLanfranchi · 2020-03-10T18:23:49Z

Inside the per-lane loop there is no increase in register usage.
hash_seed in 0.9.3. was a 64bit uint byteswapped from from first two words of first keccak.
In proposed 0.9.4. is simply those two words not byteswapped as endianness in the loop is totally irrelevant

solardiz · 2020-03-15T16:02:25Z

@AndreaLanfranchi What block number are your benchmark results above for? I'd probably use the same for mine.

solardiz · 2020-03-15T16:19:39Z

For me, the slowdown with these changes (as-is, except for the Hunter/Boost build fixes) on GTX 1080 for block 10M appears to be 3.1% using the Trial 10 speeds below:

@AndreaLanfranchi's master branch (cbd1c16):

$ ethminer/ethminer -U -M 10000000 --cuda-devices 0 --benchmark-trials 10
[...]
 cu  17:09:48|cuda-0  |  Using device: GeForce GTX 1080  (Compute 6.1)
 cu  17:09:48|cuda-0  |  Set Device to current
 cu  17:09:48|cuda-0  |  Resetting device
 cu  17:09:48|cuda-0  |  Allocating light with size: 60423872
 cu  17:09:48|cuda-0  |  Generating mining buffers
 cu  17:09:48|cuda-0  |  Generating DAG for GPU # 0  with dagBytes: 3867147904  gridSize: 1024
Trial 1...
0
Trial 2...
 cu  17:10:03|cuda-0  |  Finished DAG
 cu  17:10:03|cuda-0  |  Compile log:
 cu  17:10:03|cuda-0  |  JIT info:
 ptxas info    : 0 bytes gmem, 96 bytes cmem[3]
ptxas info    : Compiling entry function '_Z14progpow_searchy8hash32_tyPK5dag_tPV14search_resultsb' for 'sm_61'
ptxas info    : Function properties for _Z14progpow_searchy8hash32_tyPK5dag_tPV14search_resultsb
ptxas         .     0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 63 registers, 16384 bytes smem, 385 bytes cmem[0], 4 bytes cmem[2]
ptxas info    : Function properties for _Z11keccak_f8008hash32_tyS_
ptxas         .     0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
 cu  17:10:03|cuda-0  |  JIT err:

 cu  17:10:03|cuda-0  |  Mangled name: _Z14progpow_searchy8hash32_tyPK5dag_tPV14search_resultsb
 cu  17:10:03|cuda-0  |  done compiling
11010048
Trial 3...
13424457
Trial 4...
14028211
Trial 5...
15104025
Trial 6...
15104025
Trial 7...
15104025
Trial 8...
15104025
Trial 9...
15104025
Trial 10...
15104025
min/mean/max: 0/12908686/15104025 H/s
inner mean: 14247855 H/s

@AndreaLanfranchi's Issue-64 branch (04f4bee):

$ ethminer/ethminer -U -M 10000000 --cuda-devices 0 --benchmark-trials 10
[...]
 cu  17:07:32|cuda-0  |  Using device: GeForce GTX 1080  (Compute 6.1)
 cu  17:07:32|cuda-0  |  Set Device to current
 cu  17:07:32|cuda-0  |  Resetting device
 cu  17:07:32|cuda-0  |  Allocating light with size: 60423872
 cu  17:07:32|cuda-0  |  Generating mining buffers
 cu  17:07:32|cuda-0  |  Generating DAG for GPU # 0  with dagBytes: 3867147904  gridSize: 1024
Trial 1...
0
Trial 2...
 cu  17:07:46|cuda-0  |  Finished DAG
 cu  17:07:47|cuda-0  |  Compile log:
 cu  17:07:47|cuda-0  |  JIT info:
 ptxas info    : 0 bytes gmem, 96 bytes cmem[3]
ptxas info    : Compiling entry function '_Z14progpow_searchy8hash32_tyPK5dag_tPV14search_resultsb' for 'sm_61'
ptxas info    : Function properties for _Z14progpow_searchy8hash32_tyPK5dag_tPV14search_resultsb
ptxas         .     0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 72 registers, 16384 bytes smem, 385 bytes cmem[0], 4 bytes cmem[2]
 cu  17:07:47|cuda-0  |  JIT err:

 cu  17:07:47|cuda-0  |  Mangled name: _Z14progpow_searchy8hash32_tyPK5dag_tPV14search_resultsb
 cu  17:07:47|cuda-0  |  done compiling
11272192
Trial 3...
13214700
Trial 4...
13766001
Trial 5...
14632024
Trial 6...
14632024
Trial 7...
14632024
Trial 8...
14632024
Trial 9...
14632024
Trial 10...
14632024
min/mean/max: 0/12604503/14632024 H/s
inner mean: 13926626 H/s

solardiz · 2020-03-15T17:09:00Z

For me, the below hack reduces the register usage from 72 to 64 (original was 63 here) and completely removes the performance impact on the GTX 1080 here:

+++ b/libethash-cuda/CUDAMiner_kernel.cu
@@ -159,12 +159,14 @@ progpow_search(
     // Force threads to sync and ensure shared mem is in sync
     __syncthreads();
 
-    uint32_t state[25];     // Keccak's state
     uint32_t hash_seed[2];  // KISS99 initiator
     hash32_t digest;        // Carry-over from mix output
 
+    uint32_t state8[8];
+{
     // Absorb phase for initial round of keccak
     // 1st fill with header data (8 words)
+    uint32_t state[25];     // Keccak's state
     for (int i = 0; i < 8; i++)
         state[i] = header.uint32s[i];
     // 2nd fill with nonce (2 words)
@@ -177,6 +179,10 @@ progpow_search(
     // Run intial keccak round
     keccak_f800(state);
 
+    for (int i = 0; i < 8; i++)
+        state8[i] = state[i];
+}
+
     // Main loop
     #pragma unroll 1
     for (uint32_t h = 0; h < PROGPOW_LANES; h++)
@@ -184,8 +190,8 @@ progpow_search(
         uint32_t mix[PROGPOW_REGS];
 
         // share the first two words of digest across all lanes
-        hash_seed[0] = __shfl_sync(0xFFFFFFFF, state[0], h, PROGPOW_LANES);
-        hash_seed[1] = __shfl_sync(0xFFFFFFFF, state[1], h, PROGPOW_LANES);
+        hash_seed[0] = __shfl_sync(0xFFFFFFFF, state8[0], h, PROGPOW_LANES);
+        hash_seed[1] = __shfl_sync(0xFFFFFFFF, state8[1], h, PROGPOW_LANES);
 
         // initialize mix for all lanes using first
         // two words from header_hash
@@ -217,6 +223,10 @@ progpow_search(
             digest = digest_temp;
     }
 
+    uint32_t state[25];     // Keccak's state
+
+    for (int i = 0; i < 8; i++)
+        state[i] = state8[i];
 
     // Absorb phase for last round of keccak (256 bits)
     // 1st initial 8 words of state are kept as carry-over from initial keccak

 cu  17:59:12|cuda-0  |  Using device: GeForce GTX 1080  (Compute 6.1)
 cu  17:59:12|cuda-0  |  Set Device to current
 cu  17:59:12|cuda-0  |  Resetting device
 cu  17:59:12|cuda-0  |  Allocating light with size: 60423872
 cu  17:59:12|cuda-0  |  Generating mining buffers
 cu  17:59:12|cuda-0  |  Generating DAG for GPU # 0  with dagBytes: 3867147904  gridSize: 1024
Trial 1... 
0
Trial 2... 
 cu  17:59:27|cuda-0  |  Finished DAG
 cu  17:59:27|cuda-0  |  Compile log: 
 cu  17:59:28|cuda-0  |  JIT info: 
 ptxas info    : 0 bytes gmem, 96 bytes cmem[3]
ptxas info    : Compiling entry function '_Z14progpow_searchy8hash32_tyPK5dag_tPV14search_resultsb' for 'sm_61'
ptxas info    : Function properties for _Z14progpow_searchy8hash32_tyPK5dag_tPV14search_resultsb
ptxas         .     0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 64 registers, 16384 bytes smem, 385 bytes cmem[0], 4 bytes cmem[2]
 cu  17:59:28|cuda-0  |  JIT err: 
 
 cu  17:59:28|cuda-0  |  Mangled name: _Z14progpow_searchy8hash32_tyPK5dag_tPV14search_resultsb
 cu  17:59:28|cuda-0  |  done compiling
10485760
Trial 3... 
13319579
Trial 4... 
13962658
Trial 5... 
15156470
Trial 6... 
15104025
Trial 7... 
15156470
Trial 8... 
15104025
Trial 9... 
15156470
Trial 10... 
15104025
min/mean/max: 0/12854948/15156470 H/s
inner mean: 14174126 H/s

To have greater confidence that this is right, debugging output needs to be enabled (such as with #46) to make sure the computed hashes stay the same. Also, someone needs to implement the same changes in host-only code such as https://github.com/solardiz/c-progpow and in OpenCL, and make sure all 3 compute the same hashes.

AndreaLanfranchi · 2020-03-15T19:24:59Z

@AndreaLanfranchi What block number are your benchmark results above for? I'd probably use the same for mine.

I'm testing on block 0 (epoch 0)

AndreaLanfranchi · 2020-03-15T22:37:32Z

For me, the below hack reduces the register usage from 72 to 64 (original was 63 here) and completely removes the performance impact on the GTX 1080 here: [cut]

Nice catch and confirm it voids performance impact.
0.9.4 (proposed) (same environment as in my previous benchmarks)

ethminer -U -M 0 --benchmark-trials 10
  m  23:32:11|main    |  ethminer version 0.15.0.dev0
  m  23:32:11|main    |  Build: windows / release +git. 04f4bee
 cu  23:32:11|main    |  Using grid size 1024 , block size 512
Benchmarking on platform: CUDA
Preparing DAG for block #0
  i  23:32:12|cuda-0  |  No work.
  i  23:32:12|cuda-0  |  No work.
  i  23:32:12|cuda-0  |  No work.
Warming up...
  i  23:32:12|cuda-0  |  Initialising miner 0
 cu  23:32:13|cuda-0  |  Using device: GeForce GTX 1060 3GB  (Compute 6.1)
 cu  23:32:13|cuda-0  |  Set Device to current
 cu  23:32:13|cuda-0  |  Resetting device
 cu  23:32:14|cuda-0  |  Allocating light with size: 16776896
 cu  23:32:14|cuda-0  |  Generating mining buffers
 cu  23:32:14|cuda-0  |  Generating DAG for GPU # 0  with dagBytes: 1073739904  gridSize: 1024
 cu  23:32:20|cuda-0  |  Finished DAG
 cu  23:32:20|cuda-0  |  Compile log:
 cu  23:32:20|cuda-0  |  JIT info:
 ptxas info    : 0 bytes gmem, 96 bytes cmem[3]
ptxas info    : Compiling entry function '_Z14progpow_searchy8hash32_tyPK5dag_tPV14search_resultsb' for 'sm_61'
ptxas info    : Function properties for _Z14progpow_searchy8hash32_tyPK5dag_tPV14search_resultsb
ptxas         .     0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 64 registers, 16384 bytes smem, 385 bytes cmem[0], 4 bytes cmem[2]
 cu  23:32:20|cuda-0  |  JIT err:

 cu  23:32:20|cuda-0  |  Mangled name: _Z14progpow_searchy8hash32_tyPK5dag_tPV14search_resultsb
 cu  23:32:20|cuda-0  |  done compiling
Trial 1...
8448739
Trial 2...
8915570
Trial 3...
8811563
Trial 4...
8813326
Trial 5...
8816855
Trial 6...
8822153
Trial 7...
8827458
Trial 8...
8831884
Trial 9...
8836314
Trial 10...
8839861
min/mean/max: 8448739/8796372/8915570 H/s
inner mean: 8824926 H/s

Nevertheless, optimizations aside, I'd like a comment on effectiveness of change wrt the issue raised by Kikx. Thank you.

libethash-cuda/CUDAMiner_kernel.cu

blondfrogs

I don't think the digest selection is correct.

EtherCoreAdmin · 2020-03-17T07:02:55Z

Why not we fix this also? #49

AndreaLanfranchi · 2020-03-17T07:53:48Z

@ethcoreorg cause it's outside the scope of this PR.
Feel free to propose a new PR

AndreaLanfranchi · 2020-03-17T07:54:06Z

@blondfrogs you're absolutely right.
Amending

LLVM is smart enough to figure out which elements are zeroes and which are copied.

AndreaLanfranchi · 2020-03-17T09:52:22Z

@solardiz

    return keccak_f800_progpow(header, seed, digest);

to:

    return keccak_f800_progpow(header, nonce, digest);

This IMHO doesnt solve the issue and keeps us in the situation of 0.9.3 as, again, we can pick an arbitrary seed (64 bits) value which produces a digest. brute force last keccak round to find a header (+extraNonce + nonce) producing a result matching target and with a seed output identical to the one chosen. We're still bruteforcing Keccak in search of a 64 bit value.

With carry-over from intial keccak instead of header (pretty much like in ethash) we force the linear search of header+nonce which produces the same keccak squeeze on a 256 bit basis.

AndreaLanfranchi · 2020-03-17T10:03:42Z

I'd personally add more ... in ethash the first keccak absorb is constrained (10th word is 0x00000001 and 17th word is 0x80000000) and the second too (24th word is 0x00000001 and 37th word is 0x80000000).

Constraining Keccak absorb (initial and last) further reduces (practically voids) the likelyhood of a succesful brute force attempt..

solardiz · 2020-03-17T12:37:22Z

This IMHO doesnt solve the issue and keeps us in the situation of 0.9.3 as, again, we can pick an arbitrary seed (64 bits) value which produces a digest. brute force last keccak round to find a header (+extraNonce + nonce) producing a result matching target and with a seed output identical to the one chosen. We're still bruteforcing Keccak in search of a 64 bit value.

This is what I described as "variation of the attack that would still work would involve re-doing steps 3 and 4, then 5 again, etc. until both conditions are met at once" and explained why it's "clearly impractical". That said, I agree we'd better defeat even this possibility as well, which we'd do by increasing seed size as part of addressing #54.

With carry-over from intial keccak instead of header (pretty much like in ethash) we force the linear search of header+nonce which produces the same keccak squeeze on a 256 bit basis.

I don't understand this. To me, your changes are equivalent to the trivial change I proposed. The variation of attack described above is possible for both. The only difference appears to be 2 vs. 1 Keccak computations per trial, which is an insignificant difference for our decision-making.

AndreaLanfranchi · 2020-03-17T14:18:20Z

The only difference appears to be 2 vs. 1 Keccak computations per trial, which is an insignificant difference for our decision-making.

... in search of an identical 256 bit value instead one of 64 bit.

solardiz · 2020-03-17T14:43:56Z

... in search of an identical 256 bit value instead one of 64 bit.

I don't see your proposal achieve that. Given your own description:

- digest == F(header_hash, nonce)
- seed  == F(digest)
- mix   == F(seed)       // where memory access occurr
- final == F(digest, mix)

Here, let's assume seed and thus mix are fixed. To make final meet target, we'd be altering header_hash and nonce, and we'd also require that seed stays the same. This doesn't imply that digest stays the same - it will freely keep changing.

AndreaLanfranchi · 2020-03-17T14:56:48Z

I don't see your proposal achieve that. Given your own description:

Yes ... sorry ... please focus on code ... description is messed up.

solardiz · 2020-03-17T15:10:45Z

please focus on code ... description is messed up.

I see no relevant differences in code, and that part of the description actually looks good enough for the present discussion, despite of the simplifications. The part of description that was relevantly-wrong and unusable for the discussion was "extend the dependency from header_hash to mix too".

AndreaLanfranchi · 2020-04-30T17:53:17Z

NVIDIA Profiling of this PR
0.9.4

this is previous release (0.9.3)

ifdefelse · 2020-05-02T23:35:24Z

This looks good to us. It would be good to get profiling and performance results for a HBM2 card and a GDDR5X card.

lookfirst · 2020-05-03T00:59:01Z

0.88% L2 is ok?

gcolvin · 2020-08-19T20:22:25Z

It's been a long time. Is there a reason this cannot be merged, or at least referenced as correct? I'm wanting to get a nearly-final EIP-1057 to the other core devs soon.

@AndreaLanfranchi @ifdefelse @OhGodAGirl @lookfirst @solardiz @blondfrogs

gcolvin · 2020-08-21T16:45:04Z

Thanks @ifdefelse !

AndreaLanfranchi added 6 commits March 5, 2020 19:15

Ignore CMakeSettings.json from Visual Studio

706225b

Some white spaces removal

bee4dc1

Upgrade Hunter release

d45cd96

Apply knobs for spec 0.9.3

cbd1c16

Amend Readme with current implmentation of fill_mix

24225d2

AndreaLanfranchi added 3 commits March 6, 2020 14:56

Missing inclusion of digest into fill_mix

d5d2068

plus amend compile errors

No need to consume more bits in fill_mix

fa00bb5

bits consumed in fill_mix are only initiators to KIS99

Update README.md

d362bea

Remove redundant parameter definition

04f4bee

blondfrogs reviewed Mar 16, 2020

View reviewed changes

libethash-cuda/CUDAMiner_kernel.cu Outdated Show resolved Hide resolved

blondfrogs suggested changes Mar 16, 2020

View reviewed changes

AndreaLanfranchi added 2 commits March 17, 2020 09:05

Digest selector index out-of-bounds

a249464

CUDA - Use initializer instead of loop to fill zeroes

5ae9aec

LLVM is smart enough to figure out which elements are zeroes and which are copied.

Apply keccak absorb constraints

cdb4c31

AndreaLanfranchi added 3 commits March 17, 2020 19:46

[OpenCL] Implement 0.9.4

35946a8

Amend implicit address space conversion

21b7812

[OpenCL] Fix compile errors

f4b78d0

gcolvin mentioned this pull request Apr 24, 2020

Ethereum Core Devs Meeting 86 Agenda ethereum/pm#165

Closed

AndreaLanfranchi and others added 3 commits April 30, 2020 17:55

Update 0.9.4 documentation

b11671e

Update documentation 0.9.4

122aee1

Merge branch 'master' into Issue-64

b5586c4

gcolvin mentioned this pull request Aug 21, 2020

Ethereum Core Devs Meeting 94 Agenda ethereum/pm#200

Closed

ifdefelse merged commit d035ae5 into ifdefelse:master Aug 21, 2020

gcolvin mentioned this pull request Aug 24, 2020

ProgPoW Review ethereum/EIPs#2894

Merged

Issue 64 #52

Issue 64 #52

Conversation

AndreaLanfranchi commented Mar 6, 2020 • edited Loading

OhGodAGirl commented Mar 6, 2020 • edited Loading

AndreaLanfranchi commented Mar 10, 2020

solardiz commented Mar 10, 2020

lookfirst commented Mar 10, 2020

solardiz commented Mar 10, 2020 • edited Loading

solardiz commented Mar 10, 2020

AndreaLanfranchi commented Mar 10, 2020 • edited Loading

solardiz commented Mar 10, 2020

AndreaLanfranchi commented Mar 10, 2020 • edited Loading

solardiz commented Mar 10, 2020

AndreaLanfranchi commented Mar 10, 2020 • edited Loading

solardiz commented Mar 10, 2020

AndreaLanfranchi commented Mar 10, 2020

solardiz commented Mar 10, 2020

AndreaLanfranchi commented Mar 10, 2020 • edited Loading

solardiz commented Mar 15, 2020

solardiz commented Mar 15, 2020

solardiz commented Mar 15, 2020

AndreaLanfranchi commented Mar 15, 2020

AndreaLanfranchi commented Mar 15, 2020 • edited Loading

blondfrogs left a comment

Choose a reason for hiding this comment

EtherCoreAdmin commented Mar 17, 2020

AndreaLanfranchi commented Mar 17, 2020

AndreaLanfranchi commented Mar 17, 2020

AndreaLanfranchi commented Mar 17, 2020

AndreaLanfranchi commented Mar 17, 2020

solardiz commented Mar 17, 2020

AndreaLanfranchi commented Mar 17, 2020 • edited Loading

solardiz commented Mar 17, 2020

AndreaLanfranchi commented Mar 17, 2020

solardiz commented Mar 17, 2020

AndreaLanfranchi commented Apr 30, 2020 • edited Loading

ifdefelse commented May 2, 2020

lookfirst commented May 3, 2020

gcolvin commented Aug 19, 2020

gcolvin commented Aug 21, 2020

AndreaLanfranchi commented Mar 6, 2020 •

edited

Loading

OhGodAGirl commented Mar 6, 2020 •

edited

Loading

solardiz commented Mar 10, 2020 •

edited

Loading

AndreaLanfranchi commented Mar 10, 2020 •

edited

Loading

AndreaLanfranchi commented Mar 10, 2020 •

edited

Loading

AndreaLanfranchi commented Mar 10, 2020 •

edited

Loading

AndreaLanfranchi commented Mar 10, 2020 •

edited

Loading

AndreaLanfranchi commented Mar 15, 2020 •

edited

Loading

AndreaLanfranchi commented Mar 17, 2020 •

edited

Loading

AndreaLanfranchi commented Apr 30, 2020 •

edited

Loading