-
Notifications
You must be signed in to change notification settings - Fork 310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Single-chip ASIC design #31
Comments
I invite anyone with a hardware design background to correct me if any of my assumptions are grossly off-base. @timolson @AirSquirrels Also it would be interesting to have an estimate for the second HBM-based design. |
It seems you are using an instruction based model for calculation. All miner manufacturers we know use a pipeline based model to calculate AES or hash. The latency calculation looks largely correct, but the throughput calculation is totally off. It's only a single die ASIC's rough model, an actual ASIC manufacturer would not be limited by this.
Hope this helps. We are a fairly open Shenzhen-based ASIC team, feel free to ask more questions in our Telegram t.me/LinzhiCorp |
How can they pipeline RandomX if it has scratchpad accesses and branches in random places of the code? |
Thank you very much for reviewing and commenting upon our comment. Maybe there is a misunderstanding of what is a pipeline. A CPU has a pipeline stage during execution. ALU or cache operations only stick to certain stages, we don't need memory access at each stage. Please be aware that we are unable to design a chip with a series of github issue comments. Tevador asked for feedback, and we provided that for part of his question (someone else can do HBM). |
Oh, so you're talking about more CPU-like pipeline. Can we assume that RandomX achieves its target here? Making ASIC look and perform like a CPU? |
@Sonia-Chen Thanks for your comments.
I find that hard to believe as RandomX is a lot more complex than SHA256. How do you estimate the design cost just from reading the specification?
If you mean multi-chip modules, those are not technically single chips.
OK, So it's basically impossible to make a simple estimate like this.
Do you mean an out-of-order design? Because instructions have dependencies, so you cannot reach 2 IPC without reordering.
Not possible if the address is not known in advance. If it's known in advance, you need to reorder the load, which is what out-of-order CPUs do to hide the latency.
True, CPUs also prefetch it, so it's a mistake in the calculation. Cache load is designed to be prefetched.
Even a 64b x 64b -> 128b multiplication? 1 cycle at what frequency?
So why do most existing ASICs run at sub-1 GHz frequencies? I'm assuming it's not because they can't, but because it's more power efficient to run slowly.
You need some load/store coherency control if you have a retire stage. You need a retire stage if you do speculative execution. You need speculative execution if you want a pipelined design because there are branches in the code. If you don't use a pipelined design, you will not run at even 1 IPC. |
I've reviewed the RandomX spec along with 7400 who's a top ASIC designer here in Silicon Valley. As Sonia points out, power requirements are impossible to guess until the chip is close to finished, but from a macro point-of-view, we agree that IF you can utilize enough of the CPU, then the two big vendors, Intel and AMD, have a lot of IP, layout work, and scale that cannot be matched by other competitors. Philosophically, I think that's a bad thing, but we'll stick to technicals in this thread. SquareHash There seems to be a misconception that the latency of this hash matters whatsoever. Latencies can pipelined, and an ASIC can queue up a bunch of SquareHash requests for the next nonces. We really don't care about latencies in an ASIC, as long as we're able to pipeline the computation to gain throughput. However, this doesn't mean that a single-chip ASIC would use the 256MB cache to dynamically generate the data table, but that's because of the power not the latency. The energy required to recompute SquareHash makes it infeasible for a H/J metric. The latency of SquareHash is only a minor inconvenience. SquareHash still achieves its goal, but accidentally, not for the reason you think. Branching This is the crux of the matter. RandomX basically represents the compute datapath of a CPU, but in modern CPU's this is the smallest part of the chip! If the goal is to utilize the CPU fully, you must find a way to exercise all the fancy instruction reordering and speculative execution on modern CPU's. Of course the paradox is that bad branches cost energy, but if you don't branch, then you're wasting silicon, a lot of it, and advanced branch prediction IP, a lot of it. The recent settings of "no branching" and "1/128 branching" are insufficient. It's relatively easy for an ASIC to always assume no-branch and then clean up the fault 1/128 of the time. You can handle it with a stall, and you can pipeline another nonce context while you're waiting. Branch prediction is (can be) hard and discourages ASIC development, while optimizing a datapath of MUL's and XOR's is much easier. You need to emphasize branch prediction. We're not sure there's really a solution to the wasted-energy bad-branch problem and it might be a critical paradox. CPU's are designed to be as fast as possible first, and save energy second. PoW cares about energy first and speed second. No way around that. Perhaps Sonia was reviewing the version with no branching, which is when threading becomes very useful, like a GPU. Without branching, RandomX becomes a very simple compute datapath, much simpler than any CPU, which is maybe the design she's discussing. Init / Final AES Of course these can be crushingly faster in ASIC than CPU, but about the same energy. Just make sure the initialization and finalization are short compared to the inner loop. I don't know your current timings. Compilation There will be compiler tricks. The current Generator doesn't do any optimization AFAIK, but for one thing, instruction reordering and latency resolution can be calculated up-front for static accesses. This reduces the usefulness of CPU's complex instruction reordering at runtime, and can make simple non-reordering designs competitive. Overall RandomX has a much better chance of success compared to RandomJS, because it more closely approximates the target hardware. However, whether you can consistently utilize enough of the CPU to prevent ASICs from being 2x more efficient than current CPU's is an open question. On first impression, 7400 thought RandomX would work, but after we did our review and thought about some possible tricks, now he's more optimistic that a RandomX ASIC could be built. Modern CPU's are huge and complex and it's difficult to claim that enough can't be removed from them to get 2x efficiency in an ASIC. There's enough encouragement here that ASIC makers will certainly take RandomX through the design stage, and Sonia's "can-do" attitude has merit. |
@tevador is working on SquareHash replacement that will use even more calculations/power, we already realized it.
Easier said than done. Predictable branch patterns needs to be generated somehow and ASIC can simply use this generation algorithm to predict 100% of branches. Fully random branches can't be predicted and CPU will just waste power trying, unlike ASIC. |
The point we'd like to emphasize is that utilizing the complexity of branch predictors is more important than the small amount of energy wasted on bad branches. Choose your poison. |
Good Stuff |
We discussed it before. The best we can do is to have branches at least in some form which is the easiest for CPU (predicted not taken) and little energy wasted on mispredictions. ASIC will most likely use simple in-order pipeline without the need for branch prediction at all. Branch predictor uses a lot of chip area, but not that much power compared to the rest of CPU. |
It's a negligible amount of time: 262144 AES rounds for init + final AES in total and there are 4 parallel AES chains, so it can be pipelined on CPU (1 new AES each cycle) and fits in 262144 CPU cycles: 0.06-0.07 ms, 2% of total hash time.
Not sure how to mitigate it. @tevador - what % of hash time does code generation take currently? Maybe we can switch to 16 programs of 1024 iterations each to increase cost of compilation step? Or another approach, add some more optimization logic to CPU code generator if it can increase CPU performance or save power. I'm sure some single-pass optimizations are possible here, something like moving loads from scratchpad a few instructions up to hide latency. |
@timolson Thanks a lot for the review! SquareHash
Latencies matter especially for the single-chip design. Since the number of scratchpads you can store on-chip is limited to about ~250, the pipelining options are limited.
I'm still not entirely convinced about this, so I'm designing a replacement for SquareHash which has 3x more multiplications and 4-5x more ALU operations overall (same latency as SquareHash on a superscalar CPU). This should kill any single-chip design. Branching
Those are already utilized:
The importance of branches has been pointed to us by reviewers on one hardware forum and that prompted us to introduce the "1/128 branches". The difference between "no branches" and "1/128 branches" is quite significant if you think about it. There are basically 3 ways how an ASIC might handle those branches:
Out of the 3 options above, I think the most likely one to be select for a hypothetical ASIC design is the first option, which doesn't actually need any branch prediction anyways. Init / Final AESAll the fixed overhead including the AES generator/finalizer, Blake2b and compilation make up roughly 5-10% of the hash time. That's the best we can do because the hash time cannot be increased any further due to verification time concerns. Compilation
I'm aware that some optimizations might be possible, but compilation/optimization must be performed online, so you might not have the luxury to do expensive optimizations. Also branches will complicate your options to reorder the instructions. There is no easy way to prevent optimizations except perhaps increasing the cost of compilation as suggested by @SChernykh but this must be done carefuly to not hurt CPUs too much.
I don't have the latest numbers, but it's around 1% for the x86 compiler.
Moving loads will not increase the performance of out-of-order CPUs. The only optimizations that might be possible are eliminations of some instructions, such as two XORs with the same source value or 2x NEG on the same register. |
@timolson @tevador
|
Short asnwer: No. At the moment, RandomX doesn't need any complicated branch predictor (all branches are predicted not taken by default). The problem is that in order to use random code for consensus, the rules must be set out in advance. That includes the rules for branches. So there are basically 2 options:
In addition to this, we think that an ASIC would actually not predict branches at all and would simply stall the pipeline and work on another hash. See my prevous comment why this is the most likely option. TL;DR: Branch prediction cannot be forced. |
The updated design documentation contains new details about RandomX. Some points that are closely related to the above discussion:
|
I want to elaborate on this for readers who may be a bit less tuned in to the Specification and Design documents and/or lack the domain knowledge. Also to check if I have the correct understanding of the situation. As you noted and also by @ifdefelse, a general purpose CPU is designed to minimize worst case throughput and latency, not fully optimize throughput-efficiency and throughput-wafer-cost products, because the main cost for a productivity device is time/responsiveness not energy or sunk capital cost — which includes not exceeding time limits for I/O so as to not repeat requests and to minimize the limiting factor in Amdahl’s law of the part of the program which can’t be parallelized. A CPU will encounter a wide diversity of algorithms (e.g. various manifestations of nested loops, recursion, memory access patterns) and nondeterminism driven by I/O. To the extent that mobile computer productivity and/or server software can be designed to utilize more parallelized computation then CPUs will presumably become optimized for RandomX because power-efficiency is a significant factor for those devices. Perhaps to be more so if the low power consumption eink displays become commonplace or ubiquitous. A branch predictor (and speculative execution) can optimize throughput and latency for said numerous facets in varied real-world programs, which do not exist in RandomX. I wrote:
The nondeterminism that exists in RandomX after the randomization and compiler stage is the dynamic interaction of the computation (e.g. the registers, execution ports and memory). Yet to prevent the Halting problem (i.e. non-termination due to infinite loops or recursion) the nondeterminism for branching due to those factors is necessarily reduced in the design and as noted (as linked) probably wouldn’t be predictable anyway. Thus wasting energy on speculative branch misprediction would presumably be a reduction in ASIC resistance; and especially so if per my aforementioned point mobile and/or server CPUs eventually become more computational parallelism and power-efficiency oriented. Envision a trending computing paradigm in which computation (verifiable with SNARKs or STARKs) is offloaded from the mobile device to servers. Branching was retained to reduce statically compiled optimizations and presumably also to force an ASIC to have the speculative branching circuitry mentioned by @tevador on Apr 2, 2019. I wrote the above before reading @tevador’s Apr 4, 2019 post which confirms my understanding.
Agreed memory system latency bound does apply for limiting throughput. But RandomX ostensibly doesn’t[1] maximize the latency of off-chip, commodity SDRAM that the ASIC can employ. Potentially worse an ASIC design might presumably employ more memory channels so as to increase the number of banks (and for DDR4 also the bank groups) which form the latency bound. However, there’s a limitation that 256MiB commodity DDR4 chips are no longer available and who knows how long 512MiB will be available. Increasing beyond (currently four for 2GiB Dataset) needed chips would incur an additional power consumption ~1.5W per 4GiB. Presumably at high enough volumes (e.g. if RandomX was driving the world’s reserve currency) presumably commodity SDRAM foundries could be incentivized to manufacture the smaller chips? I believe @timolson has mentioned a limitation on the number of pins but I wonder if even that limit can be stretched by multiplexing the data bus. Again I’m not a hardware design guru, so I might be missing something? Additionally by employing sufficient threads at very low energy cost (by moving L3 off-chip with a very low latency external memory design, employing optimal mix of low-power slightly higher latency L1/L2) I’m thinking it should be possible to optimize compute throughput relative to power consumption and wafer area cost.
[…]
Threading to mask latencies may be a general solution to all stalls and may also enable said optimization of L1/L2 tradeoffs? This might also obviate some of said benefit of including branching, since masking latency may have a holistic optimization of design perspective.
Perhaps invalidated point by aforementioned also potential compilation to VLIW? VLIW encoded static reordering and ILP would require multiple variants of the instructions to handle all branch targets. My interpretation of @Sonia-Chen’s post is that there so many design variables in the ASIC design space for a holistic optimization that isn’t available to a general purpose CPU.
Clearly he/she wasn’t carefully studying the design or didn’t convey their intended meaning well. If she had referred to ease of gaining a significant first mover advantage then one might interpret the ‘good Bitcoin miner’ to mean there’s a more fertile ground and low hanging design fruit on a RandomX ASIC as compared to extant Bitcoin ASICs?
For combating DDoS spam of erroneous proof-of-work solutions (i.e. every full node must verify them all to know they are invalid) I had the thought of requiring a forfeitable bond to send proof-of-work solutions on the network. The bond must be burned when a transgression is observed. Btw, this isn’t the first time I have attempted to make a useful suggestion for Monero, but surely someone else has already proposed this? One disadvantage of course is one needs to have some XMR before mining. [1] And probably can’t since it varies by personal computer thus must choose reasonable maximum commonly shared SDRAM memory system latency per core. |
I came across this discussion on the Grin forum: https://www.grin-forum.org/t/obelisk-grn1-chip-details/4571
The conversation contains a lot of interesting information. The main takeway is that ASIC manufacturers prefer single chip designs even if it may not seem as the best way.
For RandomX, I see two possible options how an efficient ASIC might look like:
I did a ballpark estimate of the performance and power consumption of the first option (single-die RandomX ASIC). These are my assumptions, mainly based on the information from the Grin discussion:
I think some of the above assuptions are rather optimistic, but I'm interested in the best possible scenario here.
First, let's estimate the time to calculate 1 hash. This requires:
One program iteration consist of:
This gives 311 cycles for the program execution 704 cycles to calculate the dataset item. Overall 1015 cycles per iteration, which is about 1 μs.
Total time per hash is about 16.7 ms, which means exactly 60 H/s per scratchpad. We can see that 2/3 of the time, the core is waiting for the dataset item to be calculated, so we can have 1 core per 3 scratchpads to keep all cores busy 100% of the time.
To max-out our 800 mm2 die, we will have:
Final performance of this chip is 249*60 = 15 KH/s.
Power consumption estimate is 754*0.1 + 50*0.4 = 95 W.
Power efficiency is about 150-160 H/J. While this is about 3x more than a CPU, I would consider this as the upper bound of ASIC efficiency given the above assumptions may not be feasible in practice.
The text was updated successfully, but these errors were encountered: