-
Notifications
You must be signed in to change notification settings - Fork 310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible ASIC design #11
Comments
The 66MiB of SRAM is probably a killer, but I suspect it can be avoided by doing async instructions and thousands or hundreds of thousands of threads. BTW credit for this idea should go to: https://github.com/aggregate/MOG/ |
@cjdelisle 100 000 parallel threads would require ~24 GiB of memory just for scratchpads. You could also compromise by storing only the first 16 KiB of scratchpad in SRAM, which would decrease the required amount of on-chip memory to ~6 MiB, which is probably more realistic. |
The idea is to use threads to replace the latency bottleneck with a memory bandwidth bottleneck.. This is basically the ring processor idea but I think the von neumann bottleneck eventually degrades to a sorting problem where you need to sort the opcodes and operands to be next to each other and then gather the completed operand/opcode bundles into a queue which is then fed to the processor, creating more opcodes with need to sort more operands... |
66 MiB of SRAM is nothing unusual for ASIC, the speedup from SRAM will outweigh bigger chip area by large margin. Scrypt ASICs had 144 MiB SRAM per core, IIRC. |
Anyway, HBM memory + memory controller + 256 CPU-like cores with 66 MiB SRAM on chip sound very similar to a GPU. It'll be still limited by computing power, not memory bandwidth. Power efficiency (H/s/W) will be maybe 2-3 times better than a CPU. |
@SChernykh A GPU would be compute-bound because it cannot run RandomX efficiently. Most 64-bit instructions have to be emulated on GPUs and double precision runs 16 times slower than single precision. ASIC, on the other hand, can execute 1 instruction per cycle in ideal case. @cjdelisle RandomX was not designed to be latency-bound. That's why the dataset is accessed sequentially. Only scratchpad access is latency-bound, but it can fit into SRAM. Also the instructions executed in RandomX are not independent. There will be random dependency chains, typically due to instructions using the same register (very rarely using the same scratchpad word). This would slightly complicate the design. I have contacted @timolson to help us assess the viability and possible performance |
On first glance it looks much improved from RandomJS, being a much closer model of the underlying hardware. However, if I understand correctly, you can run many programs in parallel on chip, assuming enough scratchpad space. This is similar to CryptoNight and favors ASIC development. The limiting factor will probably be this scratchpad area, not the logic, and as you pointed out, 66MiB for a bunch of cores is no problem, being only about 41 mm2 in a 16nm process. If we conservatively assume logic doubles the area then an 80 mm2 chip might cost around $10-15 packaged. You're gonna crush CPUs with this. One way to prevent this parallelism is to make the large DRAM table read/write. Then you need a memcontroller and DRAM set for every core, which is closer to the setup of consumer hardware. Being able to isolate the running program to just the chip die makes for a nice, efficient ASIC. Where ASIC's fall down is when they have to wait for external IO. Once you hit the logic board, it's like a Ferrari in rush hour traffic. Another option is to somehow force nonces to be tried in serial instead of parallel. Then an ASIC can't beat CPU's by merely adding cores. An ASIC design could still be cost-efficient by eliminating the excess cores on a CPU and all the gates for cache control. Or maybe there's a way to limit parallelism to exactly 8 or some chosen number of cores. This would make CPU's closer to the optimal configuration for the problem. I didn't look closely enough, but wanted to point out potential "serialization" attacks. If the program generator can (quickly) do a dependency analysis and sort the instructions such that the scratchpad is r/w in sequential order, then you can replace SRAM with DRAM. Also, parallelism may be discovered and exploited in the programs if there's not enough register overlap. You might consider using fewer registers that are updated more frequently to address this. Again, I'm not sure if it's an actual problem with your design, because I didn't look closely enough, but it should be mentioned. A couple other miscellaneous comments:
I'm currently writing a GPU miner for Grin, and since the genesis block is January 15th, I don't have much time to look deeper until later in January or February, sorry. I can quickly address specific concerns if you want to point something out, or if I overlooked something critical in my very brief review. |
One more comment: An ASIC can have the "correct ratio" of logic handlers based on the frequency and latency of various instructions used in RandomX, which may be different from CPU's. As a simple example, let's assume only two instructions, int multiply and int add, randomly selected 50/50. If multiply takes 4 cycles and add takes 1, then an ASIC will have 4 mul units for every 1 add unit, whereas a CPU gets one of each. That may not be strictly true, but you should tune your probabilities such that the probability_of_instruction / latency_in_cpu is the same for all instructions. In the above case, you want adds to be 4x more frequent than multiplies (assuming the CPU has one of each) |
Aaaaand one more thing... Although the 66 MiB chip you proposed would be $10-15, it's only gonna clock around 1GHz. Intel and AMD definitely do have lots of IP and scale to do efficient layouts and squeeze the maximum speeds out of their designs. If you can fix the multi-core problem running lots of programs in parallel, then a startup ASIC maker, even Bitmain, will not get close to the CPU performance of the incumbents. But you probably need to get within a factor of 3 or so. |
@timolson Thanks for the review.
Yes and assuming you can read from the dataset quickly enough. The dataset is too big to be stored on-chip, so external memory is unavoidable.
That is difficult to do while also allowing hash verification for light clients who might not have enough RAM. But perhaps a 1 GB read/write buffer would be possible. Even phones have at least 2 GB nowadays.
I'm not aware of any way how to achieve this. I don't think it's possible without some central authority handing out nonces. Currently, parallelism is limited only by DRAM bandwidth.
I don't think this is possible. Scratchpad read/write addresses are calculated from register values, which depend on previous results. Dataset is already being read sequentially and the reads cannot be reordered. There are only 8 registers for address generation and register values change every instruction, so the sequence of independent instructions will be very short, perhaps 2-3 instructions. Regarding GPU performance, I already suspected most of what you wrote. |
Per thread? Or one buffer for all threads? How are you going to synchronize them? |
Make sure it's tuned such that typical DDR4 SODIMM speeds align with 8-core parallelism and I'd say you're getting close. However, GDDR5 and HBM crush DDR4 for bandwidth-per-dollar, so if you pin the PoW to memory bandwidth, CPUs will lose out. One way to address that dilemma may be to use random DRAM access instead of sequential. Some of GDDR's improved speed comes from using wider rows and some comes from a wider bus, but DDR4 is competitive for random access patterns. If you're only reading a single word at a time, it doesn't matter that GDDR grabs 32 words while DDR only gets 8, or whatever. They both have random access latencies of 40-45 ns. |
The best way would be to read random 64 bytes at a time. DDR4/CPU cache is optimized for this burst read size. |
Longer reads will favor GDDR & HBM because they have wider busses and also can push more bits-per-pin-per-cycle. I would suggest something smaller than 64 bytes, which would need 512 pin-cycles in DDR4 and only 128 pin-cycles in GDDR5. This is wider than consumer SODIMMs. 16 bytes is probably safe. |
A 64byte burst is optimal for a typical 64bit memory channel, DDR4 is designed for 8-bit bursts per pin. While a GPU can grab that in a single cycle with a 512bit bus, it won't be able to burst any of its accesses. |
I think the ideal case would be 64-byte random accesses that also saturate dual-channel DDR4 bandwidth on 8-core CPU. |
@timolson If we make random memory accesses (latency-bound), the CPU cores will be basically idle and you can make an ASIC with 1% of power consumption of a CPU. |
One buffer per thread. So 8 GiB of memory for 8 parallel threads. |
It depends on the frequency of reads also, right? If you're reading often, then yes. But what about semi-infrequent random DRAM reads? You can tune the number of computations-per-DRAM read to match what a CPU core can do. In this way, it is not DRAM-bound by either latency or bandwidth. The idea is similar to ProgPoW in this regard, where they tuned the number of random math ops per memory access to match GPU capabilities. |
Fair enough. Currently, it takes ~90 W of compute power (14 nm Ryzen CPU with 16 threads) to achieve ~16 GiB/s of DRAM read speed, which is about half of what dual channel DDR4 can do. If you use GDDR5/HBM, you can easily read 20x faster, but how are you going to match the compute speed? Even if you improve efficiency by a factor of 2 over a CPU, that's 900 W of compute power. RandomX uses primitive operations (add, sub, mul, div,, floating point), so I don't think you can make an ASIC much more power efficient than that. At most you can cut out some of the CPU parts like TLB, L3 cache, memory controller and IO. |
This could be a significant risk if it is feasible to run hundreds or thousands of threads with DRAM because one need not do any dependency analysis, just schedule each thread for one instruction only. Creating large datasets is a good solution but it is not really possible to require the dataset to be used by one thread only because in order for a large datasets to be worth creating and storing in the first place, it needs to be reusable. Serial re-usability is going to require that your verifier performs the whole series of operations which is probably a non-starter so you end up having to allow at least a few hundred parallel executions to use the same buffer... |
BTW, random reads already happen in RandomX. There is (on average) one random read per 213 (8192) sequential reads. |
I made this little drawing of what I think a high latency high parallelism processor could look like: https://pixelfed.social/p/cjd/24845 AFAICT there is no way to de-parallelize the mining beyond requiring memory for the verification process. If you require a different 50MB dataset per-nonce then the verifier needs 50MB and the solver using this architecture can run (AVAILABLE_DRAM / 50MB) parallel threads. The method of requiring a precomputed dataset which is reusable for more than one nonce falls down because either the solver parallelizes all allowable permutations for one precomputed dataset or (if the number of allowed permutations is too low) he doesn't bother to precompute it at all and simply tests the same way as the verifier. |
I improved the ASIC design estimate based on comments from @timolson. Let's start with memory. We need 4 GiB of GDDR5 for maximum bandwidth. At least 4 memory chips are required since the capacity is 8 Gb per chip. Each chip has a 32-bit interface, so our maximum memory bandwidth will be 4 * 32 * 8 Gb/s = 128 GiB/s, assuming 2000 MHz memory. The memory can support up to 128 GiB / 4 MiB = 32 768 programs per second. Now let's try to make a chip that has enough compute capability to actually push out 32 thousand programs per second. I started with the AMD Zen die, which has an area of 7 mm2. If we remove all cache, we have around 4 mm2. Let's say we can optimize the design down to 2 mm2 per core. We know that the Ryzen core can do ~500 programs per second at 3350 MHz. Since our ASIC will run only at 1 GHz, our optimized core can do only ~150 programs per second. We need ~218 such cores to saturate the memory bus. This amounts to about ~436 mm2. Additionally, we will need ~40 mm2 of SRAM and a GDDR5 memory controller. The DDR4 memory controller in Ryzen is ~15 mm2, so let's say we can make a minimal controller with just 5 mm2. In total, we have a huge ~480 mm2 die, which is about the same size as a Vega 64 GPU. Price estimate: Total per mining board: ~$280. This doesn't include any R&D or IP licensing costs. We can safely assume a power consumption of around 300 W per board, same as a Vega 64 at full load. Hashes per Joule:
So about 2.5 times more efficient. And this is the best case scenario. |
Zen+ should be about 10% more efficient than Zen. Has anyone tested a Ryzen 2700 yet? Your math assumes 32768 cores saturating GDDR5 all performing sequential accesses. The throughput will be much less with random accesses. I think your area estimates for the ASIC are overly optimistic, as there'll be quite a complicated interconnect network to attach all those cores to the memory etc. |
Actually, there are only 218 cores. The design above assumes scratchpads are stored in SRAM, so a maximum of 256 programs can be run in parallel. The GDDR5 memory is just for the dataset, which is read mostly sequentially. If you wanted to run thousands of programs in parallel, you'd have to store scratchpads in GDDR5 and use the design by @cjdelisle to hide random access latencies. However, in this case you would need 12 GDDR5 chips per board to get enough capacity (12 GiB) and bandwidth (384 GiB/s). The cost of the memory chips alone would be over $200 per board. Power consumption would probably also increase because GDDR5 is power hungry.
Yes, it's maybe too optimistic. The 2.5x efficiency figure is an upper estimate for an ASIC. I still think a bandwidth-limited design is the way to go. If the design was purely latency-bound, an ASIC could use much cheaper DDR3 memory. This can be seen in Antminer E3. |
This seems like a reasonable design for the tech, consider that you can eliminate caches, registers and even split the components of the ALU into circuits (direct add insns to the adders, mul insns to the multipliers, etc). You need SRAM mostly for router-like buffers because the chip is basically a network. Generally speaking, I think your approach of focusing on power consumption is a good heuristic to fit the problem to the hardware you have (though it might be worth also watching int ops and float ops to make sure there are no shortcuts). I'm hoping to fit the problem to the hardware I want to have so my design will be slightly different, focusing more on branching / prediction and use of lots of instructions with somewhat less power consumption. That said, my whole design falls down if it turns out that the high bandwidth wiring is prohibitively expensive. |
I'm experimenting with doubling the memory bandwidth requirements by increasing dataset read size from 8 to 16 bytes. The performance drop for CPUs depends on the available memory bandwidth. On Ryzen 1700, it seemed to hit a bandwidth bottleneck with dual channel DDR4-2400, so I upgraded to DDR4-2933. Here are the performance numbers:
With 16 threads, it's still slightly bandwidth-limited even with 2933 MHz memory. It seems that 3200 or 3466 MHz memory might be needed. For the ASIC design, this would mean either halving the performance to ~16K programs per second per board (with corresponding halving of die area) or a forced upgrade to 256-bit GDDR5 interface with 8 memory chips, which would double the memory cost to ~$150 per board and put more strain on inter-core bandwidth. One drawback of this change is that the execution units of CPUs would be slightly underutilized, which would make it easier for an ASIC to match the compute requirements. This could be solved by adding more compute per VM instruction. What do you think about this change? |
3200-3466 MHz memory is becoming more common now. We should aim for maxing out available DDR4 dual channel bandwidth and compensate with more computing to load CPU if needed. |
Nobody needs to be reminded how bad it would be for a PoW algorithm to be broken. Please constrain your comments to the actual topic and refrain from going off on wild tangents, if you wish to be taken seriously. |
[…]
Unfortunately Intel is not selling their new ASICs retail.
Intel has just entered the proof-of-work ASIC business. Intel claims significant performance-efficiency-product advantage compared to Bitmain. EDIT: I read that Intel is ostensibly venturing into proof-of-work ASICS because — at least in the case of SHA256 — the smaller wafer area provides higher yields (e.g. than their CPUs and other customer’s large area designs) thus giving them more options when ramping up new process yields and possibly incorporating multi-project wafers in the context of their recent strategic shift to compete with TMSC, Samsung and GlobalFounderies in offering fab services.
Incorrect assumption written during a cryptowinter before the fledgling onboarding of institutions that forebodes Bitcoin becoming a world reserve currency with a $100T market cap perhaps within a decade.
Granted the superscalar pipeline in modern non-embedded markets CPUs will exploit that instruction level parallelism (ILP). Yet the dynamic dependencies for which dynamic superscalar excels, apply to memory read/write dependencies which you stated ‘very rarely’ occur. Could the static register independence ILP be statically compiled as out-of-order in a VLIW architecture? Itanium failed but ostensibly this was because real-world programs have too dynamic ILP going on to gain sufficiently from static analysis — yet RandomX seems to fail to duplicate that facet of real-world programs as alluded to in this thread last year. 😉 Apparently ²⁵⁵⁄₂₅₆ths of the performance benefit of speculative execution from the §Superscalar execution in the Design document is obtained by simply not taking every jump instruction, rendering a CPU’s speculative execution an electricity wasting appendage (as mentioned both in §2.6.1 Branch prediction documentation and in the audits). I’m trying boggle what appears to be the sophistry of “1. Non-speculative - when a branch is encountered, the pipeline is stalled. This typically adds a 3-cycle penalty for each branch.” Why would an ASIC statically set to predict every branch as not taken stall the pipeline for any of the ²⁵⁵⁄₂₅₆ of the occurrences as has ostensibly been (disingenuously?) modeled? EDIT: I suppose the intent could be as compared to a non-speculative, general purpose CPU that isn’t designed specifically to optimize RandomX — in which case that could be clarified so the unwitting readers aren’t misled given the presumption that RandomX’s design intent focus is to be ASIC resistant. Thus I conclude the dynamic out-of-order and speculative appendages of modern CPUs waste some electricity and wafer area as compared to what could be implemented on an ASIC. As quantified it may or may not be that significant of an advantage for the ASIC but it will be one of many advantages.
New RISC-V CPU claims recordbreaking performance per watt How about 10 times more power efficient? Such embedded applications targeted CPUs will not likely be employed in applications that require real-time interaction. Nobody wants to wait on their smartphone to finish a task, and we all know there’s an Amdahl’s limitation on the parallelization of real-world programs which contain an inherent serial, contention and synchronization component. A RandomX ASIC may be able to remove the energy and wafer area wasting facets of modern CPUs that exist to minimize latency. Yet we were all told this already:
They make the throughput-efficiency-product point, but didn’t mention throughput-cost-product, for which higher yields on smaller wafer area could possibly be a significant advantage compared to cutting edge modern CPUs.
In the real world they’re inseparable because of inherent serial, contention and synchronization overhead. But afaics RandomX lacks the extent of nondeterminism in the real world of I/O. |
I realize the following contemplated design was considered deprecated in this issues thread, ostensibly because RandomX was changed during the thread discussion to have a random access latency bound on the Dataset. Yet correcting the following is relevant for forthcoming estimates of an ASIC advantage for a new contemplated design I will posit.
[…]
Keeping with the point that we can no longer assume that proof-of-work ASICs will not receive top design effort from the likes of Intel, the comparable Nvidia GPUs of that era consumed ~77% of the power with ~36% faster base clock rate when scaled proportionally by process size and area (c.f. also).
The base clock speed of said Nvidia scaled to 14nm would be only half of the Ryzen’s, thus only ~250 programs per second required. Thus the area and power required required is nearly halved, whilst the power efficiency is ~77% thus Ryzen 7 1700X operating at those frequencies is generating ~6200 programs per second and is 95W TDP. Maybe it generates ~4000 programs per second at 65 W TPD which is doesn’t change the conclusion. Thus for this roughed estimate and design the ASIC would have a ~3.7 times power efficiency advantage. The contemplated design has significantly more cache than the GPU so that would be slightly less compute intensive per unit area, so maybe that bumps the power efficiency advantage closer to 4. Also I see no reason why if volumes are significant enough that Intel or AMD couldn’t be motivated to produce cores that operate at the same frequencies as their CPUs, thus the custom RandomX chip advantage could approach 8 times for this contemplated design example.
Shrinking the wafer area by half presumably lowers the cost non-linearly (due to increased yields and more wafer area utilization) so less than $75. The Ryzen 7 1700 was retailing for 4 – 5 times that cost in 2017 when the 14nm process was prevalent . Also the end user had (and still has for newer CPUs) no way to produce a Ryzen 7 1700 computer system for less than ~$600+ and no option for amortizing system components over multiple CPUs because there’s no dual or quadruple CPU motherboards for non-server CPUs. Server CPUs and motherboards cost several thousand dollars and would be underutilized as a personal computer for most users. If they’re buying a device specifically for mining then they should purchase an ASIC.
I bet it will be less than 10% difference. Hub-and-spoke topology[1] is efficient, the spoke ends have low bandwidth in this example and it’s why we don’t run an independent water main from the pumping station directly to every home. [1] I.e. a scale-free network power-law phenomenon that dominates resource management in nature, even for wealth. Also related to the Pareto principle. |
Some excerpts:
Ftfy.
[…]
Ostensibly other than the rare dynamic (i.e. register runtime values) random memory contention and cache lines (which can perhaps be obviated by a holistic threaded masking of latency), the only illusion of non-determinism (c.f. also) added by RandomX as compared to all previous failed attempts at ASIC resistance is the static-per-program randomization of VM instructions. Being static thus deterministic, it can perhaps be somewhat obviated as limiting relative advantage for an ASIC with VLIW as I previously posited, presumably with an increasing trade-off of lower throughput-cost-product (i.e. the often idle specialized circuits) for higher throughput-efficiency-product (i.e. more efficient specialized circuits) as the design is pushed to the limits of efficacy. The relatively miserly wafer area of non-multiplicative VM instructions sans the orders-of-magnitude more wafer area gobbling multiplicative would occur 44% of the time paired, 29% as triplets and 19% as quadruplets. For integer VM instructions that’s only 100 combinations paired, 1000 as triplets and 10,000 as quadruples. The combinations are even less for floating point. However the number of combinations will be significantly larger if want to hardwire all possible register combinations for even more efficiency of the n-tuples, although this still may be a worthwhile trade-off of wafer cost for greater efficiency. We must assume that CPU designs prioritize throughput over electrical efficiency for the otherwise too slow multiplicative instructions so in that case throughput could be sacrificed for efficiency if latency is masked in another facet of the ASIC’s design.
Incorrect. Transistor speed scales with capacitance which scales proportional to the die shrink factor. I recently replied to your 2017 claim that Moore’s law was dead:
smooth_xmr wrote in your thread:
|
Specialized circuits will idle less often if we share them between threads but this incurs at least the extra latency of more distant scratchpad cache. If that’s a positive tradeoff[1] then presumably the L1 cache is eliminated (leaving only L2 to serve its function) because the latency will be masked by the additional threads required to mask the slower L1. This holistic masking would also provide some leeway for other latency hiccups that the more complex CPU might handle in strike with its higher cache set associativity, OoOE, etc.. Yet if sharing ALU resources is compartmentalized for a group of threads and threads can be moved to a different group by moving only the register file for each new program (with cache distance irrelevant due to latency masking), the binomial distribution of (perhaps separably) integer and floating point multiplicative operations could come into play. At n=256, the standard deviation for the normal distribution approximation indicates for example a ~31.7% occurrence of greater or less than 38 ± ~8 integer multiplicative operations per program (i.e. greater or less than ±15% from the mean). So some ALU groups could have more or less of these multiplicative resources to optimize the matching of throughput-efficiency-product and/or throughput-cost-product to programs relatively better than the CPU.
Random bits do not represent the entropy of the system when those bits evoke non-fungible resources. Your analogy is vacuous and #NotEvenWrong. In other words this and the prior post are examples that I was correct that the entropy is not the 2⁵¹² seed of the random generator. The entropy is reduced by the complex analysis of the interactions of the said non-fungible resources. There’s now some math in this and my prior post for you to attempt to refute. If the entropy was solely determined by the random bits that comprise the program then said math could find no optimizations involving anything other than the random bits. My posited optimizations leverage information which subsumes some of the information (Shannon entropy) in the random bits.
😛 [1] Low-power SRAM can have static power consumption 1% of dynamic (hot) and static power consumption per 256KB can be less than 10μW — thus 10,000+ static scratchpads for less than ⅒W (not that we’ll need anywhere near that many). I will propose moving L3 off-die in my next post. I do not know if that document is representative of the reality that applies to the intend context. SRAM seems to have many variants and much ongoing research. |
Shelby I completely agree with you. Complexity in a PoW is bad bad bad, increasing the attack surface and providing plenty of opportunity for unobvious optimizations by private parties who hide their trade secrets. My assessment three years ago was that an ASIC could indeed outperform CPU's running RandomX, but I no longer have the time or motivation to really get into it. IMO, there are a few reasons why there is not an obvious RandomX ASIC yet, but none of them have to do with the "ASIC resistance" of the algorithm:
For some reason the Monero team decided to hand the keys of the kingdom to AMD and Intel, a duopoly whose headquarters are across the street from each other, both of whom produce chips with backdoors for the NSA. I don't get it. I thought Monero was against that kind of centralized control under the thumb of a single government. Now that Intel will be explicitly producing mining chips, it seems an even stranger decision. If you are looking for a PoW for a new project, why not just use Keccak? It's perhaps the most extensively attacked and reviewed hash we have. It's simple, and it's hella fast in hardware. It also has uses outside PoW. |
Eh, I've been waiting for you to finish editing all your mistakes.
Nonsense. You continue to ignore ARM, which wins on power efficiency and sheer volume in terms of installed base. And RISC-V is still evolving. There is nothing preferential or advantageous to x86 in RandomX. |
As for Intel entering the SHA256 ASIC market - as I wrote in August 2019: http://highlandsun.com/hyc/monero-pow-12.txt
|
Tri-opoly now with the M1. Cupertino is just a few miles from Santa Clara. We can argue about ARM chips being competitive, but at least they're owned by the non-US entity SoftBank. My point is that when you create a complex design and try to tie it to CPU's, you limit competition to only a few very large corporations, almost all of whom are under the thumb of the US government. SHA3 chips could be produced by just about anyone, including small teams, and I urge you to look at the algorithm. Your claims of new tricks do not seem well-founded to me. It is far simpler than SHA2. And if you think SHA2 has too many tricks, why would you design a PoW as complex as RandomX? Do you think there are not a ton of tricks for optimizing RandomX? I don't understand how you can claim SHA2 has too many tricks and yet think that RandomX is solid, or that it can't be made into an ASIC. But we are replaying an old fight. 🤷 Shelby can read the history, except for the IRC talk. |
Agreed, there's nothing new here. |
For viewpoints critical of RandomX's ASIC resistance, look at comments from myself and also "Linzhi" which is a fabless ASIC company from Shenzhen. |
My Telegram Chinese friend wrote, “the biggest shareholder of sin0vac is softbank from japan.” EDIT: What do you think of RISC-V as a potential way for us to have CPUs which are free from malware? Do you think it will be possible to get these manufactured and what would be the volumes and capital one would need to be taken seriously by TSMC or other foundry? Can these foundry insert exploits that we’d be unable to find? Tangentially I wrote on my Telegram group: Well if ever we need to build our own ancient CPU out of discrete transistors: That was the second CPU I learned to program. This discrete version costs $4000 to build and consumes 10W compared to the actual 6502 IC part which costs $10 runs 300 times faster consuming 8mW. I don’t know if people comprehend the rate at which Moore’s law (no not my law, haha) has altered the human species. Even an early 1990s era CPU such as the Motorola 68000 (which is the CPU I was programming for most of my early significant accomplishments in my 20s) would occupy 8.8 hectares of land if built with discrete components. The human species has difficulty comprehending exponential growth and scale.
I want to help emphasize that those are relevant points to why some observers do not conclude with utmost confidence that RandomX is highly ASIC resistant.
I did read that other RandomX issues thread. I appreciate everyone who has shared their knowledge including the authors and contributors to RandomX, because integrated circuits at the design and manufacturing perspective is new for me. I had studied the boolean logic design of for example early microprocessors (at age 13 actually) and had built analog electronic and digital circuits when I interned at Rockwell Science Center in Thousand Oaks, but then I exited the field to launch a software company in the 1980s.
It does come across as disingenuous. That makes it difficult to trust that they have been
I noted Mircea was also raving about Keccak in the past. I think that might be a good one to consider transitioning to after onboarding, perhaps hardcoded transition schedule in the protocol. Also posit that I won’t have to lie to users about the fact that if the proof-of-work will always be entirely centralized in the end game. I have contemplated a consensus design which (hopefully) obviates a 50+% attack in any form including transaction censorship. An ISTJ tells me I lack fecundity in my bit-twiddle and I retort his bit-twiddle pursuits indicated he doesn’t conceptualize why the power-law distribution of resources in inviolable. Banging ones head against a brick wall is not a very productive activity, but hey never interrupt the antagonist when they’re busying destroying themself — Sun Tzu. Also the other reason to consider RandomX is that if there are users willing to mine at a loss (they do not care about the loss on a fraction of a penny if they are onboarding), ergo the following destruction of all the altcoins that is coming might not apply to mine (but I still think it would be vulnerable if not for the key change I made in the consensus design): http://trilema.com/2014/the-woes-of-altcoin-or-why-there-is-no-such-thing-as-cryptocurrencies/ In short, I expect Monero to be destroyed whenever the powers-that-be are ready to do do. Ditto Litecoin, Dash, etc.
I interpreted his comment differently that to truly compete with what would posited to be a state-of-the-art RandomX ASIC, it might be necessary to have Intel’s or AMD’s intellectual property. Such an intentional asymmetrical design choice could possibly be a major blunder unless the actual ASIC resistance (if we even have a way to reliably estimate it) is sufficient to meet some aims. I posit that for Monero’s raison d'être (i.e. anonymity) such an asymmetrical unknown deletes the assurance of anonymity. Smooth argued to me that routine privacy (e.g. your neighbor can’t track your spending) doesn’t necessarily require anonymity against powerful adversaries such as three letter agencies. Btw, so far I am pleased with the decision to make RandomX compatible with ARM because it might match my use case if I can nail down the level of ASIC resistance to within an order-of-magnitude. I haven’t yet finished my study and exposition yet I am already leaning towards the likelihood that a state-of-the-art RandomX ASIC will be at least an order-of-magnitude more efficient and/or less costly. Anyone else want to share any additional thoughts about such an estimate? I suppose I am thinking right now that three orders-of-magnitude is unlikely. I really wanted help on refining such estimates which I presume is beneficial to any project that wishes to employ RandomX. I am not sure if anyone but “antagonists”[1] to RandomX are going to do their best to I think @tevador and perhaps also @SChernykh were trying to be somewhat unbiased and that was appreciated. I probably dropped the ball by not applying more effort last year to explain myself, but I grew weary of the discussion at that time, had other pressing (health[2]) matters to attend to, necessitated more focused study/thought but didn’t have the free time to do it properly, and at the time I had no immediate interest in using RandomX.
[1] My impression/intuition is they (especially @timolson) are genuinely trying to help by unselfishly offering their time and expertise, but seems perhaps some of the RandomX contributors in this thread are skeptical and think there’s some hidden subterfuge agenda at least w.r.t. myself and "Linzhi". I offer no firm opinion on "Linzhi" other than cite factual statements such as the relative number of logic gates between different ALU operations. [2] The reason for not trying to explain what I might mean about entropy before was because it would have dragged me into another debate perhaps to be misunderstood if I hadn’t taken the time to carefully study and contemplate. I was in a rush to travel from that location where I was to escape the lockdowns, closed exercise facilities, masking (and climate) that I got stranded in by the sudden turn of events in 2020, which were so deleterious to my chronic health issues. Obviously these guys here will ridicule anyone who has not extensively contemplated one’s own ideas. So no one signs up for that unless they can dedicate the time to do it reasonably well and thoroughly. Also I am weary of the MoAnero trolls, as it is has always been the same throughout the years when interacting with some of them, so I just decided to find a way to shutdown the discussion last year and move on to more pleasant activities. Seems to be an ISTJ personality type issue — they’re incompatible with ENTPs. ENTPs are visionaries and work with ideas. I?TJs are detail freak experts. Technology is supposed to fun, at least that is why I got into it. I learned a lot from discussions with smooth, ArticMine and some others. |
The effort that was applied to attempt to achieve maximum ASIC resistance is admirable. Whether the tradeoffs of that goal are are a net positive is debatable. What is not admirable is the dearth of intensive ASIC designs attempted by the devs to validate their work. When one comes to this project they want to read about all the attempts to break the ASIC resistance and the detailed expositions. One wants to learn from the experts by reading instead of having to become experts themselves to do the work that we weren’t really qualified to do. Instead of being on the defensive in discussions of RandomX why not try to more aggressively attack your own design. The best programmers want to break their own programs. I always expended an order-of-magnitude or more effort breaking and fixing my programs than designing and coding them. As for IQ which you seem to be harping on, I remember 152–160 IQ Eric Raymond was being schooled by some Rust devs when he got into a debate about some of the design choices. I may not have the Mensa-level IQ I once did (after head trauma such being struck with hammer, liver disease and type 2 diabetes) but I did have instances in my youth where I wrote out verbatim from (photographic?) memory several hundred lines of code after a power outage. I doubt I could do that now, also blind in one eye. Also you should be proud that so many people are interested in discussing your work. Shouldn’t this be a labor of love to talk shop with others and educate them about your work? The circle of people who would even have the ability and interest to even converse here is presumably quite small. Not everyone who comes around is going to have the time to be as expert as the original authors (at least not initially) and it may take them some time to come up to speed and even confirm their interest in investing the effort to do so. Why do you expect that other people owe you their utmost perfection? Nobody forced you to continue replying on this project if it pisses you off so much to have to entertain the people you think are worthless clowns. Note I did appreciate your help on my recent question. |
When they assume our motives are competing financial interests why should we assume anything less for their motivation? Why should we not assume some guys mined the heck out of Monero in the early days when mining difficulty was low and now are defending their tokens. They need more and more greater fools to buy so they can cash out at higher profits. If so, then the most important priority is to maintain the illusion of virtue. And thus ASIC resistance so as to not admit that proof-of-work is always going to become more and more centralized over time. Westerners are living in an illusion that they are not fully enslaved, so the objective could be to sell that hopium to them. OTOH, I observe that some in the Monero community are very idealistic, maybe to such an extreme that they refuse to accept that they can’t defeat the centralization of proof-of-work mining, even if entails lying to themselves by making overly optimistic assumptions. And then there appears to be another facet of the Monero community they want to be known as the highest IQ, most technologically advanced project in crypto. Yet they never discarded that Rube Goldberg ring signature anonymity which is conceptually flawed. They have intensive engineering but viewed holistically it’s incoherent. This is I suppose what happens you bring a lot of huge Then there are several very level-headed members of that community also. So we can’t generalize. Maybe I am entirely wrong but that is my attempt to try to understand the project. Also I have enjoyed learning about technology from the Monero project. And delighted that RandomX has been field tested for flaws unrelated to provable ASIC resistance. So all-in-all I think everything happens for a reason. Not really complaining. Just trying to throw some shade on any expectation of pure virtue in the cryptocosm. P.S. Anyway I discovered a new way to do anonymity which is far superior to anything out there now because it renders any action we do on the Internet untraceable, not just cryptocurrency transactions. Not onion-routing nor even random-latency mixnets as I was the one who was arguing back in 2014/5 that Tor was a honeypot which was when some in the Monero community were espousing I2P. |
Done. Check that post again now. 😉 |
Oh boy this again. If you took just half the effort you put in arguing on Bitcointalk and GitHub, and put it into actually building something, you might release it before this decade is over. |
Pay the very large sum of money back to my friend that you ostensibly stole from him. I know things. EDIT: Ad hominem declined. Will not sway me into revealing my intellectual property secrets before I’m prepared to launch something if ever. |
I literally have no idea what you're talking about. You'll have to be extremely clear if you're going to sling around public accusations. |
You know exactly what I am talking about. |
I haven't stolen money from anyone, so I absolutely do not. |
It is all going to catch up to your someday. Just continue the big lie. I see you managed to escape from the EDIT: I’m aware how corrupt S. Africa is from the Youtube channel of an expat South African who grew up and still has family there. So who knows what’s really going on with that. Extortion against you perhaps, would be my leaning if I didn’t have other information indicating you might be unscrupulous. Maybe I should assume that instead that someone in S. Africa was bought off. In any case I trust my friend because I know he is virtuous. It’s my prerogative not to trust you. You injected non-factual allusions into a technical discussion — it has been explained many times that I was in no way connected with BCX and was in 2014 merely pontificating about whether his threats could have any technological realism. I was curious and learning about the technology — no malice involved. I was also 8 years younger and probably more aggressive, energetic, excited, naive and bewildered. Yet this was blown out of proportion by those who want to attach some ad hominem to my reputation. I was responding to @timolson who seemed bewildered by motivations so I proffered an explanatory hypothesis. The main thrust was to temper his expectations of virtue in the cryptocosm — just like every other facet of life there’s vested interests, situations, etc that can run counter to what would otherwise be irrational. |
Nope, still in the USA. Again - if you're going to make baseless accusations you should back them up, otherwise it's BCX's Monero attack all over again. |
There is no baseless accusation. Go sue me (EDIT: for defamation if the accusation is false), then I will reveal the name of the person I promised not to reveal. |
lol why would I sue you? You're welcome to make as many baseless accusations as you want. |
Promising to hold $millions in XMR for people and then pretending you were hacked. You must have learned that from Bruce Wanker? And never filing a police report, lol. |
The only time I promised to hold XMR for people was when I held MintPal's post-exit scam Monero to refund depositors, but ok. |
@shelby3 need I remind you of this from a mere 5 days ago
Since it appears you're unable to maintain a coherent discussion I'm inclined to block you. |
Before I go I will dump the other technological information I had dug up on latency bound. For one it appears that 40ns was the assumption about memory latency being referenced upthread but this excellent document educated me about the meaning of key timing parameters for DDR4. I remember there was a reference upthread to 40cycles + 90 ns for a L3 cache miss and this is presumably including all the latency of the various facets of the memory system including the memory controller. The memory controllers for modern CPUs are ostensibly complex and they are interacting with complex caching as well. An ASIC may not need all those features. The optimal case of Of course then we have to consider latency due to the memory controller reordering for optimizing DDR latency, which is not a throughput limitation if we can mask it with threads. We increase the number of memory banks to increase throughput, but if we have to use redundant Datasets then power consumption increases. Yet power consumption for DDR4 is reasonably low at ~1.5W per 4GB. This is a hardware feature I had never studied in depth. EDIT: the number of threads per computational core group is not contemplated to be massive as in @cjdelisle proposal. Rather just enough to mask various latencies I’ve mentioned in this and prior recent posts.
Is baseless ad hominem the acute |
EDIT: This design is outdated and no longer applies to the current RandomX version.
Similar idea was originally proposed by @cjdelisle for a GPU miner, but I think it's more applicable for an ASIC design.
RandomX ASIC miner:
During dataset expansion, the SRAM is used to store the 64 MiB cache, while dataset blocks are loaded into HBM memory.
When mining, the ASIC runs 256 programs in parallel.
Memory allocation:
Instructions are loaded into the decoder/scheduler core (each core has its own program buffer, program counter and register file). The scheduler cores handle only CALL and RET instructions and pass the rest to one of the 28 worker cores.
Each of the 28 worker cores implements exactly one RandomX instruction pipelined for throughput which matches the instruction weight (for example MUL_64 worker can handle 21 times more instructions per clock than DIV_64 worker). Each worker has an instruction queue fed by the scheduler cores.
The speed of the decoder/scheduler cores would be designed to keep every worker core 100% utilized.
Some complications of the design:
There could be also some dedicated load/store ports for loading instruction operands.
If limited only by HBM bandwidth, this ASIC could do around 120 000 programs per second (480 GiB/s memory read rate), or roughly the same as 30 Ryzen 1700 CPUs. This assumes that sufficient compute power can fit on a single chip. If we estimate power consumption at 300 W (= Vega 64), this ASIC would be around 9 times more power efficient than a CPU.Improved estimate: #11 (comment)
Dislaimer: I'm not an ASIC designer.
The text was updated successfully, but these errors were encountered: