Add caveats to Scalability section (memory, storage) #26

nfd9001 · 2021-10-22T22:41:01Z

I read the Scalability entry, and it's a good post. I'd add a couple more caveats (discussed briefly in the article). Not all "big data" scalability problems are built around scaling out the number of CPU cores; I've worked in "big data" scaling on Spark before and often built out clusters for 10,000-100,000 times the dataset size of the one on McSherry's laptop. The calculus for these sorts of systems starts to tip back towards "the cluster's better" fairly quickly when you're also dealing with bus and memory bounds (do you have enough memory to hold the data you need in-memory, plus room to receive shuffles? Do you have a local network/NICs that are adequate to run those shuffles in reasonable time? Do you have enough striped fast storage?)

I'd add the 1G (still fairly large, sure) dataset size to the Shower part and explain that this is heavily a warning against overengineering and premature optimization.

hwayne · 2023-04-11T20:33:30Z

Thesea re good ideas

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add caveats to Scalability section (memory, storage) #26

Add caveats to Scalability section (memory, storage) #26

nfd9001 commented Oct 22, 2021

hwayne commented Apr 11, 2023

Add caveats to Scalability section (memory, storage) #26

Add caveats to Scalability section (memory, storage) #26

Comments

nfd9001 commented Oct 22, 2021

hwayne commented Apr 11, 2023