Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add caveats to Scalability section (memory, storage) #26

Open
nfd9001 opened this issue Oct 22, 2021 · 1 comment
Open

Add caveats to Scalability section (memory, storage) #26

nfd9001 opened this issue Oct 22, 2021 · 1 comment

Comments

@nfd9001
Copy link

nfd9001 commented Oct 22, 2021

I read the Scalability entry, and it's a good post. I'd add a couple more caveats (discussed briefly in the article). Not all "big data" scalability problems are built around scaling out the number of CPU cores; I've worked in "big data" scaling on Spark before and often built out clusters for 10,000-100,000 times the dataset size of the one on McSherry's laptop. The calculus for these sorts of systems starts to tip back towards "the cluster's better" fairly quickly when you're also dealing with bus and memory bounds (do you have enough memory to hold the data you need in-memory, plus room to receive shuffles? Do you have a local network/NICs that are adequate to run those shuffles in reasonable time? Do you have enough striped fast storage?)

I'd add the 1G (still fairly large, sure) dataset size to the Shower part and explain that this is heavily a warning against overengineering and premature optimization.

@hwayne
Copy link
Owner

hwayne commented Apr 11, 2023

Thesea re good ideas

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants