Add option to control number of nodes & partitions #411
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The goal of this PR is to facilitate dataset generation on smaller Spark clusters on EMR.
Our experiments show that a machine with 128GB of memory is capable of generating SF3K reliably with 3 blocks per partition given ample disk size to allow for spills (tested with 3.8TB); while less partitions (subsequently, larger block/partition ratio) would introduce OOM errors for this configuration.
This PR:
spark.default.parallelism
to the desired partition number (i.e the same value as--num-threads
). This makes sure that after wide operations (e.g. the sorting), we don't fall back to a risky, lower value. This is the same behaviour asrun.py
.sf_ratio
parameter tosf_per_executor
with the same semantics (scale factor for each executor node). The default value is changed to 3000, i.e. 3000 SF for each executor.executors
parameter mutually exclusive with the former, to allow for absolutely setting the number of executor nodes.sf_per_partition
andpartitions
parameters to set the desired partition number. The default issf_per_partitions = 100
The following combinations are tested and work. Partition numbers are given as absolute values and they were calculated for each scale factor using the formula
ceil(persons_in_sf / block_size / 3)
, to make sure each partition contains no more than 3 blocks.~11 hours
~ 11 hours