Thrashing for SF30k with default settings #428

szarnyasg · 2023-05-19T06:09:03Z

As remarked in #321, the default settings can cause some thrashing. This may still be true today (although the Datagen is much better optimized now).

The Python script should be altered such that it uses more machines for large SFs (e.g. ~20 for SF30k).

The expected length of the generation job should also be documented.

szarnyasg · 2023-05-21T09:12:03Z

The time required to generate SF30,000 on AWS EMR with 20 i3.4xlarge instances is ~12 1/4 hours:

9 1/4 hours for the generation (Run LDBC SNB Datagen step)
3 hours for copying the data to S3 (S3 dist cp step)

Running the factor generator in its current form is very slow, this needs further investigation.

Yourens · 2023-07-24T01:24:27Z

I'm struggling to generate a dataset of SF10K using Spark.

So far, I have attempted to install Spark locally and run it with "--parallelism 8 --memory 96G". However, After about 2 hours, I receive a 'java.lang.OutOfMemoryError: Java heap space' error. I then reduced the concurrency level to 2, but after running for 12 hours, I received a 'Issue communicating with driver in heartbeater org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10000 milliseconds]. This timeout is controlled by spark.executor.heartbeatInterval' error.
Currently, I am following Spark's suggestions and adjusting the heartbeat timeout time before running it again. I am unsure if I have misconfigured something else.
Should we provide a tutorial on how to generate large datasets with Spark if we can't make it with default config.

Here is a full version of my command now:
./tools/run.py --cores 64 --parallelism 2 --memory 96G --conf spark.network.timeout=120000 spark.executor.heartbeatInterval=10000 -- --format csv --scale-factor 10000 --mode bi --explode-edges --output-dir /largedata/sf10000

szarnyasg · 2023-07-26T19:23:21Z

Hi Youren, you will need a larger machine. SF10,000 needs 4 i3.4xlarge instances which have 122 GiB memory each.

Also, the factor generation at the moment is a very expensive step. This something we'll fix in the near future - until then, make sure you do not use --generate-factors.

dszakallas · 2023-09-03T18:29:49Z

@Yourens you should try to increase parallelism not decrease it. It controls the number of partitions generated |> results in smaller partitions |> more likely that each partition fits into memory.
The theoretical limit for SF10K seems to be 2720 based on numPersons / blockSize.
I admit this name is unintuitive, but it follows the Spark naming.

dszakallas · 2023-09-03T18:40:02Z

We run this with 1000 partitions altogether without memory issues on 122 GB machines. With 96GB memory you might want to increase this somewhat. But if you don't care about small files, run with 2720, that's likely to succeed.

szarnyasg self-assigned this May 19, 2023

szarnyasg mentioned this issue May 21, 2023

Factor generator is slow #431

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thrashing for SF30k with default settings #428

Thrashing for SF30k with default settings #428

szarnyasg commented May 19, 2023

szarnyasg commented May 21, 2023

Yourens commented Jul 24, 2023 •

edited

Loading

szarnyasg commented Jul 26, 2023

dszakallas commented Sep 3, 2023

dszakallas commented Sep 3, 2023

Thrashing for SF30k with default settings #428

Thrashing for SF30k with default settings #428

Comments

szarnyasg commented May 19, 2023

szarnyasg commented May 21, 2023

Yourens commented Jul 24, 2023 • edited Loading

szarnyasg commented Jul 26, 2023

dszakallas commented Sep 3, 2023

dszakallas commented Sep 3, 2023

Yourens commented Jul 24, 2023 •

edited

Loading