Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thrashing for SF30k with default settings #428

Open
szarnyasg opened this issue May 19, 2023 · 5 comments
Open

Thrashing for SF30k with default settings #428

szarnyasg opened this issue May 19, 2023 · 5 comments
Assignees

Comments

@szarnyasg
Copy link
Member

As remarked in #321, the default settings can cause some thrashing. This may still be true today (although the Datagen is much better optimized now).

The Python script should be altered such that it uses more machines for large SFs (e.g. ~20 for SF30k).

The expected length of the generation job should also be documented.

@szarnyasg szarnyasg self-assigned this May 19, 2023
@szarnyasg
Copy link
Member Author

The time required to generate SF30,000 on AWS EMR with 20 i3.4xlarge instances is ~12 1/4 hours:

  • 9 1/4 hours for the generation (Run LDBC SNB Datagen step)
  • 3 hours for copying the data to S3 (S3 dist cp step)

Running the factor generator in its current form is very slow, this needs further investigation.

@Yourens
Copy link

Yourens commented Jul 24, 2023

I'm struggling to generate a dataset of SF10K using Spark.

So far, I have attempted to install Spark locally and run it with "--parallelism 8 --memory 96G". However, After about 2 hours, I receive a 'java.lang.OutOfMemoryError: Java heap space' error. I then reduced the concurrency level to 2, but after running for 12 hours, I received a 'Issue communicating with driver in heartbeater org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10000 milliseconds]. This timeout is controlled by spark.executor.heartbeatInterval' error.
Currently, I am following Spark's suggestions and adjusting the heartbeat timeout time before running it again. I am unsure if I have misconfigured something else.
Should we provide a tutorial on how to generate large datasets with Spark if we can't make it with default config.

Here is a full version of my command now:
./tools/run.py --cores 64 --parallelism 2 --memory 96G --conf spark.network.timeout=120000 spark.executor.heartbeatInterval=10000 -- --format csv --scale-factor 10000 --mode bi --explode-edges --output-dir /largedata/sf10000

@szarnyasg
Copy link
Member Author

Hi Youren, you will need a larger machine. SF10,000 needs 4 i3.4xlarge instances which have 122 GiB memory each.

Also, the factor generation at the moment is a very expensive step. This something we'll fix in the near future - until then, make sure you do not use --generate-factors.

@dszakallas
Copy link
Member

@Yourens you should try to increase parallelism not decrease it. It controls the number of partitions generated |> results in smaller partitions |> more likely that each partition fits into memory.
The theoretical limit for SF10K seems to be 2720 based on numPersons / blockSize.
I admit this name is unintuitive, but it follows the Spark naming.

@dszakallas
Copy link
Member

We run this with 1000 partitions altogether without memory issues on 122 GB machines. With 96GB memory you might want to increase this somewhat. But if you don't care about small files, run with 2720, that's likely to succeed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants