-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Thrashing for SF30k with default settings #428
Comments
The time required to generate SF30,000 on AWS EMR with 20 i3.4xlarge instances is ~12 1/4 hours:
Running the factor generator in its current form is very slow, this needs further investigation. |
I'm struggling to generate a dataset of SF10K using Spark. So far, I have attempted to install Spark locally and run it with "--parallelism 8 --memory 96G". However, After about 2 hours, I receive a 'java.lang.OutOfMemoryError: Java heap space' error. I then reduced the concurrency level to 2, but after running for 12 hours, I received a 'Issue communicating with driver in heartbeater org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10000 milliseconds]. This timeout is controlled by spark.executor.heartbeatInterval' error. Here is a full version of my command now: |
Hi Youren, you will need a larger machine. SF10,000 needs 4 i3.4xlarge instances which have 122 GiB memory each. Also, the factor generation at the moment is a very expensive step. This something we'll fix in the near future - until then, make sure you do not use |
@Yourens you should try to increase parallelism not decrease it. It controls the number of partitions generated |> results in smaller partitions |> more likely that each partition fits into memory. |
We run this with 1000 partitions altogether without memory issues on 122 GB machines. With 96GB memory you might want to increase this somewhat. But if you don't care about small files, run with 2720, that's likely to succeed. |
As remarked in #321, the default settings can cause some thrashing. This may still be true today (although the Datagen is much better optimized now).
The Python script should be altered such that it uses more machines for large SFs (e.g. ~20 for SF30k).
The expected length of the generation job should also be documented.
The text was updated successfully, but these errors were encountered: