Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SF30k factor generator crashes on 125 instances #438

Closed
szarnyasg opened this issue Aug 28, 2023 · 1 comment
Closed

SF30k factor generator crashes on 125 instances #438

szarnyasg opened this issue Aug 28, 2023 · 1 comment

Comments

@szarnyasg
Copy link
Member

szarnyasg commented Aug 28, 2023

I ran the generator with the following configuration:

export LDBC_SNB_DATAGEN_JAR=$(sbt -batch -error 'print assembly / assemblyOutputPath')
export JAR_NAME=$(basename ${LDBC_SNB_DATAGEN_JAR})
export SCALE_FACTOR=30000
export JOB_NAME=sf${SCALE_FACTOR}-for-surf

./tools/emr/submit_datagen_job.py \
    --use-spot \
    --sf-per-executor 240 \
    --instance-type i3.8xlarge \
    --jar ${JAR_NAME} \
    --bucket ${BUCKET_NAME} \
    --copy-all \
    --az us-east-2b \
    ${JOB_NAME} \
    ${SCALE_FACTOR} \
    csv \
    bi \
    -- \
    --explode-edges \
    --format-options compression=gzip \
    --generate-factors

This started a 125 instances. Unfortunately, the job crashed after 2 hours but continued running for another ~7 hours before quitting.

The logs are quite useless:

controller

2023-08-28T01:27:49.231Z INFO Ensure step 1 jar file command-runner.jar
2023-08-28T01:27:49.232Z INFO StepRunner: Created Runner for step 1
INFO startExec 'hadoop jar /var/lib/aws/emr/step-runner/hadoop-jars/command-runner.jar spark-submit --class ldbc.snb.datagen.LdbcDatagen s3://ldbc-snb-datagen-bi-2021-07/jars/ldbc_snb_datagen_2.12_spark3.2-0.5.1+16-d6bfc51f-jar-with-dependencies.jar --output-dir /ldbc_snb_datagen/build --scale-factor 30000 --num-threads 3000 --mode bi --format csv --explode-edges --format-options compression=gzip --generate-factors'
INFO Environment:
  PATH=/usr/lib64/qt-3.3/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/opt/aws/puppet/bin/
  SECURITY_PROPERTIES=/emr/instance-controller/lib/security.properties
  HISTCONTROL=ignoredups
  HISTSIZE=1000
  HADOOP_ROOT_LOGGER=INFO,DRFA
  JAVA_HOME=/etc/alternatives/jre
  AWS_DEFAULT_REGION=us-east-2
  LANG=en_US.UTF-8
  MAIL=/var/spool/mail/hadoop
  LOGNAME=hadoop
  PWD=/
  HADOOP_CLIENT_OPTS=-Djava.io.tmpdir=/mnt/var/lib/hadoop/steps/s-090045419NSP6RN3ALBQ/tmp
  _=/etc/alternatives/jre/bin/java
  LESSOPEN=||/usr/bin/lesspipe.sh %s
  SHELL=/bin/bash
  QTINC=/usr/lib64/qt-3.3/include
  USER=hadoop
  HADOOP_LOGFILE=syslog
  HOSTNAME=ip-172-31-27-148
  QTDIR=/usr/lib64/qt-3.3
  HADOOP_LOG_DIR=/mnt/var/log/hadoop/steps/s-090045419NSP6RN3ALBQ
  EMR_STEP_ID=s-090045419NSP6RN3ALBQ
  QTLIB=/usr/lib64/qt-3.3/lib
  HOME=/home/hadoop
  SHLVL=1
  HADOOP_IDENT_STRING=hadoop
INFO redirectOutput to /mnt/var/log/hadoop/steps/s-090045419NSP6RN3ALBQ/stdout
INFO redirectError to /mnt/var/log/hadoop/steps/s-090045419NSP6RN3ALBQ/stderr
INFO Working dir /mnt/var/lib/hadoop/steps/s-090045419NSP6RN3ALBQ
INFO ProcessRunner started child process 19181
2023-08-28T01:27:49.234Z INFO HadoopJarStepRunner.Runner: startRun() called for s-090045419NSP6RN3ALBQ Child Pid: 19181
INFO Synchronously wait child process to complete : hadoop jar /var/lib/aws/emr/step-runner/hadoop-...
INFO Process 19181 still running
INFO Process 19181 still running
INFO Process 19181 still running
INFO Process 19181 still running
INFO Process 19181 still running
INFO Process 19181 still running
INFO Process 19181 still running
INFO Process 19181 still running
INFO Process 19181 still running
INFO Process 19181 still running
INFO Process 19181 still running
INFO Process 19181 still running
INFO Process 19181 still running
INFO Process 19181 still running
INFO Process 19181 still running
INFO Process 19181 still running
INFO Process 19181 still running
INFO Process 19181 still running
INFO Process 19181 still running
INFO Process 19181 still running
INFO Process 19181 still running
INFO Process 19181 still running
INFO Process 19181 still running
INFO Process 19181 still running
INFO Process 19181 still running
INFO Process 19181 still running
INFO Process 19181 still running
INFO Process 19181 still running
INFO Process 19181 still running
INFO Process 19181 still running
INFO Process 19181 still running
INFO Process 19181 still running
INFO Process 19181 still running
INFO Process 19181 still running
INFO waitProcessCompletion ended with exit code 1 : hadoop jar /var/lib/aws/emr/step-runner/hadoop-...
INFO total process run time: 31543 seconds
2023-08-28T10:13:33.044Z INFO Step created jobs: 
2023-08-28T10:13:33.044Z WARN Step failed with exitCode 1 and took 31543 seconds

stdout

Reading scale factors..
Available scale factor configuration set 0.003
Available scale factor configuration set 0.1
Available scale factor configuration set 0.3
Available scale factor configuration set 1
Available scale factor configuration set 3
Available scale factor configuration set 10
Available scale factor configuration set 30
Available scale factor configuration set 100
Available scale factor configuration set 300
Available scale factor configuration set 1000
Available scale factor configuration set 3000
Available scale factor configuration set 10000
Available scale factor configuration set 30000
Number of scale factors read 13
Applied configuration of scale factor 30000
 ... Num Persons 77000000
 ... Start Year 2010
 ... Num Years 3
Done ... 49558 surnames were extracted 
Done ... 42970 given names were extracted 

stderr

See https://gist.github.com/szarnyasg/28673ecdda0325b59295a3fc8c70cc14

@szarnyasg
Copy link
Member Author

szarnyasg commented Sep 5, 2023

This was caused by the huge messageId table. Fixed by 829b7a0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant