Parameter generation is very slow #51

ArnauPrat · 2017-11-03T15:51:54Z

According to reports by several users, for large datesets (eg. SF1000) the generation of parameters becomes the most expensive part of the generation process (90 minutes of generating data, 12 hours for generating the parameters). We should rethink its implementation (maybe porting it to a hadoop job).

mingxiw · 2019-04-04T17:39:39Z

we at tigergraph tried this data generation for SF-1000. It took 33+ hours. However, the problem is we could not find the parameter files under ldbc_snb_data/substitution_parameters. Have anyone successfully generated the parameters for SF-1000?

ArnauPrat · 2019-04-04T17:55:44Z

Can you look at the log files of parameter generation (parameters_bi.log and parameters_interactive.log). Any hint there?

CongyanLi01 · 2019-04-04T20:55:57Z

Continued for the comment from TigerGraph above:
Here is the hint of these 2 logs:
For "parameters_bi.log":
loading input for parameter generation
Traceback (most recent call last):
File "paramgenerator/generateparamsbi.py", line 410, in
sys.exit(main())
File "paramgenerator/generateparamsbi.py", line 330, in main
readfactors.load(personFactorFiles,activityFactorFiles, friendsFiles)
File "/home/ubuntu/datagen/ldbc_snb_datagen/paramgenerator/readfactors.py", line 72, in load
for line in f.readlines():
File "/usr/lib/python2.7/codecs.py", line 696, in readlines
return self.reader.readlines(sizehint)
File "/usr/lib/python2.7/codecs.py", line 606, in readlines
return data.splitlines(keepends)
MemoryError

and for "parameters_interactive.log":
loading input for parameter generation
Traceback (most recent call last):
File "paramgenerator/generateparams.py", line 258, in
sys.exit(main())
File "paramgenerator/generateparams.py", line 133, in main
(personFactors, countryFactors, tagFactors, tagClassFactors, nameFactors, givenNames, ts, postHisto) = readfactors.load(personFactorFiles, activityFactorFiles, friendsFiles)
File "/home/ubuntu/datagen/ldbc_snb_datagen/paramgenerator/readfactors.py", line 72, in load
for line in f.readlines():
File "/usr/lib/python2.7/codecs.py", line 696, in readlines
return self.reader.readlines(sizehint)
File "/usr/lib/python2.7/codecs.py", line 606, in readlines
return data.splitlines(keepends)
MemoryError

It seems that there is something wrong with the memory. I add the modification 'export HADOOP_CLIENT_OPTS="-Xmx200G"' in "run.sh" and the memory size of my machine is 244GB. Do you have any suggestions?

ArnauPrat · 2019-04-08T08:04:10Z

Parameter generation is implemented using a couple of python scripts, this is the reason it is so slow, because its execution is not parallelized in any way. Setting HADOOP_CLIENT_OPTS will have no effect on parameter generation.
The parameter generation scripts under the folder "paramgenerator", use as input files the "factor" files, which are produced by datagen. These factor files, namely mXactivityFactors.txt, mXfriendList0.csv and mXpersonFactors.txt (where X can be any number between 0 and NumberOfWorkers-1) are produced by Datagen during data generation, and can be found under the /hadoop folder (either in local filesystem if you executed on standalone mode or in HDFS if executed on distributed or pseudo-distributed mode).
If you can get these files, you can try to debug just the parameter generation part, without having to rerun the whole generation process.
Here is where the script is launched, and the first parameter to the script is where the factor files are.

szarnyasg · 2020-09-18T18:43:55Z

Closing this but the story continues in ~~#206~~ #83 .

ArnauPrat added the enhancement label Nov 3, 2017

szarnyasg mentioned this issue May 18, 2019

Rewrite parameter generator in Julia #83

Closed

10 tasks

szarnyasg self-assigned this Sep 28, 2019

szarnyasg closed this as completed Sep 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parameter generation is very slow #51

Parameter generation is very slow #51

ArnauPrat commented Nov 3, 2017

mingxiw commented Apr 4, 2019

ArnauPrat commented Apr 4, 2019

CongyanLi01 commented Apr 4, 2019

ArnauPrat commented Apr 8, 2019

szarnyasg commented Sep 18, 2020 •

edited

Loading

Parameter generation is very slow #51

Parameter generation is very slow #51

Comments

ArnauPrat commented Nov 3, 2017

mingxiw commented Apr 4, 2019

ArnauPrat commented Apr 4, 2019

CongyanLi01 commented Apr 4, 2019

ArnauPrat commented Apr 8, 2019

szarnyasg commented Sep 18, 2020 • edited Loading

szarnyasg commented Sep 18, 2020 •

edited

Loading