Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parameter generation is very slow #51

Closed
ArnauPrat opened this issue Nov 3, 2017 · 5 comments
Closed

Parameter generation is very slow #51

ArnauPrat opened this issue Nov 3, 2017 · 5 comments
Assignees

Comments

@ArnauPrat
Copy link
Contributor

According to reports by several users, for large datesets (eg. SF1000) the generation of parameters becomes the most expensive part of the generation process (90 minutes of generating data, 12 hours for generating the parameters). We should rethink its implementation (maybe porting it to a hadoop job).

@mingxiw
Copy link

mingxiw commented Apr 4, 2019

we at tigergraph tried this data generation for SF-1000. It took 33+ hours. However, the problem is we could not find the parameter files under ldbc_snb_data/substitution_parameters. Have anyone successfully generated the parameters for SF-1000?

@ArnauPrat
Copy link
Contributor Author

Can you look at the log files of parameter generation (parameters_bi.log and parameters_interactive.log). Any hint there?

@CongyanLi01
Copy link

Continued for the comment from TigerGraph above:
Here is the hint of these 2 logs:
For "parameters_bi.log":
loading input for parameter generation
Traceback (most recent call last):
File "paramgenerator/generateparamsbi.py", line 410, in
sys.exit(main())
File "paramgenerator/generateparamsbi.py", line 330, in main
readfactors.load(personFactorFiles,activityFactorFiles, friendsFiles)
File "/home/ubuntu/datagen/ldbc_snb_datagen/paramgenerator/readfactors.py", line 72, in load
for line in f.readlines():
File "/usr/lib/python2.7/codecs.py", line 696, in readlines
return self.reader.readlines(sizehint)
File "/usr/lib/python2.7/codecs.py", line 606, in readlines
return data.splitlines(keepends)
MemoryError

and for "parameters_interactive.log":
loading input for parameter generation
Traceback (most recent call last):
File "paramgenerator/generateparams.py", line 258, in
sys.exit(main())
File "paramgenerator/generateparams.py", line 133, in main
(personFactors, countryFactors, tagFactors, tagClassFactors, nameFactors, givenNames, ts, postHisto) = readfactors.load(personFactorFiles, activityFactorFiles, friendsFiles)
File "/home/ubuntu/datagen/ldbc_snb_datagen/paramgenerator/readfactors.py", line 72, in load
for line in f.readlines():
File "/usr/lib/python2.7/codecs.py", line 696, in readlines
return self.reader.readlines(sizehint)
File "/usr/lib/python2.7/codecs.py", line 606, in readlines
return data.splitlines(keepends)
MemoryError

It seems that there is something wrong with the memory. I add the modification 'export HADOOP_CLIENT_OPTS="-Xmx200G"' in "run.sh" and the memory size of my machine is 244GB. Do you have any suggestions?

@ArnauPrat
Copy link
Contributor Author

Parameter generation is implemented using a couple of python scripts, this is the reason it is so slow, because its execution is not parallelized in any way. Setting HADOOP_CLIENT_OPTS will have no effect on parameter generation.
The parameter generation scripts under the folder "paramgenerator", use as input files the "factor" files, which are produced by datagen. These factor files, namely mXactivityFactors.txt, mXfriendList0.csv and mXpersonFactors.txt (where X can be any number between 0 and NumberOfWorkers-1) are produced by Datagen during data generation, and can be found under the /hadoop folder (either in local filesystem if you executed on standalone mode or in HDFS if executed on distributed or pseudo-distributed mode).
If you can get these files, you can try to debug just the parameter generation part, without having to rerun the whole generation process.
Here is where the script is launched, and the first parameter to the script is where the factor files are.

@szarnyasg szarnyasg self-assigned this Sep 28, 2019
@szarnyasg
Copy link
Member

szarnyasg commented Sep 18, 2020

Closing this but the story continues in #206 #83 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants