Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYSTEMDS-3551] Extended Performance TestSuite #1850

Open
wants to merge 19 commits into
base: main
Choose a base branch
from

Conversation

Sheypex
Copy link

@Sheypex Sheypex commented Jun 23, 2023

[SYSTEMDS-3551] Extended performance testsuite

This contains new .dml and .sh scripts to include perftests for some of the components in scripts/nn.
As a start, 2 perftests for a simple SGD trained regression neural network and a neural network classifier trained while using Nesterov momentum as described in the scripts/nn/README.md were added.
(Incidentally also resolved a bug with the batching of training samples in both of these examples in the readme.)

Additionally, a semi-broken perftest for staging/NCF.dml is included. As far as I am aware, the perfest scripts are fine (though currently untested) but the implementation of NCF in staging crashes on launch due to yet undetermined cause.

The general structure of the new tests follows observed standards of presently implemented perftests:

  • scripts/datagen houses individual .dml scripts for data generation
  • scripts/perftest/datagen contains .sh scripts that wrap these .dml scripts and designate parameters for the generation based on required file sizes
  • scripts/perftest/scripts houses .dml scripts that implement the workload that is to be tested
  • scripts/perftest contains .sh scripts to run individual or all perftests pertaining to a designated type of component

Currently, the parameters in the datagen scripts fail to meet the expected output sizes of 80MB, 800MB etc.
Their output is currently too small.

The .sh scripts for the neural network components test various input sizes as given by the MAXMEM variable in runAll.sh like other perftests and additionally perform individual tests both with a base number of epochs as well as ten times that many epochs eg. 5 and 50 epochs. This seems appropriate since the number of epochs in neural network training is a shorthand parameter for the maximum amount of individual training iterations.

Finally, runAll.sh contains a flag to enable/disable the execution of neural network perftests as well as a flag to toggle the use of the gpu in these tests.

@phaniarnab
Copy link
Contributor

Thanks, @Sheypex for the commit. I will have a look into the changes in a day or two.

@Baunsgaard Baunsgaard changed the title SYSTEMDS-3551 [SYSTEMDS-3551] Extended Performance TestSuite Jun 26, 2023
@phaniarnab
Copy link
Contributor

It is a good start @Sheypex.

  • Few of the new files are missing licenses. Please add.
  • The data generation scripts might produce invalid data, which cannot produce valid weights. Is it possible to reuse the existing scripts, genRandData4LogisticRegression, genRandData4MultiClassSVM etc.?
  • Is the new simple SGD script running? Can you execute the first two dataset sizes with -stats and paste the statistics here?

@Sheypex
Copy link
Author

Sheypex commented Jun 26, 2023

Just fixed my implenentation of the -gpu flag
Perftests are running/working again
Incidentally found out, that there is some error with systemds utilizing my gpu .. apparently some shared libraries cant be found? So not sure what exactly is the problem there.
Fixed/added missing licenses

Stats are as follows:
For training simple sgd on smallest data:

SystemDS Statistics:
Total elapsed time: 0.553 sec.
Total compilation time: 0.236 sec.
Total execution time: 0.317 sec.
Number of compiled Spark inst: 2.
Number of executed Spark inst: 0.
Cache hits (Mem/Li/WB/FS/HDFS): 6082/0/0/0/2.
Cache writes (Li/WB/FS/HDFS): 0/1/0/4.
Cache times (ACQr/m, RLS, EXP): 0.164/0.001/0.003/0.039 sec.
HOP DAGs recompiled (PRED, SB): 0/0.
HOP DAGs recompile time: 0.000 sec.
Spark ctx create time (lazy): 0.000 sec.
Spark trans counts (par,bc,col):0/0/0.
Spark trans times (par,bc,col): 0.000/0.000/0.000 secs.
Spark async. count (pf,bc,op): 0/0/0.
Total JIT compile time: 1.604 sec.
Total JVM GC count: 0.
Total JVM GC time: 0.0 sec.
Heavy hitter instructions:
# Instruction Time(s) Count
1 sp_csvrblk 0.163 2
2 write 0.039 4
3 ba+* 0.038 800
4 -* 0.010 640
5 + 0.009 645
6 * 0.006 647
7 / 0.006 323
8 r' 0.006 480
9 createvar 0.005 3850
10 rand 0.005 4

For training simple sgd on next larger data:

SystemDS Statistics:
Total elapsed time: 0.861 sec.
Total compilation time: 0.252 sec.
Total execution time: 0.610 sec.
Number of compiled Spark inst: 2.
Number of executed Spark inst: 0.
Cache hits (Mem/Li/WB/FS/HDFS): 18242/0/0/0/2.
Cache writes (Li/WB/FS/HDFS): 1/962/0/4.
Cache times (ACQr/m, RLS, EXP): 0.232/0.001/0.012/0.045 sec.
HOP DAGs recompiled (PRED, SB): 0/0.
HOP DAGs recompile time: 0.000 sec.
Spark ctx create time (lazy): 0.000 sec.
Spark trans counts (par,bc,col):0/0/0.
Spark trans times (par,bc,col): 0.000/0.000/0.000 secs.
Spark async. count (pf,bc,op): 0/0/0.
Total JIT compile time: 2.509 sec.
Total JVM GC count: 0.
Total JVM GC time: 0.0 sec.
Heavy hitter instructions:
# Instruction Time(s) Count
1 sp_csvrblk 0.231 2
2 ba+* 0.143 2400
3 -* 0.053 1920
4 write 0.045 4
5 r' 0.018 1440
6 + 0.015 1925
7 rightIndex 0.014 960
8 * 0.012 1927
9 uak+ 0.010 960
10 rmvar 0.010 13450

@Sheypex
Copy link
Author

Sheypex commented Jun 26, 2023

With respect to the datagen:
Looking it over, genRandData4LogisticRegression and genRandData4MultiClassSVM should work fine as datagen scripts for the SGD test case, where only a vector of target data is required
I suppose genRandData4Kmeans may work for the classification use case .. but I'm less sure on that

I've been looking into adding a convolution/deep learning perftest along the lines of the MNIST examples and since we're on the topic of datagen: Using MNIST (or a subset of corresponding size) is probably preferable to generating random data, correct?

@phaniarnab
Copy link
Contributor

Thanks for the stats. Do not worry about the GPU-related issue. It is fine if you cannot manage to make the GPU work.
I agree about using MNIST for the NN scripts. Otherwise, try to stick to the existing datagen scripts.

Sheypex added 4 commits June 26, 2023 18:29
…n and genRandData4Multinomial. now also running tests for sparse and dense data. not yet utilizing generated test data sets
…dataset based on MAXMEM setting, and using whole MNIST only for biggest MAXMEM
Comment on lines 52 to 53
target_num_train=$(python -c "from math import floor; print( ${min_num_examples_train} + floor(${span_num_examples_train} * ${percent_size}))") # todo couldn't work out how to do this using bc so using slower python calls instead
target_num_test=$(python -c "from math import floor; print( ${min_num_examples_test} + floor(${span_num_examples_test} * ${percent_size}))")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend not to inline Python calls here. You can find another way or push some of the logic inside the dml script.

@Sheypex
Copy link
Author

Sheypex commented Jul 6, 2023

As far as i can tell, the perftest for conv2d (that in turn uses the mnist lenet implementation) is done now .. however I'm getting an error in the lenet implementation

An Error Occurred : 
        HopsException -- ERROR: ./../../nn/examples/mnist_lenet.dml line 282, column 4 -- In LeftIndexingOp Hop, error in constructing Lops 
        HopsException -- ERROR: ./nn/layers/softmax.dml line 53, column 2 -- error constructing Lops for UnaryOp Hop -- 
IllegalCallerException -- java.nio is not open to unnamed module @6a7c0ffd

I've been staring at the source of the lenet implementation in nn/examples for some time now, but I can't pinpoint the actual problem..

I'm guessing the sizes of the softmax output and the probs buffer may be mismatched? (lines 279-282 in nn/examples/mnist_lenet.dml)
But I feel like an error like that would produce a different error message

probs_batch = softmax::forward(outa4)
# Store predictions
probs[beg:end,] = probs_batch

Any idea perhaps on how to fix this?

@Baunsgaard
Copy link
Contributor

As far as i can tell, the perftest for conv2d (that in turn uses the mnist lenet implementation) is done now .. however I'm getting an error in the lenet implementation

An Error Occurred : 
        HopsException -- ERROR: ./../../nn/examples/mnist_lenet.dml line 282, column 4 -- In LeftIndexingOp Hop, error in constructing Lops 
        HopsException -- ERROR: ./nn/layers/softmax.dml line 53, column 2 -- error constructing Lops for UnaryOp Hop -- 
IllegalCallerException -- java.nio is not open to unnamed module @6a7c0ffd

I've been staring at the source of the lenet implementation in nn/examples for some time now, but I can't pinpoint the actual problem..

I'm guessing the sizes of the softmax output and the probs buffer may be mismatched? (lines 279-282 in nn/examples/mnist_lenet.dml) But I feel like an error like that would produce a different error message

probs_batch = softmax::forward(outa4)
# Store predictions
probs[beg:end,] = probs_batch

Any idea perhaps on how to fix this?

Could you run it again, with a '-debug' argument.
Also the IO error is typically related to Operating system or JDK issues, what are you using?
Please write 'java --version' in your terminal and answer with your output.

@Sheypex
Copy link
Author

Sheypex commented Jul 7, 2023

Im on JDK 17

openjdk version "17.0.7" 2023-04-18
OpenJDK Runtime Environment (build 17.0.7+7-Ubuntu-0ubuntu123.04)
OpenJDK 64-Bit Server VM (build 17.0.7+7-Ubuntu-0ubuntu123.04, mixed mode, sharing)

-debug yields this

An Error Occurred : 
        HopsException -- ERROR: ./../../nn/examples/mnist_lenet.dml line 282, column 4 -- In LeftIndexingOp Hop, error in constructing Lops 
        HopsException -- ERROR: ./nn/layers/softmax.dml line 53, column 2 -- error constructing Lops for UnaryOp Hop -- 

IllegalCallerException -- java.nio is not open to unnamed module @6a7c0ffd

org.apache.sysds.hops.HopsException: ERROR: ./../../nn/examples/mnist_lenet.dml line 282, column 4 -- In LeftIndexingOp Hop, error in constructing Lops 
at org.apache.sysds.hops.LeftIndexingOp.constructLops(LeftIndexingOp.java:155)
at org.apache.sysds.hops.DataOp.constructLops(DataOp.java:311)
at org.apache.sysds.parser.DMLTranslator.constructLops(DMLTranslator.java:435)
at org.apache.sysds.parser.DMLTranslator.constructLops(DMLTranslator.java:400)
at org.apache.sysds.parser.DMLTranslator.constructLops(DMLTranslator.java:424)
at org.apache.sysds.parser.DMLTranslator.constructLops(DMLTranslator.java:339)
at org.apache.sysds.api.DMLScript.execute(DMLScript.java:457)
at org.apache.sysds.api.DMLScript.executeScript(DMLScript.java:320)
at org.apache.sysds.api.DMLScript.main(DMLScript.java:208)
Caused by: org.apache.sysds.hops.HopsException: ERROR: ./nn/layers/softmax.dml line 53, column 2 -- error constructing Lops for UnaryOp Hop -- 

at org.apache.sysds.hops.UnaryOp.constructLops(UnaryOp.java:180)
at org.apache.sysds.hops.BinaryOp.constructLopsBinaryDefault(BinaryOp.java:503)
at org.apache.sysds.hops.BinaryOp.constructLops(BinaryOp.java:237)
at org.apache.sysds.hops.LeftIndexingOp.constructLops(LeftIndexingOp.java:145)
... 8 more
Caused by: java.lang.IllegalCallerException: java.nio is not open to unnamed module @6a7c0ffd
at java.base/java.lang.Module.addOpens(Module.java:836)
at org.apache.sysds.runtime.controlprogram.context.SparkExecutionContext.handleIllegalReflectiveAccessSpark(SparkExecutionContext.java:209)
at org.apache.sysds.runtime.controlprogram.context.SparkExecutionContext$SparkClusterConfig.<init>(SparkExecutionContext.java:1831)
at org.apache.sysds.runtime.controlprogram.context.SparkExecutionContext.getSparkClusterConfig(SparkExecutionContext.java:1753)
at org.apache.sysds.runtime.controlprogram.context.SparkExecutionContext.getBroadcastMemoryBudget(SparkExecutionContext.java:1763)
at org.apache.sysds.hops.AggBinaryOp.optFindMMultMethodSpark(AggBinaryOp.java:1093)
at org.apache.sysds.hops.AggBinaryOp.constructLops(AggBinaryOp.java:217)
at org.apache.sysds.hops.BinaryOp.constructLopsBinaryDefault(BinaryOp.java:514)
at org.apache.sysds.hops.BinaryOp.constructLops(BinaryOp.java:237)
at org.apache.sysds.hops.BinaryOp.constructLopsBinaryDefault(BinaryOp.java:503)
at org.apache.sysds.hops.BinaryOp.constructLops(BinaryOp.java:237)
at org.apache.sysds.hops.UnaryOp.constructLops(UnaryOp.java:171)
... 11 more

@phaniarnab
Copy link
Contributor

SystemDS is not tested for JDK 17. Our official support version is 11.
Can you please downgrade to JDK 11 and try?

@Sheypex
Copy link
Author

Sheypex commented Jul 10, 2023

Ok, tested on Java 11.
Can confirm, it was apparently just the Java version.
Also just adjusted the number of epochs in the MNIST test because it was just taking too long.
Might want to consider reducing the number of epochs from 5 and 50 down to 5 and 25.

@phaniarnab
Copy link
Contributor

Glad that worked.
Are both the NN tests working? I understand NCF is untested and may have bugs. If so, comment out the calls to the NCF files for now, so that the perf tests don't fail in the middle.
Also, please summarize the changes and additions.

@Sheypex
Copy link
Author

Sheypex commented Jul 10, 2023

  • NN classifier and regression tests are working
  • conv2d/mnist test has been found to work on java 11 but not on java 17
  • NCF has the same error on java 11 as on java 17
  • reduced number of epochs for MNIST test again, since they were still taking way too long, with 5 and 25 epochs they now take about 15 minutes in total for MAXMEM=800

Overall summary:

Datagen

  • added datagen scripts for NN regression and classification,
    • only new shell script, backend uses existing .dml implementations for regression and classification datagen
  • for NCF,
    • data is generated as demonstrated in scripts/nn/examples/ncf-dummy-data.dml
  • and for MNIST (conv2d)
    • added seperate script, that downloads MNIST dataset in .csv format from Github repo and additional scripts to trim this data down into separate datasets given the MAXMEM flag in runAll.sh
  • Generally:
    • scripts/datagen has respective .dml implementations
    • scripts/perftest/datagen has .sh implementation parts of this datagen

Perftests

  • added perftest scripts for NN regression and classification,
    • NN tests run on a sparse and a dense input dataset
  • for NCF,
    • NCF perftest is structurally complete and should work but is untested, as the .dml NCF implementation fails
  • and for MNIST (conv2d)
  • Generally:
    • These perftests adhere to the common structure of other tests: scripts/perftest/scripts houses .dml implementations that are to be tested, scripts/perftest/run[xyz].sh implement staging data and collecting timing data of test runs
    • tests run 2 rounds of tests with separate number of training epochs per given dataset
    • tests are split into a training test and a simple prediction test: the prediction test only runs a single prediction to check the accuracy/loss of the trained model

Miscellaneous

  • runAll.sh now has a flag to enable the use of the GPU for NN, NCF, and MNIST (conv2d) tests and a flag to enable all of these
    • NN tests are currently on by default ie. flag to run them is set in runAll.sh, while the use of the GPU is disabled
  • NCF test and datagen have been disabled in runAll.sh, since .dml NCF implementation fails
  • example algorithms in scirpts/nn/examples/README.md have been altered to correctly pick and batch training data

@phaniarnab
Copy link
Contributor

Great. Thanks @Sheypex, for your contribution. 👍🏽

@j143 j143 added this to the systemds-3.2.0 milestone Dec 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

4 participants