Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

significant performance regression in SpMV #13449

Closed
zheng-da opened this issue Nov 29, 2018 · 4 comments · Fixed by #13501
Closed

significant performance regression in SpMV #13449

zheng-da opened this issue Nov 29, 2018 · 4 comments · Fixed by #13501

Comments

@zheng-da
Copy link
Contributor

zheng-da commented Nov 29, 2018

It seems #12380 causes significant performance regression in SpMV. It causes about 3 times slow down on p3.16x. The main reason is that the PR causes a small number of omp threads to perform computation.

Here is the minimal code for reproducing the bug. It seems the problem occurs only when a model is initialized with multiple GPUs.

use the code below to run the code.

python3 sse_batch.py --graph-file ../../data/5_5_csr.nd --gpu 8

The csr file can be downloaded from

aws s3 cp s3://haibin-dgl/5_5_csr.nd .
"""
Learning Steady-States of Iterative Algorithms over Graphs
Paper: http://proceedings.mlr.press/v80/dai18a.html

"""
import argparse
import random
import numpy as np
import time
import math
import mxnet as mx
from mxnet import gluon

def gcn_msg(edges):
    # TODO should we use concat?
    return {'m': mx.nd.concat(edges.src['in'], edges.src['h'], dim=1)}

def gcn_reduce(nodes):
    return {'accum': mx.nd.sum(nodes.mailbox['m'], 1) / nodes.mailbox['m'].shape[1]}

class NodeUpdate(gluon.Block):
    def __init__(self, out_feats, activation=None, alpha=0.1, **kwargs):
        super(NodeUpdate, self).__init__(**kwargs)
        self.linear1 = gluon.nn.Dense(out_feats, activation=activation)
        # TODO what is the dimension here?
        self.linear2 = gluon.nn.Dense(out_feats)
        self.alpha = alpha

    def forward(self, in_data, hidden_data, accum):
        tmp = mx.nd.concat(in_data, accum, dim=1)
        hidden = self.linear2(self.linear1(tmp))
        return hidden_data * (1 - self.alpha) + self.alpha * hidden

def main(args, data):
    update_hidden_train = NodeUpdate(16, 'relu')
    train_ctxs = []
    for i in range(args.gpu):
        train_ctxs.append(mx.gpu(i))
    update_hidden_train.initialize(ctx=train_ctxs)

    csr = data.astype('float32')
    dns = mx.nd.ones((csr.shape[1], 200))
    mx.nd.waitall()
    t0 = time.time()
    for i in range(3):
        out = mx.nd.dot(csr, dns)
        out.wait_to_read()
        print(i, time.time() - t0)
        mx.nd.waitall()

if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='GCN')
    parser.add_argument("--graph-file", type=str, default="",
            help="graph file")
    parser.add_argument("--gpu", type=int, default=-1,
            help="gpu")
    args = parser.parse_args()

    # load and preprocess dataset
    csr = mx.nd.load(args.graph_file)[0]
    rets1 = main(args, csr)
    #rets2 = main(args, data)
    #for hidden1, hidden2 in zip(rets1, rets2):
    #    print("hidden: " + str(mx.nd.sum(mx.nd.abs(hidden1 - hidden2)).asnumpy()))
@vrakesh
Copy link
Contributor

vrakesh commented Nov 29, 2018

@zheng-da Thank you for reporting the regression in sparse matrix vector multiplication.

@vrakesh
Copy link
Contributor

vrakesh commented Nov 29, 2018

@mxnet-label-bot add [Performance]

@anirudh2290
Copy link
Member

Looking at this. tried out your example, notice more than 3x speed drop with and without the change.

@anirudh2290
Copy link
Member

Found the root cause of the issue: After the PR: #12380 , omp_thread_max_ is mutated in set_reserve_cores. This means for each gpu worker the omp_thread_max_ will keep dropping. For 8 GPU workers, it drops till it it is 1. After this, the dot operator execution internally calls GetRecommendedOMPThreadCount which return omp_thread_max_ which is 1. Thus the dot operator executes on a single thread. For now, reverting the PR to the old behavior is a good option. We should also try to understand more on cause of the segfault which was the reason for the PR #12380 and come up with a different fix.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants