Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

suggestions for performance #18

Open
peterukk opened this issue Sep 30, 2019 · 9 comments
Open

suggestions for performance #18

peterukk opened this issue Sep 30, 2019 · 9 comments

Comments

@peterukk
Copy link

I'm using this package in a very performance-critical setting. I managed to make the inference quite a bit faster by making some changes (some of these are custom to my model). Below is the inference for a "flat" feed-forward model which takes an 1D array of inputs. I also experimented with inputting multiple samples at a time (2D array) which replaces matrix-vector multiplications with matrix-matrix. This may be faster for some platforms/models (for me it was a wash) but in this case make sure to use SGEMM/DGEMM to replace the matmul call.

subroutine output_flatmodel_opt_sig(self, x, neurons, output)
    ! Use forward propagation to compute the output of the network.
    ! For computational efficiency, following changes are implemented:
    ! 1) Outputs are allocated outside of function, 
    ! 2) use of explicit-shape intermediate array that assumes the number of neurons are the same for all hidden layers,
    ! 3) activation functions are replaced with a subroutine that modifies the arguments (sigmoid), activation from final layer removed (linear activation=redundant 1:1 copy)
    ! 4) matmul replaced by custom function which is often faster than matmul for matrix-vector multiplication
    ! 5) weights have been pre-transposed in the load routine, class variable w_transposed.
    ! This procedure was much faster than the original when using gfortran -O3 -march=native or ifort -O3.
    ! For lower optimization levels the custom function (4) may be SLOWER
    class(network_type),    intent(in)  :: self
    integer,                intent(in)  :: neurons
    real(wp), dimension(:), intent(in)  :: x
    real(wp), dimension(:), intent(out) :: output
    ! Local variables
    real(wp), dimension(neurons)        :: a
    integer :: n

    associate(layers => self % layers)
      a = matvecmul(layers(1) % w_transposed,x,neurons,size(x)) + layers(2) % b
      ! sigmoid activation: using an "inout" subroutine to avoid array copy 
      call sigmoid_subroutine(a)
      ! INTERMEDIATE LAYERS
      do n = 3, size(layers)-1
        a = matvecmul(layers(n-1) % w_transposed, a, neurons, neurons) + layers(n) % b
        call sigmoid_subroutine(a)
      end do
      ! LAST LAYER (LINEAR ACTIVATION = do nothing, just add biases)
      output = (matvecmul(layers(n-1) % w_transposed, a, size(output), neurons) + layers(n) % b)
    end associate
  end subroutine
  function matvecmul(matA,vecB,nrow,ncol)
        implicit none
        integer, intent(in) :: nrow,ncol
        real(wp), intent(in) :: matA(nrow,ncol)
        real(wp), intent(in) :: vecB(ncol)
        integer :: i,j
        real(wp) :: matvecmul(nrow)

        matvecmul = 0.0_wp
        do j=1,ncol !  
            matvecmul = matvecmul + matA(:,j) * vecB(j)
        enddo

  end function matvecmul
  subroutine sigmoid_subroutine(x)
    real(wp), dimension(:), intent(inout) :: x
    x = 1 / (1 + exp(-x))
  end subroutine
@milancurcic
Copy link
Member

Thanks Peter, it all makes sense considering that the equal-size hidden layers allows for less allocations and copies to be made. What kind of performance improvement are you getting? I'm interested in both GNU and Intel compilers.

@peterukk
Copy link
Author

peterukk commented Oct 1, 2019

gfortran -ffree-line-length-none -std=f2008 -pg -march=native -O3 --fast-math
Elapsed time on output: 0.939000010
Elapsed time on output_opt_sig: 0.680999994
Elapsed time on output_flatmodel_opt_sig: 0.669000030

gfortran -ffree-line-length-none -m64 -std=f2008 -march=native -O3 --fast-math -funroll-loops -ftree-loop-linear -fprefetch-loop-arrays

Elapsed time on output: 0.732999980
Elapsed time on output_opt_sig: 0.486000001
Elapsed time on output_flatmodel_opt_sig: 0.485000014

This on gfortran-7, intel processor and a model with (50,50) hidden neurons, 21 inputs and 256 outputs. I don't have the intel compilers on this computer but I do at home, I can get back to you.

  subroutine output_opt_sig(self, x, output)
    class(network_type),    intent(in)  :: self
    real(wp), dimension(:), intent(in)  :: x
    real(wp), dimension(:), intent(out) :: output
    ! Local variables
    real(wp), allocatable   :: a(:)
    integer,  dimension(2)  :: matsize
    integer                 :: n

    associate(layers => self % layers)
      matsize = shape(layers(1) % w_transposed)
      a = matvecmul(layers(1) % w_transposed, x, matsize(1), matsize(2)) + layers(2) % b
      ! sigmoid activation: using an "inout" subroutine to avoid array copy 
      call sigmoid_subroutine(a)
      ! INTERMEDIATE LAYERS
      do n = 3, size(layers)-1
        matsize = shape(layers(n-1) % w_transposed)
        a = matvecmul(layers(n-1) % w_transposed, a, matsize(1), matsize(2)) + layers(n) % b
        call sigmoid_subroutine(a)
      end do
      ! LAST LAYER (LINEAR ACTIVATION = do nothing, just add biases)
      matsize = shape(layers(n-1) % w_transposed)
      output = (matvecmul(layers(n-1) % w_transposed, a, matsize(1), matsize(2)) + layers(n) % b)
    end associate
  end subroutine

@peterukk
Copy link
Author

peterukk commented Oct 1, 2019

On ifort the differences are much bigger:

ifort -O3 -mkl

Elapsed time on output: 0.8096000
Elapsed time on output_opt_sig: 0.1968000
Elapsed time on output_flatmodel_opt_sig: 0.1787000
Elapsed time on output_sgemm_sig: 0.1423000

Here the last routine processes all the input data at once as mentioned (for the other ones, I used a loop).

subroutine output_sgemm_sig(self, x, num_sample, output)
    ! Use this routine for a 2D input data array to process all the samples simultaenously in a feed-forward network
   ! Replace the sigmoid activation subroutine with whatever activation you're using
    ! Author: Peter Ukkonen
    ! sgemm = single-precision (wp = sp)
    class(network_type),      intent(in)    :: self
    real(wp), dimension(:,:), intent(in)    :: x      ! (features, num_sample)
    real(wp), dimension(:,:), intent(out)   :: output ! (outputs, num_sample)
    integer,                  intent(in)    :: num_sample
    ! Local variables
    real(wp), allocatable   :: a(:,:), a_next(:,:)
    real(wp)                :: alpha, beta
    integer,  dimension(2)  :: matsize
    integer                 :: n, isample

    alpha = 1.0_wp
    beta = 0.0_wp
    output = 0.0_wp

    associate(layers => self % layers)
      matsize = shape(layers(1) % w_transposed)
      allocate(a(matsize(1),num_sample))
      call sgemm("N","N",matsize(1), num_sample, matsize(2), alpha, layers(1) % w_transposed, matsize(1), x, matsize(2), beta, a, matsize(1))
      ! a = a + spread( layers(2) % b, 1, num_sample )
      do isample = 1, num_sample
        a(:,isample) = a(:,isample ) + layers(2) % b
      end do
      call sigmoid_subroutine_mat(a)

      ! INTERMEDIATE LAYERS
      do n = 3, size(layers)-1
        matsize = shape(layers(n-1) % w_transposed)
        allocate(a_next(matsize(1),num_sample))

        call sgemm("N","N",matsize(1),num_sample,matsize(2),alpha,layers(n-1) % w_transposed,matsize(1),a,matsize(2),beta,a_next,matsize(1))
        deallocate(a)
        ! a = a + spread( layers(n) % b, 1, num_sample )
        do isample = 1, num_sample
          a_next(:,isample) = a_next(:,isample ) + layers(n) % b
        end do 
        call sigmoid_subroutine_mat(a_next)
        a = a_next
        deallocate(a_next)
      end do

      matsize = shape(layers(n-1) % w_transposed)
      call sgemm("N","N",matsize(1), num_sample, matsize(2), alpha, layers(n-1) % w_transposed, matsize(1), a, matsize(2), beta, output, matsize(1))
      ! a = a + spread( layers(n) % b, 1, num_sample )
      do isample = 1, num_sample
        output(:,isample) = output(:,isample ) + layers(n) % b
      end do
    end associate
  end subroutine

@milancurcic
Copy link
Member

Thanks. Do you know if any of these could be implemented in the existing class that doesn't assume equal sizes of hidden layers?

The only opportunity I see is to allocate the activation arrays once rather than re-allocate on assignment in every iteration, in which case the subroutine to calculate activations in-place may be effective.

Btw, the 10-output-batch branch introduces the function to output per batch (contributed by @jvdp1), although it's just a wrapper around the output for a single set of inputs, so it doesn't have any performance improvements that you propose. The output_batch method could be a place to introduce these, if any are feasible. I will take some time soon to clean up and merge the new methods into master.

@peterukk
Copy link
Author

peterukk commented Oct 2, 2019

output_opt_sig and output_sgemm_sig do not assume equal-sized hidden layers (I tested that they work). As you can see, the former is almost as fast as the flatmodel one. Processing inputs as matrices is fastest on ifort. I believe the most significant optimizations for me were the use of matvecmul instead of matmul, removal of the output activation, and the use of subroutines instead of functions to avoid array copy. I think you could quite easily implement these changes. I believe the weights need to be pre-transposed for matvecmul to be faster than matmul.

@peterukk
Copy link
Author

peterukk commented Oct 2, 2019

The only opportunity I see is to allocate the activation arrays once rather than re-allocate on assignment in every iteration, in which case the subroutine to calculate activations in-place may be effective.

That might be a good idea, although based on the previous result the allocation overhead isn't that big..I'm by no means an expert on Fortran by the way, this has been an interesting learning experience!

@milancurcic
Copy link
Member

output_opt_sig and output_sgemm_sig do not assume equal-sized hidden layers (I tested that they work)

OK, great, thanks, I overlooked that. In general, I don't want to break the high-level functional API. However, I'm all for tweaking the internal implementation if performance benefits are clear. You give clear ideas and examples how to do this. I will play with it and let you know.

@peterukk
Copy link
Author

peterukk commented Nov 5, 2019

You're welcome! I have now implemented these changes in a more general manner here: https://github.com/peterukk/rte-rrtmgp/blob/master/neural/

Most notably, activation functions have been replaced with subroutines, and these procedures have a 2D array variant which are used in e.g. output_sgemm

@peterukk
Copy link
Author

peterukk commented Jan 7, 2020

FYI I now tested using elemental subroutines as activation functions. This is considerably faster (and work with both 1D and 2D arrays), but unfortunately, pointers do not work with elemental procedures :/ disappointed that object-oriented programming leads to a performance loss in this instance

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants