suggestions for performance #18

peterukk · 2019-09-30T14:53:08Z

I'm using this package in a very performance-critical setting. I managed to make the inference quite a bit faster by making some changes (some of these are custom to my model). Below is the inference for a "flat" feed-forward model which takes an 1D array of inputs. I also experimented with inputting multiple samples at a time (2D array) which replaces matrix-vector multiplications with matrix-matrix. This may be faster for some platforms/models (for me it was a wash) but in this case make sure to use SGEMM/DGEMM to replace the matmul call.

subroutine output_flatmodel_opt_sig(self, x, neurons, output)
    ! Use forward propagation to compute the output of the network.
    ! For computational efficiency, following changes are implemented:
    ! 1) Outputs are allocated outside of function, 
    ! 2) use of explicit-shape intermediate array that assumes the number of neurons are the same for all hidden layers,
    ! 3) activation functions are replaced with a subroutine that modifies the arguments (sigmoid), activation from final layer removed (linear activation=redundant 1:1 copy)
    ! 4) matmul replaced by custom function which is often faster than matmul for matrix-vector multiplication
    ! 5) weights have been pre-transposed in the load routine, class variable w_transposed.
    ! This procedure was much faster than the original when using gfortran -O3 -march=native or ifort -O3.
    ! For lower optimization levels the custom function (4) may be SLOWER
    class(network_type),    intent(in)  :: self
    integer,                intent(in)  :: neurons
    real(wp), dimension(:), intent(in)  :: x
    real(wp), dimension(:), intent(out) :: output
    ! Local variables
    real(wp), dimension(neurons)        :: a
    integer :: n

    associate(layers => self % layers)
      a = matvecmul(layers(1) % w_transposed,x,neurons,size(x)) + layers(2) % b
      ! sigmoid activation: using an "inout" subroutine to avoid array copy 
      call sigmoid_subroutine(a)
      ! INTERMEDIATE LAYERS
      do n = 3, size(layers)-1
        a = matvecmul(layers(n-1) % w_transposed, a, neurons, neurons) + layers(n) % b
        call sigmoid_subroutine(a)
      end do
      ! LAST LAYER (LINEAR ACTIVATION = do nothing, just add biases)
      output = (matvecmul(layers(n-1) % w_transposed, a, size(output), neurons) + layers(n) % b)
    end associate
  end subroutine

  function matvecmul(matA,vecB,nrow,ncol)
        implicit none
        integer, intent(in) :: nrow,ncol
        real(wp), intent(in) :: matA(nrow,ncol)
        real(wp), intent(in) :: vecB(ncol)
        integer :: i,j
        real(wp) :: matvecmul(nrow)

        matvecmul = 0.0_wp
        do j=1,ncol !  
            matvecmul = matvecmul + matA(:,j) * vecB(j)
        enddo

  end function matvecmul

  subroutine sigmoid_subroutine(x)
    real(wp), dimension(:), intent(inout) :: x
    x = 1 / (1 + exp(-x))
  end subroutine

The text was updated successfully, but these errors were encountered:

milancurcic · 2019-09-30T21:00:57Z

Thanks Peter, it all makes sense considering that the equal-size hidden layers allows for less allocations and copies to be made. What kind of performance improvement are you getting? I'm interested in both GNU and Intel compilers.

peterukk · 2019-10-01T10:17:43Z

gfortran -ffree-line-length-none -std=f2008 -pg -march=native -O3 --fast-math
Elapsed time on output: 0.939000010
Elapsed time on output_opt_sig: 0.680999994
Elapsed time on output_flatmodel_opt_sig: 0.669000030

gfortran -ffree-line-length-none -m64 -std=f2008 -march=native -O3 --fast-math -funroll-loops -ftree-loop-linear -fprefetch-loop-arrays

Elapsed time on output: 0.732999980
Elapsed time on output_opt_sig: 0.486000001
Elapsed time on output_flatmodel_opt_sig: 0.485000014

This on gfortran-7, intel processor and a model with (50,50) hidden neurons, 21 inputs and 256 outputs. I don't have the intel compilers on this computer but I do at home, I can get back to you.

  subroutine output_opt_sig(self, x, output)
    class(network_type),    intent(in)  :: self
    real(wp), dimension(:), intent(in)  :: x
    real(wp), dimension(:), intent(out) :: output
    ! Local variables
    real(wp), allocatable   :: a(:)
    integer,  dimension(2)  :: matsize
    integer                 :: n

    associate(layers => self % layers)
      matsize = shape(layers(1) % w_transposed)
      a = matvecmul(layers(1) % w_transposed, x, matsize(1), matsize(2)) + layers(2) % b
      ! sigmoid activation: using an "inout" subroutine to avoid array copy 
      call sigmoid_subroutine(a)
      ! INTERMEDIATE LAYERS
      do n = 3, size(layers)-1
        matsize = shape(layers(n-1) % w_transposed)
        a = matvecmul(layers(n-1) % w_transposed, a, matsize(1), matsize(2)) + layers(n) % b
        call sigmoid_subroutine(a)
      end do
      ! LAST LAYER (LINEAR ACTIVATION = do nothing, just add biases)
      matsize = shape(layers(n-1) % w_transposed)
      output = (matvecmul(layers(n-1) % w_transposed, a, matsize(1), matsize(2)) + layers(n) % b)
    end associate
  end subroutine

peterukk · 2019-10-01T13:55:06Z

On ifort the differences are much bigger:

ifort -O3 -mkl

Elapsed time on output: 0.8096000
Elapsed time on output_opt_sig: 0.1968000
Elapsed time on output_flatmodel_opt_sig: 0.1787000
Elapsed time on output_sgemm_sig: 0.1423000

Here the last routine processes all the input data at once as mentioned (for the other ones, I used a loop).

subroutine output_sgemm_sig(self, x, num_sample, output)
    ! Use this routine for a 2D input data array to process all the samples simultaenously in a feed-forward network
   ! Replace the sigmoid activation subroutine with whatever activation you're using
    ! Author: Peter Ukkonen
    ! sgemm = single-precision (wp = sp)
    class(network_type),      intent(in)    :: self
    real(wp), dimension(:,:), intent(in)    :: x      ! (features, num_sample)
    real(wp), dimension(:,:), intent(out)   :: output ! (outputs, num_sample)
    integer,                  intent(in)    :: num_sample
    ! Local variables
    real(wp), allocatable   :: a(:,:), a_next(:,:)
    real(wp)                :: alpha, beta
    integer,  dimension(2)  :: matsize
    integer                 :: n, isample

    alpha = 1.0_wp
    beta = 0.0_wp
    output = 0.0_wp

    associate(layers => self % layers)
      matsize = shape(layers(1) % w_transposed)
      allocate(a(matsize(1),num_sample))
      call sgemm("N","N",matsize(1), num_sample, matsize(2), alpha, layers(1) % w_transposed, matsize(1), x, matsize(2), beta, a, matsize(1))
      ! a = a + spread( layers(2) % b, 1, num_sample )
      do isample = 1, num_sample
        a(:,isample) = a(:,isample ) + layers(2) % b
      end do
      call sigmoid_subroutine_mat(a)

      ! INTERMEDIATE LAYERS
      do n = 3, size(layers)-1
        matsize = shape(layers(n-1) % w_transposed)
        allocate(a_next(matsize(1),num_sample))

        call sgemm("N","N",matsize(1),num_sample,matsize(2),alpha,layers(n-1) % w_transposed,matsize(1),a,matsize(2),beta,a_next,matsize(1))
        deallocate(a)
        ! a = a + spread( layers(n) % b, 1, num_sample )
        do isample = 1, num_sample
          a_next(:,isample) = a_next(:,isample ) + layers(n) % b
        end do 
        call sigmoid_subroutine_mat(a_next)
        a = a_next
        deallocate(a_next)
      end do

      matsize = shape(layers(n-1) % w_transposed)
      call sgemm("N","N",matsize(1), num_sample, matsize(2), alpha, layers(n-1) % w_transposed, matsize(1), a, matsize(2), beta, output, matsize(1))
      ! a = a + spread( layers(n) % b, 1, num_sample )
      do isample = 1, num_sample
        output(:,isample) = output(:,isample ) + layers(n) % b
      end do
    end associate
  end subroutine

milancurcic · 2019-10-01T19:12:35Z

Thanks. Do you know if any of these could be implemented in the existing class that doesn't assume equal sizes of hidden layers?

The only opportunity I see is to allocate the activation arrays once rather than re-allocate on assignment in every iteration, in which case the subroutine to calculate activations in-place may be effective.

Btw, the 10-output-batch branch introduces the function to output per batch (contributed by @jvdp1), although it's just a wrapper around the output for a single set of inputs, so it doesn't have any performance improvements that you propose. The output_batch method could be a place to introduce these, if any are feasible. I will take some time soon to clean up and merge the new methods into master.

peterukk · 2019-10-02T08:06:51Z

output_opt_sig and output_sgemm_sig do not assume equal-sized hidden layers (I tested that they work). As you can see, the former is almost as fast as the flatmodel one. Processing inputs as matrices is fastest on ifort. I believe the most significant optimizations for me were the use of matvecmul instead of matmul, removal of the output activation, and the use of subroutines instead of functions to avoid array copy. I think you could quite easily implement these changes. I believe the weights need to be pre-transposed for matvecmul to be faster than matmul.

peterukk · 2019-10-02T08:55:37Z

The only opportunity I see is to allocate the activation arrays once rather than re-allocate on assignment in every iteration, in which case the subroutine to calculate activations in-place may be effective.

That might be a good idea, although based on the previous result the allocation overhead isn't that big..I'm by no means an expert on Fortran by the way, this has been an interesting learning experience!

milancurcic · 2019-10-03T13:14:50Z

output_opt_sig and output_sgemm_sig do not assume equal-sized hidden layers (I tested that they work)

OK, great, thanks, I overlooked that. In general, I don't want to break the high-level functional API. However, I'm all for tweaking the internal implementation if performance benefits are clear. You give clear ideas and examples how to do this. I will play with it and let you know.

peterukk · 2019-11-05T14:32:45Z

You're welcome! I have now implemented these changes in a more general manner here: https://github.com/peterukk/rte-rrtmgp/blob/master/neural/

Most notably, activation functions have been replaced with subroutines, and these procedures have a 2D array variant which are used in e.g. output_sgemm

peterukk · 2020-01-07T16:41:08Z

FYI I now tested using elemental subroutines as activation functions. This is considerably faster (and work with both 1D and 2D arrays), but unfortunately, pointers do not work with elemental procedures :/ disappointed that object-oriented programming leads to a performance loss in this instance

milancurcic mentioned this issue Oct 28, 2019

Benchmark against popular frameworks #2

Closed

peterukk mentioned this issue Apr 21, 2021

Performance enhancements (batched predictions using GEMM) scientific-computing/FKB#17

Open

milancurcic mentioned this issue Mar 31, 2022

Refactoring suggestions #36

Closed

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

suggestions for performance #18

suggestions for performance #18

peterukk commented Sep 30, 2019

milancurcic commented Sep 30, 2019

peterukk commented Oct 1, 2019

peterukk commented Oct 1, 2019 •

edited

Loading

milancurcic commented Oct 1, 2019

peterukk commented Oct 2, 2019 •

edited

Loading

peterukk commented Oct 2, 2019

milancurcic commented Oct 3, 2019

peterukk commented Nov 5, 2019

peterukk commented Jan 7, 2020

suggestions for performance #18

suggestions for performance #18

Comments

peterukk commented Sep 30, 2019

milancurcic commented Sep 30, 2019

peterukk commented Oct 1, 2019

peterukk commented Oct 1, 2019 • edited Loading

milancurcic commented Oct 1, 2019

peterukk commented Oct 2, 2019 • edited Loading

peterukk commented Oct 2, 2019

milancurcic commented Oct 3, 2019

peterukk commented Nov 5, 2019

peterukk commented Jan 7, 2020

peterukk commented Oct 1, 2019 •

edited

Loading

peterukk commented Oct 2, 2019 •

edited

Loading