Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor performance when accessing multidimensional arrays #1466

Open
Leporacanthicus opened this issue Feb 10, 2022 · 0 comments
Open

Poor performance when accessing multidimensional arrays #1466

Leporacanthicus opened this issue Feb 10, 2022 · 0 comments
Labels
Optimizer optimization pass(es) related

Comments

@Leporacanthicus
Copy link
Collaborator

Leporacanthicus commented Feb 10, 2022

This ticket is a result of performance investigation done at Arm, comparing SNAP application compiled with flang-new and gfortran.

The code in SNAP's mms.f90 contains this code:

  !$OMP PARALLEL DO SCHEDULE(DYNAMIC,1) DEFAULT(SHARED)                &
  !$OMP& PRIVATE(g,kd,ks,jd,js,id,is,n,k,j,i,m,gp,lm,l,ll)
    DO g = 1, ng
      DO kd = 1, MAX( ndimen-1, 1 )
        ks = -one
        IF ( kd == 2 ) ks = one
        DO jd = 1, MIN( ndimen, 2 )
          js = -one
          IF ( jd == 2 ) js = one
          DO id = 1, 2
            is = -one
            IF ( id == 2 ) is = one
            n = 4*(kd - 1) + 2*(jd - 1) + id
            DO k = 1, nz
            DO j = 1, ny
            DO i = 1, nx
              DO m = 1, nang
                qim(m,i,j,k,n,g) = qim(m,i,j,k,n,g) +                  &
                  REAL( g, r_knd ) * is*mu(m)*sx(i)*cy(j)*cz(k) +      &
                    sigt(mat(i,j,k),g)*ref_flux(i,j,k,g)
                IF ( ndimen > 1 ) THEN
                  qim(m,i,j,k,n,g) = qim(m,i,j,k,n,g) +                &
                    REAL( g, r_knd ) * js*eta(m)*cx(i)*sy(j)*cz(k)
                END IF
                IF ( ndimen > 2 ) THEN
                  qim(m,i,j,k,n,g) = qim(m,i,j,k,n,g) +                &
                    REAL( g, r_knd ) * ks*xi(m)*cx(i)*cy(j)*sz(k)
                END IF
                DO gp = 1, ng
                  qim(m,i,j,k,n,g) = qim(m,i,j,k,n,g) -                &
                    slgg(mat(i,j,k),1,gp,g) * ref_flux(i,j,k,gp)
                  lm = 2
                  DO l = 2, nmom
                    DO ll = 1, lma(l)
                      qim(m,i,j,k,n,g) = qim(m,i,j,k,n,g) - ec(m,lm,n)*&
                        slgg(mat(i,j,k),l,gp,g)*ref_fluxm(lm-1,i,j,k,g)
                      lm = lm + 1
                    END DO
                  END DO
                END DO
              END DO
            END DO
            END DO
            END DO
          END DO
        END DO
      END DO
    END DO
  !$OMP END PARALLEL DO

The code in the innermost loop contains two references to a 6-dimension array, and a total of 6 different arrays, all of which are fully computed for every loop iteration. Hand-modification of the index computations gives a 33% improvmenet in overall performance for the entire SNAP application.

MLIR and LLVM-IR for the mms.f90 file can be found here:
https://gist.github.com/Leporacanthicus/693028a27f40523fc41d08eb87d9f5be

A small example of the same thing, with standalone and "self-timing":

program p
  implicit none
  REAL(8), ALLOCATABLE, DIMENSION(:,:,:,:) :: qim

  integer(4) :: nang, noct,nx, ny, ng
  integer(4) :: i,    j,   k,  l,  m,  n 
  integer :: istat
  integer(4) :: ll

  real :: time_start
  real :: time_end

  nx = 40
  ny = 40
  noct = 6
  ng = 18

  print *, "Total size = ", nx * ny * noct * ng * 8
  ALLOCATE( qim(nx,ny,noct,ng), STAT=istat )

  qim = 0.0

  call CPU_TIME(time_start)
  do ll = 1, 5000
     do j = 1, nx
        do k = 1, ny
           do m = 1, noct
              do n = 1, ng
                 qim(j, k, m, n) = qim(j, k, m, n) + 1
              end do
           end do
        end do
     end do
  end do
  call CPU_TIME(time_end)
  
  do j = 1, nx
     do k = 1, ny
        do m = 1, noct
           do n = 1, ng
              if (qim(j, k, m, n) .ne. 5000.0) then
                 STOP ("Failed");
              end if
           end do
        end do
     end do
  end do

  print *, "Time taken (s): ", time_end - time_start
  ! Make sure the whole calculation isn't optimised away by printing some element of qim.
  print *, qim(1,1,1,1)
end program p

When compiled with flang-new multi.f90 (on my x86-64 machine) it takes about 2.95s, Using gfortran -O3 multi.f90, it takes 1.96s.

The LLVM-IR for the above code is in the above gist as multi.ll

@banach-space banach-space added the Optimizer optimization pass(es) related label Feb 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Optimizer optimization pass(es) related
Projects
None yet
Development

No branches or pull requests

2 participants