You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This ticket is a result of performance investigation done at Arm, comparing SNAP application compiled with flang-new and gfortran.
The code in SNAP's mms.f90 contains this code:
!$OMP PARALLEL DO SCHEDULE(DYNAMIC,1) DEFAULT(SHARED) &
!$OMP& PRIVATE(g,kd,ks,jd,js,id,is,n,k,j,i,m,gp,lm,l,ll)
DO g = 1, ng
DO kd = 1, MAX( ndimen-1, 1 )
ks = -one
IF ( kd == 2 ) ks = one
DO jd = 1, MIN( ndimen, 2 )
js = -one
IF ( jd == 2 ) js = one
DO id = 1, 2
is = -one
IF ( id == 2 ) is = one
n = 4*(kd - 1) + 2*(jd - 1) + id
DO k = 1, nz
DO j = 1, ny
DO i = 1, nx
DO m = 1, nang
qim(m,i,j,k,n,g) = qim(m,i,j,k,n,g) + &
REAL( g, r_knd ) * is*mu(m)*sx(i)*cy(j)*cz(k) + &
sigt(mat(i,j,k),g)*ref_flux(i,j,k,g)
IF ( ndimen > 1 ) THEN
qim(m,i,j,k,n,g) = qim(m,i,j,k,n,g) + &
REAL( g, r_knd ) * js*eta(m)*cx(i)*sy(j)*cz(k)
END IF
IF ( ndimen > 2 ) THEN
qim(m,i,j,k,n,g) = qim(m,i,j,k,n,g) + &
REAL( g, r_knd ) * ks*xi(m)*cx(i)*cy(j)*sz(k)
END IF
DO gp = 1, ng
qim(m,i,j,k,n,g) = qim(m,i,j,k,n,g) - &
slgg(mat(i,j,k),1,gp,g) * ref_flux(i,j,k,gp)
lm = 2
DO l = 2, nmom
DO ll = 1, lma(l)
qim(m,i,j,k,n,g) = qim(m,i,j,k,n,g) - ec(m,lm,n)*&
slgg(mat(i,j,k),l,gp,g)*ref_fluxm(lm-1,i,j,k,g)
lm = lm + 1
END DO
END DO
END DO
END DO
END DO
END DO
END DO
END DO
END DO
END DO
END DO
!$OMP END PARALLEL DO
The code in the innermost loop contains two references to a 6-dimension array, and a total of 6 different arrays, all of which are fully computed for every loop iteration. Hand-modification of the index computations gives a 33% improvmenet in overall performance for the entire SNAP application.
A small example of the same thing, with standalone and "self-timing":
program p
implicit none
REAL(8), ALLOCATABLE, DIMENSION(:,:,:,:) :: qim
integer(4) :: nang, noct,nx, ny, ng
integer(4) :: i, j, k, l, m, n
integer :: istat
integer(4) :: ll
real :: time_start
real :: time_end
nx = 40
ny = 40
noct = 6
ng = 18
print *, "Total size = ", nx * ny * noct * ng * 8
ALLOCATE( qim(nx,ny,noct,ng), STAT=istat )
qim = 0.0
call CPU_TIME(time_start)
do ll = 1, 5000
do j = 1, nx
do k = 1, ny
do m = 1, noct
do n = 1, ng
qim(j, k, m, n) = qim(j, k, m, n) + 1
end do
end do
end do
end do
end do
call CPU_TIME(time_end)
do j = 1, nx
do k = 1, ny
do m = 1, noct
do n = 1, ng
if (qim(j, k, m, n) .ne. 5000.0) then
STOP ("Failed");
end if
end do
end do
end do
end do
print *, "Time taken (s): ", time_end - time_start
! Make sure the whole calculation isn't optimised away by printing some element of qim.
print *, qim(1,1,1,1)
end program p
When compiled with flang-new multi.f90 (on my x86-64 machine) it takes about 2.95s, Using gfortran -O3 multi.f90, it takes 1.96s.
The LLVM-IR for the above code is in the above gist as multi.ll
The text was updated successfully, but these errors were encountered:
This ticket is a result of performance investigation done at Arm, comparing SNAP application compiled with flang-new and gfortran.
The code in SNAP's mms.f90 contains this code:
The code in the innermost loop contains two references to a 6-dimension array, and a total of 6 different arrays, all of which are fully computed for every loop iteration. Hand-modification of the index computations gives a 33% improvmenet in overall performance for the entire SNAP application.
MLIR and LLVM-IR for the mms.f90 file can be found here:
https://gist.github.com/Leporacanthicus/693028a27f40523fc41d08eb87d9f5be
A small example of the same thing, with standalone and "self-timing":
When compiled with
flang-new multi.f90
(on my x86-64 machine) it takes about 2.95s, Usinggfortran -O3 multi.f90
, it takes 1.96s.The LLVM-IR for the above code is in the above gist as multi.ll
The text was updated successfully, but these errors were encountered: