I would like to continue to post my performance problem with array operations a in a new topic. Some history can be found in https://software.intel.com/en-us/forums/intel-visual-fortran-compiler-for-windows/topic/605372#comment-1854000 and http://openmp.org/forum/viewtopic.php?f=3&t=1682 . Some of the participants in these previous discussions got better performance on their HW and compilers than me. I also was advised to go for MKL and this is part of this topic. The dissapointing message: neither OMP nor MKL are faster than simple Do loops on my laptop.
My questions are: is it that my hardware is not suited for parallel calculations, or did I forget to use some special compiler options, or does HT in my Win7x64 home premium SP1 impede the performance (don't know how to supress it)? I am attaching my processor details (bandwidth issue etc.).
This is the test code, trying to compare Do-loops, vector notation, OMP and MKL.
! TESTS 26.12.2015
! Test speed for array operation y(i)=a*x(i)*y(i) in 4 different ways
!
! "C:\Program Files (x86)\Intel\Composer XE 2013 SP1\mkl\bin\mklvars" intel64 mod
! "C:\Program Files (x86)\Intel\Composer XE 2013 SP1\bin\compilervars.bat" intel64
! ifort testMKLvsOpenMP.f90 /QopenMP /Qmkl
! Intel(R) Visual Fortran Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 14.0.2.176 Build 20140130
! Microsoft (R) Incremental Linker Version 9.00.21022.08
! -out:testMKLvsOpenMP.exe
! -subsystem:console
! -defaultlib:libiomp5md.lib
! -nodefaultlib:vcomp.lib
! -nodefaultlib:vcompd.lib
! "-libpath:C:\Program Files (x86)\Intel\Composer XE 2013 SP1\mkl\lib\intel64"
! testMKLvsOpenMP.obj
!
program TestMKLvsOpenMP
use omp_lib
IMPLICIT NONE
integer :: N
real*8,Allocatable :: x(:),y(:)
real*8 :: alpha
real*8 :: endtime,starttime,DSECND
real :: cpu1,cpu2
integer :: NTHREADS,irepeat,nrepeat,i
! initialize
alpha=.0001
print *,'N=?'
read *,N
nrepeat=1000000000/N ! nrepeat*N=10 Mio
print *,'nrepeat=',nrepeat
Allocate (x(N),y(N))
x(:)=0. ; y(:)=0.
pause 'Press Return'
! 1. standard do loops
forall (i=1:N) ; x(i)=i ; y(i)=-i ;end forall
Nthreads=0
starttime = OMP_get_wtime()
Call cpu_time(cpu1)
do irepeat=1,nrepeat
do i=1,N
y(i)=alpha*x(i)+y(i)
enddo
enddo
endtime = OMP_get_wtime() ; Call cpu_time(cpu2)
print *, 'Threads=',NTHREADS,' DO time=',SNGL(endtime - starttime),cpu2-cpu1
pause 'Press Return'
! 2. vector
forall (i=1:N) ; x(i)=i ; y(i)=-i ; end forall
Nthreads=0
Call cpu_time(cpu1)
do irepeat=1,nrepeat
y(1:N)=alpha*x(1:N)+y(1:N)
enddo
Call cpu_time(cpu2)
print *, 'Threads=',NTHREADS,' Vector time=',cpu2-cpu1
pause 'Press Return'
Nthreads=2
! 3. OMP
forall (i=1:N) ;x(i)=i ; y(i)=-i ; end forall
CALL OMP_SET_NUM_THREADS(NTHREADS)
starttime = OMP_get_wtime() ; Call cpu_time(cpu1)
!$OMP PARALLEL Shared(N,x,y,alpha)
do irepeat=1,nrepeat
!$OMP DO PRIVATE(i) SCHEDULE(static)
do i=1,N
y(i)=alpha*x(i)+y(i)
enddo
!$OMP END DO nowait
enddo
!$OMP END PARALLEL
endtime = OMP_get_wtime() ; Call cpu_time(cpu2)
print *, 'OMP Threads=',NTHREADS,' OMPtime=',SNGL(endtime - starttime),(cpu2-cpu1)/Nthreads
pause 'Press Return'
! 4. MKL
forall (i=1:N) ; x(i)=i ; y(i)=-i ; end forall
starttime =DSECND() ; Call cpu_time(cpu1)
CALL MKL_SET_NUM_THREADS(NTHREADS)
do irepeat=1,nrepeat
CALL daxpy(N,alpha,x,1, y ,1)
end do
endtime = DSECND(); Call cpu_time(cpu2)
print *, 'MKL Threads=',NTHREADS,' time=',SNGL(endtime - starttime),(cpu2-cpu1)/Nthreads
end
The results for N=1000000 (1 mio) and N=10000 are
N=?
1000000
nrepeat= 1000
Press Return
Threads= 0 DO time= 0.1767103 0.1716011
Press Return
Threads= 0 Vector time= 0.1716011
Press Return
OMP Threads= 2 OMPtime= 1.397965 1.388409
Press Return
MKL Threads= 2 time= 1.406852 1.404009
N=?
10000
nrepeat= 100000
Press Return
Threads= 0 DO time= 0.1744737 0.1560010
Press Return
Threads= 0 Vector time= 0.1716011
Press Return
OMP Threads= 2 OMPtime= 0.2589355 0.2574016
Press Return
MKL Threads= 2 time= 0.3295782 0.3120020
When I go down to N=100 OMP and MKL run slower
OMP Threads= 2 OMPtime= 0.8096205 0.8112052
MKL Threads= 2 time= 0.4926147 0.2418015
I am /QxAVX would speed up the vector time by 80% (and the do loops), however, as I am distributing my codes to a variety of users, I want to use the most common flags. All comments are welcome.