From Documentation
Jump to: navigation, search

OpenMP Reduction for User Defined Data Types

Nick Chepurniy, Ph.D.
e-mail: nickc@sharcnet.ca
HPC Consultant - SHARCNET http://www.sharcnet.ca
Compute Canada http://www.computecanada.org



Overview

Presently, the OpenMP reduction clause supports only the basic data types (e.g. integer, real*8). This document explores reduction techniques that can be used for User Defined Data Types. Examples of User Defined Data Types defined in the user's program and Data Types used in the QD and MPFUN90 packages are presented in full detail.


For those users interested in the programs a tar file is available
by contacting the author.

Introduction

In this document sample programs of the Multi Precision packages MPFUN90 and QD are presented which can be compiled and run SERIALLY or with OpenMP. Both of these packages use "user defined data types". Since the OpenMP reduction clause cannot be used for "user defined data types" the reduction is done manually in a PARALLEL OpenMP region. To further explore this reduction process a third program is used with a simple "user defined data type" called "my_dt". The third program defines addition and multiplication functions for the "my_dt" user defined data type and the operations + and * are overloaded with these functions. For completness and in order to do some comparisons between the different data types a program using the data types INTEGER*8 (I8) and REAL*16 (Quad) are used in a fourth program, where the OpenMP reduction clause can be used for the data type INTEGER*8, but a PARALLEL OpenMP region had to be used for the REAL*16 basic data type, which is more flexible.

The Computational Model

To illustrate the reduction on "user defined data types" following series summation was used in all four programs:

 Sum_N =  \sum_{i=1}^N i^2 = 1^2 + 2^2 + 3^2 + ... + N^2  

To ensure that we are getting the correct results in these programs a proposition is described for the sum of squared integers from 1 to N=10^m (m positive integer). (In Mathematics, a proposition is a useful proved result, similar to a theorem but less important). Thus, knowing what the sum is for a given N value allows us to compare it to the results of the program and we can rapidly determine if the program is working properly.

Each program can be run with one or two different data types either in SERIES or PARALLEL using OpenMP. The HIST files associated with each program contain the instructions for compiling the different cases and submitting the jobs. Often there will be a makefile which has the different parameters required to compile and submit the jobs.

Sample outputs for each of the four programs are also presented, along with the program listings, makefiles and HIST files. Finally, walltimes are compared for SERIAL and OpenMP jobs for the same value of N=10^m for the four programs.

For those users interested in the programs a tar file is available by contacting the author.

Proposition for the sum of squared integers from 1 to 10^m (m positive integer)

Let's use the symbol Sum_N to denote the sum of the integers squared from 1 to N. There is a closed form for this sum, i.e.

 Sum_N =  \sum_{i=1}^N i^2 = 1^2 + 2^2 + 3^2 + ... + N^2   = N*(N+1)*(2*N+1)/6

In each of the programs this closed form is calculated so that results from the summation can be compared to the closed form, but there could be overflow while the closed form is calculated. This is why this proposition is so valuable.

For N=10^m (where m is an integer > 0) following results were observed:

m     N=10^m           Sum_N                  long_format of Sum_N
1         10       10^2  x 3.85                       385
2        100       10^5  x 3.3835                    338350
3       1000       10^8  x 3.338335                 333833500
4      10000       10^11 x 3.33383335              333383335000
5     100000       10^14 x 3.3333833335           333338333350000
6    1000000       10^17 x 3.333338333335        333333833333500000
7   10000000       10^20 x 3.33333383333335     333333383333335000000

Proposition Rule:

The value of Sum_N (with N=10^m) for a given value of m > 0 is built as follows:

  (1) Repeat the digit '3' m times and append digit '8' to it.
  (2) Next, append the digit '3' (m-1) times and then digit '5'.
  (3) Finally, append (m-1) digits '0', so that Sum_N will have 3m digits.

Here are some more examples for m=7 through 13 using above Proposition Rule:

 m     Sum_N                                        long_format of Sum_N
 7   10^20 x 3.33333383333335                       333333383333335000000
 8   10^23 x 3.3333333833333335                    333333338333333350000000
 9   10^26 x 3.333333338333333335                 333333333833333333500000000
10   10^29 x 3.33333333383333333335              333333333383333333335000000000
11   10^32 x 3.3333333333833333333335           333333333338333333333350000000000
12   10^35 x 3.333333333338333333333335        333333333333833333333333500000000000
13   10^38 x 3.33333333333383333333333335     333333333333383333333333335000000000000

A proof of the proposition is available from the author in case anyone is doubtful.

The four programs

In the Introduction (above) the programs used different data types and were listed in this order:

     MPFUN90,  QD, my_dt, I8/Quad

which starts with the most complex (and time consuming) and ends with the basic data types supported by higher level programming languages such as Fortran and C++. Note that the OpenMP reduction clause works properly with these basic data types. It is for the "user defined data types" where the reduction clause fails.

Instead of describing the programs in the above order, we will present them in the reverse order, i.e. starting with the simplest and moving towards the more complex.

PROBLEM_1_I8: Fortran90 INTEGER*8 (I8) and REAL*16 (Quad) precisions

In Fortran90 the data types INTEGER*4 and INTEGER*8 have about 9 and 18 precision digits. This would allow us to safely compute our Sum_N up to N=10^m for m=3 and m=6 terms repectively.

On the other hand using REAL*16 (Quad) precision we could go up to 33 digits but since round off errors will be introduced the precision might be less than 33 digits.

The program in file sum_reduction_I8.f90 (in subdirectory PROBLEM_1_I8 and also displayed below) can use either INTEGER*8 (I8) or REAL*16 (Quad) data types, and with either of these data types we can be run SERIALLY or in PARALLEL (using OpenMP). This is a total of four cases.

The HIST file lists the commands for compiling and submitting jobs for the four cases.

/* HIST file for PROBLEM_1_I8 */
! PROBLEM_1_I8: I8/Quad with/without OpenMP

! P1_I8_SER:

       vi sum_reduction_I8.f90
       $FC -o a.out -fpp -DUSE_I8 sum_reduction_I8.f90
       ./a.out >& ${CLU}_OUT_P1_I8_SER_  &
       sqsub -r 1h --mpp=2G -o ${CLU}_OUT_P1_I8_SER_%J -n 1 ./a.out

! P1_I8_OMP

       vi sum_reduction_I8.f90
       $FC -o b.out -fpp -DUSE_I8 -openmp sum_reduction_I8.f90
       sqsub -r 1h --mpp=2G -o ${CLU}_OUT_P1_I8_OMP_%J -f threaded -N 1 -n 16 ./b.out

! P1_Quad_SER:

       vi sum_reduction_I8.f90
       $FC -o c.out -fpp  sum_reduction_I8.f90
       ./a.out >& ${CLU}_OUT_P1_Quad_SER_  &
       sqsub -r 1d --mpp=2G -o ${CLU}_OUT_P1_Quad_SER_%J -n 1 ./c.out
 
! P1_Quad_OMP

       vi sum_reduction_I8.f90
       $FC -o d.out -fpp -openmp sum_reduction_I8.f90
       sqsub -r 1h --mpp=2G -o ${CLU}_OUT_P1_Quad_OMP_%J -f threaded -N 1 -n 16 ./d.out

============================================================================================

       $FC -o a.out -fpp -DUSE_I8 sum_reduction_I8.f90
       $FC -o b.out -fpp -DUSE_I8 -openmp sum_reduction_I8.f90
       $FC -o c.out -fpp  sum_reduction_I8.f90
       $FC -o d.out -fpp -openmp sum_reduction_I8.f90

============================================================================================

       sqsub -r 1h --mpp=2G -o ${CLU}_OUT_P1_I8_SER_%J -n 1 ./a.out
       sqsub -r 1h --mpp=2G -o ${CLU}_OUT_P1_I8_OMP_%J -f threaded -N 1 -n 16 ./b.out
       sqsub -r 1d --mpp=2G -o ${CLU}_OUT_P1_Quad_SER_%J -n 1 ./c.out
       sqsub -r 1h --mpp=2G -o ${CLU}_OUT_P1_Quad_OMP_%J -f threaded -N 1 -n 16 ./d.out

============================================================================================

       ./cleanup

/* END HIST file for PROBLEM_1_I8 */ 

The option -fpp (for the fortran preprocessor) is required in the compile command, because the source file to be compiled has some fpp directives. Specifying the macro USE_I8 (i.e. -DUSE_I8) will use the INTEGER*8 data type, while by omitting the macro the dafault (REAL*16) data type will be used. If OpenMp is required then the flag -openmp must be included, otherwise a serial executable will be generated.

One file, sum_reduction_I8.f90, is used for all four cases without a makefile:

/* SOURCE file sum_reduction_I8.f90 */
     PROGRAM SUMREDUCTION
!
! PROBLEM_1_I8: I8/Quad with/without OpenMP
!

      implicit none

      INTEGER JJ, JJMAX, KLINE, m
      INTEGER*8 :: N, I8

#ifdef USE_I8
      INTEGER*8 :: SSUM, AN, SUM_EXACT
#else
      REAL*16   :: SSUM, AN, SUM_EXACT
      REAL*16   :: T1,T2,T3,PRIVATE_PARTIAL_SUM
#endif

!$    INTEGER OMP_GET_NUM_THREADS, OMP_GET_THREAD_NUM
      INTEGER NTHREADS, my_thread
      INTEGER*8 :: my_width, i_start, i_end

      LOGICAL DEBUG
      DEBUG = .true.
      DEBUG = .false.

!     Some initializations

!
! Write title:
!
      write(6,1004)
 1004 format("PROBLEM_1_I8: I8/Quad with/without OpenMP"/)

      N           =  1
      m           =  0
!
      N           =  10000000000
      m           =           10
!
      my_thread   =  0
      NTHREADS    =  1

#ifdef USE_I8
      JJMAX = 8
#else
      JJMAX =  1     ! use this for timing "walltime"
      JJMAX = 12
#endif

      DO JJ=1,JJMAX
!
! Parallel region displaying threads to ensure OpenMP is active.
! Do this only once i.e. only for JJ=1
!
!     IF ( JJ .EQ. 1) THEN
      IF ( JJ .EQ. 2) THEN
!$omp parallel private(my_thread) SHARED(nthreads)
!$    my_thread = omp_get_thread_num()
!$    nthreads = omp_get_num_threads()
!$OMP CRITICAL
      write (6,1005) my_thread,nthreads
!$OMP END CRITICAL
!$omp barrier
      if ( my_thread .eq. 0 ) then
        write (6,1006) nthreads
      end if
!$omp end parallel
      ENDIF

!
! Print N
!

#ifdef USE_I8
        if (nthreads .gt. 1) then
          write(6,1022) N,m
        else
          write(6,1012) N,m
        endif
#else
        if (nthreads .gt. 1) then
          write(6,2022) N,m
        else
          write(6,2022) N,m
        endif
#endif

! Anaylitic result

#ifdef USE_I8
        AN = N
        if (mod(AN,2) .eq. 0 .and. mod(AN+1,3) .eq. 0) then
          SUM_EXACT = (AN/2)*((AN+1)/3)*(2*AN+1)
          KLINE=1
        else
          if (mod(AN,3) .eq. 0 .and. mod(AN+1,2) .eq. 0) then
            SUM_EXACT = (AN/3)*((AN+1)/2)*(2*AN+1)
            KLINE=2
          else
            if (mod(AN,6) .eq. 0 ) then
              SUM_EXACT = (AN/6)*((AN+1)  )*(2*AN+1)
              KLINE=3
            else
              if (mod(AN+1,6) .eq. 0) then
                SUM_EXACT = (AN  )*((AN+1)/6)*(2*AN+1)
                KLINE=4
              else
                SUM_EXACT = (AN  )*((AN+1)  )*(2*AN+1)/6
                KLINE=5
              endif
            endif
          endif
        endif
#else
        AN = N
        SUM_EXACT = (AN  )*((AN+1)  )*(2*AN+1)/6
        KLINE=5
#endif

        if (DEBUG) write(6,1007) KLINE
 1007   format("Used KLINE = ",i1)

#ifdef USE_I8
        if (nthreads .gt. 1) then
          write(6,1023) SUM_EXACT
        else
          write(6,1013) SUM_EXACT
        endif
#else
        if (nthreads .gt. 1) then
          write(6,2023) SUM_EXACT
        else
          write(6,2013) SUM_EXACT
        endif
#endif

        SSUM = 0.0

#ifdef USE_I8
!$OMP PARALLEL DO REDUCTION(+:SSUM)
      DO I8 = 1, N
        SSUM = SSUM + I8*I8
      ENDDO

#else
!$OMP PARALLEL DEFAULT(NONE) PRIVATE(my_width,my_thread,i_start,i_end,  &
!$OMP I8,T1,T2,T3,PRIVATE_PARTIAL_SUM) SHARED(nthreads,N,SSUM)
!$    my_thread = omp_get_thread_num()
!$    nthreads  = omp_get_num_threads()
      my_width  = (N + nthreads -1) / nthreads

      i_start = 1 + my_thread*my_width
      i_end   = min((my_thread + 1)*my_width,N)

      PRIVATE_PARTIAL_SUM = 0.0D0

      DO I8 = i_start, i_end
        T1 = I8
        T2 = T1
        T3 = T1 * T2
        PRIVATE_PARTIAL_SUM = PRIVATE_PARTIAL_SUM + T3
      ENDDO
!
! Combine partial sums - must synchronize because SSUM is SHARED
!

!$OMP CRITICAL
      SSUM = SSUM + PRIVATE_PARTIAL_SUM
!$OMP END CRITICAL
!$OMP END PARALLEL
#endif

!
! Print SSUM
!

#ifdef USE_I8
        if (nthreads .gt. 1) then
          write(6,1021) SSUM
        else
          write(6,1011) SSUM
        endif
#else
        if (nthreads .gt. 1) then
          write(6,2021) SSUM
        else
          write(6,2011) SSUM
        endif
#endif

        N = N*10
        m = m + 1

      ENDDO

 1013 format("SUM_EXACT (I8/SER)  = ",I28)
 1023 format("SUM_EXACT (I8/OMP)  = ",I28)
 1012 format(/"N = ",I18,"  m=",I2.2," without OpenMP")
 1022 format(/"N = ",I18,"  m=",I2.2," with OpenMP")
 1011 format("SSum_Serial         = ",I28/)
 1021 format("SSum_OMP            = ",I28/)

 1005 format("Thread ",i2," out of ",i3)
 1006   format(/"There are", I3," threads"/)

 2013 format("SUM_EXACT (Quad/SER)  = ",F48.1)
 2023 format("SUM_EXACT (Quad/OMP)  = ",F48.1)
 2012 format(/"N = ",I18,"  m=",I2.2," without OpenMP")
 2022 format(/"N = ",I18,"  m=",I2.2," with OpenMP")
 2011 format("SSum_Serial           = ",F48.1)
 2021 format("SSum_OMP              = ",F48.1)
      END


/* END SOURCE file sum_reduction_I8.f90 */

The output files, that were produced, display the Sum_N for different values of N. Since we have tabulated these values above, we will not display the output files, unless there is something unexpected in the output.

If you run P1_I8_SER or P1_I8_OMP the output for N=1 through N=10^6 agrees with the tabulated Sum_N values. Furthermore, from the Proposition Rule, we know that Sum_N should have 3m digits for m>0.

Here is the output for m=6 and 7:

N =            1000000  m=06 with OpenMP
SUM_EXACT (I8/OMP)  =           333333833333500000
SSum_OMP            =           333333833333500000


N =           10000000  m=07 with OpenMP
SUM_EXACT (I8/OMP)  =          1291990006563070912
SSum_OMP            =          1291990006563070912


For m=06 the number of digits is 18 and the answer agrees with that in the tabulated values for Sum_N according to the Proposition Rule.

For m=07 we should have 3m=21 digits but we have only 19 and the answer does not agree with the tabulated values for Sum_N.

Recall that the data type INTEGER*8 has 18 precision digits ! We have experienced an overflow on Sum_N and that is why our answer is WRONG.

Running P1_Quad_SER and P1_Quad_OMP will use REAL*16 (Quad) precision, which is around 33 precision digits, which means that we should be able to calculate Sum_N up to m=11. But note that the reduction clause was replaced by a PARALLEL REGION where Quad variables are used. The last few lines of the output for P1_Quad_OMP show:

N =       100000000000  m=11 with OpenMP
SUM_EXACT (Quad/OMP)  =             333333333338333333333350000000000.0
SSum_OMP              =             333333333338333333333350000000000.0

which agree with the tabulated values for Sum_N.

PROBLEM_2_DT: Quad Serial/OpenMP

In this subdirectory the same problem is solved as in PROBLEM 1_I8 for the Quad option, but here a "user defined data type" is used to show how the OpenMP reduction is handled when "user defined data type" need to be reduced.

A module file is used to define the data type and some subroutines. The HIST file has instructions for compiling the module file and main program either SERIALLY or in PARALLEL using OpenMP:

/* HIST file for PROBLEM_2_DT */

! PROBLEM_2_DT: Quad  Serial/OpenMP

! Compile module my_dt_mod.f90 required for main_my_data_type.f90:

        vi my_dt_mod.f90
        $FC -c my_dt_mod.f90

! Serial

        vi main_my_data_type.f90
        $FC main_my_data_type.f90 my_dt_mod.o
       ./a.out >& SERIAL_OUT &
        sqsub -r  8h --mpp=2.0G  -o OUT_P2_MYDT_SER_%J -n 1  ./a.out

! OpenMP

        vi main_my_data_type.f90
        $FC -o b.out -openmp  main_my_data_type.f90 my_dt_mod.o
        sqsub -r  8h --mpp=2.0G  -o OUT_P2_MYDT_OMP_%J -f threaded -N 1 -n 16  ./b.out

/* END HIST file for PROBLEM_2_DT */

Following module and source files are used to generate a SERIAL or PARALLEL executable:

/* SOURCE file (module) my_dt_mod.f90 */
      module my_dt_mod

       type my_dt
         sequence
         character*4 :: number_type  ! REAL or CPLX (Real or Complex)
!                                       This version uses only REAL but
!                                       assume CPLX will be used in the
!                                       future. We just want a data type
!                                       with char, integer and quad.

          integer     :: int_exponent
          real*16     :: coefficient  ! Quad precision
        end type my_dt

      CONTAINS

      subroutine my_neg_normalize(v1,v2)
        type (my_dt), intent(in) :: v1
        type (my_dt)             :: v2
        integer                  ::  i, m

        m = 0

        v2 = v1

        do while (v2%coefficient .lt. 0.1 )
          v2%coefficient = v2%coefficient*10.0D0
          m = m + 1
        enddo

        v2%int_exponent = v2%int_exponent - m

      end subroutine my_neg_normalize

      subroutine my_normalize(v1,v2)
        type (my_dt), intent(in) :: v1
        type (my_dt)             :: v2
        integer                  ::  i, m

        m = 0

        v2 = v1

        do while (v2%coefficient .ge. 10.0 )
          v2%coefficient = v2%coefficient/10.0D0
          m = m + 1
        enddo

        v2%int_exponent = v2%int_exponent + m

      end subroutine my_normalize


      subroutine my_dt_init(dt_zero,dt_one)
        type (my_dt) dt_zero, dt_one
        dt_zero = my_dt("REAL",0,0.00)
        dt_one  = my_dt("REAL",0,1.00)
        return
      end subroutine my_dt_init


      subroutine my_dt_write(dt_var,NPREC)
        INTEGER, INTENT (IN), OPTIONAL :: NPREC
        Character(2) :: XS1, XS2
        Character(24) :: FMT
        LOGICAL :: DEBUG
        type (my_dt) dt_var,v2

        DEBUG = .false.

        call my_normalize(dt_var,v2)

        IF (PRESENT(NPREC)) THEN
          if ( NPREC .lt. 10 .or. NPREC .gt. 80 ) then
            write(6,1003)
 1003       format("N is out of range ! - using default format.")
            FMT="('10 ^',i3,' x',f37.33)"
          else
            write(XS1,'(I2)' )  NPREC+4
            write(XS2,'(I2)' )  NPREC
            FMT="('10 ^',i3,' x'" // ",F" // XS1 // "." // XS2 // ")"
            if (DEBUG) print *,"Using fmt: ",FMT
          endif

          write(6,FMT) v2%int_exponent,v2%coefficient

        ELSE
          FMT="('10 ^',i3,' x',f37.33)"
           write(6,FMT) v2%int_exponent,v2%coefficient
        END IF

        return

      end subroutine my_dt_write

      end module my_dt_mod
 
/* END SOURCE file (module) my_dt_mod.f90 */
/* SOURCE file main_my_data_type.f90 */
        program my_dt_prog
!
! PROBLEM_2_DT:   Quad  Serial/OpenMP version with "user defined data type"
!                 -  See HIST file for compile instructions
!                 Same as PROBLEM 1_I8 for Quad option but here a "user defined
!                 data type" is used to show how the OpenMP reduction is
!                 handled when "user defined data type" need to be reduced.
!
      use my_dt_mod
      implicit none

      INTERFACE OPERATOR ( + )
        function my_add(v1,v2) result (v3)
          use my_dt_mod
          type (my_dt), intent(in) :: v1,v2
          type (my_dt) :: v3
        end function my_add
      END INTERFACE

      INTERFACE OPERATOR ( * )
        function my_mult(v1,v2) result (v3)
          use my_dt_mod
          type (my_dt), intent(in) :: v1,v2
          type (my_dt) :: v3
        end function my_mult
      END INTERFACE

      type(my_dt) SSUM,T1,T2,T3,PRIVATE_PARTIAL_SUM
      type(my_dt) my_dt_zero, my_dt_one, temp_dt, my_dt_eps

      integer*8  N, I8
      real*16 :: R16

      INTEGER NTHREADS, my_thread
!$    INTEGER OMP_GET_NUM_THREADS, OMP_GET_THREAD_NUM

      INTEGER*8 :: my_width, i_start, i_end
      integer   :: JJ, m, NN
      LOGICAL DEBUG

! ---------------------------------------------------------------------

      call my_dt_init(my_dt_zero,my_dt_one)

      NN = 33
      temp_dt = my_dt("REAL",0,1.0Q-33)
      write(6,1003)
 1003 format("eps = ",$)
      call my_dt_write(temp_dt,NN)

       NN = 32
      call my_neg_normalize(temp_dt,my_dt_eps)
      write(6,1004)
 1004 format("normalized eps = ",$)
      call my_dt_write(my_dt_eps,NN)

      NN = 50
      call my_dt_write(my_dt_one,NN)

      DEBUG = .false.

      NTHREADS  = 1
      my_thread = 0

!     N = 1
!     m = 0
!     DO JJ=1,12

      N = 10000000000
      m =          10

      DO JJ=1,1

!
! Check valid range for N, recall integer*8 has max of 18 precision
! digits, so for N > 10^18 index I8 will overflow
!
      if (N .lt. 1) then
        print *,"Must have N > 0"
        stop
      endif

      if (N .gt. 1000000000000000000 ) then
        print *,"Need N < 10^18 - otherwise index I8 will overflow"
        stop
      endif

!
! Parallel region displaying threads to ensure OpenMP is active.
! Do this only once i.e. only for JJ=1
!
!     IF ( JJ .EQ. 1) THEN
      IF ( JJ .EQ. 2) THEN

!$omp parallel private(my_thread) SHARED(nthreads)
!$    my_thread = omp_get_thread_num()
!$    nthreads  = omp_get_num_threads()
!$OMP CRITICAL
      write (6,1005) my_thread,nthreads
 1005 format("Thread ",i2," out of ",i3)
!$OMP END CRITICAL
!$omp barrier
      if ( my_thread .eq. 0 ) then
        write (6,1006) nthreads
 1006   format(/"There are", I3," threads"/)
      end if
!$omp end parallel

      ENDIF

!
! Display N, m and execution mode: OpenMP or SERIAL
!
      if (nthreads .gt. 1 ) then
        write(6,1001) N, m
 1001   format("N = ",I14,"  m=",I2.2," OpenMP for my_dt data type " &
     &         "(quad)")
      else
        write(6,2001) N, m
 2001   format("N = ",I14,"  m=",I2.2," SERIAL for my_dt data type " &
     &         "(quad)")
      endif

!
! Start the reduction process by defining a PARALLEL REGION
!
      SSUM =  my_dt_zero

!$OMP PARALLEL DEFAULT(NONE) PRIVATE(my_width,my_thread,i_start,i_end,  &
!$OMP I8,R16,T1,T2,T3,PRIVATE_PARTIAL_SUM)                              &
!$OMP SHARED(nthreads,my_dt_zero,N,SSUM)

!$    my_thread = omp_get_thread_num()
!$    nthreads  = omp_get_num_threads()
      my_width  = (N + nthreads -1) / nthreads

      i_start = 1 + my_thread*my_width
      i_end   = min((my_thread + 1)*my_width,N)

      PRIVATE_PARTIAL_SUM = my_dt_zero

      DO I8 = i_start, i_end
        R16 = I8
        T1 = my_dt("REAL",0,R16)
        T2 = T1
        T3 = T1 * T2
        PRIVATE_PARTIAL_SUM = PRIVATE_PARTIAL_SUM + T3
      ENDDO
!
! Combine partial sums - must synchronize because SSUM is SHARED
!

!$OMP CRITICAL
      SSUM = SSUM + PRIVATE_PARTIAL_SUM
!$OMP END CRITICAL
!$OMP END PARALLEL

      write(6,1002)
 1002 format("SSUM - ",$)
      call my_dt_write(SSUM)

      print *,""
      N = N*10
      m = m + 1

      ENDDO

      end program my_dt_prog




      function my_add(v1,v2) result (v3)
        use my_dt_mod
        type (my_dt), intent(in) :: v1,v2
        type (my_dt) :: v3
        real*16 :: res

        res = v1%coefficient*10**v1%int_exponent +                      &
     &        v2%coefficient*10**v2%int_exponent

        v3 = my_dt("REAL",0,res)

      end function my_add

      function my_mult(v1,v2) result (v3)
        use my_dt_mod
        type (my_dt), intent(in) :: v1,v2
        type (my_dt) :: v3

        v3%coefficient  = v1%coefficient  * v2%coefficient
        v3%int_exponent = v1%int_exponent + v2%int_exponent
      end function my_mult


/* END SOURCE file main_my_data_type.f90 */

PROBLEM_3_QD: QD/Quad with/without OpenMP

This program (file sum_QD_reduction.f90) handles either QD data type or REAL*16 (Quad), and both can be run SERIALLY or in PARALLEL (using OpenMP), i.e. four cases.

QD includes routines to perform "quad-double" (approx. 62 digits) arithmetic while with Quad precision we could go up to 33 digits. Recall that for Quad precision round off errors will be introduced so the precision might be less than 33 digits.

According to the HIST file, the compilation and job submission for the four cases is done through a makefile. Here are the commands for the four cases:

/* HIST file for PROBLEM_3_QD */
  ! PROBLEM_3_QD: QD/Quad with/without OpenMP

! Defaults in makefile are:
!
!       D_QD  := DO_NOT_USE_QD   (default)
!       D_OMP :=        USE_OMP  (default)

        vi sum_QD_reduction.f90

! There are 4 cases:

! DO_NOT_USE_QD DO_NOT_USE_OMP

       make clean
       make D_OMP=DO_NOT_USE_OMP
       make D_OMP=DO_NOT_USE_OMP brun

! DO_NOT_USE_QD USE_OMP

       make clean
       make
       make brun


! USE_QD DO_NOT_USE_OMP

       make clean
       make D_QD=USE_QD D_OMP=DO_NOT_USE_OMP
       make D_QD=USE_QD D_OMP=DO_NOT_USE_OMP brun

! USE_QD USE_OMP

       make clean
       make D_QD=USE_QD
       make D_QD=USE_QD brun


/* END HIST file for PROBLEM_3_QD */

Following makefile is used:

/* Makefile for PROBLEM_3_QD */
#!/bin/bash
SHELL := /bin/bash

# Makefile for compiling an MPI/QD program

ERR_MSG_1:= - You must use the icc compiler for this make command - See ENVIRONMENT file for details !

# Find where we are

TOPDIR := $(shell pwd)

# Check environment - we want icc (INTEL) compilers
ifneq ($(CC), icc)
      $(error error $(ERR_MSG_1))
endif

ifeq ($(CLU),orc)
  XNCPUS = 16
else
  XNCPUS = 8
endif

# D_MACROS:
D_QD  :=        USE_QD
D_OMP := DO_NOT_USE_OMP

# Set defaults:

D_QD  := DO_NOT_USE_QD
D_OMP :=        USE_OMP

OBJDIR  := OBJECTS

# Compilers
USEF90  := $(FC)
USECXX  := $(CXX)

# Libraries
QDLIB=/opt/sharcnet/qd/2.3.8/intel/lib
INTELLIB=/opt/sharcnet/intel/11.0.083/ifc/lib/intel64
LIBS= -L$(QDLIB) -lqdmod -lqd -L${INTELLIB} -lifcore -limf -lifport

# Flags
IFLAGS  := -I$(QDLIB)/qd
FFLAGS  := -O2 -g  -fpp $(IFLAGS)

ifeq ($(D_QD),USE_QD)
  FFLAGS  += -DUSE_QD
endif

ifeq ($(D_OMP),USE_OMP)
  FFLAGS  += -openmp -DUSE_OMP
  LIBS    += -openmp
  NCPUS    = $(XNCPUS)
else
  NCPUS    = 1
endif

SRCS:= sum_QD_reduction.f90

OBJS:= $(addprefix $(OBJDIR)/,$(SRCS:.f90=.o))

TARGET := x.im

 all: $(TARGET)
       @echo " "
       @echo "This command executed from subdirectory: "
       @echo "     $(TOPDIR) "
       @echo " "
       @echo "Running on cluster: ${CLUSTER}"
       @echo "Compiling with USEF90 = $(USEF90) and FFLAGS = ${FFLAGS}"
       @echo " "

$(TARGET): $(OBJS) makefile
       @echo " "
       @echo "Linking ... "
       @echo " "
       $(USECXX) -o $@ cmain1.C  $(OBJS) $(LIBS)

$(OBJDIR)/%.o : %.f90 makefile
       @echo " "
       @echo "Compiling ... "
       @echo " "
       $(USEF90) ${FFLAGS} -c -o $@ $<

.PHONY: clean

clean:
       rm -rf *.o $(OBJDIR) $(TARGET)

run: $(TARGET) brun
       @echo " "
       @echo "Job was submitted to the queue ... "
       @echo " "

brun:
ifeq ($(D_OMP),USE_OMP)
 ifeq ($(D_QD),USE_QD)

       sqsub -r 5h --mpp=2G  -f threaded -N 1 -n $(NCPUS) -o ${CLU}_OUT_P3_QD_$(NCPUS)_%J  ./$(TARGET)
 else
       sqsub -r 5h --mpp=2G  -f threaded -N 1 -n $(NCPUS) -o ${CLU}_OUT_P3_Quad_$(NCPUS)_%J  ./$(TARGET)
 endif
else
 ifeq ($(D_QD),USE_QD)
       sqsub -r 5h --mpp=2G  -o ${CLU}_OUT_P3_QD_$(NCPUS)_%J  ./$(TARGET)
 else
       sqsub -r 5h --mpp=2G  -o ${CLU}_OUT_P3_Quad_$(NCPUS)_%J  ./$(TARGET)
 endif
endif


list:
       @echo " "
       @echo "List makefile macros: "
       @echo "This command executed from subdirectory: "
       @echo "     $(TOPDIR) "
       @echo "Running on cluster: ${CLUSTER}"
       @echo "Compiling with USEF90 = $(USEF90) and FFLAGS = ${FFLAGS}"
       @echo "               USECXX = $(USECXX) and FFLAGS = ${FFLAGS}"
       @echo "Directories:"
       @echo "  FFLAGS  = $(FFLAGS) "
       @echo "  OBJS    = $(OBJS) "
       @echo "  OBJ    = $(OBJ) "
       @echo "  LIBS   = $(LIBS) "

$(OBJS): | $(OBJDIR)

$(OBJDIR):
       mkdir $(OBJDIR)

/* END Makefile for PROBLEM_3_QD */

One program for the four cases is used:

/* SOURCE file sum_QD_reduction.f90 */
      subroutine SUMQDREDUCTION
!
! PROBLEM_3_QD: QD/Quad with/without OpenMP
!

#ifdef USE_QD
      use qdmodule
#endif

      implicit none

      INTEGER JJ, JJMAX, m, old_cw
      INTEGER*8 I8, I, N

#ifdef USE_QD
      type(qd_real)  :: A, B, SSUM, AN, SUM_EXACT
      REAL*8 :: Di
      type(qd_real)  :: PRIVATE_PARTIAL_SUM, T1, T2, T3
#else
      REAL*16 :: A, B, SSUM, AN, SUM_EXACT, T1, T2, T3
#endif

      INTEGER NTHREADS, TID
!$    INTEGER OMP_GET_NUM_THREADS, OMP_GET_THREAD_NUM

!***********************************************************************
!
#ifdef USE_QD
      call f_fpu_fix_start(old_cw)
#endif

!     Some initializations

!
! Write title:
!
      write(6,2004)
 2004 format("Problem 3: QD/Quad with/without OpenMP"/)

#ifdef USE_OMP
      write(6,1010)
#else
      write(6,2010)
#endif

#ifdef USE_QD
      JJMAX = 1
#else
      JJMAX = 1
#endif

      N    = 10000000000
      m =             10

      DO JJ=1,JJMAX
!
! Parallel region displaying threads to ensure OpenMP is active.
! Do this only once i.e. only for JJ=1
!
!     IF ( JJ .EQ. 1) THEN
      IF ( JJ .EQ. 2) THEN
#ifdef USE_OMP
!$omp parallel DEFAULT(NONE) private(TID) SHARED(nthreads)
      TID = omp_get_thread_num()
      nthreads = omp_get_num_threads()
!$OMP CRITICAL
      write (6,1005) TID,nthreads
!$OMP END CRITICAL
!$omp barrier
      if ( TID .eq. 0 ) then
        write (6,1006) nthreads
      end if
!$omp end parallel
#else
      TID=0
      nthreads=1
      write (6,1006) nthreads
#endif
      ENDIF

#ifdef USE_QD
      write(6,1000) N, m
      Di = N
      AN = qd_real((/Di, 0.0d0, 0.0d0, 0.0d0/))
#else
      write(6,2000) N, m
      AN = N
#endif

! Analytic result

        SUM_EXACT = AN*(AN+1)*(2*AN+1)/6

#ifdef USE_QD
        if (nthreads .gt. 1) then
          write(6,1021)
 1021     format("SUM_EXACT (QD/OMP) = ",$)
          call qdwrite(6,SUM_EXACT)
        else
          write(6,1011)
 1011     format("SUM_EXACT (QD/SER) = ",$)
          call qdwrite(6,SUM_EXACT)
        endif

        SSUM = QD_ZERO
#else
        if (nthreads .gt. 1) then
          write(6,2021) SUM_EXACT
        else
          write(6,2011) SUM_EXACT
        endif

        SSUM = 0.0
#endif

#ifdef USE_QD
!$OMP PARALLEL PRIVATE(PRIVATE_PARTIAL_SUM,Di,AN,T1,T2,T3) SHARED(SSUM)
      PRIVATE_PARTIAL_SUM = QD_ZERO
!$OMP DO
      DO I = 1, N
        Di = I
        AN  = qd_real((/Di, 0.0d0, 0.0d0, 0.0d0/))
        T1 = AN
        T2 = AN
        T3 = T1*T2
        PRIVATE_PARTIAL_SUM = PRIVATE_PARTIAL_SUM + T3
      ENDDO
!$OMP END DO

!
! Combine partial sums - must synchronize because SUM is SHARED
!

!$OMP CRITICAL
      SSUM = SSUM + PRIVATE_PARTIAL_SUM
!$OMP END CRITICAL
!$OMP END PARALLEL

#else

        SSUM = 0.0
!$OMP PARALLEL DO REDUCTION(+:SSUM)
        DO I8 = 1, N
          T1 = I8
          T2 = T1
          T3 = T1 * T2
          SSUM = SSUM + T3
        ENDDO
#endif

!
! Print SSUM
!

#ifdef USE_QD
        if (nthreads .gt. 1) then
          write(6,1022)
 1022     format("SSUM (QD/OMP)      = ",$)
          call qdwrite(6,SSUM)
        else
          write(6,1012)
 1012     format("SSUM (QD/SER)      = ",$)
          call qdwrite(6,SSUM)
        endif
#else
        if (nthreads .gt. 1) then
          write(6,2022) SSUM
        else
          write(6,2012) SSUM
        endif
#endif

        N = N*10
        m = m + 1

      ENDDO

#ifdef USE_QD
      call f_fpu_fix_end(old_cw)
#endif

! QD
 1000 format(/"N = ",I18,"  m = ",I2,"  Using QD precision")
 1010 format("  OpenMP JOB")

 1005 format("Thread ",i2," out of ",i3)
 1006 format(/"There are", I3," threads"/)

! Quad
 2000 format(/"N = ",I18,"  m = ",I2," Using Quad precision")
 2010 format("  SERIAL JOB")
 2011 format("SUM_EXACT (Quad/SER) = ",F48.1)
 2021 format("SUM_EXACT (Quad/OMP) = ",F48.1)
 2012 format("SSum (Quad/SER)      = ",F48.1/)
 2022 format("SSum (Quad/OMP)      = ",F48.1/)

      stop
      end

/* END SOURCE file sum_QD_reduction.f90 */

PROBLEM_4_MPFUN: I8/MPFUN with/without OpenMP

File omp_reduction.f90 can be used to compile and run the four cases:

 I8/MPFUN with/without OpenMP

The precision for an MPFUN program is specified at the beginning of the program by defining an integer, say num_digits, as the required precision and then invoking subroutine mpinit(num_digits). The value of num_digits in this program was set to 62 to correspond to the QD precision in PROBLEM_3_QD. However, the value can be 1000 or even larger.

/* HIST file for PROBLEM_4_MPFUN */
 
ifort -fpp omp_reduction.f90 -I/opt/sharcnet/mpfun/6.1.23/f90/include -L/opt/sharcnet/mpfun/6.1.23/f90/lib -lmp

! MPFUN/Openmp

# Default D_MACROS in Makefile:
#  D_MPFUN  :=        USE_MPFUN
#  D_OMP    :=          USE_OMP

! MPFUN/OMP

       make clean
       make
       make brun

! MPFUN/SER

       make clean
       make D_OMP=DO_NOT_USE_OMP
       make D_OMP=DO_NOT_USE_OMP  brun

! I8/SER

       make clean
       make D_MPFUN=DO_NOT_USE_MPFUN D_OMP=DO_NOT_USE_OMP
       make D_MPFUN=DO_NOT_USE_MPFUN D_OMP=DO_NOT_USE_OMP brun

! I8/OMP

       make clean
       make D_MPFUN=DO_NOT_USE_MPFUN
       make D_MPFUN=DO_NOT_USE_MPFUN  brun

/* END of HIST file for PROBLEM_4_MPFUN */

The makefile:

/*  makefile for PROBLEM_4_MPFUN */
ERR_MSG_1:= - MPFUN uses icc compiler - reset env to intel'

ROOTDIR  = $(shell pwd)
OPSYS    = $(shell uname -s )

SRCS_F   = omp_reduction.f90

OBJS_F   = $(SRCS_F:.f90=.o)
TARGET   = my_OpenMP_exec

MPFLIB   = -L/opt/sharcnet/mpfun/6.1.23/f90/lib -lmp

# Check environment - we want intel compilers

ifneq ($(CC), icc)
       $(error error $(ERR_MSG_1))
endif

ifeq ($(CLU),orc)
  XNCPUS = 16
else
  XNCPUS = 8
endif

# D_MACROS:
D_MPFUN  := DO_NOT_USE_MPFUN
D_OMP    :=   DO_NOT_USE_OMP

# Default for D_MACROS:
D_MPFUN  :=        USE_MPFUN
D_OMP    :=          USE_OMP

ifeq ($(CC), icc)
FFLAGS = -fpp
OUTPUTFILE=OUTPUT_
endif

ifeq ($(D_MPFUN),USE_MPFUN)
  FFLAGS  += -DUSE_MPFUN
  LIBS    += ${MPFLIB}
  IFLAGS   = -I/opt/sharcnet/mpfun/6.1.23/f90/include
endif

ifeq ($(D_OMP),USE_OMP)
  FFLAGS  += -openmp -DUSE_OMP
  LIBS    += -openmp
  NCPUS    = $(XNCPUS)
else
  NCPUS    = 1
endif


all: $(TARGET)
       @echo " "
       @echo "Compiling with FFLAGS = $(FFLAGS)"
       @echo " "

$(TARGET): $(OBJS_F) $(SRCS_F) makefile
       @echo " "
       @echo "Linking ..."
       $(FC) -o $@ $(FFLAGS)  $(OBJS_F) ${LIBS}

$(OBJS_F): $(SRCS_F) makefile
       @echo " "
       @echo "Compiling ... "
       $(FC) $(FFLAGS) $(IFLAGS) -c $(SRCS_F)


help:;  @echo " "
       @echo "Operating System Detected: $(OPSYS) "
       @echo " "
       @echo "USAGE: "
       @echo "make help       To get this listing"
       @echo "make            To compile the OpenMP  program in current environment"
       @echo "make clean      Remove *.o and executable files"
       @echo "make list       List the compilers in current environment"
       @echo "make brun       Submit a batch job"
       @echo " "

list:
       @echo
       @echo "OPSYS:       $(OPSYS)"
       @echo "ROOTDIR:     $(ROOTDIR)"
       @echo "FC Compiler: $(FC)"
       @echo "C  Compiler: $(CC)"
       @echo "FFLAGS:      $(FFLAGS)"
       @echo "OUTPUT FILE: $(OUTPUTFILE)"
       @echo "ARCH:        $(ARCH)"
       @echo " "

clean:
       @echo "Removing *.o and executable files"
       rm -rf *.o $(TARGET) OUTPUT_my_OpenMP

brun:  $(OBJS_F) $(SRCS_F) makefile
ifeq ($(D_OMP),USE_OMP)
 ifeq ($(D_MPFUN),USE_MPFUN)

       sqsub -r 1d --mpp=2G  -f threaded -N 1 -n $(NCPUS) -o ${CLU}_OUT_P4_MPFUN_$(NCPUS)_%J  ./$(TARGET)
 else
       sqsub -r 1h --mpp=2G  -f threaded -N 1 -n $(NCPUS) -o ${CLU}_OUT_P4_I8_$(NCPUS)_%J  ./$(TARGET)
 endif
else
 ifeq ($(D_MPFUN),USE_MPFUN)
       sqsub -r 1d --mpp=2G  -o ${CLU}_OUT_P4_MPFUN_$(NCPUS)_%J  ./$(TARGET)
 else
       sqsub -r 1h --mpp=2G  -o ${CLU}_OUT_P4_I8_$(NCPUS)_%J  ./$(TARGET)
 endif
endif
 

/* END of makefile for PROBLEM_4_MPFUN */

OUTPUT naming convention

For each job submitted from the four subdirectories the base name for the different cases will be as shown below. These base names will be prefixed by the environment variable $(CLU) to indicate on which cluster the job run and postfixed by the JobID (%J).

PROBLEM_1_I8: Fortran90 INTEGER*8 (I8) and REAL*16 (Quad) precisions.

 OUT_P1_I8_SER
 OUT_P1_I8_OMP
 OUT_P1_Quad_SER
 OUT_P1_Quad_OMP

PROBLEM_2_DT: Quad Serial/OpenMP

 OUT_P2_MYDT_SER
 OUT_P2_MYDT_OMP

PROBLEM_3_QD: QD/Quad with/without OpenMP

 OUT_P3_QD_1      (SERIAL)
 OUT_P3_QD_16     (OMP)
 OUT_P3_Quad_1    (SERIAL)
 OUT_P3_Quad_16   (OMP)

PROBLEM_4_MPFUN: I8/MPFUN with/without OpenMP

 OUT_P4_I8_1       (SERIAL)
 OUT_P4_I8_16      (OMP)
 OUT_P4_MPFUN_1    (SERIAL)
 OUT_P4_MPFUN_16   (OMP)

For example, the output file:

       orc_OUT_P3_QD_16_75374.orc-admin2.orca.sharcnet

was run on cluster orca with JobID 75374 and has base name P3_QD_16

OpenMP SpeedUps

If you know the JobID (e.g. 75134) of a particular job issuing the command:

   qstat -f 75134  | head -21

once the job has completed will display the statistics (e.g. cput and walltime) for that job.

Following table shows the walltime for the SER and OMP jobs and the SpeedUp defined as walltime_SER/walltime_OMP (for m=10):

PROBLEM_1_I8:

 OUT_P1_Quad_SER  = 00:11:48
 OUT_P1_Quad_OMP  = 00:01:10
 walltime_SpeedUp = 10.11


PROBLEM_2_DT:

 OUT_P2_MYDT_SER  = 00:13:02
 OUT_P2_MYDT_OMP  = 00:01:29
 walltime_SpeedUp = 8.78

PROBLEM_3_QD:

 (uses OpenMP reduction clause)
 OUT_P3_Quad_SER  = 00:06:51
 OUT_P3_Quad_OMP  = 00:00:50
 walltime_SpeedUp = 8.22
 OUT_P3_QD_SER    = 01:13:09
 OUT_P3_QD_OMP    = 00:11:27
 walltime_SpeedUp =  6.39

PROBLEM_4_MPFUN:

 OUT_P4_MPFUN_SER = 02:48:45
 OUT_P4_MPFUN_OMP = 00:52:13
 walltime_SpeedUp =  3.23