From Documentation
Jump to: navigation, search

MPI Reduction for User Defined Data Types

Nick Chepurniy, Ph.D.
e-mail: nickc@sharcnet.ca
HPC Consultant - SHARCNET http://www.sharcnet.ca
Compute Canada http://www.computecanada.org



Overview

MPI reduction can be done only with the basic data types (e.g. integer, real*8). This document explores reduction techniques that can be used for User Defined Data Types. Note that there is no Quad data type for MPI. An example of a User Defined Data Type defined for the QD package is presented in full detail.

For those users interested in the programs a tar file is available
by contacting the author.

Introduction

Two problems using reductions for summing a series are presented. Each problem has a choice between two data types and processing Serially or with MPI. Thus, there are four cases to be considered. In the first problem in subdirectory PROBLEM_1_I8 the user selects either INTEGER*8 or DOUBLE PRECISION for the data type. Then there is a second choice: SERIAL or MPI. In the second problem in subdirectory PROBLEM_3_QD the data types are INTEGER*8 and QD ("quad-double" approx. 62 digits arithmetic) with the second choice: SERIAL or MPI.

In PROBLEM_1_I8 we can use the MPI_REDUCE subroutine to do the reduction since both INTEGER*8 and DOUBLE PRECISION are basic data types.

For the second problem MPI_REDUCE can be used for INTEGER*8 but should not be used for QD data types since they are not supported. Given the fact that a QD data type has four double precision elements of an array one is tempted to use MPI_REDUCE with count=4 for the four components. This scheme does work for some simpler reductions but not for all of them. In the program in subdirectory PROBLEM_3_QD we carry out this scheme and also do the reduction the proper way.

The Computational Model

To illustrate the reduction on "user defined data types" following series summation was used in the two problems:

 Sum_N =  \sum_{i=1}^N i^2 = 1^2 + 2^2 + 3^2 + ... + N^2  

To ensure that we are getting the correct results in these programs a proposition is described for the sum of squared integers from 1 to N=10^m (m positive integer). (In Mathematics, a proposition is a useful proved result, similar to a theorem but less important). Thus, knowing what the sum is for a given N value allows us to compare it to the results of the program and we can rapidly determine if the program is working properly.

Each program can be run with one of two different data types either in SERIES or PARALLEL using MPI. The HIST files associated with each program contain the instructions for compiling the different cases and submitting the jobs. Often there will be a makefile which has the different parameters required to compile and submit the jobs.

Sample outputs for each of the two programs are also presented, along with the program listings, makefiles and HIST files. Finally, walltimes are compared for SERIAL and MPI jobs for the same value of N=10^m for the two programs.

For those users interested in the programs a tar file is available by contacting the author.


Proposition for the sum of squared integers from 1 to 10^m (m positive integer)

Let's use the symbol Sum_N to denote the sum of the integers squared from 1 to N. There is a closed form for this sum, i.e.

 Sum_N =  \sum_{i=1}^N i^2 = 1^2 + 2^2 + 3^2 + ... + N^2   = N*(N+1)*(2*N+1)/6

In each of the programs this closed form is calculated so that results from the summation can be compared to the closed form, but there could be overflow while the closed form is calculated. This is why this proposition is so valuable.

For N=10^m (where m is an integer > 0) following results were observed:

m     N=10^m           Sum_N                  long_format of Sum_N
1         10       10^2  x 3.85                       385
2        100       10^5  x 3.3835                    338350
3       1000       10^8  x 3.338335                 333833500
4      10000       10^11 x 3.33383335              333383335000
5     100000       10^14 x 3.3333833335           333338333350000
6    1000000       10^17 x 3.333338333335        333333833333500000
7   10000000       10^20 x 3.33333383333335     333333383333335000000

Proposition Rule:

The value of Sum_N (with N=10^m) for a given value of m > 0 is built as follows:

  (1) Repeat the digit '3' m times and append digit '8' to it.
  (2) Next, append the digit '3' (m-1) times and then digit '5'.
  (3) Finally, append (m-1) digits '0', so that Sum_N will have 3m digits.

Here are some more examples for m=7 through 13 using above Proposition Rule:

 m     Sum_N                                        long_format of Sum_N
 7   10^20 x 3.33333383333335                       333333383333335000000
 8   10^23 x 3.3333333833333335                    333333338333333350000000
 9   10^26 x 3.333333338333333335                 333333333833333333500000000
10   10^29 x 3.33333333383333333335              333333333383333333335000000000
11   10^32 x 3.3333333333833333333335           333333333338333333333350000000000
12   10^35 x 3.333333333338333333333335        333333333333833333333333500000000000
13   10^38 x 3.33333333333383333333333335     333333333333383333333333335000000000000

A proof of the proposition is available from the author in case anyone is doubtful.


The two programs

PROBLEM_1_I8: Fortran90 INTEGER*8 (I8) and DOUBLE PRECISION (REAL*8).

In Fortran90 the data type INTEGER*8 has about 18 precision digits. This would allow us to safely compute our Sum_N up to N=10^m for m=6 terms. The program in file sum_reduction_I8.f90 (in subdirectory PROBLEM_1_I8) can use either INTEGER*8 (I8) or REAL*8 (DP) data types, and with either of these data types we can run SERIALLY or in PARALLEL (using MPI). This is a total of four cases.

The HIST file lists the commands for compiling and submitting jobs for the four cases.

/* HIST file for PROBLEM_1_I8 */

! PROBLEM_1_I8: I8/DP  MPI/Serial

! P1_I8_SER:

       vi sum_reduction_I8.f90
       mpif90 -o a.out -fpp -DUSE_I8 sum_reduction_I8.f90
       ./a.out >& ${CLU}_OUT_P1_I8_SER_  &
       sqsub -r 1h --mpp=2G -o ${CLU}_OUT_P1_I8_SER_%J -n 1  ./a.out

! P1_I8_MPI:

       vi sum_reduction_I8.f90
       mpif90 -o b.out -fpp -DUSE_I8 -DUSE_MPI sum_reduction_I8.f90
       sqsub -r 1h --mpp=2G -o ${CLU}_OUT_P1_I8_MPI_%J -q mpi -n 16 ./b.out


! P1_DP_SER:

       vi sum_reduction_I8.f90
       mpif90 -o c.out -fpp  sum_reduction_I8.f90
       ./a.out >& ${CLU}_OUT_P1_DP_SER_  &
       sqsub -r 1d --mpp=2G -o ${CLU}_OUT_P1_DP_SER_%J -n 1 ./c.out

! P1_DP_MPI

       vi sum_reduction_I8.f90
       mpif90 -o d.out -fpp -DUSE_MPI sum_reduction_I8.f90
       sqsub -r 1h --mpp=2G -o ${CLU}_OUT_P1_DP_MPI_%J -q mpi -n 16 ./d.out


/* END HIST file for PROBLEM_1_I8 */

The option -fpp (for the fortran preprocessor) is required in the compile command, because the source file to be compiled has some fpp directives. Specifying the macro USE_I8 (i.e. -DUSE_I8) will use the INTEGER*8 data type, while by omitting the macro the dafault DP (REAL*8) data type will be used. If you want to use MPI for this job then the flag -DUSE_MPI must be specified, otherwise a serial executable will be generated.

One file, sum_reduction_I8.f90, is used for all four cases without a makefile:

/* SOURCE file sum_reduction_I8.f90 */
      PROGRAM SUMREDUCTION
!
! PROBLEM_1_I8: I8/DP   MPI/Serial
!                  ^^      ^^^^^^
!                  ||      ||||||
!               default    default

      implicit none

#ifdef USE_MPI
      include 'mpif.h'
#endif

      INTEGER JJ, JJMAX, m
      INTEGER*8 :: N, I8

#ifdef USE_I8
      INTEGER*8 :: SSUM, T1, T2, T3, P_SUM

#else
      REAL*8    :: SSUM, T1, T2, T3, P_SUM
#endif

      INTEGER   :: myid, numprocs, ierr
      INTEGER*8 :: i_start, i_end
      LOGICAL DEBUG

      DEBUG = .true.
      DEBUG = .false.

!     Some initializations

#ifdef USE_MPI
      call MPI_INIT(ierr)
      call MPI_COMM_RANK(MPI_COMM_WORLD, myid, ierr)
      call MPI_COMM_SIZE(MPI_COMM_WORLD, numprocs, ierr)
#else
      myid     = 0
      numprocs = 1
#endif

!
! Write title:
!

#ifdef USE_MPI
      if (myid .eq. 0) write(6,2004)
 2004 format("PROBLEM_1_I8: I8/DP MPI/Serial"/)
#else
      write(6,1004)
 1004 format("PROBLEM_1_I8: I8/DP MPI/Serial"/)
#endif

      N           =  1000000
      m           =        6

#ifdef USE_I8
      JJMAX = 8
      JJMAX = 2
#else
      JJMAX = 10
      JJMAX =  5
#endif

      DO JJ=1,JJMAX
!
! Print N
!

#ifdef USE_I8
        if (m .gt. 6) then
          if (numprocs .gt. 1) then
            if (myid .eq. 0) write(6,3001) m
            go to 9999
          else
            write(6,3001) m
 3001       format("For m=",i2," I8 data type too small ==> overflow"/)
            go to 9999
          endif
        endif

        if (numprocs .gt. 1) then
          if (myid .eq. 0) write(6,1022) N,m
        else
          write(6,1012) N,m
        endif
#else
        if (numprocs .gt. 1) then
          if (myid .eq. 0) write(6,2022) N,m
        else
          write(6,2012) N,m
        endif
#endif

      SSUM = 0.0

      i_start = 1 + myid
      i_end   = N

      P_SUM = 0.0D0

      DO I8 = i_start, i_end, numprocs
        T1 = I8
        T2 = T1
        T3 = T1 * T2
        P_SUM = P_SUM + T3
      ENDDO
!
! Collect all the partial sums
!

#ifdef USE_MPI
#ifdef USE_I8
      call MPI_REDUCE(P_SUM,SSUM,1,MPI_INTEGER8,MPI_SUM,0,              &
     &                  MPI_COMM_WORLD,ierr)
#else
!
!  MPI does not support MPI_Quad which means we cannot us the
!  subroutine call MPI_REDUCE(...) for Quad data types. Downgrading
!  to data type DOUBLE_PRECISION.

      call MPI_REDUCE(P_SUM,SSUM,1,MPI_DOUBLE_PRECISION,MPI_SUM,0,      &
     &                  MPI_COMM_WORLD,ierr)
#endif
#else
      SSUM = P_SUM
#endif

!
! Print SSUM
!

#ifdef USE_I8
        if (numprocs .gt. 1) then
          if (myid .eq. 0) write(6,1021) SSUM
        else
          write(6,1011) SSUM
        endif
#else
        if (numprocs .gt. 1) then
          if (myid .eq. 0) write(6,2021) SSUM
        else
          write(6,2011) SSUM
        endif
#endif

 9999   N = N*10
        m = m + 1

      ENDDO

#ifdef USE_MPI
      call MPI_FINALIZE(ierr)
#endif
 1012 format(/"N =",I12,"  m=",I2.2," I8 Serial")
 1022 format(/"N =",I12,"  m=",I2.2," I8 MPI")
 1011 format("SSum_I8_Serial      = ",I20/)
 1021 format("SSum_I8_MPI         = ",I20/)

 2012 format(/"N = ",I12,"  m=",I2.2," DP Serial")
 2022 format(/"N = ",I12,"  m=",I2.2," DP MPI")
 2011 format("SSum_DP_Serial    = ",F48.1)
 2021 format("SSum_DP_MPI       = ",F48.1)
      END

/* END SOURCE file sum_reduction_I8.f90 */

The output files produced, display the Sum_N for different values of N. Since we have tabulated these values above, we will not display the output files, unless there is something unexpected in the output.

If you run P1_I8_MPI the output for N=1 through N=10^6 agrees with the tabulated Sum_N values. Furthermore, from the Proposition Rule, we know that Sum_N should have 3m digits for m>0.

Here is the output for m=6 and 7:

PROBLEM_1_I8: I8/DP MPI/Serial

N =     1000000  m=06 I8 MPI
SSum_I8_MPI         =   333333833333500000

For m= 7 I8 data type too small ==> overflow


For m=06 the number of digits is 18 and the answer agrees with that in the tabulated values for Sum_N according to the Proposition Rule.

For m-07 we would need 3m=21 digits but we have only 18 (for data type INTEGER*8) and therefore the answer would not fit into this data type, i.e. an overflow on Sum_N would occur producing a WRONG answer. An if statement has been added to the program to print an error message for m>6 instead of displaying the WRONG answers !

PROBLEM_3_QD: QD/I8 MPI/Serial

This program (file sum_QD_reduction.f90) handles either QD data type or INTEGER*8, and both can be run SERIALLY or in PARALLEL (using MPI), i.e. four cases.

QD includes routines to perform "quad-double" (approx. 62 digits) arithmetic while with INTEGER*8 precision we could go up to 18 digits.

According to the HIST file, the compilation and job submission for the four cases is done through a makefile:

/* HIST file for PROBLEM_3_QD */
 
! PROBLEM_3_QD: QD/I8   MPI/Serial

! Defaults in makefile are:
!
!       D_QD  := DO_NOT_USE_QD   (default) i.e. use I8
!       D_MPI := DO_NOT_USE_MPI  (default) i.e. use Serial

       vi sum_QD_reduction.f90

! There are 4 cases:

  ! DO_NOT_USE_QD DO_NOT_USE_MPI

       make clean
       make
       make brun

  ! DO_NOT_USE_QD USE_MPI

       make clean
       make D_MPI=USE_MPI
       make D_MPI=USE_MPI  brun


  ! USE_QD DO_NOT_USE_MPI

       make clean
       make D_QD=USE_QD D_MPI=DO_NOT_USE_MPI
       make D_QD=USE_QD D_MPI=DO_NOT_USE_MPI brun

  ! USE_QD USE_MPI

       make clean
       make D_QD=USE_QD D_MPI=USE_MPI
       make D_QD=USE_QD D_MPI=USE_MPI brun

/* END HIST file for PROBLEM_3_QD */

Following makefile is used:

/* Makefile for PROBLEM_3_QD */
  #!/bin/bash
SHELL := /bin/bash

# Makefile for compiling an MPI/QD program

ERR_MSG_1:= - You must use the icc compiler for this make command - See ENVIRONMENT file for details !

# Find where we are

TOPDIR := $(shell pwd)

# Check environment - we want icc (INTEL) compilers
ifneq ($(CC), icc)
      $(error error $(ERR_MSG_1))
endif

# Set default D_MACROS:

D_MPI := DO_NOT_USE_MPI
D_QD  := DO_NOT_USE_QD

OBJDIR  := OBJECTS

# Compilers
USEF90  := mpif90
USECXX  := mpiCC

# Libraries
QDLIB=/opt/sharcnet/qd/2.3.8/intel/lib
INTELLIB=/opt/sharcnet/intel/11.0.083/ifc/lib/intel64
LIBS=-L${INTELLIB} -lifcore -limf -lifport

# Flags
IFLAGS   := -I$(QDLIB)/qd
FFLAGS   := -O2 -g  -fpp

ifeq ($(D_QD),USE_QD)
  FFLAGS  += -DUSE_QD $(IFLAGS)
  LIBS    += -L$(QDLIB) -lqdmod -lqd
endif

ifeq ($(D_MPI),USE_MPI)
  FFLAGS  += -DUSE_MPI
  LIBS    += -lmpi_f90
endif

SRCS:= sum_QD_reduction.f90

OBJS:= $(addprefix $(OBJDIR)/,$(SRCS:.f90=.o))

TARGET := x.im

all: $(TARGET)
       @echo " "
       @echo "This command executed from subdirectory: "
       @echo "     $(TOPDIR) "
       @echo " "
       @echo "Running on cluster: ${CLUSTER}"
       @echo "Compiling with USEF90 = $(USEF90) and FFLAGS = ${FFLAGS}"
       @echo " "

$(TARGET): $(OBJS) makefile
       @echo " "
       @echo "Linking ... "
       @echo " "
       $(USECXX) -o $@ cmain1.C $(OBJS) $(LIBS)

$(OBJDIR)/%.o : %.f90 makefile
       @echo " "
       @echo "Compiling ... "
       @echo " "
       $(USEF90) ${FFLAGS} -c -o $@ $<

.PHONY: clean

clean:
       rm -rf *.o $(OBJDIR) $(TARGET)

run: $(TARGET) brun
       @echo " "
       @echo "Job was submitted to the queue ... "
       @echo " "

brun:
ifeq ($(D_MPI),USE_MPI)
  ifeq ($(D_QD),USE_QD)
       @echo "IARG0= $(IARG0)"
       sqsub -r $(IARG3)h --mpp=2G -q mpi -n $(IARG0) -o ${CLU}_OUT_P3_QD_$(IARG0)_%J  ./$(TARGET) $(IARGK)  $(IARG0)  $(IARG1)  $(IARG2)  $(IARG3) $(IARG4)
  else
       sqsub -r $(IARG3)h --mpp=2G -q mpi -n $(IARG0) -o ${CLU}_OUT_P3_I8_$(IARG0)_%J  ./$(TARGET) $(IARGK)  $(IARG0)  $(IARG1)  $(IARG2)  $(IARG3) $(IARG4)
  endif

else
  ifeq ($(D_QD),USE_QD)
       sqsub -r $(IARG3)h --mpp=2G -n $(IARG0) -o ${CLU}_OUT_P3_QD_$(IARG0)_%J  ./$(TARGET) $(IARGK)  $(IARG0)  $(IARG1)  $(IARG2)  $(IARG3) $(IARG4)
  else
       sqsub -r $(IARG3)h --mpp=2G -n $(IARG0) -o ${CLU}_OUT_P3_I8_$(IARG0)_%J  ./$(TARGET) $(IARGK)  $(IARG0)  $(IARG1)  $(IARG2)  $(IARG3) $(IARG4)
  endif

endif


list:
       @echo " "
       @echo "List makefile macros: "
       @echo "This command executed from subdirectory: "
       @echo "     $(TOPDIR) "
       @echo "Running on cluster: ${CLUSTER}"
       @echo "Compiling with USEF90 = $(USEF90) and FFLAGS = ${FFLAGS}"
       @echo "               USECXX = $(USECXX) and FFLAGS = ${FFLAGS}"
       @echo "Directories:"
       @echo "  FFLAGS  = $(FFLAGS) "
       @echo "  OBJS    = $(OBJS) "
       @echo "  OBJ    = $(OBJ) "
       @echo "  LIBS   = $(LIBS) "

$(OBJS): | $(OBJDIR)

$(OBJDIR):
       mkdir $(OBJDIR)


/* END Makefile for PROBLEM_3_QD */

One program for the four cases is used:

/* SOURCE file sum_QD_reduction.f90 */
      subroutine SUMQDREDUCTION(iargK,iarg0,iarg1,iarg2,iarg3,iarg4)
!
! PROBLEM_3_QD: QD/I8   MPI/Serial
!                  ^^       ^^^^^^
!                  ||       ||||||
!               default     default
!

!     QD   no  yes
! MPI\
!  no      00   01
! yes      10   11


#ifdef USE_QD
      use qdmodule
#endif

      implicit none

#ifdef USE_MPI
      include 'mpif.h'
      INTEGER :: ierr
#endif

      INTEGER        :: iargK, iarg0, NPROCS, status
      INTEGER        :: K_MPI, K_QD, KASE
      INTEGER        :: iarg1,iarg2,iarg3
      INTEGER*8      :: Larg1,Larg2
      CHARACTER*1    :: iarg4

      INTEGER        :: numprocs, myid, root, ii, JJ, JJMAX, m, old_cw
      INTEGER*8      :: I8, N, i_start, i_end
      LOGICAL        :: DEBUG

#ifdef USE_QD
      type(qd_real), DIMENSION(:), ALLOCATABLE  :: Q_SUM
      type(qd_real)  :: SSUM, AN, RSUM, P_SUM, T1, T2, T3
      REAL*8 :: Di
#else
      INTEGER*8      :: SSUM, AN, P_SUM, T1, T2, T3
#endif

      DEBUG = .true.
      DEBUG = .false.

      root  = 0
      NPROCS = iarg0

      if (DEBUG) then
        print *,"root = ",root
        call flush(6)
      endif

!***********************************************************************
!
#ifdef USE_QD
      call f_fpu_fix_start(old_cw)
#endif

!     Some initializations

#ifdef USE_MPI
      call MPI_INIT(ierr)
      call MPI_COMM_RANK(MPI_COMM_WORLD, myid, ierr)
      call MPI_COMM_SIZE(MPI_COMM_WORLD, numprocs, ierr)

#ifdef USE_QD
      ALLOCATE(Q_SUM(NPROCS), STAT=status)

      if (status .ne. 0) then
        if (myid .eq. 0) then
          write(6,3003)
 3003     format("ALLOCATE(Q_SUM(NPROCS)) FAILED"/)
        endif
        call MPI_FINALIZE(ierr)
        stop
      endif
#endif

#else
      myid     = 0
      numprocs = 1
#endif

#ifdef USE_QD
      JJMAX =  1
#else
      JJMAX =  1
#endif

      Larg1 = iarg1
      Larg2 = iarg2
      N = Larg1*10**Larg2
      m = iarg2

#ifdef USE_MPI
      K_MPI = 1
#else
      K_MPI = 0
#endif

#ifdef USE_QD
      K_QD  = 1
#else
      K_QD  = 0
#endif

      KASE = 10*K_MPI + K_QD

      if (myid .eq. 0) write(6,1000)
 1000 format("     QD   no  yes"/ &
     &       " MPI\            "/ &
     &       "  no      00   01"/ &
     &       " yes      10   11"/)

 1002 format("iargK=",I2.2," N = ",i13," = ",i1,"x10^",i2.2,"  KASE = ",&
     &       I2.2,"    cpu time requested = ",i3," hours  NPROCS=",i2/)
 1003 format("iargK=",I2.2," N = ",i13,"  KASE = ",                     &
     &       I2.2,"    cpu time requested = ",i3," hours  NPROCS=",i2 /)

      if ( m .eq. 0) then
        if (myid .eq. 0) write(6,1003) iargK,N,KASE,iarg3,numprocs
      else
        if (myid .eq. 0) write(6,1002) iargK,N,iarg1,m,KASE,iarg3,      &
     &                                 numprocs
      endif


      if (iarg4 == 'a' .and. myid .eq. 0)                               &
     &  print *,"APPENDING to previous run"

!
! Write title:
!

#ifdef USE_MPI
      if (myid .eq. 0) write(6,3001)
#else
      write(6,3001)
#endif

      DO JJ=1,JJMAX

#ifdef USE_MPI
#ifdef USE_QD
      if (myid .eq. 0) write(6,2010) N,m
#else
      if (myid .eq. 0) write(6,2020) N,m
#endif

#else

#ifdef USE_QD
      write(6,1010) N,m
#else
      write(6,1020) N,m
#endif
#endif

      i_start = 1 + myid
      i_end   = N

#ifdef USE_QD
      SSUM  = QD_ZERO
      P_SUM = QD_ZERO

      DO I8 = i_start, i_end, numprocs
        Di = I8
        AN  = qd_real((/Di, 0.0d0, 0.0d0, 0.0d0/))
        T1 = AN
        T2 = AN
        T3 = T1*T2
        P_SUM = P_SUM + T3
      ENDDO

      if (DEBUG) then
        write(6,3002) myid
 3002   format(i2.2," P_SUM (QD/MPI)      = ",$)
        call qdwrite(6,P_SUM)
      endif
!
! Collect all the partial sums
!

#ifdef USE_MPI
!     call MPI_BARRIER(MPI_COMM_WORLD, ierr)
      call MPI_REDUCE(P_SUM,RSUM,4,MPI_DOUBLE_PRECISION,MPI_SUM,0,      &
     &                MPI_COMM_WORLD,ierr)
!
! Send the partial sum P_SUM from process myid to root
!
        if (DEBUG) then
          write(6,3005) myid
 3005   format(i2.2," ====> P_SUM (QD/MPI)      = ",$)
          call qdwrite(6,P_SUM)
        endif

        call MPI_Gather(P_SUM,4,MPI_DOUBLE_PRECISION,                   &
     &                  Q_SUM(myid+1),4,MPI_DOUBLE_PRECISION,root,      &
     &                  MPI_COMM_WORLD, ierr)

      call MPI_BARRIER(MPI_COMM_WORLD, ierr)

      if (DEBUG .and. myid .eq. 0) then
        do ii=1,numprocs
          write(6,3004) myid*100 + ii
 3004     format(i4.4," Q_SUM(ii) (QD/MPI)      = ",$)
          call qdwrite(6,Q_SUM(ii))
        enddo
      endif

!
!  Let root add the partial sums
!
      if (myid .eq. 0) then
        do ii=1,numprocs
          SSUM = SSUM + Q_SUM(ii)
        enddo
      endif


#else
      SSUM = P_SUM
#endif

#else
        SSUM  = 0
        P_SUM = 0

        DO I8 = i_start, i_end, numprocs
          T1 = I8
          T2 = T1
          T3 = T1 * T2
          P_SUM = P_SUM + T3
        ENDDO

#ifdef USE_MPI
      call MPI_REDUCE(P_SUM,SSUM,1,MPI_INTEGER8,MPI_SUM,0,              &
     &                  MPI_COMM_WORLD,ierr)
#else
      SSUM = P_SUM
#endif

#endif

!
! Print SSUM
!

#ifdef USE_QD
        if (numprocs .gt. 1) then
          if (myid .eq. 0) then
            write(6,1042)
 1042       format("RSUM (QD/MPI)      = ",$)
            call qdwrite(6,RSUM)
          endif

          if (myid .eq. 0) then
            write(6,1043)
 1043       format("SSUM (QD/MPI)      = ",$) 
            call qdwrite(6,SSUM)
          endif

        else
          write(6,1032)
 1032     format("SSUM (QD/SER)      = ",$)
          call qdwrite(6,SSUM)
        endif
#else
        if (numprocs .gt. 1) then
          if (myid .eq. 0) then
            write(6,1022) SSUM
            if (m .gt. 6) then
              write(6,1013)
            else
              write(6,1014)
            endif
          endif
        else
          write(6,1012) SSUM
          if (m .gt. 6) then
            write(6,1013)
          else
            write(6,1014)
          endif
        endif
#endif

        N = N*10
        m = m + 1

      ENDDO

#ifdef USE_QD
      call f_fpu_fix_end(old_cw)
#endif

#ifdef USE_MPI
      call MPI_FINALIZE(ierr)
#endif

! QD
 2010 format("QD MPI JOB with N = ",I14,"  m=",I2.2)
 2020 format("I8 MPI JOB with N = ",I14,"  m=",I2.2)

 3001 format("PROBLEM_3_QD: QD/I8 MPI/Serial"/)

! I8
 1010 format("QD SERIAL JOB with N = ",I14,"  m=",I2.2)
 1020 format("I8 SERIAL JOB with N = ",I14,"  m=",I2.2)
 1012 format("SSum (I8/SER)      = ",I20,$)
 1022 format("SSum (I8/MPI)      = ",I20,$)
 1013 format("     <== OVERFLOW "/)
 1014 format(" "/)

      stop
      end

/* END SOURCE file sum_QD_reduction.f90 */

To run some test cases we need to decide on following options:

(1) Which KASE ?  There are four possibilities:

!       QD   no  yes
!   MPI\
!    no      00   01
!   yes      10   11

For the MPI column we can use MPI (yes) or SERIAL (no), and for row QD
we can use QD (yes) or INTEGER*8 (no). Thus specifying KASE=11 means
we selected MPI/QD while for KASE=01 we would select SERIAL/QD.

(2) Number of processors required, NPROCS. For SERIAL jobs this must be
    set to 1, for MPI jobs it should be greater than 1.

(3) To specify N we specify 2 integers: <lead_digit> and <npow> such that
    N = <lead_digit>*10^<npow>  (i.e. <npow> is the integer m).  If we use
    <lead_digit>=1 and m > 1 we can check our results against the tabulated
    results in our proposition table. Note that any integral value of N can be
    used by specifying m=0, say we want N=32575, that translates into:
    <lead_digit> = 32575 and m=0 i.e. N=32575*10^0 = 32575.

(4) When a job is submitted we like to be able to specify the time. We can
    do this by specifying <cpu_time_in_hours>.

Rather than constantly making adjustments to the source file sum_QD_reduction.f90 and makefile following script has been designed to handle these input parameters:

/* START of mm script */
 #!/bin/bash

# file name = mm  (script)

# This script requires at least five arguments - sixth argument is optional:
# sixth argument = r ==> remove previous output file (default)
#                = a ==> append output from this run to previous output file
#   Usage:
#
#      ./mm   <KASE>  <NPROCS>  <lead_digit>  <npow>  <cpu_time_in_hours> [ <append_rm_switch> ]

function usage () {
   cat <<EOF

Usage: $scriptname <KASE>  <NPROCS> <lead_digit>  <npow>  <cpu_time_in_hours> [ <append_rm_switch> ]
       The default for <append_rm_switch> is r

EOF
   exit 0
}

if (( $# != 5  && $# != 6 )) ; then
  usage
  exit 1
fi

if (( $# == 5 )) ; then
  MY_IARG6=r
else
  MY_IARG6=$6
fi

echo
echo "<KASE>: $1"
echo "<NPROCS>: $2"
echo "<lead_digit>: $3"
echo "<npow>: $4"
echo "<cpu_time_in_hours>: $5"
echo "<append_rm_switch>: $MY_IARG6"
echo

MY_NPROCS=$2

if (( $MY_NPROCS < 1 )); then
    MY_NPROCS=1
fi

if (( $1 == 0 )) ; then
  MPI_VAL=DO_NOT_USE_MPI
  QD_VAL=DO_NOT_USE_QD
  if (( $MY_NPROCS > 1 )); then
    MY_NPROCS=1
  fi
fi

if (( $1 == 1 )) ; then
  MPI_VAL=DO_NOT_USE_MPI
  QD_VAL=USE_QD
  if (( $MY_NPROCS > 1 )); then
    MY_NPROCS=1
  fi
fi


if (( $1 == 10 )) ; then
  MPI_VAL=USE_MPI
  QD_VAL=DO_NOT_USE_QD
fi

if (( $1 == 11 )) ; then
  MPI_VAL=USE_MPI
  QD_VAL=USE_QD
fi

echo "MY_NPROCS=$MY_NPROCS"
echo "MPI_VAL=$MPI_VAL"
echo "QD_VAL=$QD_VAL"

# Remove old objects and executable

make clean

# Compile with given macros, sleep and submit job with $IARGK $IARG0 $IARG1 $IARG2 $IARG3 $IARG4

make D_MPI=$MPI_VAL D_QD=$QD_VAL
sleep 10
IARGK=$1 IARG0=$MY_NPROCS  IARG1=$3  IARG2=$4  IARG3=$5  IARG4=$MY_IARG6  make D_MPI=$MPI_VAL D_QD=$QD_VAL  brun


/* END of mm script */

As an example, assume we want to run the program with data type QD using MPI with 64 processors for N=1000000 (i.e. N=1x10^6 which translates to <lead_digit>=1 and m=<npow>=6. For the time let's use 2 hours. Then all we need to do is type:

  ./mm 11 64 1 6 2

i.e. we specified <kase>=11 (for MPI with QD data type), requested 64 processors for MPI (i.e. <NPROCS>=64). Value of N is specified by using <lead_digit>=1 and <npow>=6 and finally <cpu_time_in_hours>=2.

Note that to use the optional parameter <append_rm_switch> we need to remove the JobID parameter %J in the makefile, then we will be able to specify <append_rm_switch>=a and the output will be appended to the appropriate file if it exists. With %J in the outputfile name (in the makefile) the outputfile name will be different each time we submit a job since the JobID parameter %J will change.

If you like to run several cases you could specify them by creating another script file. Consider following example of such a script file:

/* START of script file sub_mm3 to run several cases */


#!/bin/bash

# file name = sub_mm3

./mm 00 0 24250  0  1
./mm 00   4  1   6  1
./mm 01   4  1   8  1
./mm 10   4  1   6  1
./mm 11   4  1   8  1
./mm 11  16  1  10  1
./mm 11  32  1  10  1
./mm 11  64  1  10  1
./mm 11  96  1  10  1

/* END of script file sub_mm3 to run several cases */

Above script file sub_mm3 can be submitted by typing:

./sub_mm3

and it would submit 8 jobs. The first one with <KASE>=00 (i.e. SERIAL and INTEGER*8) for N=24250 and <cpu_time_in_hours>=1. To check the validity of the first job we can use the analytical expression for Sum_N=N*(N+1)*(2*N+1)/6. Here is the OCTAVE code to calculate Sum_N:

 octave:1> format long
 octave:2> N=24250
 N =  24250
 octave:3> Sum_N=N*(N+1)*(2*N+1)/6
 Sum_N =  4753799243625

and the output from our first job (./mm 00 0 24250 0 1) produced:

     QD   no  yes
 MPI\
  no      00   01
 yes      10   11

iargK=00 N =         24250  KASE = 00    cpu time requested =   1 hours

PROBLEM_3_QD: QD/I8 MPI/Serial

I8 SERIAL JOB with N =          24250  m=00
SSum (I8/SER)      =        4753799243625

How does it work

At the highest level, the user specifies the parameters described above to the mm script. Then the mm script (after doing some verification on the parameters) invokes the makefile twice using the make utility. The first time to compile the code and produce and executable and the second time to submit a job with the same parameters. The makefile takes all the parameters as arguments and passes them to the executable.

Timing the jobs

If you know the JobID (e.g. 75134) of a particular job issuing the command:

   qstat -f 75134  | head -21

once the job has completed will display the statistics (e.g. cput and walltime) for that job. For the last entry in the script sub_m3 we had:

./mm 11  96  1  10  1

Using the qstat -f <JobID> command and appending manually the first 20 or so lines to the output file produced:

     QD   no  yes
 MPI\
  no      00   01
 yes      10   11

iargK=11 N =   10000000000 = 1x10^10  KASE = 11    cpu time requested =   1 hours  NPROCS=96

PROBLEM_3_QD: QD/I8 MPI/Serial

QD MPI JOB with N =    10000000000  m=10
RSUM (QD/MPI)      =    3.33333333383333335534023255552000000000000000000000000000000000E+29
SSUM (QD/MPI)      =    3.33333333383333333335000000000000000000000000000000000000000000E+29

Job Id: 89859.orc-admin2.orca.sharcnet
    Job_Name = ./x.im#11#96#1#10#1#r
    Job_Owner = nickc@orc-login2.orca.sharcnet
    resources_used.cput = 03:07:10
    resources_used.walltime = 00:02:10
    job_state = C
    queue = mpi
    ctime = Sun Nov  4 11:53:14 2012

This case for N=10^10 with NPROCS=96 using MPI had walltime 2 min 10 sec.

The exactly same case was run using OpenMP with NPROCS=16 had walltime 11 min 27 sec. See:

 https://www.sharcnet.ca/help/index.php/OpenMP_Reduction_for_User_Defined_Data_Types#PROBLEM_3_QD:_QD.2FQuad_with.2Fwithout_OpenMP

and go towards the end of that document where you will find:

OUT_P3_QD_OMP    = 00:11:27

Also, note the entry:

OUT_P3_QD_SER    = 01:13:09

which means the the SERIAL job took 1 hour 13 min and 9 sec.