From Documentation
Revision as of 10:38, 30 November 2011 by Ppomorsk (Talk | contribs)

Jump to: navigation, search

LAPACK and ScaLAPACK Examples

Nick Chepurniy, Ph.D.
e-mail: nickc@sharcnet.ca
HPC Consultant - SHARCNET http://www.sharcnet.ca
Compute Canada http://www.computecanada.org


We start by defining a special matrix called STRIDWAD (Sparse TRIDiagonal With Anti-Diagonal). This matrix will be stored using the dense format (i.e. all elements are defined and are stored in memory). The matrix has the following properties:

It can be of any order N (where N is an even integer)
When STRIDWAD is multiplied by the vector X whose components are X(i)=i it produces a vector C 
whose elements are C(i)=N+1 except for the last element which is C(N)=2N+2.

These characteristics make it very easy to work with STRIDWAD while developing computer programs for testing purposes.

Defining the STRIDWAD matrix

In practice we will use large values of N but for illustration purposes the STRIDWAD system for N=6 would be:

[  3. -1.  0.  0.  0.  1. ]  [ 1. ]      [  7. ]
[ -1.  3. -1.  0.  1.  0. ]  [ 2. ]      [  7. ]
[  0. -1.  3.  0.  0.  0. ]  [ 3. ]      [  7. ]
[  0.  0.  0.  3. -1.  0. ]  [ 4. ]   =  [  7. ]
[  0.  1.  0. -1.  3. -1. ]  [ 5. ]      [  7. ]    
[  1.  0.  0.  0. -1.  3. ]  [ 6. ]      [ 14. ]



\begin{bmatrix}
    \quad 3 &        -1 & \quad   0 & \quad   0 & \quad  0 & \quad  1  \\
       -1   & \quad   3 &        -1 & \quad   0 & \quad  1 & \quad  0  \\
    \quad 0 &        -1 & \quad   3 & \quad   0 & \quad  0 & \quad  0  \\
    \quad 0 & \quad   0 & \quad   0 & \quad   3 &  -1      & \quad  0  \\
    \quad 0 & \quad   1 & \quad   0 &        -1 & \quad  3 &       -1  \\
    \quad 1 & \quad   0 & \quad   0 & \quad   0 &  -1      &  \quad 3
\end{bmatrix}



For N=18 it takes this form:

[  3. -1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1. ]  [  1. ]      [ 19. ]
[ -1.  3. -1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  0. ]  [  2. ]      [ 19. ]
[  0. -1.  3. -1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  0.  0. ]  [  3. ]      [ 19. ]
[  0.  0. -1.  3. -1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  0.  0.  0. ]  [  4. ]      [ 19. ]
[  0.  1.  0. -1.  3. -1.  0.  0.  0.  0.  0.  0.  0.  1.  0.  0.  0.  0. ]  [  5. ]      [ 19. ]    
[  1.  0.  0.  0. -1.  3. -1.  0.  0.  0.  0.  0.  1.  0.  0.  0.  0.  0. ]  [  6. ]      [ 19. ]
[  0.  0.  0.  0.  0. -1.  3. -1.  0.  0.  0.  1.  0.  0.  0.  0.  0.  0. ]  [  7. ]      [ 19. ]
[  0.  0.  0.  0.  0.  0. -1.  3. -1.  0.  1.  0.  0.  0.  0.  0.  0.  0. ]  [  8. ]      [ 19. ]
[  0.  0.  0.  0.  0.  0.  0. -1.  3.  0.  0.  0.  0.  0.  0.  0.  0.  0. ]  [  9. ]   =  [ 19. ]
[  0.  0.  0.  0.  0.  0.  0.  0.  0.  3. -1.  0.  0.  0.  0.  0.  0.  0. ]  [ 10. ]      [ 19. ]
[  0.  0.  0.  0.  0.  0.  0.  1.  0. -1.  3. -1.  0.  0.  0.  0.  0.  0. ]  [ 11. ]      [ 19. ]
[  0.  0.  0.  0.  0.  0.  1.  0.  0.  0. -1.  3. -1.  0.  0.  0.  0.  0. ]  [ 12. ]      [ 19. ]
[  0.  0.  0.  0.  0.  1.  0.  0.  0.  0.  0. -1.  3. -1.  0.  0.  0.  0. ]  [ 13. ]      [ 19. ]
[  0.  0.  0.  0.  1.  0.  0.  0.  0.  0.  0.  0. -1.  3. -1.  0.  0.  0. ]  [ 14. ]      [ 19. ]
[  0.  0.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0. -1.  3. -1.  0.  0. ]  [ 15. ]      [ 19. ]
[  0.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0. -1.  3. -1.  0. ]  [ 16. ]      [ 19. ]
[  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0. -1.  3. -1. ]  [ 17. ]      [ 19. ]
[  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0. -1.  3. ]  [ 18. ]      [ 38. ]

Observations on the structure of the STRIDWAD systems:

All elements along the diagonal are equal to 3. 
The elements in the two subdiagonals (lower and upper diagonals) are -1. 
The diagonal going from the lower left corner to the upper right corner is called anti-diagonal.
All elements on the anti-diagonal are 1. 
When the anti-diagonal crosses the subdiagonals the 1's cancel out with the -1's.
Order N is an EVEN number.
The components of the X vector are x(i)=i and all the components of C are N+1 except for the last one.
Last component of C is 2*(N+1) where N is the order of the STRIDWAD system.

So in fact what we have established here is an IDENTITY which for computational purposes can be used in the following two ways:

If we assume X is known and defined by x(i)=i we can use a matrix multiply program 
to calculate the product of the STRIDWAD matrix and X and compare the results to C.
If we assume X is unknown and we use the C as defined above we can
use a matrix inversion program and calculate the unknown vector X.

Program to calculate the RHS vector of the STRIDWAD system

Here is the listing of the program used to calculate the Right Hand Side (RHS) vector assuming that X is known and defined by x(i)=i:

      program TMA
      implicit none
 
      integer   :: istat,i,N_PRT
      real(kind=8),dimension(:,:),allocatable  :: A
      real(kind=8),dimension(:,:),allocatable  :: C
      real(kind=8),dimension(:,:),allocatable  :: D
      real(kind=8),dimension(:,:),allocatable  :: X
 
      integer   ::  NH, NH1, nprocs, KOUNT
 
      integer*8 :: N, mem_allocated
      real(kind=8) :: mem_gb, EPS
 
      parameter (N =  55388)
!     parameter (N =     18)
 
      parameter (N_PRT = 40)
 
! --------------------------------------------------------------------------
 
      interface
 
        subroutine init_my_matrix (n,a)
        implicit none
        integer*8 :: N
        real(kind=8)    :: a(:,:)
        end subroutine init_my_matrix
 
        subroutine init_my_vector (N,X)
        implicit none
        integer*8 :: N
        real(kind=8)    :: X(:,:)
        end subroutine init_my_vector
 
        subroutine init_my_rhs (n,c)
        implicit none
        integer*8 :: N
        real(kind=8)    :: c(:,:)
        end subroutine init_my_rhs
 
      end interface
 
! ---------------------------------------------------------------------------
 
!
!     Trace intermediate steps only for small values of N
 
      EPS = 0.00000001
 
      NPROCS = 1
 
!     N must be EVEN
 
      NH    = N/2
      NH1   = NH + 1
 
      IF (2*NH .NE. N) THEN
        WRITE(6,2001) N
 2001   FORMAT("N must be EVEN, got N = ",i9)
        STOP
      ENDIF
 
!
! -----    Print out nprocs   -----
 
      WRITE(6,101) nprocs,N
  101 FORMAT(" nprocs = ",i3,"      N = ",i12)
      call flush(6_4)
!
!
! -----   Allocate LHS, RHS, and pivot   -----
!
      allocate (a(N, N ), stat=istat)
 
      if (istat/=0) then
        print *,"ERROR: ALLOCATE FAILS for A"
        call flush(6_4)
        stop "ERR:ALLOCATE FAILS for A"
      else
        mem_allocated = 8*N*N
        mem_gb = mem_allocated / 1000000000.0
        WRITE(6,103) mem_allocated,mem_gb
  103   FORMAT(" memory alloc (A)    = ",i12," bytes i.e. ", f8.5," Gb")
      endif
 
      allocate (c(N,1), stat=istat)
 
      if (istat/=0) then
        print *,"ERROR: ALLOCATE FAILS for C"
        call flush(6_4)
        stop "ERR:ALLOCATE FAILS for C"
      else
        mem_allocated = 8*N
        mem_gb = mem_allocated / 1000000000.0
        WRITE(6,104) mem_allocated,mem_gb
  104   FORMAT(" memory alloc (C)    = ",i12," bytes i.e. ", f8.5," Gb")
      endif
 
      allocate (D(N,1), stat=istat)
 
      if (istat/=0) then
        print *,"ERROR: ALLOCATE FAILS for D"
        call flush(6_4)
        stop "ERR:ALLOCATE FAILS for D"
      else
        mem_allocated = 8*N
        mem_gb = mem_allocated / 1000000000.0
        WRITE(6,105) mem_allocated,mem_gb
  105   FORMAT(" memory alloc (D)    = ",i12," bytes i.e. ", f8.5," Gb")
      endif
 
      allocate (X (n,1), stat=istat)
 
      if (istat/=0)  then
        print *,"ERROR: ALLOCATE FAILS for X"
        call flush(6_4)
        stop "ERR:ALLOCATE FAILS for X"
      else
        mem_allocated = 8*N
        mem_gb = mem_allocated / 1000000000.0
        WRITE(6,106) mem_allocated,mem_gb
  106   FORMAT(" memory alloc (X)    = ",i12," bytes i.e. ", f8.5," Gb")
      endif
 
      mem_allocated = 8*N*N + 8*N + 8*N
      mem_gb = mem_allocated / 1000000000.0
 
      WRITE(6,102) mem_allocated,mem_gb
  102 FORMAT(" memory allocated    = ",i12," bytes i.e. ", f8.5," Gb")
      call flush(6_4)
 
!
! -----    Initialize:  A X = C
!
      call init_my_matrix (N,A)
      call init_my_vector (N,X)
      call init_my_rhs    (N,C)
 
      D = MATMUL(A,X)
!
      write (6,500)
  500 FORMAT(/"          C(I): ")
      IF (N .le. N_PRT) THEN
        DO I=1,N
          WRITE (6,501) I,C(I,1)
  501     FORMAT(" I = ",I3," ",f6.2)
        ENDDO
 
      ELSE
 
!     For large N print only last N_PRT elements of vector
 
        DO I=N-N_PRT,N
          WRITE (6,502) I,A(I,I),D(I,1)
  502     FORMAT(" I=",I10,"  A(I,I) = ",f13.2,"  D(I,1) = ",f13.2)
        ENDDO
 
      ENDIF
 
      KOUNT = 0
 
      DO I=1,N
        if (abs(C(I,1)-D(I,1)) .gt. EPS) then
          KOUNT = KOUNT + 1
          write(6,107) I
  107     format("For I = ",i12," C .neq. D")
        endif
      ENDDO
 
      write(6,108) KOUNT
  108 format(/"KOUNT = ",i10)
!
! -----    Cleanup arrays -----
!
      deallocate (A,C,D,X)
!
      end program TMA
!
!-----------------------------------------------------------------------
!
      SUBROUTINE INIT_MY_MATRIX (N,A)
      IMPLICIT NONE
      INTEGER      :: NH,NH1
      INTEGER*8    :: N
      REAL(kind=8) :: A(:,:)
 
      INTEGER      :: I
 
      NH  = N/2
      NH1 = NH + 1
 
!     Diagonal elements
 
      DO I = 1,N
        A(I,I) = 3.0
      ENDDO
 
!     Upper diagonal
 
      DO 180 I = 1,N-1
        if (I .EQ. NH) go to 180
        A(I,I+1) = -1.0d0
  180 CONTINUE
 
!     Lower diagonal
 
      DO 160 I = 1,N-1
      if (I .EQ. NH) go to 160
      A(I+1,I) = -1.0d0
  160 CONTINUE
 
!
!     ANTI-DIAGONAL
!
      DO 190 I = 1,N
        if (I .EQ. NH .OR. I .EQ. NH1) go to 190
 
        A(I,N-I+1) = 1.0d0
  190 CONTINUE
 
      return
      end
 
!
!-----------------------------------------------------------------------
!
      SUBROUTINE INIT_MY_RHS (N,C)
      IMPLICIT NONE
      INTEGER*8    :: N
      REAL(kind=8) :: C(:,:)
 
      INTEGER :: I
 
      DO I= 1, N-1
        C(I,1) = N + 1.0
      ENDDO
 
      C(N,1) = 2*N + 2
 
      return
      end
!
!-----------------------------------------------------------------------
!
      SUBROUTINE INIT_MY_VECTOR (N,X)
      IMPLICIT NONE
      INTEGER*8    :: N
      REAL(kind=8) :: X(:,:)
 
      INTEGER :: I
 
      DO I= 1, N
        X(I,1) = I
      ENDDO
 
      return
      end

Important Notes on the program and Output Results

There are several things which should be pointed out:

 (1) The order of the STRIDWAD system (i.e. parameter N) has been declared as INTEGER*8
     in order that there be no overflow when memory is calculated using mem_allocated.
 (2) To submit the job the command: 
        sqsub -t  -r 60m -q threaded -n 4 -o ${CLU}_TEST_MEM_8_%J  ./test_mem_alloc8
     was used so that all the memory of the node would be available to this program.

Output for N=18 was as follows:

 nprocs =   1      N =           18
 memory alloc (A)    =         2592 bytes i.e.  0.00000 Gb
 memory alloc (C)    =          144 bytes i.e.  0.00000 Gb
 memory alloc (D)    =          144 bytes i.e.  0.00000 Gb
 memory alloc (X)    =          144 bytes i.e.  0.00000 Gb
 memory allocated    =         2880 bytes i.e.  0.00000 Gb
 
           C(I):
 I =   1  19.00
 I =   2  19.00
 I =   3  19.00
 I =   4  19.00
 I =   5  19.00
 I =   6  19.00
 I =   7  19.00
 I =   8  19.00
 I =   9  19.00
 I =  10  19.00
 I =  11  19.00
 I =  12  19.00
 I =  13  19.00
 I =  14  19.00
 I =  15  19.00
 I =  16  19.00
 I =  17  19.00
 I =  18  38.00
 
KOUNT =          0

Output for N=55388 (the highest possible value that fits in memory) was:

 nprocs =   1      N =        55388
 memory alloc (A)    =  24542644352 bytes i.e. 24.54264 Gb
 memory alloc (C)    =       443104 bytes i.e.  0.00044 Gb
 memory alloc (D)    =       443104 bytes i.e.  0.00044 Gb
 memory alloc (X)    =       443104 bytes i.e.  0.00044 Gb
 memory allocated    =  24543530560 bytes i.e. 24.54353 Gb
 
          C(I):
 I=     55348  A(I,I) =          3.00  D(I,1) =      55389.00
 I=     55349  A(I,I) =          3.00  D(I,1) =      55389.00
 I=     55350  A(I,I) =          3.00  D(I,1) =      55389.00
...
 I=     55386  A(I,I) =          3.00  D(I,1) =      55389.00
 I=     55387  A(I,I) =          3.00  D(I,1) =      55389.00
 I=     55388  A(I,I) =          3.00  D(I,1) =     110778.00
 
KOUNT =          0

LAPACK EXAMPLE

In the previous section the STRIDWAD matrices of oder N were defined. We now want to use those matrices as input to the LAPACK's routines dgetrf/dgetrs to find the inverses of different orders N. The programs should also report the memory used, cpu time and elapsed time.

Compiling and Submitting Jobs

Following makefile is used to compile and submit the jobs to a compute node:

# Determine the platform that is being used for this compilation
 
OS:= $(shell uname)
PLATFORM:= $(shell uname)_$(shell uname -m)
 
# Set the dafault parameters N (order of matrix times 1000)
# and P (number of threads)
 
N:= 1
P:= 4
 
# Specify executable, source and required libraries :
 
EXE = DP_LAPACK_SOLVE
EXT:= f90
SRC:= ${EXE}.$(EXT)
LIB:= -L/opt/sharcnet/acml/current/pathscale64/lib -lacml
 
 
$(EXE): $(SRC)
        @echo Check if N is in the range [1,40]
        ./check_N $N
#       @echo COMPILING LAPACK DP program with N = $(N),000
#       @echo PLATFORM used = $(PLATFORM)
#       @echo Edit the parameter_N.inc file
        cp parameter_N.inc_basic  parameter_N.inc
        ./ed_parm_N parameter_N.inc $N
#       @echo "Using following parameter_N.inc file:"
#       cat parameter_N.inc
#       @echo "LIB = " $(LIB)
        pathf90 -o $(EXE) $(SRC) $(LIB)
        cp sub_batch_DP_LAPACK_SOLVE_basic sub_batch_DP_LAPACK_SOLVE
        ./ed_procs_P sub_batch_DP_LAPACK_SOLVE $P
        ./sub_batch_DP_LAPACK_SOLVE
 
 
superclean: clean
        @echo ' '
        @echo  Removing Output files generated by $(EXE)
        @echo '--------------------------------------------------'
        @echo ' '
        rm -rf ${CLU}_SERIAL_DP_SOLVE_*
        @echo ' '
        @echo  Output files have been removed.
        @echo '-------------------------------'
        @echo ' '
 
clean:
        @echo ' '
        @echo  Cleaning executable $(EXE)
        @echo '-----------------------------------'
        @echo ' '
        rm -rf $(EXE)
        rm -rf ED_ERROR
        @echo ' '
        @echo  Executable $(EXE) has been removed.
        @echo '--------------------------------------------'
        @echo ' '
 
 
help:
        @echo "+-----------------------------------------------------------------------------------+"
        @echo "|                                                                                   |"
        @echo "|              Makefile for running LAPACK example using dgetrf/dgetrs              |"
        @echo "|                                                                                   |"
        @echo "| Usage: make                      compile program for N=1 (i.e. order=1,000)       |"
        @echo "|                                                                                   |"
        @echo "|        make N=n                  compile program for N=n (i.e. order=n*1000)      |"
        @echo "|                                  where  1 <= n <= 40                              |"
        @echo "|                                                                                   |"
        @echo "|        make P=p                  submit job with -q threaded -n p                 |"
        @echo "|                                  where  1 <= p <= 4 or 6 (depending on cluster)   |"
        @echo "|                                                                                   |"
        @echo "|        make superclean           remove output files and executable               |"
        @echo "|                                                                                   |"
        @echo "|        make clean                remove executable                                |"
        @echo "|                                                                                   |"
        @echo "|        make help                 display makefile usage information               |"
        @echo "|                                                                                   |"
        @echo "+-----------------------------------------------------------------------------------+"
        @echo " "
        @echo PLATFORM $(PLATFORM)

The make utility will use above makefile and we can override any parameters appearing in the the makefile. Thus, if we type the command:

make N=2 P=6

make will use parameter N=2 and P=6 instead of those appearing in the makefile.

Then make proceeds to specify the executable, source and required library and executes the commands required to generate the target $(EXE). The first command for generating this target is to invoke the script check_N which takes as argument the parameter $N.


#!/bin/bash
 
# script file = check_N
 
if [ $# -ne 1 ]; then
  echo "Must have 1 argument: value of N"
  echo "Usage: $0  <value_of_N> "
  exit -1
fi
 
N=$1
 
if [ "$N"  -lt 1 ] || [ "$N"  -gt 40 ]
  then
    echo "Value of N is out of range !"
    echo " 1 <= N <= 40 "
    exit -1
fi
 
exit


If the value of $N is in the specified range the check_N script returns 0 and make proceeds to the next command, but if $N is out of the specified range, then the check_N script returns -1 and the make command terminates with an error message.

If the check_N script returned 0 then make proceeds to the next command which copies the parameter_N.inc_basic file into the parameter_N.inc file. The next script ed_parm_N takes two arguments: parameter_N.inc and $N to modify the file parameter_N.inc which is used by the fortran program that we will compile in the next command. Here is a listing of the second script:


#!/bin/bash
 
# script file = ed_parm_N
 
if [ $# -ne 2 ]; then
  echo "Must have 2 argument: Filename to change and value to substitute x for"
  echo "Usage: $0  <filename>  <value_for_x> "
  exit
fi
 
Filename=$1
 
ed $Filename <<EOF 2> ED_ERROR
1,2s/x/$2/
w
q
EOF


Now, the make command is ready to compile the source program using the updated version of the include file parameter_N.inc.

If errors are detected in the compilation then make will terminate, otherwise the next command is executed in the makefile script, which is the script:

#!/bin/bash
 
# script file = ed_procs_P
 
if [ $# -ne 2 ]; then
  echo "Must have 2 argument: Filename to change and number of procs to use"
  echo "Usage: $0  <filename>  <num_of_processors> "
  exit -1
fi
 
if [ `printenv CLU` = "nar" ] ; then
  let MXP=4
fi
 
if [ `printenv CLU` = "saw" ] ; then
  let MXP=6
fi
 
if [ $2 -lt 1 ] || [ $2 -gt $MXP ] ; then
  echo "Parameter p is out of range for cluster ${CLUSTER}"
  exit -1
fi
 
Filename=$1
 
ed $Filename <<EOF 2> ED_ERROR
1,6s/nprocs/$2/
w
q
EOF

If p is out of range the make command exits, otherwise the next command is executed which is a script submitting the job to a queue (in which the parameter nprocs will be replaced by what is specified in the makefile):

#!/bin/bash
 
# script file = sub_batch_DP_LAPACK_SOLVE_basic
 
sqsub -t -r 60m -q threaded -n nprocs -o ${CLU}_SERIAL_DP_SOLVE_%J  ./DP_LAPACK_SOLVE

The Fortran and Other Files

For completnes we list next the whole fortran source file and working files parameter_N.inc_basic and sub_batch_DP_LAPACK_SOLVE_basic:

      program DP_LAPACK_SOLVE
!
!   file name = DP_LAPACK_SOLVE.f90
!
      implicit none
 
      integer   :: istat,info,i,j,N_PRT,milestone
      real(kind=8),dimension(:,:),allocatable  :: a
      real(kind=8),dimension(:,:),allocatable  :: c
      integer,dimension(:),allocatable :: ipiv
 
      integer   :: root, ierr
      integer*8 :: N, mem_allocated
      real(kind=8) :: mem_gb, EPS
 
!     parameter (N = 64000)
!     parameter (N = 100000)
!     parameter (N =  35000)
 
      include "parameter_N.inc"
 
      parameter (N_PRT = 40)
 
      integer   ::       context, iam, pnum
      integer   ::       mycol, myrow, nb
      integer   ::       npcol, nprocs, nprow
 
      integer   ::       ITEM, IAorC, NH, NH1, DEBUG
! --------------------------------------------------------------------------
 
      integer   :: day(10),hour(10),minute(10),second(10),millisec(10)
      character*30 :: descr_milestone(10)
 
      integer    :: day_beg,hour_beg,minute_beg,second_beg,millisec_beg
      integer    :: day_end,hour_end,minute_end,second_end,millisec_end
      character*23 :: date_time
      real(kind=8) :: T1, T2
      real(kind=8) :: TOT_TIME
      real(kind=8) :: tm
 
! --------------------------------------------------------------------------
 
      interface
 
        subroutine init_my_matrix (n,a,nprow,npcol,myrow,mycol)
        implicit none
        integer*8 :: N
        integer :: nprow,npcol,myrow,mycol
        real(kind=8)    :: a(:,:)
        end subroutine init_my_matrix
 
        subroutine init_my_rhs (n,c,nprow,npcol,myrow,mycol)
        implicit none
        integer*8 :: N
        integer :: nprow,npcol,myrow,mycol
        real(kind=8)    :: c(:,:)
        end subroutine init_my_rhs
 
 
        subroutine elapsed_time(day_beg,hour_beg, minute_beg,second_beg,&
     &  millisec_beg, day_end, hour_end, minute_end, second_end,        &
     &  millisec_end,tid,ITEM )
        implicit none
        integer :: day_beg,hour_beg,minute_beg,second_beg,millisec_beg
        integer :: day_end,hour_end,minute_end,second_end,millisec_end
        integer :: day,hour, minute, second, millisec, tid, ITEM
        end subroutine elapsed_time
 
        subroutine timestamp(date_time,day,hour,minute,second,millisec)
        implicit none
        character*23 :: date_time
        integer      :: day, hour, minute, second, millisec
        integer      :: elements(8)
        character*3  :: months(12)
        end subroutine timestamp
 
      end interface
 
! ---------------------------------------------------------------------------
 
!
!     Trace intermediate steps only for small values of N
 
      DEBUG = 1
      root = 0
      IAM  = 0
      NPROCS = 1
 
!     N must be EVEN
 
      NH    = N/2
      NH1   = NH + 1
 
      IF (2*NH .NE. N) THEN
        WRITE(6,2001) N
 2001   FORMAT("N must be EVEN, got N = ",i9)
        STOP
      ENDIF
 
      IF ( N .gt. N_PRT ) DEBUG = 0
!
! ---------------------------------------------------------------------------
 
     ITEM = 1000 + iam
 
      call timestamp(date_time,day_beg,hour_beg,minute_beg,second_beg,  &
     &               millisec_beg )
 
      milestone = 1
      IF (IAM .eq. 0) write(6,1006) ITEM,iam,date_time,milestone
 1006 format(I4," iam ",i3," BEGIN ",a23," milestone = ",i2)
 1008 format(I4," iam ",i3,"       ",a23," milestone = ",i2)
      call flush(6_4)
 
      call cpu_time(T1)
 
      call timestamp( date_time,day(milestone), hour(milestone),        &
     &     minute(milestone),second(milestone),millisec(milestone) )
 
      descr_milestone(milestone) = "Init blacs_gridinit, allocate"
!
! -----    Print out nprocs   -----
 
        IF (IAM .EQ. 0) THEN
          WRITE(6,101) nprocs,N
  101     FORMAT(" nprocs = ",i3,"      N = ",i12)
          call flush(6_4)
        ENDIF
!
!
! -----   Allocate LHS, RHS, and pivot   -----
!
      allocate (a(N, N ), stat=istat)
      if (istat/=0) then
        print *,"ERROR: ALLOCATE FAILS for A"
        call flush(6_4)
        stop "ERR:ALLOCATE FAILS for A"
      else
        mem_allocated = 8*N*N
        mem_gb = mem_allocated / 1000000000.0
        WRITE(6,103) mem_allocated,mem_gb
  103   FORMAT(" memory alloc (A)    = ",i12," bytes i.e. ", f9.6," Gb")
      endif
 
      allocate (c(N,1), stat=istat)
      if (istat/=0) then
        print *,"ERROR: ALLOCATE FAILS for C"
        call flush(6_4)
        stop "ERR:ALLOCATE FAILS for C"
      else
        mem_allocated = 8*N
        mem_gb = mem_allocated / 1000000000.0
        WRITE(6,104) mem_allocated,mem_gb
  104   FORMAT(" memory alloc (C)    = ",i12," bytes i.e. ", f9.6," Gb")
      endif
 
      allocate (ipiv (n), stat=istat)
      if (istat/=0)  then
        print *,"ERROR: ALLOCATE FAILS for IPIV"
        call flush(6_4)
        stop "ERR:ALLOCATE FAILS for IPIV"
      else
        mem_allocated = 4*N
        mem_gb = mem_allocated / 1000000000.0
        WRITE(6,105) mem_allocated,mem_gb
  105   FORMAT(" memory alloc (ipiv) = ",i12," bytes i.e. ", f9.6," Gb")
      endif
 
      mem_allocated = 8*N*N + 8*N + 4*N
      mem_gb = mem_allocated / 1000000000.0
 
        IF (IAM .EQ. 0) THEN
          WRITE(6,102) mem_allocated,mem_gb
  102   FORMAT(" memory allocated    = ",i12," bytes i.e. ", f9.6," Gb")
          call flush(6_4)
        ENDIF
 
!
! -----    Initialize LHS and RHS
!
      call init_my_matrix (n,a,nprow,npcol,myrow,mycol)
      call init_my_rhs    (n,c,nprow,npcol,myrow,mycol)
!
      milestone = milestone + 1
      IF (IAM .eq. 0) write(6,1008) ITEM,iam,date_time,milestone
      call flush(6_4)
      call timestamp( date_time,day(milestone), hour(milestone),        &
     &     minute(milestone),second(milestone),millisec(milestone) )
 
      descr_milestone(milestone) = "call dgetrf"
!
! -----    Do the work -----
!
      call dgetrf( N, N, A, N, ipiv, info )
 
      if (iam.eq.0 .AND. DEBUG .eq. 1)                                  &
     &  write (6,*) "info (from dgetrf) = ", info
 
      milestone = milestone + 1
      IF (IAM .eq. 0) write(6,1008) ITEM,iam,date_time,milestone
      call flush(6_4)
      call timestamp( date_time,day(milestone), hour(milestone),        &
     &     minute(milestone),second(milestone),millisec(milestone) )
 
      descr_milestone(milestone) = "call dgetrs"
 
      call dgetrs( 'N', N, 1, A, N, ipiv, c, N, info )
 
      if (iam.eq.0 .AND. DEBUG .eq. 1)                                  &
     &  write (6,*) "info (from dgetrs) = ", info
!
 
      milestone = milestone + 1
      IF (IAM .eq. 0) write(6,1008) ITEM,iam,date_time,milestone
      call flush(6_4)
      call timestamp( date_time,day(milestone), hour(milestone),        &
     &     minute(milestone),second(milestone),millisec(milestone) )
 
      descr_milestone(milestone) = "get_solution --> root"
 
 
 
      IF (IAM.EQ.0) THEN
        IF (N .le. N_PRT) THEN
          ITEM = 6000
          write (6,500) ITEM
  500     FORMAT(I4," SOLUTION: ")
          DO I=1,N
            ITEM = 6000 + I
            WRITE (6,501) ITEM,I,C(I,1)
  501       FORMAT(I4," I = ",I3," ",f6.2)
          ENDDO
 
        ELSE
 
!       For large N print only last N_PRT elements of vector
 
          DO I=N-N_PRT,N
            WRITE (6,502) I,C(I,1)
  502       FORMAT(" I=",I10," ",f13.2)
          ENDDO
 
        ENDIF
 
      ENDIF
 
 
      milestone = milestone + 1
      IF (IAM .eq. 0) write(6,1008) ITEM,iam,date_time,milestone
      call flush(6_4)
      call timestamp( date_time,day(milestone), hour(milestone),        &
     &     minute(milestone),second(milestone), millisec(milestone) )
 
      descr_milestone(milestone) = "print and clean up"
 
!     Now report the milestone time differences
 
      ITEM = 7000
 
      IF (IAM .eq. 0) WRITE(6,1007) ITEM,iam,milestone
 1007 FORMAT(I4," iam = ",i3," FINAL value of milestone = ",i2)
 
      do i=1,milestone-1
        ITEM = 7000 + I
        tm = 86400.0*(day(i+1)-day(i)) +  3600.0*(hour(i+1)-hour(i)) +  &
     &       60.0*(minute(i+1)-minute(i)) + (second(i+1)-second(i))  +  &
     &       0.001 * (millisec(i+1)-millisec(i))
        IF (IAM .EQ. 0) write(6,1005) ITEM,iam, i, i+1, tm,             &
     &                  descr_milestone(i)
 1005   format(I4," iam ",i3," Time from milestone ",i2," to milestone "&
     &        ,I2," equals ",f10.2,3x,a30)
      enddo
 
      call timestamp( date_time,day_end,hour_end,minute_end,second_end, &
     &                millisec_end  )
 
      ITEM = ITEM + 1
      IF (IAM .EQ. 0) write(6,1011) ITEM,iam,date_time
 1011 format(I4," iam ",i3," END   ",a23)
 
      call cpu_time(T2)
      TOT_TIME = T2-T1
 
      call elapsed_time(day_beg,hour_beg, minute_beg, second_beg,       &
     &  millisec_beg, day_end, hour_end, minute_end, second_end,        &
     &  millisec_end,iam,ITEM )
 
      ITEM = ITEM + 1
      IF (IAM .EQ. 0) write(6,1012) ITEM,iam,TOT_TIME
 1012 format(I4," IAM ",i3," Total CPU Time/processor = ",f11.4,        &
     &       " seconds")
 
! ---------------------------------------------------------------------------
 
!
! -----    Cleanup arrays -----
!
      deallocate (a,c,ipiv)
!
      end program DP_LAPACK_SOLVE
!
!-----------------------------------------------------------------------
!
      SUBROUTINE INIT_MY_MATRIX (N,A,NPROW,NPCOL,MYROW,MYCOL)
      IMPLICIT NONE
      integer*8 :: N
      INTEGER      :: NH,NH1,NPROW,NPCOL,MYROW,MYCOL
      REAL(kind=8) :: A(:,:)
 
      INTEGER      :: I, J
 
      NH  = N/2
      NH1 = NH + 1
 
!     Diagonal elements
 
      DO I = 1,N
        A(I,I) = 3.0
      ENDDO
 
!     Upper diagonal
 
      DO 180 I = 1,N-1
        if (I .EQ. NH) go to 180
        A(I,I+1) = -1.0d0
  180 CONTINUE
 
!     Lower diagonal
 
      DO 160 I = 1,N-1
      if (I .EQ. NH) go to 160
      A(I+1,I) = -1.0d0
  160 CONTINUE
 
!
!     ANTI-DIAGONAL
!
      DO 190 I = 1,N
        if (I .EQ. NH .OR. I .EQ. NH1) go to 190
 
        A(I,N-I+1) = 1.0d0
  190 CONTINUE
 
      return
      end
 
!
!-----------------------------------------------------------------------
!
      SUBROUTINE INIT_MY_RHS (N,C,NPROW,NPCOL,MYROW,MYCOL)
      IMPLICIT NONE
      integer*8    :: N
      INTEGER      :: NPROW,NPCOL,MYROW,MYCOL
      REAL(kind=8) :: C(:,:)
 
      INTEGER :: I
 
      DO I= 1, N-1
        C(I,1) = N + 1.0
      ENDDO
 
      C(N,1) = 2*N + 2
 
      return
      end
!
!-----------------------------------------------------------------------
!
      subroutine elapsed_time(day_beg, hour_beg, minute_beg, second_beg,&
     &          millisec_beg, day_end, hour_end, minute_end, second_end,&
     &          millisec_end,tid, ITEM )
 
!date_and_time
 
! VALUES
!   must be of type default integer and of rank one. It is an INTENT(OUT)
!   argument. Its size must be at least eight. The values returned in VALUES
!   are as follows:
 
!   VALUES(1)
!   is the year (for example, 1998), or -HUGE (0) if no date is available.
 
!   VALUES(2)
!   is the month of the year, or -HUGE (0) if no date is available.
 
!   VALUES(3)
!   is the day of the month, or -HUGE (0) if no date is available.
 
!   VALUES(4)
!   is the time difference with respect to Coordinated Universal Time (UTC)
!   in minutes, or -HUGE (0) if this information is not available.
!
!   VALUES(5)
!   is the hour of the day, in the range 0 to 23, or -HUGE (0) if there is
!   no clock.
!
!   VALUES(6)
!   is the minutes of the hour, in the range 0 to 59, or -HUGE (0) if there
!   is no clock.
!   !
!   VALUES(7)
!   is the seconds of the minute, in the range 0 to 60, or -HUGE (0) if
!   there is no clock.
!
!   VALUES (8)
!   is the milliseconds of the second, in the range 0 to 999, or -HUGE (0)
!   if there is no clock.
 
      implicit none
      integer :: day_beg, hour_beg, minute_beg, second_beg, millisec_beg
      integer :: day_end, hour_end, minute_end, second_end, millisec_end
      integer :: day, hour, minute, second, millisec, tid, ITEM
 
      if ( millisec_end .lt. millisec_beg ) then
        millisec_end = millisec_end + 1000
        second_end = second_end - 1
      endif
 
      if ( second_end .lt. second_beg ) then
        second_end = second_end + 60
        minute_end = minute_end - 1
      endif
 
      if ( minute_end .lt. minute_beg ) then
        minute_end = minute_end + 60
        hour_end   = hour_end - 1
      endif
 
      if ( hour_end .lt. hour_beg ) then
        hour_end = hour_end + 24
        day_end  = day_end - 1
      endif
 
      millisec = millisec_end - millisec_beg
      second   = second_end   - second_beg
      minute   = minute_end   - minute_beg
      hour     = hour_end     - hour_beg
      day      = day_end      - day_beg
 
      if ( .not. ( day .eq. 0 .and. hour .eq. 0 .and. minute .eq. 0     &
     &        .and. second .eq. 0 .and. millisec .eq. 0 ) ) then
        ITEM = ITEM + 1
        IF (tid .eq. 0 ) THEN
          write(6,8000) ITEM,tid,day
 8000     format(I4," IAM = ",I3," ELAPSED0 days          = ",i3)
          write(6,8001) ITEM,tid,hour
 8001     format(I4," IAM = ",I3," ELAPSED1 hours         = ",i3)
          write(6,8002) ITEM,tid,minute
 8002     format(I4," IAM = ",I3," ELAPSED2 minute        = ",i3)
          write(6,8003) ITEM,tid,second
 8003     format(I4," IAM = ",I3," ELAPSED3 second        = ",i3)
          write(6,8004) ITEM,tid,millisec
 8004     format(I4," IAM = ",I3," ELAPSED4 millisec      = ",i3)
        ENDIF
      endif
 
      return
 
      end subroutine elapsed_time
 
!
!-----------------------------------------------------------------------
!
 
      subroutine timestamp( date_time, day,hour,minute,second,millisec)
      implicit none
      character*23 :: date_time
      integer      :: day, hour, minute, second, millisec
      integer      :: elements(8)
      character*3  :: months(12)
 
      data months /'Jan','Feb','Mar','Apr','May','Jun',                 &
     &             'Juk','Aug','Sep','Oct','Nov','Dec' /
      call date_and_time ( VALUES=elements )
 
      if ( elements(1) .ne. -HUGE(0) ) then
      century: if ( elements(1) .lt. 2000 ) then
                 elements(1) = elements(1) - 1900
               else
                 elements(1) = elements(1) - 2000
               endif century
               write(date_time,1010) elements(3),                       &
     &                       months ( elements(2) ),                    &
     &                                elements(1)  , elements(5),       &
     &                 elements(6)  , elements(7)  , elements(8)
 
 1010          format(i2.2,1x,a3,1x,i2.2,1x,                            &
     &                i2.2,':',i2.2,':',i2.2,'.',i4.4 )
 
               day      = elements(3)
               hour     = elements(5)
               minute   = elements(6)
               second   = elements(7)
               millisec = elements(8)
 
      else
               date_time=' '
 
               day      = 0
               hour     = 0
               minute   = 0
               second   = 0
               millisec = 0
 
      endif
 
      end subroutine timestamp
 
 
!
!-----------------------------------------------------------------------
!

This file is used by script ed_parm_N which substitutes the x by $N:

! filename = parameter_N.inc_basic
      parameter (N =   x000)

This file is used by script ed_procs_P which substitutes nprocs by $P:

#!/bin/bash
 
# script file = sub_batch_DP_LAPACK_SOLVE_basic
sqsub -t -r 60m -q threaded -n nprocs -o ${CLU}_SERIAL_DP_SOLVE_%J  ./DP_LAPACK_SOLVE

ScaLAPACK EXAMPLE

ScaLAPACK is a distributed memory version of LAPACK. It uses the Parallel Basic Linear Algebra Subprograms (PBLAS) as computational building blocks. ScaLAPACK uses the block-cyclic decomposition scheme to distribute block-partitioned matrices which should make the computation balanced and scalable.

Example of a block cyclic data distribution

According to the two-dimensional block cyclic data distribution scheme, an M_ by N_ dense matrix is first decomposed into MB_ by NB_ blocks starting at its upper left corner.

These blocks are then uniformly distributed in each dimension of the Process Grid.

Thus, every process owns a collection of blocks, which are locally and contiguously stored in a two-dimensional ``column major array. The partitioning of a matrix into blocks and the mapping of these blocks onto a Process Grid is illustrated with a global 9x9 matrix A. The first step in this process is to partition the matrix A into block. Let us use 2x2 blocks and assume that the 2-D Process Grid is 2x3.


a11 a12 | a13 a14 | a15 a16 | a17 a18 | a19
a21 a22 | a23 a24 | a25 a26 | a27 a28 | a29           B11  B12  B13  B14  B15
--------.---------.---------.---------.----
a31 a32 | a33 a34 | a35 a36 | a37 a38 | a39
a41 a42 | a43 a44 | a45 a46 | a47 a48 | a49           B21  B22  B23  B24  B25
--------.---------.---------.---------.----
a51 a52 | a53 a54 | a55 a56 | a57 a58 | a59      ==
a61 a62 | a63 a64 | a65 a66 | a67 a68 | a69           B31  B32  B33  B34  B35
--------.---------.---------.---------.----
a71 a72 | a73 a74 | a75 a76 | a77 a78 | a79
a81 a82 | a83 a84 | a85 a86 | a87 a88 | a89           B41  B42  B43  B44  B45
--------.---------.---------.---------.----
a91 a92 | a93 a94 | a95 a96 | a97 a98 | a99           B51  B52  B53  B54  B55


        matrix A is M_ x N_ (9 x 9)
     A is partitioned into 2x2 blocks

In the above diagram the Bij are the 2x2 blocks, e.g.


 B11  ==  a11 a12      B12  ==  a13 a14     ...    B15 == a19
          a21 a22               a23 a24                   a29


Initially, the 2x3 Process Grid is empty and looks like this:

          0                 1              2

          .                 .            .
 0        .                 .            .
          .                 .            .

 1        .                 .            .
          .                 .            .

We identify each process in the Process Grid by two coordinates (row,col). Thus, for the 2x3 Process Grid the processes would be:

         0                 1              2

 0     (0,0)             (0,1)          (0,2)
     
 1     (1,0)             (1,1)          (1,2)  
      

The distribution process starts by taking the Global Bij in first row and distribute them to the first row of the Processor Grid:

          0                 1              2

       B11 B14           B12 B15        B13
 0        .                 .            .
          .                 .            .

 1        .                 .            .
          .                 .            .

Take Global Bij in next row and distribute them to the next row of the Process Grid: (if previous distribution was on last row of Process Grid then restart with row 0 of Process Grid ).

          0                 1              2

       B11 B14           B12 B15        B13
 0        .                 .            .
          .                 .            .

 1     B21 B24           B22 B25        B23
          .                 .            .


Take Global Bij in next row and distribute them to the first row of the Process Grid: (restart with row 0 of Process Grid).

          0                 1              2

       B11 B14           B12 B15        B13
 0     B31 B34           B32 B35        B33
          .                 .            .

 1     B21 B24           B22 B25        B23
          .                 .            .


Take Global Bij in next row and distribute them to the next row of the Process Grid:

          0                 1              2

       B11 B14           B12 B15        B13
 0     B31 B34           B32 B35        B33
          .                 .            .

 1     B21 B24           B22 B25        B23
       B41 B44           B42 B45        B43


Take Global Bij in next row and distribute them to the first row of the Process Grid: (restart with row 0 of Process Grid).

          0                 1              2

       B11 B14           B12 B15        B13
 0     B31 B34           B32 B35        B33
       B51 B54           B52 B55        B53

 1     B21 B24           B22 B25        B23
       B41 B44           B42 B45        B43

Set Up prior to invoking the Parallel Solvers

Before calling solvers and other functions from the ScaLAPACK library the user needs to configure the processes into a BLACS 2D Process Grid. ScaLAPACK requires that the data arrays be block-cyclically distributed across the 2D Processor Grid.

Here is the procedure to accomplish this:

(1) Start by invoking the following routine:

       call blacs_pinfo( iam, nprocs )
                         OUT  OUT

which identifies the process in the process grid and how many processes there are in total.

(2) Next, you should set up a rectangular grid (2D) as close as possible to a square. Following algorithm was developed by Carlo Cavazzoni of CINECA (this is not part of BLACS), and you must include it in your source:

       call gridsetup (nprocs,nprow,npcol)
                       INPUT  OUT   OUT

This computes the number of rows and columns we'll have in the process grid, such that NPROW x NPCOL == NPROCS.

(3) Inform BLACS of this grid extent by calling following routines:

       call blacs_get( -1, 0, context )
                        IN IN  OUT
       call blacs_gridinit( context, 'r', nprow, npcol )
                             INPUT    IN   INPUT  INPUT
       Note: 'r' ==> ROW-MAJOR ordering i.e. fill in first row left to right then next row.
             'c' ==> COLUMN-MAJOR ordering.

MPI has communicators, BLACS has contexts. For this example, the default context, which includes all processors, is all we need. The BLACS_GET call simply returns a handle to the default context for use in subsequent BLACS calls. BLACS_GRIDINIT informs BLACS of the grid extent which we computed in step 2. Behind the scenes, BLACS then assigns processors to points in this virtual grid.

This now completes the definition of the BLACS 2D process grid.

We can now use subroutine BLACS_GRIDINFO to query BLACS on the information we supplied plus additional information BLACS has computed such as its coordinate, e.g., COL and ROW, in the grid by specifying "context":

       call blacs_gridinfo( context, nprow, npcol, myrow, mycol )
                            INPUT    OUT    OUT    OUT    OUT

Subroutine report_back prints the information we got from the call to blacs_gridinfo.


(4) Distribute the data arrays across all processors:

For each array, we need to take four basic steps:

(a) determine a block size for the global array
(b) call DESCINIT to create a standard ScaLAPACK array descriptor
(c) allocate local memory for each process' portion of the array
(d) distribute the actual data values into the allocated memory

(a) determine a block size for the global arrays and block-cyclically distribute these blocks across the 2D processor grid defined in steps (1), (2) and (3):

Here "Block size" refers to the dimensions of the subdivisions into which the global array is decomposed.

The ScaLAPACK User Guide recommends a block size of 64x64 for large arrays.

Carlo Cavazzoni of CINECA has written a subroutine to compute the block size for a given array: "Blockset" chooses a good block size nb based on the size of the global array (i.e. dimension of matrix A), n, and number of rows and columns in the processor grid (nprow,npcol). It also honors a maximum block size value, which, following the User Guide recommendation, can simply be set at 64. We will then simply call blockset to determine the block size, nb:

        call blockset( nb,  64,    n,     nprow,    npcol)
                       OUT  INPUT  INPUT  INPUT     INPUT


(b) call DESCINIT to create a standard ScaLAPACK array descriptor

The ScaLAPACK function NUMROC is used to determine the number of rows of matrix A:

           l_nrowsa = numroc(n,nb,myrow,0,nprow)

and again to determine the number of columns of matrix A:

           l_ncolsa = numroc(n,nb,mycol,0,npcol)

Now we are ready to call DESCINIT:

           call descinit( desca, n, n, nb, nb, 0, 0, context, l_nrowsa, info )

Similar calls are required for matrix C (the right hand side).


(c) allocate local memory for each process' portion of the array

Following fortran statement will allocate the required local memory for the segment of the matrix A used by this processor:

     allocate (a(l_nrowsa, l_ncolsa ), stat=istat)

Similar statements are required for the arrays C and IPIV, etc ...

(d) distribute the actual data values into the allocated memory

The ScaLAPACK subroutine PDELSET is used to store the global values of matrices A and C to the appropriate locations in each processor.


When we are done with a BLACS context, it should be released by calling subroutine BLACS_GRIDEXITR, and when we are done with BLACS altogether, we should call BLACS_EXIT. (These calls are similar to the MPI functions, MPI_Comm_free and MPI_Finalize).

Timing Routines and Scripts used in the makefile

In the next section a full listing of the program designed to solve a system of STRIDWAD equations of an even order N will appear. This program uses ScaLAPACK routines by distributing the coefficient matrix A and right hand side vector on the processors of the 2D Process Grid.

The timing routines are used to measure the cpu and elapsed times in different segments of the program. Several milestones are set up throughout the program and the timings between this milestones is reported at the end of the program.

Following scripts and basic files are used in the makefile:

./check_N $N
parameter_N.inc_basic 
./ed_parm_N parameter_N.inc $N
sub_batch_DP_LAPACK_SOLVE_basic 
./ed_procs_P sub_batch_DP_LAPACK_SOLVE $P
./sub_batch_DP_LAPACK_SOLVE

listings follow:

Using make/makefile to compile and submit the program

The make command is used to execute the commands in the makefile. The makefile has several macros with default values but the user can override these values. Several scripts are invoked to verify some parameters and if these parameters are out of range then the script returns a non-zero return code which stop make from further executing the commands on the makefile.

For example if the default parameters N, P and FPP_OPT were set to:

N:= 1
P:= 64
FPP_OPT:= -DPRINT_TIMES -DNO_PRINT_MEMORY_DISTR

in the makefile and we type the command:

make

the utility program make will proceed to read the makefile and carry out the following tasks:

Check if N is in the range [1,40]  (N times 1000 is the order of the matrix to invert)
Create the include file parameter_N.inc by making a copy of it from  
   file parameter_N.inc_basic and changing the parameter N to 4
Invoking fpp with option -DPRINT_TIMES and generating a new source 
   file ScaLAPACK_SOLVE2.f90
Compiling the latter with command compile using the libraries defined in the makefile by  
   macro $(LIB)
Generating a script to submit the job based on file sub_batch_ScaLAPACK_SOLVE_basic 
   and setting the parameter for number or processors to use
Submitting the job to the test queue

Since we are submitting to the test queue we must wait until the job finishes running, also we need to type the command:

make clean

to remove the last executable.

If we want to run the case for a system of order 4000 using 128 processors and do not want to time the routines but want to see how the matrix A is distributed on the processors we could type:

make N=4 P=128 "FPP_OPT=-DNO_PRINT_TIMES -DPRINT_MEMORY_DISTR"

or one could also type:

make N=4 P=128 FPP_OPT=-DNO_PRINT_TIMES FPP_OPT+=-DPRINT_MEMORY_DISTR

The macro USE_QUEUE is used to decide if the job to be submitted should be run in the test queue or not. This macro must be set up manually by editing the makefile.

Thus, for jobs to be submitted to the test queue the macro should look like this:

USE_QUEUE:= \-t
# USE_QUEUE:=\-v

also note that the script sub_batch_ScaLAPACK_SOLVE_basic uses -r 60m.

For longer production jobs (not to be submitted to the test queue) the macro should look like this:

# USE_QUEUE:= \-t
USE_QUEUE:=\-v

and in the script sub_batch_ScaLAPACK_SOLVE_basic could use larger than 60m values for parameter -r instead of -r 60m.

As you become familiar with the macros in the makefile, you can adjust them to fit your needs. Here is the makefile used in this online tutorial:

# Determine the platform that is being used for this compilation
 
OS:= $(shell uname)
PLATFORM:= $(shell uname)_$(shell uname -m)
 
# Set the dafault parameters N (order of matrx times 1000)
# and P (number of processors
 
N:= 1
P:= 64
FPP_OPT:= -DPRINT_TIMES -DNO_PRINT_MEMORY_DISTR
 
 
USE_QUEUE:= \-t
# USE_QUEUE:=\-v
 
# Specify executable, source and required libraries :
 
EXE = ScaLAPACK_SOLVE
EXT:= f90
SRC:= ${EXE}.$(EXT)
 
EXE2:=$(EXE)2
SRC2:=${EXE2}.$(EXT)
 
LIB:= -psc -lscalapack -lblacs -lblacsF77init -lblacs -lmpi -L$(BLACSPATH)/pathscale/lib \
      -L$(SCALAPACKPATH)/pathscale/lib -L$(ACMLPATH) -lacml
 
 
$(EXE): $(SRC)
        @echo Check if N is in the range [1,40]
        ./check_N $N
#       @echo COMPILING LAPACK DP program with N = $(N),000
#       @echo PLATFORM used = $(PLATFORM)
#       @echo Edit the parameter_N.inc file
        cp parameter_N.inc_basic  parameter_N.inc
        ./ed_parm_N parameter_N.inc $N
#       @echo "Using following parameter_N.inc file:"
#       cat parameter_N.inc
#       @echo "LIB = " $(LIB)
 
        @echo "Invoking fpp with option $(FPP_OPT)"
        fpp $(FPP_OPT) $(SRC) $(SRC2)
 
        compile -o $(EXE) $(SRC2) $(LIB)
 
        cp sub_batch_ScaLAPACK_SOLVE_basic sub_batch_ScaLAPACK_SOLVE
        ./ed_procs_P sub_batch_ScaLAPACK_SOLVE $(USE_QUEUE) $P
        ./sub_batch_ScaLAPACK_SOLVE
 
superclean: clean
        @echo ' '
        @echo  Removing Output files generated by $(EXE)
        @echo '--------------------------------------------------'
        @echo ' '
        rm -rf ${CLU}_SERIAL_DP_SOLVE_*
        @echo ' '
        @echo  Output files have been removed.
        @echo '-------------------------------'
        @echo ' '
 
clean:
        @echo ' '
        @echo  Cleaning executable $(EXE)
        @echo '-----------------------------------'
        @echo ' '
        rm -rf $(EXE)
        rm -rf ED_ERROR
        @echo ' '
        @echo  Executable $(EXE) has been removed.
        @echo '--------------------------------------------'
        @echo ' '
 
 
help:
        @echo "+-----------------------------------------------------------------------------------+"
        @echo "|                                                                                   |"
        @echo "|         Makefile for running ScaLAPACK example using pdgetrf/pdgetrs              |"
        @echo "|                                                                                   |"
        @echo "| Usage: make                      compile program for N=1 (i.e. order=1,000)       |"
        @echo "|                                                                                   |"
        @echo "|        make N=n                  compile program for N=n (i.e. order=n*1000)      |"
        @echo "|                                  where  1 <= n <= 40                              |"
        @echo "|                                                                                   |"
        @echo "|        make P=p                  submit job with -q threaded -n p                 |"
        @echo "|                                  where  1 <= p <= 4 or 6 (depending on cluster)   |"
        @echo "|                                                                                   |"
        @echo "|        make superclean           remove output files and executable               |"
        @echo "|                                                                                   |"
        @echo "|        make clean                remove executable                                |"
        @echo "|                                                                                   |"
        @echo "|        make help                 display makefile usage information               |"
        @echo "|                                                                                   |"
        @echo "+-----------------------------------------------------------------------------------+"
        @echo " "
        @echo PLATFORM $(PLATFORM)

Full fortran program listing using ScaLAPACK routines to solve a STRIDWAD system

The full fortran 90 program (which includes fpp directives) is listed next:

      program ScaLAPACK_SOLVE
!
!   file name = ScaLAPACK_SOLVE.f90
!
      implicit none
 
      include 'mpif.h'
 
      integer   :: istat,info,i,j,N_PRT,milestone
      real(kind=8),dimension(:,:),allocatable  :: a
      real(kind=8),dimension(:,:),allocatable  :: c
      real(kind=8),dimension(:),allocatable  :: solution
      integer,dimension(:),allocatable :: ipiv
 
      integer*8 :: N
      real(kind=8) :: memory_local, memory_sum
      integer   :: root, ierr
      real(kind=8),dimension(:),allocatable    :: memory_a
      logical :: PRT_MEM
 
      include "parameter_N.inc"
 
      parameter (N_PRT = 40)
 
      integer   ::       context, iam, pnum
      integer   ::       mycol, myrow, nb
      integer   ::       npcol, nprocs, nprow
      integer   ::       l_nrowsa,l_ncolsa
      integer   ::       l_nrowsc,l_ncolsc
 
      integer,parameter :: descriptor_len=9
      integer   ::       desca( descriptor_len )
      integer   ::       descc( descriptor_len )
 
      integer   ::       numroc, ITEM, IAorC, NH, NH1, DEBUG
! --------------------------------------------------------------------------
#ifdef PRINT_TIMES
      integer   :: day(10),hour(10),minute(10),second(10),millisec(10)
      character*30 :: descr_milestone(10)
 
      integer    :: day_beg,hour_beg,minute_beg,second_beg,millisec_beg
      integer    :: day_end,hour_end,minute_end,second_end,millisec_end
      character*23 :: date_time
      real(kind=8) :: T1, T2
      real(kind=8) :: TOT_TIME
      real(kind=8) :: tm
#endif
 
! --------------------------------------------------------------------------
 
      interface
        subroutine report_back ( context,                          &
        iam,nprocs,myrow,nprow,mycol,npcol)
        implicit none
        integer :: context
        integer :: iam,nprocs,myrow,nprow,mycol,npcol
        integer :: i,j,pnum
        end subroutine report_back
 
        subroutine printlocals( context,                          &
        a,iam,nb,nprocs,myrow,mycol,l_nrows,l_ncols,IAorC)
        implicit none
        real(kind=8)    :: a(:,:)
        integer :: context
        integer :: iam,nb,nprocs,myrow,mycol,l_nrows,l_ncols,IAorC
        integer :: i,j,pnum
        end subroutine printlocals
 
        subroutine init_my_matrix (n,a,nprow,npcol,myrow,mycol,desca)
        implicit none
        integer*8 :: N
        integer :: nprow,npcol,myrow,mycol
        integer :: desca(:)
        real(kind=8)    :: a(:,:)
        end subroutine init_my_matrix
 
        subroutine init_my_rhs (n,c,nprow,npcol,myrow,mycol,descc)
        implicit none
        integer*8 :: N
        integer :: nprow,npcol,myrow,mycol
        integer :: descc(:)
        real(kind=8)    :: c(:,:)
        end subroutine init_my_rhs
 
        subroutine get_solution (n,c,descc,solution)
        implicit none
        integer*8 :: N
        integer :: descc(:)
        real(kind=8) ::  c(:,:),solution(:)
        end subroutine get_solution
 
#ifdef PRINT_TIMES
        subroutine elapsed_time(day_beg,hour_beg, minute_beg,second_beg,&
     &  millisec_beg, day_end, hour_end, minute_end, second_end,        &
     &  millisec_end,tid,ITEM )
        implicit none
        integer :: day_beg,hour_beg,minute_beg,second_beg,millisec_beg
        integer :: day_end,hour_end,minute_end,second_end,millisec_end
        integer :: day,hour, minute, second, millisec, tid, ITEM
        end subroutine elapsed_time
        subroutine timestamp(date_time,day,hour,minute,second,millisec)
        implicit none
        character*23 :: date_time
        integer      :: day, hour, minute, second, millisec
        integer      :: elements(8)
        character*3  :: months(12)
        end subroutine timestamp
#endif
 
      end interface
 
! ---------------------------------------------------------------------------
 
#ifdef PRINT_MEMORY_DISTR
      PRT_MEM = .TRUE.
#else
      PRT_MEM = .FALSE.
#endif
 
!
!     Trace intermediate steps only for small values of N
 
      DEBUG = 1
      root = 0
 
!     N must be EVEN
 
      NH    = N/2
      NH1   = NH + 1
 
      IF (2*NH .NE. N) THEN
        WRITE(6,2001) N
 2001   FORMAT("N must be EVEN, got N = ",i9)
        STOP
      ENDIF
 
      IF ( N .gt. N_PRT ) DEBUG = 0
!
! -----    Initialize the blacs.  Note: processors are counted starting at 0.
!
      call blacs_pinfo( iam, nprocs )
 
! ---------------------------------------------------------------------------
 
     ITEM = 1000 + iam
 
#ifdef PRINT_TIMES
      call timestamp(date_time,day_beg,hour_beg,minute_beg,second_beg,  &
     &               millisec_beg )
 
      IF (IAM .eq. 0) write(6,1006) ITEM,iam,date_time
 1006 format(I4," iam ",i3," BEGIN ",a23)
 
      call cpu_time(T1)
 
      milestone = 1
      call timestamp( date_time,day(milestone), hour(milestone),        &
     &     minute(milestone),second(milestone),millisec(milestone) )
 
      descr_milestone(milestone) = "Init blacs_gridinit, allocate"
#endif
!
! -----    Print out nprocs   -----
 
        IF (IAM .EQ. 0) THEN
          WRITE(6,101) nprocs
  101     FORMAT(" nprocs = ",i3)
          call flush(6_4)
        ENDIF
! -----    Set the dimension of the 2d processors grid.
!
      call gridsetup(nprocs,nprow,npcol)
!
! -----    Initialize a single blacs context.  Determine which processor I
!          am in the 2D process or grid.
!
      call blacs_get( -1, 0, context )
      call blacs_gridinit( context, 'r', nprow, npcol )
      call blacs_gridinfo( context, nprow, npcol, myrow, mycol )
 
      IF (DEBUG .eq. 1) call report_back (context,                      &
     &  iam,nprocs,myrow,nprow,mycol,npcol)
!
! -----    Calculate the blocking factor for the matrix.
!
      call blockset( nb, 64, n, nprow, npcol)
!
! -----    Distributed matrices: get num. local rows/cols. Create description.
!
      l_nrowsa = numroc(n,nb,myrow,0,nprow)
      l_ncolsa = numroc(n,nb,mycol,0,npcol)
      call descinit( desca, n, n, nb, nb, 0, 0, context, l_nrowsa, info )
 
      l_nrowsc = numroc(n,nb,myrow,0,nprow)
      l_ncolsc = numroc(1,nb,mycol,0,npcol)
      call descinit( descc, n, 1, nb, nb, 0, 0, context, l_nrowsc, info )
!
! -----   Allocate LHS, RHS, pivot, and solution -----
!
      allocate (a(l_nrowsa, l_ncolsa ), stat=istat)
      if (istat/=0) stop "ERR:ALLOCATE FAILS for A"
 
      allocate (c(l_nrowsc,l_ncolsc), stat=istat)
      if (istat/=0) stop "ERR:ALLOCATE FAILS for C"
 
      allocate (ipiv (n), stat=istat)
      if (istat/=0) stop "ERR:ALLOCATE FAILS for IPIV"
 
      allocate (solution (n), stat=istat)
      if (istat/=0) stop "ERR:ALLOCATE FAILS for solution"
 
      allocate (memory_a (nprocs), stat=istat)
      if (istat/=0) stop "ERR:ALLOCATE FAILS for memory_a"
!
! -----    Reassemble memory_a on all processors. Add and Print on 0 -----
!
 
       memory_local = l_nrowsa*l_ncolsa*8
 
       call MPI_Gather(memory_local, 1, MPI_REAL8,                      &
     &                memory_a, 1, MPI_REAL8,   root,                   &
     &                MPI_COMM_WORLD, ierr)
 
 
      IF (IAM.EQ.0) THEN
        IF (N .le. N_PRT .or. PRT_MEM ) THEN
          ITEM = 6500
          memory_sum = 0
          write (6,550) ITEM,N
  550     FORMAT(I4," N = ",i12 /" Distributed memory for A: ")
          DO I=0,nprocs-1
            ITEM = 6500 + I
            WRITE (6,551) ITEM,I,memory_a(I+1)
            memory_sum = memory_sum + memory_a(I+1)
  551       FORMAT(I4,"       processor = ",I3," ",f14.1)
          ENDDO
 
          ITEM = ITEM + 1
          WRITE (6,552) ITEM,memory_sum,float(N)*float(N)*8
  552     FORMAT(I4," TOTAL MEMORY for A = ",f14.1," ==  N*N*8 = ",f14.1)
        ENDIF
        call flush(6_4)
 
      ENDIF
 
!
! -----    Initialize LHS and RHS
!
      call init_my_matrix (n,a,nprow,npcol,myrow,mycol,desca)
      call init_my_rhs    (n,c,nprow,npcol,myrow,mycol,descc)
!
! -----    Show how arrays distributed
!
      IF (DEBUG .eq. 1) THEN
 
 
        ITEM = 2000 + 100*MYROW + 10*MYCOL
 
        IF (IAM.EQ.0) THEN
          write(6,300) ITEM,n,n
  300     format(i4," DISTRIBUTION OF ARRAY: A  -  Global dimension:",  &
     &         i3,":",i3)
        ENDIF
 
        IAorC = 0
        call printlocals ( context,                                     &
     &       a,iam,nb,nprocs,myrow,mycol,l_nrowsa,l_ncolsa,IAorC)
 
        IF (IAM.EQ.0) THEN
          write(6,400) ITEM,N,1
  400     FORMAT(i4," DISTRIBUTION OF ARRAY: C  -  Global dimension:",  &
     &           i3,":",i3)
        ENDIF
 
        IAorC = 1000
        call printlocals ( context,                                     &
     &  c,iam,nb,nprocs,myrow,mycol,l_nrowsc,l_ncolsc,IAorC)
 
      ENDIF
 
#ifdef PRINT_TIMES
      milestone = milestone + 1
      call timestamp( date_time,day(milestone), hour(milestone),        &
     &     minute(milestone),second(milestone),millisec(milestone) )
 
      descr_milestone(milestone) = "call pdgetrf"
#endif
!
! -----    Do the work -----
!
      call pdgetrf( n, n, a, 1, 1, desca, ipiv, info )
 
      if (iam.eq.0 .AND. DEBUG .eq. 1)                                  &
     &  write (6,*) "info (from pdgetrf) = ", info
 
#ifdef PRINT_TIMES
      milestone = milestone + 1
      call timestamp( date_time,day(milestone), hour(milestone),        &
     &     minute(milestone),second(milestone),millisec(milestone) )
 
      descr_milestone(milestone) = "call pdgetrs"
#endif
 
      call pdgetrs( 'n', n, 1, a, 1, 1, desca, ipiv,                    &
     &               c, 1, 1, descc, info )
 
      if (iam.eq.0 .AND. DEBUG .eq. 1)                                  &
     &  write (6,*) "info (from pdgetrs) = ", info
!
! -----    Reassemble solution on all processors. Print on 0 -----
!
 
#ifdef PRINT_TIMES
      milestone = milestone + 1
      call timestamp( date_time,day(milestone), hour(milestone),        &
     &     minute(milestone),second(milestone),millisec(milestone) )
 
      descr_milestone(milestone) = "get_solution --> root"
#endif
 
      call get_solution (n,c,descc,solution)
 
 
      IF (IAM.EQ.0) THEN
        IF (N .le. N_PRT) THEN
          ITEM = 6000
          write (6,500) ITEM
  500     FORMAT(I4," SOLUTION: ")
          DO I=1,N
            ITEM = 6000 + I
            WRITE (6,501) ITEM,I,SOLUTION(I)
  501       FORMAT(I4," I = ",I3," ",f6.2)
          ENDDO
 
        ELSE
 
!       For large N print only last N_PRT elements of vector
 
          DO I=N-N_PRT,N
            WRITE (6,502) I,SOLUTION(I)
  502       FORMAT(" I=",I10," ",f13.2)
          ENDDO
 
        ENDIF
 
      ENDIF
 
 
#ifdef PRINT_TIMES
      milestone = milestone + 1
      call timestamp( date_time,day(milestone), hour(milestone),        &
     &     minute(milestone),second(milestone), millisec(milestone) )
 
      descr_milestone(milestone) = "print and clean up"
#endif
 
!     Now report the milestone time differences
 
#ifdef PRINT_TIMES
      ITEM = 7000
      IF (IAM .eq. 0) WRITE(6,1007) ITEM,iam,milestone
 1007 FORMAT(I4," iam = ",i3," FINAL value of milestone = ",i2)
 
      do i=1,milestone-1
        ITEM = 7000 + I
        tm = 86400.0*(day(i+1)-day(i)) +  3600.0*(hour(i+1)-hour(i)) +  &
     &       60.0*(minute(i+1)-minute(i)) + (second(i+1)-second(i))  +  &
     &       0.001 * (millisec(i+1)-millisec(i))
        IF (IAM .EQ. 0) write(6,1005) ITEM,iam, i, i+1, tm,             &
     &                  descr_milestone(i)
 1005   format(I4," iam ",i3," Time from milestone ",i2," to milestone "&
     &        ,I2," equals ",f10.2,3x,a30)
      enddo
 
      call timestamp( date_time,day_end,hour_end,minute_end,second_end, &
     &                millisec_end  )
 
      ITEM = ITEM + 1
      IF (IAM .EQ. 0) write(6,1011) ITEM,iam,date_time
 1011 format(I4," iam ",i3," END   ",a23)
 
      call cpu_time(T2)
      TOT_TIME = T2-T1
 
      call elapsed_time(day_beg,hour_beg, minute_beg, second_beg,       &
     &  millisec_beg, day_end, hour_end, minute_end, second_end,        &
     &  millisec_end,iam,ITEM )
 
      ITEM = ITEM + 1
      IF (IAM .EQ. 0) write(6,1012) ITEM,iam,TOT_TIME
 1012 format(I4," IAM ",i3," Total CPU Time/processor = ",f11.4,        &
     &       " seconds")
#endif
 
! ---------------------------------------------------------------------------
 
!
! -----    Cleanup arrays -----
!
      deallocate (a,c,ipiv,solution,memory_a)
!
! -----    Exit BLACS cleanly -----
!
      call blacs_gridexit( context )
      call blacs_exit( 0 )
 
      end program ScaLAPACK_SOLVE
 
      subroutine gridsetup(nproc,nprow,npcol)
!
! This subroutine factorizes the number of processors (nproc)
! into nprow and npcol,  that are the sizes of the 2d processors mesh.
!
! Written by Carlo Cavazzoni
!
      implicit none
      integer nproc,nprow,npcol
      integer sqrtnp,i
 
      sqrtnp = int( sqrt( dble(nproc) ) + 1 )
      do i=1,sqrtnp
        if(mod(nproc,i).eq.0) nprow = i
      end do
      npcol = nproc/nprow
 
      return
      end
!
!-----------------------------------------------------------------------
!
      subroutine blockset( nb, nbuser, n, nprow, npcol)
!
!     This subroutine try to choose an optimal block size
!     for the distributd matrix.
!
!     Written by Carlo Cavazzoni, CINECA
!
      implicit none
      integer*8 :: N
      integer nb, nprow, npcol, nbuser
 
      nb = min ( n/nprow, n/npcol )
      if(nbuser.gt.0) then
        nb = min ( nb, nbuser )
      endif
      nb = max(nb,1)
 
      return
      end subroutine blockset
!
!-----------------------------------------------------------------------
!
      subroutine report_back ( context,                                 &
        iam,nprocs,myrow,nprow,mycol,npcol)
!
! Each processor identifies itself and its place in the processor grid
!
        implicit none
        integer :: context
        integer :: iam,nprocs,myrow,nprow,mycol,npcol
        integer :: i,j,pnum,ITEM
 
        do i=0,nprocs-1
          call blacs_barrier (context, 'a')
 
          if (iam.eq.i) then
            ITEM = 3000 + 100*MYROW + 10*MYCOL
            write(6,100) ITEM,iam,nprocs,myrow,mycol,myrow,nprow,mycol, &
     &                   npcol
 
  100       format(i4," PE=",i2,":",i2," Grid-coord(",i2,",",i2,        &
     &               ")   PROW=",i2,":",i2,"   PCOL=",i2,":",i2)
            call flush(6_4)
          endif
        enddo
 
        call blacs_barrier (context, 'a')
      end subroutine report_back
!
!-----------------------------------------------------------------------
!
      subroutine printlocals( context,                          &
        a,iam,nb,nprocs,myrow,mycol,l_nrows,l_ncols,IAorC)
!
! Prints the local array "a", processor by processor. Only use this
! for very small arrays or sub-arrays...
!
      implicit none
      real(kind=8)    :: a(:,:)
      integer :: context
      integer :: iam,nb,nprocs,myrow,mycol,l_nrows,l_ncols,IAorC
      integer :: i,j,pnum,ITEM
 
      call blacs_barrier (context, 'a')
 
      do pnum=0,nprocs-1
        if (iam .eq. pnum) then
          ITEM = IAorC + 4000 + 100*MYROW + 10*MYCOL
          write (6,100) ITEM,iam,myrow,mycol,nb,l_nrows,l_ncols,        &
     &          l_nrows*l_ncols*8
100       format (i4," proc:",i3," grid position:",i3,",",i3,           &
     &            " blksz:",i3," numroc:",i3,":",i3," Memory Allocated "&
     &           ,i6," bytes")
          do i=1,l_nrows
            ITEM = IAorC + 4000 + 100*MYROW + 10*MYCOL + I
            write (6,200) ITEM,(a(i,j),j=1,l_ncols)
200         format (I4," ",40(" ",f5.1))
            call flush(6_4)
          enddo
        endif
        call blacs_barrier (context, 'a')
      enddo
 
      call flush(6_4)
      end subroutine printlocals
!
!-----------------------------------------------------------------------
!
      SUBROUTINE INIT_MY_MATRIX (N,A,NPROW,NPCOL,MYROW,MYCOL,DESCA)
      IMPLICIT NONE
      integer*8 :: N
      INTEGER      :: NH,NH1,NPROW,NPCOL,MYROW,MYCOL
      INTEGER      :: DESCA(:)
      REAL(kind=8) :: A(:,:)
      REAL(kind=8) :: AII,AII1,AI1I,AINI1
 
      INTEGER      :: I, J
 
!     Compute values for all elements of the global array, but using
!     pdelset, only set those elements which occur in local portion.
 
      NH  = N/2
      NH1 = NH + 1
 
!     Diagonal elements
 
      DO I = 1,N
!       A(I,I) = 3.0
        AII = 3.0
        J = I
        CALL PDELSET(A,I,J,DESCA,AII)
      ENDDO
 
!     Upper diagonal
 
      DO 180 I = 1,N-1
        if (I .EQ. NH) go to 180
!       A(I,I+1) = -1.0d0
        AII1 = -1.0d0
        J = I+1
        CALL PDELSET(A,I,J,DESCA,AII1)
  180 CONTINUE
 
!     Lower diagonal
 
      DO 160 I = 1,N-1
      if (I .EQ. NH) go to 160
!     A(I+1,I) = -1.0d0
      AI1I = -1.0d0
        J = I
        CALL PDELSET(A,I+1,J,DESCA,AI1I)
  160 CONTINUE
 
!
!     ANTI-DIAGONAL
!
      DO 190 I = 1,N
        if (I .EQ. NH .OR. I .EQ. NH1) go to 190
 
!       A(I,N-I+1) = 1.0d0
        AINI1 = 1.0d0
        J = N-I+1
        CALL PDELSET(A,I,J,DESCA,AINI1)
  190 CONTINUE
 
      return
      end
 
!
!-----------------------------------------------------------------------
!
      SUBROUTINE INIT_MY_RHS (N,C,NPROW,NPCOL,MYROW,MYCOL,DESCC)
      IMPLICIT NONE
      integer*8 :: N
      INTEGER      :: NPROW,NPCOL,MYROW,MYCOL
      INTEGER      :: DESCC(:)
      REAL(kind=8) :: C(:,:)
      REAL(kind=8) :: CI, CN
 
      INTEGER :: I
 
      DO I= 1, N-1
!       C(I) = N + 1.0
        CI = N + 1.0
        CALL PDELSET(C,I,1,DESCC,CI)
      ENDDO
 
!     C(N) = 2*N + 2
      CN = 2*N + 2
      CALL PDELSET(C,I,1,DESCC,CN)
 
 
      return
      end
!
!-----------------------------------------------------------------------
!
      subroutine get_solution (n,c,descc,solution)
      implicit none
      integer*8 :: n
      integer :: descc(:)
      real(kind=8) ::  c(:,:),solution(:)
 
      integer :: i
 
      do i= 1, n
        call pdelget('A',' ',solution(i),c,i,1,descc)
      enddo
 
      return
      end
 
!
!-----------------------------------------------------------------------
!
#ifdef PRINT_TIMES
      subroutine elapsed_time(day_beg, hour_beg, minute_beg, second_beg,&
     &          millisec_beg, day_end, hour_end, minute_end, second_end,&
     &          millisec_end,tid, ITEM )
 
!date_and_time
 
! VALUES
!   must be of type default integer and of rank one. It is an INTENT(OUT)
!   argument. Its size must be at least eight. The values returned in VALUES
!   are as follows:
 
!   VALUES(1)
!   is the year (for example, 1998), or -HUGE (0) if no date is available.
 
!   VALUES(2)
!   is the month of the year, or -HUGE (0) if no date is available.
 
!   VALUES(3)
!   is the day of the month, or -HUGE (0) if no date is available.
 
!   VALUES(4)
!   is the time difference with respect to Coordinated Universal Time (UTC)
!   in minutes, or -HUGE (0) if this information is not available.
!
!   VALUES(5)
!   is the hour of the day, in the range 0 to 23, or -HUGE (0) if there is
!   no clock.
!
!   VALUES(6)
!   is the minutes of the hour, in the range 0 to 59, or -HUGE (0) if there
!   is no clock.
!   !
!   VALUES(7)
!   is the seconds of the minute, in the range 0 to 60, or -HUGE (0) if
!   there is no clock.
!
!   VALUES (8)
!   is the milliseconds of the second, in the range 0 to 999, or -HUGE (0)
!   if there is no clock.
 
      implicit none
      integer :: day_beg, hour_beg, minute_beg, second_beg, millisec_beg
      integer :: day_end, hour_end, minute_end, second_end, millisec_end
      integer :: day, hour, minute, second, millisec, tid, ITEM
 
      if ( millisec_end .lt. millisec_beg ) then
        millisec_end = millisec_end + 1000
        second_end = second_end - 1
      endif
 
      if ( second_end .lt. second_beg ) then
        second_end = second_end + 60
        minute_end = minute_end - 1
      endif
 
      if ( minute_end .lt. minute_beg ) then
        minute_end = minute_end + 60
        hour_end   = hour_end - 1
      endif
 
      if ( hour_end .lt. hour_beg ) then
        hour_end = hour_end + 24
        day_end  = day_end - 1
      endif
 
      millisec = millisec_end - millisec_beg
      second   = second_end   - second_beg
      minute   = minute_end   - minute_beg
      hour     = hour_end     - hour_beg
      day      = day_end      - day_beg
 
      if ( .not. ( day .eq. 0 .and. hour .eq. 0 .and. minute .eq. 0     &
     &        .and. second .eq. 0 .and. millisec .eq. 0 ) ) then
        ITEM = ITEM + 1
        IF (tid .eq. 0 ) THEN
          write(6,8000) ITEM,tid,day
 8000     format(I4," IAM = ",I3," ELAPSED0 days          = ",i3)
          write(6,8001) ITEM,tid,hour
 8001     format(I4," IAM = ",I3," ELAPSED1 hours         = ",i3)
          write(6,8002) ITEM,tid,minute
 8002     format(I4," IAM = ",I3," ELAPSED2 minute        = ",i3)
          write(6,8003) ITEM,tid,second
 8003     format(I4," IAM = ",I3," ELAPSED3 second        = ",i3)
          write(6,8004) ITEM,tid,millisec
 8004     format(I4," IAM = ",I3," ELAPSED4 millisec      = ",i3)
        ENDIF
      endif
 
      return
 
      end subroutine elapsed_time
#endif
 
!
!-----------------------------------------------------------------------
!
#ifdef PRINT_TIMES
      subroutine timestamp( date_time, day,hour,minute,second,millisec)
      implicit none
      character*23 :: date_time
      integer      :: day, hour, minute, second, millisec
      integer      :: elements(8)
      character*3  :: months(12)
 
      data months /'Jan','Feb','Mar','Apr','May','Jun',                 &
     &             'Juk','Aug','Sep','Oct','Nov','Dec' /
      call date_and_time ( VALUES=elements )
 
      if ( elements(1) .ne. -HUGE(0) ) then
      century: if ( elements(1) .lt. 2000 ) then
                 elements(1) = elements(1) - 1900
               else
                 elements(1) = elements(1) - 2000
               endif century
               write(date_time,1010) elements(3),                       &
     &                       months ( elements(2) ),                    &
     &                                elements(1)  , elements(5),       &
     &                 elements(6)  , elements(7)  , elements(8)
 
 1010          format(i2.2,1x,a3,1x,i2.2,1x,                            &
     &                i2.2,':',i2.2,':',i2.2,'.',i4.4 )
 
               day      = elements(3)
               hour     = elements(5)
               minute   = elements(6)
               second   = elements(7)
               millisec = elements(8)
 
      else
               date_time=' '
 
               day      = 0
               hour     = 0
               minute   = 0
               second   = 0
               millisec = 0
 
      endif
 
      end subroutine timestamp
#endif
 
 
!
!-----------------------------------------------------------------------
!

Results for solutions of the STRIDWAD system for different orders of matrix A

LAPACK RESULTS

These are the results for serial LAPACK runs (nprocs=1). Note that on one processor the highest matrix size is 30,000.

   N         CPU Time
order of       per
 matrix      processor
              [ sec ]

  10000      312.33
  20000     1808.32
  30000     5487.23
  32000        N/A

ScaLAPACK RESULTS

    N                CPU Time                CPU Time                CPU Time                CPU Time
order of  nprocs       per         nprocs      per         nprocs      per                     per
 matrix             processor                processor               processor               processor
                     [ sec ]                  [ sec ]                 [ sec ]                 [ sec ]

 10000      64        8.17          128         5.15         256        6.31         512         .
 20000                40.40                    24.37                   24.44
 30000               118.24                    64.40                   59.70
 40000               254.73                   135.43                  110.18
 50000               462.04                   244.30                  219.82
 60000               754.16                   398.18                  288.39
 70000              1164.53                   608.61                  562.66
 80000              1666.45                   881.23                  587.94
 90000              2330.41                  1227.83                 1146.27
100000              3134.40                  1645.89                 1028.17                 3134.39
110000                                       2131.83                 1328.90
120000                                       2726.03                 1672.00
130000                                       3412.37                 2071.03
140000                                          N/A                  2558.06
150000                                                               3097.58
160000                                                               3661.49
170000                                                               4318.20
180000                                                               5092.44
190000                                                               5873.94
200000                                                               6794.57
210000                                                               7843.55
220000                                                               8798.14
230000                                                              10190.61
240000                                                              11230.16
250000                                                              12874.02
                                                                        N/A

CONCLUSIONS

With ScaLAPACK on 256 processors matrices up to order 250,000 can be inverted, compared to order of 30,000 for LAPACK. When the matrix is distributed over many processors it takes less time and larger order matrices can be inverted.

For a list of the available ScaLAPACK routines see:

http://www.netlib.org/scalapack/double/

All ScaLAPACK routines assume that the data has been distributed on the process grid prior to the invocation of the routine. Detailed descriptions of the appropriate calling sequences for each of the ScaLAPACK routines can be found in the leading comments of the source code or the ScaLAPACK Users' Guide:

http://netlib2.cs.utk.edu/scalapack/slug/scalapack_slug.html