## LAPACK and ScaLAPACK Examples

Nick Chepurniy, Ph.D.
e-mail: nickc@sharcnet.ca
HPC Consultant - SHARCNET http://www.sharcnet.ca

We start by defining a special matrix called STRIDWAD (Sparse TRIDiagonal With Anti-Diagonal). This matrix will be stored using the dense format (i.e. all elements are defined and are stored in memory). The matrix has the following properties:

It can be of any order N (where N is an even integer)

When STRIDWAD is multiplied by the vector X whose components are X(i)=i it produces a vector C
whose elements are C(i)=N+1 except for the last element which is C(N)=2N+2.


These characteristics make it very easy to work with STRIDWAD while developing computer programs for testing purposes.

In practice we will use large values of N but for illustration purposes the STRIDWAD system for N=6 would be:

[  3. -1.  0.  0.  0.  1. ]  [ 1. ]      [  7. ]
[ -1.  3. -1.  0.  1.  0. ]  [ 2. ]      [  7. ]
[  0. -1.  3.  0.  0.  0. ]  [ 3. ]      [  7. ]
[  0.  0.  0.  3. -1.  0. ]  [ 4. ]   =  [  7. ]
[  0.  1.  0. -1.  3. -1. ]  [ 5. ]      [  7. ]
[  1.  0.  0.  0. -1.  3. ]  [ 6. ]      [ 14. ]


$\begin{bmatrix} \quad 3 & -1 & \quad 0 & \quad 0 & \quad 0 & \quad 1 \\ -1 & \quad 3 & -1 & \quad 0 & \quad 1 & \quad 0 \\ \quad 0 & -1 & \quad 3 & \quad 0 & \quad 0 & \quad 0 \\ \quad 0 & \quad 0 & \quad 0 & \quad 3 & -1 & \quad 0 \\ \quad 0 & \quad 1 & \quad 0 & -1 & \quad 3 & -1 \\ \quad 1 & \quad 0 & \quad 0 & \quad 0 & -1 & \quad 3 \end{bmatrix}$

For N=18 it takes this form:

[  3. -1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1. ]  [  1. ]      [ 19. ]
[ -1.  3. -1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  0. ]  [  2. ]      [ 19. ]
[  0. -1.  3. -1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  0.  0. ]  [  3. ]      [ 19. ]
[  0.  0. -1.  3. -1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  0.  0.  0. ]  [  4. ]      [ 19. ]
[  0.  1.  0. -1.  3. -1.  0.  0.  0.  0.  0.  0.  0.  1.  0.  0.  0.  0. ]  [  5. ]      [ 19. ]
[  1.  0.  0.  0. -1.  3. -1.  0.  0.  0.  0.  0.  1.  0.  0.  0.  0.  0. ]  [  6. ]      [ 19. ]
[  0.  0.  0.  0.  0. -1.  3. -1.  0.  0.  0.  1.  0.  0.  0.  0.  0.  0. ]  [  7. ]      [ 19. ]
[  0.  0.  0.  0.  0.  0. -1.  3. -1.  0.  1.  0.  0.  0.  0.  0.  0.  0. ]  [  8. ]      [ 19. ]
[  0.  0.  0.  0.  0.  0.  0. -1.  3.  0.  0.  0.  0.  0.  0.  0.  0.  0. ]  [  9. ]   =  [ 19. ]
[  0.  0.  0.  0.  0.  0.  0.  0.  0.  3. -1.  0.  0.  0.  0.  0.  0.  0. ]  [ 10. ]      [ 19. ]
[  0.  0.  0.  0.  0.  0.  0.  1.  0. -1.  3. -1.  0.  0.  0.  0.  0.  0. ]  [ 11. ]      [ 19. ]
[  0.  0.  0.  0.  0.  0.  1.  0.  0.  0. -1.  3. -1.  0.  0.  0.  0.  0. ]  [ 12. ]      [ 19. ]
[  0.  0.  0.  0.  0.  1.  0.  0.  0.  0.  0. -1.  3. -1.  0.  0.  0.  0. ]  [ 13. ]      [ 19. ]
[  0.  0.  0.  0.  1.  0.  0.  0.  0.  0.  0.  0. -1.  3. -1.  0.  0.  0. ]  [ 14. ]      [ 19. ]
[  0.  0.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0. -1.  3. -1.  0.  0. ]  [ 15. ]      [ 19. ]
[  0.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0. -1.  3. -1.  0. ]  [ 16. ]      [ 19. ]
[  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0. -1.  3. -1. ]  [ 17. ]      [ 19. ]
[  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0. -1.  3. ]  [ 18. ]      [ 38. ]


Observations on the structure of the STRIDWAD systems:

All elements along the diagonal are equal to 3.

The elements in the two subdiagonals (lower and upper diagonals) are -1.

The diagonal going from the lower left corner to the upper right corner is called anti-diagonal.

All elements on the anti-diagonal are 1.

When the anti-diagonal crosses the subdiagonals the 1's cancel out with the -1's.

Order N is an EVEN number.

The components of the X vector are x(i)=i and all the components of C are N+1 except for the last one.
Last component of C is 2*(N+1) where N is the order of the STRIDWAD system.


So in fact what we have established here is an IDENTITY which for computational purposes can be used in the following two ways:

If we assume X is known and defined by x(i)=i we can use a matrix multiply program
to calculate the product of the STRIDWAD matrix and X and compare the results to C.

If we assume X is unknown and we use the C as defined above we can
use a matrix inversion program and calculate the unknown vector X.


### Program to calculate the RHS vector of the STRIDWAD system

Here is the listing of the program used to calculate the Right Hand Side (RHS) vector assuming that X is known and defined by x(i)=i:

      program TMA
implicit none

integer   :: istat,i,N_PRT
real(kind=8),dimension(:,:),allocatable  :: A
real(kind=8),dimension(:,:),allocatable  :: C
real(kind=8),dimension(:,:),allocatable  :: D
real(kind=8),dimension(:,:),allocatable  :: X

integer   ::  NH, NH1, nprocs, KOUNT

integer*8 :: N, mem_allocated
real(kind=8) :: mem_gb, EPS

parameter (N =  55388)
!     parameter (N =     18)

parameter (N_PRT = 40)

! --------------------------------------------------------------------------

interface

subroutine init_my_matrix (n,a)
implicit none
integer*8 :: N
real(kind=8)    :: a(:,:)
end subroutine init_my_matrix

subroutine init_my_vector (N,X)
implicit none
integer*8 :: N
real(kind=8)    :: X(:,:)
end subroutine init_my_vector

subroutine init_my_rhs (n,c)
implicit none
integer*8 :: N
real(kind=8)    :: c(:,:)
end subroutine init_my_rhs

end interface

! ---------------------------------------------------------------------------

!
!     Trace intermediate steps only for small values of N

EPS = 0.00000001

NPROCS = 1

!     N must be EVEN

NH    = N/2
NH1   = NH + 1

IF (2*NH .NE. N) THEN
WRITE(6,2001) N
2001   FORMAT("N must be EVEN, got N = ",i9)
STOP
ENDIF

!
! -----    Print out nprocs   -----

WRITE(6,101) nprocs,N
101 FORMAT(" nprocs = ",i3,"      N = ",i12)
call flush(6_4)
!
!
! -----   Allocate LHS, RHS, and pivot   -----
!
allocate (a(N, N ), stat=istat)

if (istat/=0) then
print *,"ERROR: ALLOCATE FAILS for A"
call flush(6_4)
stop "ERR:ALLOCATE FAILS for A"
else
mem_allocated = 8*N*N
mem_gb = mem_allocated / 1000000000.0
WRITE(6,103) mem_allocated,mem_gb
103   FORMAT(" memory alloc (A)    = ",i12," bytes i.e. ", f8.5," Gb")
endif

allocate (c(N,1), stat=istat)

if (istat/=0) then
print *,"ERROR: ALLOCATE FAILS for C"
call flush(6_4)
stop "ERR:ALLOCATE FAILS for C"
else
mem_allocated = 8*N
mem_gb = mem_allocated / 1000000000.0
WRITE(6,104) mem_allocated,mem_gb
104   FORMAT(" memory alloc (C)    = ",i12," bytes i.e. ", f8.5," Gb")
endif

allocate (D(N,1), stat=istat)

if (istat/=0) then
print *,"ERROR: ALLOCATE FAILS for D"
call flush(6_4)
stop "ERR:ALLOCATE FAILS for D"
else
mem_allocated = 8*N
mem_gb = mem_allocated / 1000000000.0
WRITE(6,105) mem_allocated,mem_gb
105   FORMAT(" memory alloc (D)    = ",i12," bytes i.e. ", f8.5," Gb")
endif

allocate (X (n,1), stat=istat)

if (istat/=0)  then
print *,"ERROR: ALLOCATE FAILS for X"
call flush(6_4)
stop "ERR:ALLOCATE FAILS for X"
else
mem_allocated = 8*N
mem_gb = mem_allocated / 1000000000.0
WRITE(6,106) mem_allocated,mem_gb
106   FORMAT(" memory alloc (X)    = ",i12," bytes i.e. ", f8.5," Gb")
endif

mem_allocated = 8*N*N + 8*N + 8*N
mem_gb = mem_allocated / 1000000000.0

WRITE(6,102) mem_allocated,mem_gb
102 FORMAT(" memory allocated    = ",i12," bytes i.e. ", f8.5," Gb")
call flush(6_4)

!
! -----    Initialize:  A X = C
!
call init_my_matrix (N,A)
call init_my_vector (N,X)
call init_my_rhs    (N,C)

D = MATMUL(A,X)
!
write (6,500)
500 FORMAT(/"          C(I): ")
IF (N .le. N_PRT) THEN
DO I=1,N
WRITE (6,501) I,C(I,1)
501     FORMAT(" I = ",I3," ",f6.2)
ENDDO

ELSE

!     For large N print only last N_PRT elements of vector

DO I=N-N_PRT,N
WRITE (6,502) I,A(I,I),D(I,1)
502     FORMAT(" I=",I10,"  A(I,I) = ",f13.2,"  D(I,1) = ",f13.2)
ENDDO

ENDIF

KOUNT = 0

DO I=1,N
if (abs(C(I,1)-D(I,1)) .gt. EPS) then
KOUNT = KOUNT + 1
write(6,107) I
107     format("For I = ",i12," C .neq. D")
endif
ENDDO

write(6,108) KOUNT
108 format(/"KOUNT = ",i10)
!
! -----    Cleanup arrays -----
!
deallocate (A,C,D,X)
!
end program TMA
!
!-----------------------------------------------------------------------
!
SUBROUTINE INIT_MY_MATRIX (N,A)
IMPLICIT NONE
INTEGER      :: NH,NH1
INTEGER*8    :: N
REAL(kind=8) :: A(:,:)

INTEGER      :: I

NH  = N/2
NH1 = NH + 1

!     Diagonal elements

DO I = 1,N
A(I,I) = 3.0
ENDDO

!     Upper diagonal

DO 180 I = 1,N-1
if (I .EQ. NH) go to 180
A(I,I+1) = -1.0d0
180 CONTINUE

!     Lower diagonal

DO 160 I = 1,N-1
if (I .EQ. NH) go to 160
A(I+1,I) = -1.0d0
160 CONTINUE

!
!     ANTI-DIAGONAL
!
DO 190 I = 1,N
if (I .EQ. NH .OR. I .EQ. NH1) go to 190

A(I,N-I+1) = 1.0d0
190 CONTINUE

return
end

!
!-----------------------------------------------------------------------
!
SUBROUTINE INIT_MY_RHS (N,C)
IMPLICIT NONE
INTEGER*8    :: N
REAL(kind=8) :: C(:,:)

INTEGER :: I

DO I= 1, N-1
C(I,1) = N + 1.0
ENDDO

C(N,1) = 2*N + 2

return
end
!
!-----------------------------------------------------------------------
!
SUBROUTINE INIT_MY_VECTOR (N,X)
IMPLICIT NONE
INTEGER*8    :: N
REAL(kind=8) :: X(:,:)

INTEGER :: I

DO I= 1, N
X(I,1) = I
ENDDO

return
end

### Important Notes on the program and Output Results

There are several things which should be pointed out:

 (1) The order of the STRIDWAD system (i.e. parameter N) has been declared as INTEGER*8
in order that there be no overflow when memory is calculated using mem_allocated.

 (2) To submit the job the command:
sqsub -t  -r 60m -q threaded -n 4 -o ${CLU}_TEST_MEM_8_%J ./test_mem_alloc8 was used so that all the memory of the node would be available to this program.  Output for N=18 was as follows:  nprocs = 1 N = 18 memory alloc (A) = 2592 bytes i.e. 0.00000 Gb memory alloc (C) = 144 bytes i.e. 0.00000 Gb memory alloc (D) = 144 bytes i.e. 0.00000 Gb memory alloc (X) = 144 bytes i.e. 0.00000 Gb memory allocated = 2880 bytes i.e. 0.00000 Gb C(I): I = 1 19.00 I = 2 19.00 I = 3 19.00 I = 4 19.00 I = 5 19.00 I = 6 19.00 I = 7 19.00 I = 8 19.00 I = 9 19.00 I = 10 19.00 I = 11 19.00 I = 12 19.00 I = 13 19.00 I = 14 19.00 I = 15 19.00 I = 16 19.00 I = 17 19.00 I = 18 38.00 KOUNT = 0 Output for N=55388 (the highest possible value that fits in memory) was:  nprocs = 1 N = 55388 memory alloc (A) = 24542644352 bytes i.e. 24.54264 Gb memory alloc (C) = 443104 bytes i.e. 0.00044 Gb memory alloc (D) = 443104 bytes i.e. 0.00044 Gb memory alloc (X) = 443104 bytes i.e. 0.00044 Gb memory allocated = 24543530560 bytes i.e. 24.54353 Gb C(I): I= 55348 A(I,I) = 3.00 D(I,1) = 55389.00 I= 55349 A(I,I) = 3.00 D(I,1) = 55389.00 I= 55350 A(I,I) = 3.00 D(I,1) = 55389.00 ... I= 55386 A(I,I) = 3.00 D(I,1) = 55389.00 I= 55387 A(I,I) = 3.00 D(I,1) = 55389.00 I= 55388 A(I,I) = 3.00 D(I,1) = 110778.00 KOUNT = 0 ## LAPACK EXAMPLE In the previous section the STRIDWAD matrices of oder N were defined. We now want to use those matrices as input to the LAPACK's routines dgetrf/dgetrs to find the inverses of different orders N. The programs should also report the memory used, cpu time and elapsed time. ### Compiling and Submitting Jobs Following makefile is used to compile and submit the jobs to a compute node: # Determine the platform that is being used for this compilation OS:=$(shell uname)
PLATFORM:= $(shell uname)_$(shell uname -m)

# Set the dafault parameters N (order of matrix times 1000)
# and P (number of threads)

N:= 1
P:= 4

# Specify executable, source and required libraries :

EXE = DP_LAPACK_SOLVE
EXT:= f90
SRC:= ${EXE}.$(EXT)
LIB:= -L/opt/sharcnet/acml/current/pathscale64/lib -lacml

$(EXE):$(SRC)
@echo Check if N is in the range [1,40]
./check_N $N # @echo COMPILING LAPACK DP program with N =$(N),000
#       @echo PLATFORM used = $(PLATFORM) # @echo Edit the parameter_N.inc file cp parameter_N.inc_basic parameter_N.inc ./ed_parm_N parameter_N.inc$N
#       @echo "Using following parameter_N.inc file:"
#       cat parameter_N.inc
#       @echo "LIB = " $(LIB) pathf90 -o$(EXE) $(SRC)$(LIB)
cp sub_batch_DP_LAPACK_SOLVE_basic sub_batch_DP_LAPACK_SOLVE
./ed_procs_P sub_batch_DP_LAPACK_SOLVE $P ./sub_batch_DP_LAPACK_SOLVE superclean: clean @echo ' ' @echo Removing Output files generated by$(EXE)
@echo '--------------------------------------------------'
@echo ' '
rm -rf ${CLU}_SERIAL_DP_SOLVE_* @echo ' ' @echo Output files have been removed. @echo '-------------------------------' @echo ' ' clean: @echo ' ' @echo Cleaning executable$(EXE)
@echo '-----------------------------------'
@echo ' '
rm -rf $(EXE) rm -rf ED_ERROR @echo ' ' @echo Executable$(EXE) has been removed.
@echo '--------------------------------------------'
@echo ' '

help:
@echo "+-----------------------------------------------------------------------------------+"
@echo "|                                                                                   |"
@echo "|              Makefile for running LAPACK example using dgetrf/dgetrs              |"
@echo "|                                                                                   |"
@echo "| Usage: make                      compile program for N=1 (i.e. order=1,000)       |"
@echo "|                                                                                   |"
@echo "|        make N=n                  compile program for N=n (i.e. order=n*1000)      |"
@echo "|                                  where  1 <= n <= 40                              |"
@echo "|                                                                                   |"
@echo "|        make P=p                  submit job with -q threaded -n p                 |"
@echo "|                                  where  1 <= p <= 4 or 6 (depending on cluster)   |"
@echo "|                                                                                   |"
@echo "|        make superclean           remove output files and executable               |"
@echo "|                                                                                   |"
@echo "|        make clean                remove executable                                |"
@echo "|                                                                                   |"
@echo "|        make help                 display makefile usage information               |"
@echo "|                                                                                   |"
@echo "+-----------------------------------------------------------------------------------+"
@echo " "
@echo PLATFORM $(PLATFORM) The make utility will use above makefile and we can override any parameters appearing in the the makefile. Thus, if we type the command: make N=2 P=6  make will use parameter N=2 and P=6 instead of those appearing in the makefile. Then make proceeds to specify the executable, source and required library and executes the commands required to generate the target$(EXE). The first command for generating this target is to invoke the script check_N which takes as argument the parameter $N. #!/bin/bash # script file = check_N if [$# -ne 1 ]; then
echo "Must have 1 argument: value of N"
echo "Usage: $0 <value_of_N> " exit -1 fi N=$1

if [ "$N" -lt 1 ] || [ "$N"  -gt 40 ]
then
echo "Value of N is out of range !"
echo " 1 <= N <= 40 "
exit -1
fi

exit

If the value of $N is in the specified range the check_N script returns 0 and make proceeds to the next command, but if$N is out of the specified range, then the check_N script returns -1 and the make command terminates with an error message.

If the check_N script returned 0 then make proceeds to the next command which copies the parameter_N.inc_basic file into the parameter_N.inc file. The next script ed_parm_N takes two arguments: parameter_N.inc and $N to modify the file parameter_N.inc which is used by the fortran program that we will compile in the next command. Here is a listing of the second script: #!/bin/bash # script file = ed_parm_N if [$# -ne 2 ]; then
echo "Must have 2 argument: Filename to change and value to substitute x for"
echo "Usage: $0 <filename> <value_for_x> " exit fi Filename=$1

ed $Filename <<EOF 2> ED_ERROR 1,2s/x/$2/
w
q
EOF

Now, the make command is ready to compile the source program using the updated version of the include file parameter_N.inc.

If errors are detected in the compilation then make will terminate, otherwise the next command is executed in the makefile script, which is the script:

#!/bin/bash

# script file = ed_procs_P

if [ $# -ne 2 ]; then echo "Must have 2 argument: Filename to change and number of procs to use" echo "Usage:$0  <filename>  <num_of_processors> "
exit -1
fi

if [ printenv CLU = "nar" ] ; then
let MXP=4
fi

if [ printenv CLU = "saw" ] ; then
let MXP=6
fi

if [ $2 -lt 1 ] || [$2 -gt $MXP ] ; then echo "Parameter p is out of range for cluster${CLUSTER}"
exit -1
fi

Filename=$1 ed$Filename <<EOF 2> ED_ERROR
1,6s/nprocs/$2/ w q EOF If p is out of range the make command exits, otherwise the next command is executed which is a script submitting the job to a queue (in which the parameter nprocs will be replaced by what is specified in the makefile): #!/bin/bash # script file = sub_batch_DP_LAPACK_SOLVE_basic sqsub -t -r 60m -q threaded -n nprocs -o${CLU}_SERIAL_DP_SOLVE_%J  ./DP_LAPACK_SOLVE

### The Fortran and Other Files

For completnes we list next the whole fortran source file and working files parameter_N.inc_basic and sub_batch_DP_LAPACK_SOLVE_basic:

      program DP_LAPACK_SOLVE
!
!   file name = DP_LAPACK_SOLVE.f90
!
implicit none

integer   :: istat,info,i,j,N_PRT,milestone
real(kind=8),dimension(:,:),allocatable  :: a
real(kind=8),dimension(:,:),allocatable  :: c
integer,dimension(:),allocatable :: ipiv

integer   :: root, ierr
integer*8 :: N, mem_allocated
real(kind=8) :: mem_gb, EPS

!     parameter (N = 64000)
!     parameter (N = 100000)
!     parameter (N =  35000)

include "parameter_N.inc"

parameter (N_PRT = 40)

integer   ::       context, iam, pnum
integer   ::       mycol, myrow, nb
integer   ::       npcol, nprocs, nprow

integer   ::       ITEM, IAorC, NH, NH1, DEBUG
! --------------------------------------------------------------------------

integer   :: day(10),hour(10),minute(10),second(10),millisec(10)
character*30 :: descr_milestone(10)

integer    :: day_beg,hour_beg,minute_beg,second_beg,millisec_beg
integer    :: day_end,hour_end,minute_end,second_end,millisec_end
character*23 :: date_time
real(kind=8) :: T1, T2
real(kind=8) :: TOT_TIME
real(kind=8) :: tm

! --------------------------------------------------------------------------

interface

subroutine init_my_matrix (n,a,nprow,npcol,myrow,mycol)
implicit none
integer*8 :: N
integer :: nprow,npcol,myrow,mycol
real(kind=8)    :: a(:,:)
end subroutine init_my_matrix

subroutine init_my_rhs (n,c,nprow,npcol,myrow,mycol)
implicit none
integer*8 :: N
integer :: nprow,npcol,myrow,mycol
real(kind=8)    :: c(:,:)
end subroutine init_my_rhs

subroutine elapsed_time(day_beg,hour_beg, minute_beg,second_beg,&
&  millisec_beg, day_end, hour_end, minute_end, second_end,        &
&  millisec_end,tid,ITEM )
implicit none
integer :: day_beg,hour_beg,minute_beg,second_beg,millisec_beg
integer :: day_end,hour_end,minute_end,second_end,millisec_end
integer :: day,hour, minute, second, millisec, tid, ITEM
end subroutine elapsed_time

subroutine timestamp(date_time,day,hour,minute,second,millisec)
implicit none
character*23 :: date_time
integer      :: day, hour, minute, second, millisec
integer      :: elements(8)
character*3  :: months(12)
end subroutine timestamp

end interface

! ---------------------------------------------------------------------------

!
!     Trace intermediate steps only for small values of N

DEBUG = 1
root = 0
IAM  = 0
NPROCS = 1

!     N must be EVEN

NH    = N/2
NH1   = NH + 1

IF (2*NH .NE. N) THEN
WRITE(6,2001) N
2001   FORMAT("N must be EVEN, got N = ",i9)
STOP
ENDIF

IF ( N .gt. N_PRT ) DEBUG = 0
!
! ---------------------------------------------------------------------------

ITEM = 1000 + iam

call timestamp(date_time,day_beg,hour_beg,minute_beg,second_beg,  &
&               millisec_beg )

milestone = 1
IF (IAM .eq. 0) write(6,1006) ITEM,iam,date_time,milestone
1006 format(I4," iam ",i3," BEGIN ",a23," milestone = ",i2)
1008 format(I4," iam ",i3,"       ",a23," milestone = ",i2)
call flush(6_4)

call cpu_time(T1)

call timestamp( date_time,day(milestone), hour(milestone),        &
&     minute(milestone),second(milestone),millisec(milestone) )

descr_milestone(milestone) = "Init blacs_gridinit, allocate"
!
! -----    Print out nprocs   -----

IF (IAM .EQ. 0) THEN
WRITE(6,101) nprocs,N
101     FORMAT(" nprocs = ",i3,"      N = ",i12)
call flush(6_4)
ENDIF
!
!
! -----   Allocate LHS, RHS, and pivot   -----
!
allocate (a(N, N ), stat=istat)
if (istat/=0) then
print *,"ERROR: ALLOCATE FAILS for A"
call flush(6_4)
stop "ERR:ALLOCATE FAILS for A"
else
mem_allocated = 8*N*N
mem_gb = mem_allocated / 1000000000.0
WRITE(6,103) mem_allocated,mem_gb
103   FORMAT(" memory alloc (A)    = ",i12," bytes i.e. ", f9.6," Gb")
endif

allocate (c(N,1), stat=istat)
if (istat/=0) then
print *,"ERROR: ALLOCATE FAILS for C"
call flush(6_4)
stop "ERR:ALLOCATE FAILS for C"
else
mem_allocated = 8*N
mem_gb = mem_allocated / 1000000000.0
WRITE(6,104) mem_allocated,mem_gb
104   FORMAT(" memory alloc (C)    = ",i12," bytes i.e. ", f9.6," Gb")
endif

allocate (ipiv (n), stat=istat)
if (istat/=0)  then
print *,"ERROR: ALLOCATE FAILS for IPIV"
call flush(6_4)
stop "ERR:ALLOCATE FAILS for IPIV"
else
mem_allocated = 4*N
mem_gb = mem_allocated / 1000000000.0
WRITE(6,105) mem_allocated,mem_gb
105   FORMAT(" memory alloc (ipiv) = ",i12," bytes i.e. ", f9.6," Gb")
endif

mem_allocated = 8*N*N + 8*N + 4*N
mem_gb = mem_allocated / 1000000000.0

IF (IAM .EQ. 0) THEN
WRITE(6,102) mem_allocated,mem_gb
102   FORMAT(" memory allocated    = ",i12," bytes i.e. ", f9.6," Gb")
call flush(6_4)
ENDIF

!
! -----    Initialize LHS and RHS
!
call init_my_matrix (n,a,nprow,npcol,myrow,mycol)
call init_my_rhs    (n,c,nprow,npcol,myrow,mycol)
!
milestone = milestone + 1
IF (IAM .eq. 0) write(6,1008) ITEM,iam,date_time,milestone
call flush(6_4)
call timestamp( date_time,day(milestone), hour(milestone),        &
&     minute(milestone),second(milestone),millisec(milestone) )

descr_milestone(milestone) = "call dgetrf"
!
! -----    Do the work -----
!
call dgetrf( N, N, A, N, ipiv, info )

if (iam.eq.0 .AND. DEBUG .eq. 1)                                  &
&  write (6,*) "info (from dgetrf) = ", info

milestone = milestone + 1
IF (IAM .eq. 0) write(6,1008) ITEM,iam,date_time,milestone
call flush(6_4)
call timestamp( date_time,day(milestone), hour(milestone),        &
&     minute(milestone),second(milestone),millisec(milestone) )

descr_milestone(milestone) = "call dgetrs"

call dgetrs( 'N', N, 1, A, N, ipiv, c, N, info )

if (iam.eq.0 .AND. DEBUG .eq. 1)                                  &
&  write (6,*) "info (from dgetrs) = ", info
!

milestone = milestone + 1
IF (IAM .eq. 0) write(6,1008) ITEM,iam,date_time,milestone
call flush(6_4)
call timestamp( date_time,day(milestone), hour(milestone),        &
&     minute(milestone),second(milestone),millisec(milestone) )

descr_milestone(milestone) = "get_solution --> root"

IF (IAM.EQ.0) THEN
IF (N .le. N_PRT) THEN
ITEM = 6000
write (6,500) ITEM
500     FORMAT(I4," SOLUTION: ")
DO I=1,N
ITEM = 6000 + I
WRITE (6,501) ITEM,I,C(I,1)
501       FORMAT(I4," I = ",I3," ",f6.2)
ENDDO

ELSE

!       For large N print only last N_PRT elements of vector

DO I=N-N_PRT,N
WRITE (6,502) I,C(I,1)
502       FORMAT(" I=",I10," ",f13.2)
ENDDO

ENDIF

ENDIF

milestone = milestone + 1
IF (IAM .eq. 0) write(6,1008) ITEM,iam,date_time,milestone
call flush(6_4)
call timestamp( date_time,day(milestone), hour(milestone),        &
&     minute(milestone),second(milestone), millisec(milestone) )

descr_milestone(milestone) = "print and clean up"

!     Now report the milestone time differences

ITEM = 7000

IF (IAM .eq. 0) WRITE(6,1007) ITEM,iam,milestone
1007 FORMAT(I4," iam = ",i3," FINAL value of milestone = ",i2)

do i=1,milestone-1
ITEM = 7000 + I
tm = 86400.0*(day(i+1)-day(i)) +  3600.0*(hour(i+1)-hour(i)) +  &
&       60.0*(minute(i+1)-minute(i)) + (second(i+1)-second(i))  +  &
&       0.001 * (millisec(i+1)-millisec(i))
IF (IAM .EQ. 0) write(6,1005) ITEM,iam, i, i+1, tm,             &
&                  descr_milestone(i)
1005   format(I4," iam ",i3," Time from milestone ",i2," to milestone "&
&        ,I2," equals ",f10.2,3x,a30)
enddo

call timestamp( date_time,day_end,hour_end,minute_end,second_end, &
&                millisec_end  )

ITEM = ITEM + 1
IF (IAM .EQ. 0) write(6,1011) ITEM,iam,date_time
1011 format(I4," iam ",i3," END   ",a23)

call cpu_time(T2)
TOT_TIME = T2-T1

call elapsed_time(day_beg,hour_beg, minute_beg, second_beg,       &
&  millisec_beg, day_end, hour_end, minute_end, second_end,        &
&  millisec_end,iam,ITEM )

ITEM = ITEM + 1
IF (IAM .EQ. 0) write(6,1012) ITEM,iam,TOT_TIME
1012 format(I4," IAM ",i3," Total CPU Time/processor = ",f11.4,        &
&       " seconds")

! ---------------------------------------------------------------------------

!
! -----    Cleanup arrays -----
!
deallocate (a,c,ipiv)
!
end program DP_LAPACK_SOLVE
!
!-----------------------------------------------------------------------
!
SUBROUTINE INIT_MY_MATRIX (N,A,NPROW,NPCOL,MYROW,MYCOL)
IMPLICIT NONE
integer*8 :: N
INTEGER      :: NH,NH1,NPROW,NPCOL,MYROW,MYCOL
REAL(kind=8) :: A(:,:)

INTEGER      :: I, J

NH  = N/2
NH1 = NH + 1

!     Diagonal elements

DO I = 1,N
A(I,I) = 3.0
ENDDO

!     Upper diagonal

DO 180 I = 1,N-1
if (I .EQ. NH) go to 180
A(I,I+1) = -1.0d0
180 CONTINUE

!     Lower diagonal

DO 160 I = 1,N-1
if (I .EQ. NH) go to 160
A(I+1,I) = -1.0d0
160 CONTINUE

!
!     ANTI-DIAGONAL
!
DO 190 I = 1,N
if (I .EQ. NH .OR. I .EQ. NH1) go to 190

A(I,N-I+1) = 1.0d0
190 CONTINUE

return
end

!
!-----------------------------------------------------------------------
!
SUBROUTINE INIT_MY_RHS (N,C,NPROW,NPCOL,MYROW,MYCOL)
IMPLICIT NONE
integer*8    :: N
INTEGER      :: NPROW,NPCOL,MYROW,MYCOL
REAL(kind=8) :: C(:,:)

INTEGER :: I

DO I= 1, N-1
C(I,1) = N + 1.0
ENDDO

C(N,1) = 2*N + 2

return
end
!
!-----------------------------------------------------------------------
!
subroutine elapsed_time(day_beg, hour_beg, minute_beg, second_beg,&
&          millisec_beg, day_end, hour_end, minute_end, second_end,&
&          millisec_end,tid, ITEM )

!date_and_time

! VALUES
!   must be of type default integer and of rank one. It is an INTENT(OUT)
!   argument. Its size must be at least eight. The values returned in VALUES
!   are as follows:

!   VALUES(1)
!   is the year (for example, 1998), or -HUGE (0) if no date is available.

!   VALUES(2)
!   is the month of the year, or -HUGE (0) if no date is available.

!   VALUES(3)
!   is the day of the month, or -HUGE (0) if no date is available.

!   VALUES(4)
!   is the time difference with respect to Coordinated Universal Time (UTC)
!   in minutes, or -HUGE (0) if this information is not available.
!
!   VALUES(5)
!   is the hour of the day, in the range 0 to 23, or -HUGE (0) if there is
!   no clock.
!
!   VALUES(6)
!   is the minutes of the hour, in the range 0 to 59, or -HUGE (0) if there
!   is no clock.
!   !
!   VALUES(7)
!   is the seconds of the minute, in the range 0 to 60, or -HUGE (0) if
!   there is no clock.
!
!   VALUES (8)
!   is the milliseconds of the second, in the range 0 to 999, or -HUGE (0)
!   if there is no clock.

implicit none
integer :: day_beg, hour_beg, minute_beg, second_beg, millisec_beg
integer :: day_end, hour_end, minute_end, second_end, millisec_end
integer :: day, hour, minute, second, millisec, tid, ITEM

if ( millisec_end .lt. millisec_beg ) then
millisec_end = millisec_end + 1000
second_end = second_end - 1
endif

if ( second_end .lt. second_beg ) then
second_end = second_end + 60
minute_end = minute_end - 1
endif

if ( minute_end .lt. minute_beg ) then
minute_end = minute_end + 60
hour_end   = hour_end - 1
endif

if ( hour_end .lt. hour_beg ) then
hour_end = hour_end + 24
day_end  = day_end - 1
endif

millisec = millisec_end - millisec_beg
second   = second_end   - second_beg
minute   = minute_end   - minute_beg
hour     = hour_end     - hour_beg
day      = day_end      - day_beg

if ( .not. ( day .eq. 0 .and. hour .eq. 0 .and. minute .eq. 0     &
&        .and. second .eq. 0 .and. millisec .eq. 0 ) ) then
ITEM = ITEM + 1
IF (tid .eq. 0 ) THEN
write(6,8000) ITEM,tid,day
8000     format(I4," IAM = ",I3," ELAPSED0 days          = ",i3)
write(6,8001) ITEM,tid,hour
8001     format(I4," IAM = ",I3," ELAPSED1 hours         = ",i3)
write(6,8002) ITEM,tid,minute
8002     format(I4," IAM = ",I3," ELAPSED2 minute        = ",i3)
write(6,8003) ITEM,tid,second
8003     format(I4," IAM = ",I3," ELAPSED3 second        = ",i3)
write(6,8004) ITEM,tid,millisec
8004     format(I4," IAM = ",I3," ELAPSED4 millisec      = ",i3)
ENDIF
endif

return

end subroutine elapsed_time

!
!-----------------------------------------------------------------------
!

subroutine timestamp( date_time, day,hour,minute,second,millisec)
implicit none
character*23 :: date_time
integer      :: day, hour, minute, second, millisec
integer      :: elements(8)
character*3  :: months(12)

data months /'Jan','Feb','Mar','Apr','May','Jun',                 &
&             'Juk','Aug','Sep','Oct','Nov','Dec' /
call date_and_time ( VALUES=elements )

if ( elements(1) .ne. -HUGE(0) ) then
century: if ( elements(1) .lt. 2000 ) then
elements(1) = elements(1) - 1900
else
elements(1) = elements(1) - 2000
endif century
write(date_time,1010) elements(3),                       &
&                       months ( elements(2) ),                    &
&                                elements(1)  , elements(5),       &
&                 elements(6)  , elements(7)  , elements(8)

1010          format(i2.2,1x,a3,1x,i2.2,1x,                            &
&                i2.2,':',i2.2,':',i2.2,'.',i4.4 )

day      = elements(3)
hour     = elements(5)
minute   = elements(6)
second   = elements(7)
millisec = elements(8)

else
date_time=' '

day      = 0
hour     = 0
minute   = 0
second   = 0
millisec = 0

endif

end subroutine timestamp

!
!-----------------------------------------------------------------------
!

This file is used by script ed_parm_N which substitutes the x by $N: ! filename = parameter_N.inc_basic parameter (N = x000) This file is used by script ed_procs_P which substitutes nprocs by$P:

#!/bin/bash

# script file = sub_batch_DP_LAPACK_SOLVE_basic
sqsub -t -r 60m -q threaded -n nprocs -o ${CLU}_SERIAL_DP_SOLVE_%J ./DP_LAPACK_SOLVE ## ScaLAPACK EXAMPLE ScaLAPACK is a distributed memory version of LAPACK. It uses the Parallel Basic Linear Algebra Subprograms (PBLAS) as computational building blocks. ScaLAPACK uses the block-cyclic decomposition scheme to distribute block-partitioned matrices which should make the computation balanced and scalable. ### Example of a block cyclic data distribution According to the two-dimensional block cyclic data distribution scheme, an M_ by N_ dense matrix is first decomposed into MB_ by NB_ blocks starting at its upper left corner. These blocks are then uniformly distributed in each dimension of the Process Grid. Thus, every process owns a collection of blocks, which are locally and contiguously stored in a two-dimensional column major array. The partitioning of a matrix into blocks and the mapping of these blocks onto a Process Grid is illustrated with a global 9x9 matrix A. The first step in this process is to partition the matrix A into block. Let us use 2x2 blocks and assume that the 2-D Process Grid is 2x3. a11 a12 | a13 a14 | a15 a16 | a17 a18 | a19 a21 a22 | a23 a24 | a25 a26 | a27 a28 | a29 B11 B12 B13 B14 B15 --------.---------.---------.---------.---- a31 a32 | a33 a34 | a35 a36 | a37 a38 | a39 a41 a42 | a43 a44 | a45 a46 | a47 a48 | a49 B21 B22 B23 B24 B25 --------.---------.---------.---------.---- a51 a52 | a53 a54 | a55 a56 | a57 a58 | a59 == a61 a62 | a63 a64 | a65 a66 | a67 a68 | a69 B31 B32 B33 B34 B35 --------.---------.---------.---------.---- a71 a72 | a73 a74 | a75 a76 | a77 a78 | a79 a81 a82 | a83 a84 | a85 a86 | a87 a88 | a89 B41 B42 B43 B44 B45 --------.---------.---------.---------.---- a91 a92 | a93 a94 | a95 a96 | a97 a98 | a99 B51 B52 B53 B54 B55   matrix A is M_ x N_ (9 x 9) A is partitioned into 2x2 blocks  In the above diagram the Bij are the 2x2 blocks, e.g.  B11 == a11 a12 B12 == a13 a14 ... B15 == a19 a21 a22 a23 a24 a29  Initially, the 2x3 Process Grid is empty and looks like this:  0 1 2 . . . 0 . . . . . . 1 . . . . . .  We identify each process in the Process Grid by two coordinates (row,col). Thus, for the 2x3 Process Grid the processes would be:  0 1 2 0 (0,0) (0,1) (0,2) 1 (1,0) (1,1) (1,2)  The distribution process starts by taking the Global Bij in first row and distribute them to the first row of the Processor Grid:  0 1 2 B11 B14 B12 B15 B13 0 . . . . . . 1 . . . . . .  Take Global Bij in next row and distribute them to the next row of the Process Grid: (if previous distribution was on last row of Process Grid then restart with row 0 of Process Grid ).  0 1 2 B11 B14 B12 B15 B13 0 . . . . . . 1 B21 B24 B22 B25 B23 . . .  Take Global Bij in next row and distribute them to the first row of the Process Grid: (restart with row 0 of Process Grid).  0 1 2 B11 B14 B12 B15 B13 0 B31 B34 B32 B35 B33 . . . 1 B21 B24 B22 B25 B23 . . .  Take Global Bij in next row and distribute them to the next row of the Process Grid:  0 1 2 B11 B14 B12 B15 B13 0 B31 B34 B32 B35 B33 . . . 1 B21 B24 B22 B25 B23 B41 B44 B42 B45 B43  Take Global Bij in next row and distribute them to the first row of the Process Grid: (restart with row 0 of Process Grid).  0 1 2 B11 B14 B12 B15 B13 0 B31 B34 B32 B35 B33 B51 B54 B52 B55 B53 1 B21 B24 B22 B25 B23 B41 B44 B42 B45 B43  ### Set Up prior to invoking the Parallel Solvers Before calling solvers and other functions from the ScaLAPACK library the user needs to configure the processes into a BLACS 2D Process Grid. ScaLAPACK requires that the data arrays be block-cyclically distributed across the 2D Processor Grid. Here is the procedure to accomplish this: (1) Start by invoking the following routine:  call blacs_pinfo( iam, nprocs ) OUT OUT  which identifies the process in the process grid and how many processes there are in total. (2) Next, you should set up a rectangular grid (2D) as close as possible to a square. Following algorithm was developed by Carlo Cavazzoni of CINECA (this is not part of BLACS), and you must include it in your source:  call gridsetup (nprocs,nprow,npcol) INPUT OUT OUT  This computes the number of rows and columns we'll have in the process grid, such that NPROW x NPCOL == NPROCS. (3) Inform BLACS of this grid extent by calling following routines:  call blacs_get( -1, 0, context ) IN IN OUT   call blacs_gridinit( context, 'r', nprow, npcol ) INPUT IN INPUT INPUT   Note: 'r' ==> ROW-MAJOR ordering i.e. fill in first row left to right then next row. 'c' ==> COLUMN-MAJOR ordering.  MPI has communicators, BLACS has contexts. For this example, the default context, which includes all processors, is all we need. The BLACS_GET call simply returns a handle to the default context for use in subsequent BLACS calls. BLACS_GRIDINIT informs BLACS of the grid extent which we computed in step 2. Behind the scenes, BLACS then assigns processors to points in this virtual grid. This now completes the definition of the BLACS 2D process grid. We can now use subroutine BLACS_GRIDINFO to query BLACS on the information we supplied plus additional information BLACS has computed such as its coordinate, e.g., COL and ROW, in the grid by specifying "context":  call blacs_gridinfo( context, nprow, npcol, myrow, mycol ) INPUT OUT OUT OUT OUT  Subroutine report_back prints the information we got from the call to blacs_gridinfo. (4) Distribute the data arrays across all processors: For each array, we need to take four basic steps: (a) determine a block size for the global array (b) call DESCINIT to create a standard ScaLAPACK array descriptor (c) allocate local memory for each process' portion of the array (d) distribute the actual data values into the allocated memory  (a) determine a block size for the global arrays and block-cyclically distribute these blocks across the 2D processor grid defined in steps (1), (2) and (3): Here "Block size" refers to the dimensions of the subdivisions into which the global array is decomposed. The ScaLAPACK User Guide recommends a block size of 64x64 for large arrays. Carlo Cavazzoni of CINECA has written a subroutine to compute the block size for a given array: "Blockset" chooses a good block size nb based on the size of the global array (i.e. dimension of matrix A), n, and number of rows and columns in the processor grid (nprow,npcol). It also honors a maximum block size value, which, following the User Guide recommendation, can simply be set at 64. We will then simply call blockset to determine the block size, nb:  call blockset( nb, 64, n, nprow, npcol) OUT INPUT INPUT INPUT INPUT  (b) call DESCINIT to create a standard ScaLAPACK array descriptor The ScaLAPACK function NUMROC is used to determine the number of rows of matrix A:  l_nrowsa = numroc(n,nb,myrow,0,nprow)  and again to determine the number of columns of matrix A:  l_ncolsa = numroc(n,nb,mycol,0,npcol)  Now we are ready to call DESCINIT:  call descinit( desca, n, n, nb, nb, 0, 0, context, l_nrowsa, info )  Similar calls are required for matrix C (the right hand side). (c) allocate local memory for each process' portion of the array Following fortran statement will allocate the required local memory for the segment of the matrix A used by this processor:  allocate (a(l_nrowsa, l_ncolsa ), stat=istat)  Similar statements are required for the arrays C and IPIV, etc ... (d) distribute the actual data values into the allocated memory The ScaLAPACK subroutine PDELSET is used to store the global values of matrices A and C to the appropriate locations in each processor. When we are done with a BLACS context, it should be released by calling subroutine BLACS_GRIDEXITR, and when we are done with BLACS altogether, we should call BLACS_EXIT. (These calls are similar to the MPI functions, MPI_Comm_free and MPI_Finalize). ### Timing Routines and Scripts used in the makefile In the next section a full listing of the program designed to solve a system of STRIDWAD equations of an even order N will appear. This program uses ScaLAPACK routines by distributing the coefficient matrix A and right hand side vector on the processors of the 2D Process Grid. The timing routines are used to measure the cpu and elapsed times in different segments of the program. Several milestones are set up throughout the program and the timings between this milestones is reported at the end of the program. Following scripts and basic files are used in the makefile: ./check_N$N
parameter_N.inc_basic
./ed_parm_N parameter_N.inc $N sub_batch_DP_LAPACK_SOLVE_basic ./ed_procs_P sub_batch_DP_LAPACK_SOLVE$P
./sub_batch_DP_LAPACK_SOLVE


listings follow:

### Using make/makefile to compile and submit the program

The make command is used to execute the commands in the makefile. The makefile has several macros with default values but the user can override these values. Several scripts are invoked to verify some parameters and if these parameters are out of range then the script returns a non-zero return code which stop make from further executing the commands on the makefile.

For example if the default parameters N, P and FPP_OPT were set to:

N:= 1
P:= 64
FPP_OPT:= -DPRINT_TIMES -DNO_PRINT_MEMORY_DISTR


in the makefile and we type the command:

make


the utility program make will proceed to read the makefile and carry out the following tasks:

Check if N is in the range [1,40]  (N times 1000 is the order of the matrix to invert)
Create the include file parameter_N.inc by making a copy of it from
file parameter_N.inc_basic and changing the parameter N to 4
Invoking fpp with option -DPRINT_TIMES and generating a new source
file ScaLAPACK_SOLVE2.f90
Compiling the latter with command compile using the libraries defined in the makefile by
macro $(LIB) Generating a script to submit the job based on file sub_batch_ScaLAPACK_SOLVE_basic and setting the parameter for number or processors to use Submitting the job to the test queue  Since we are submitting to the test queue we must wait until the job finishes running, also we need to type the command: make clean  to remove the last executable. If we want to run the case for a system of order 4000 using 128 processors and do not want to time the routines but want to see how the matrix A is distributed on the processors we could type: make N=4 P=128 "FPP_OPT=-DNO_PRINT_TIMES -DPRINT_MEMORY_DISTR"  or one could also type: make N=4 P=128 FPP_OPT=-DNO_PRINT_TIMES FPP_OPT+=-DPRINT_MEMORY_DISTR  The macro USE_QUEUE is used to decide if the job to be submitted should be run in the test queue or not. This macro must be set up manually by editing the makefile. Thus, for jobs to be submitted to the test queue the macro should look like this: USE_QUEUE:= \-t # USE_QUEUE:=\-v  also note that the script sub_batch_ScaLAPACK_SOLVE_basic uses -r 60m. For longer production jobs (not to be submitted to the test queue) the macro should look like this: # USE_QUEUE:= \-t USE_QUEUE:=\-v  and in the script sub_batch_ScaLAPACK_SOLVE_basic could use larger than 60m values for parameter -r instead of -r 60m. As you become familiar with the macros in the makefile, you can adjust them to fit your needs. Here is the makefile used in this online tutorial: # Determine the platform that is being used for this compilation OS:=$(shell uname)
PLATFORM:= $(shell uname)_$(shell uname -m)

# Set the dafault parameters N (order of matrx times 1000)
# and P (number of processors

N:= 1
P:= 64
FPP_OPT:= -DPRINT_TIMES -DNO_PRINT_MEMORY_DISTR

USE_QUEUE:= \-t
# USE_QUEUE:=\-v

# Specify executable, source and required libraries :

EXE = ScaLAPACK_SOLVE
EXT:= f90
SRC:= ${EXE}.$(EXT)

EXE2:=$(EXE)2 SRC2:=${EXE2}.$(EXT) LIB:= -psc -lscalapack -lblacs -lblacsF77init -lblacs -lmpi -L$(BLACSPATH)/pathscale/lib \
-L$(SCALAPACKPATH)/pathscale/lib -L$(ACMLPATH) -lacml

$(EXE):$(SRC)
@echo Check if N is in the range [1,40]
./check_N $N # @echo COMPILING LAPACK DP program with N =$(N),000
#       @echo PLATFORM used = $(PLATFORM) # @echo Edit the parameter_N.inc file cp parameter_N.inc_basic parameter_N.inc ./ed_parm_N parameter_N.inc$N
#       @echo "Using following parameter_N.inc file:"
#       cat parameter_N.inc
#       @echo "LIB = " $(LIB) @echo "Invoking fpp with option$(FPP_OPT)"
fpp $(FPP_OPT)$(SRC) $(SRC2) compile -o$(EXE) $(SRC2)$(LIB)

cp sub_batch_ScaLAPACK_SOLVE_basic sub_batch_ScaLAPACK_SOLVE
./ed_procs_P sub_batch_ScaLAPACK_SOLVE $(USE_QUEUE)$P
./sub_batch_ScaLAPACK_SOLVE

superclean: clean
@echo ' '
@echo  Removing Output files generated by $(EXE) @echo '--------------------------------------------------' @echo ' ' rm -rf${CLU}_SERIAL_DP_SOLVE_*
@echo ' '
@echo  Output files have been removed.
@echo '-------------------------------'
@echo ' '

clean:
@echo ' '
@echo  Cleaning executable $(EXE) @echo '-----------------------------------' @echo ' ' rm -rf$(EXE)
rm -rf ED_ERROR
@echo ' '
@echo  Executable $(EXE) has been removed. @echo '--------------------------------------------' @echo ' ' help: @echo "+-----------------------------------------------------------------------------------+" @echo "| |" @echo "| Makefile for running ScaLAPACK example using pdgetrf/pdgetrs |" @echo "| |" @echo "| Usage: make compile program for N=1 (i.e. order=1,000) |" @echo "| |" @echo "| make N=n compile program for N=n (i.e. order=n*1000) |" @echo "| where 1 <= n <= 40 |" @echo "| |" @echo "| make P=p submit job with -q threaded -n p |" @echo "| where 1 <= p <= 4 or 6 (depending on cluster) |" @echo "| |" @echo "| make superclean remove output files and executable |" @echo "| |" @echo "| make clean remove executable |" @echo "| |" @echo "| make help display makefile usage information |" @echo "| |" @echo "+-----------------------------------------------------------------------------------+" @echo " " @echo PLATFORM$(PLATFORM)

### Full fortran program listing using ScaLAPACK routines to solve a STRIDWAD system

The full fortran 90 program (which includes fpp directives) is listed next:

      program ScaLAPACK_SOLVE
!
!   file name = ScaLAPACK_SOLVE.f90
!
implicit none

include 'mpif.h'

integer   :: istat,info,i,j,N_PRT,milestone
real(kind=8),dimension(:,:),allocatable  :: a
real(kind=8),dimension(:,:),allocatable  :: c
real(kind=8),dimension(:),allocatable  :: solution
integer,dimension(:),allocatable :: ipiv

integer*8 :: N
real(kind=8) :: memory_local, memory_sum
integer   :: root, ierr
real(kind=8),dimension(:),allocatable    :: memory_a
logical :: PRT_MEM

include "parameter_N.inc"

parameter (N_PRT = 40)

integer   ::       context, iam, pnum
integer   ::       mycol, myrow, nb
integer   ::       npcol, nprocs, nprow
integer   ::       l_nrowsa,l_ncolsa
integer   ::       l_nrowsc,l_ncolsc

integer,parameter :: descriptor_len=9
integer   ::       desca( descriptor_len )
integer   ::       descc( descriptor_len )

integer   ::       numroc, ITEM, IAorC, NH, NH1, DEBUG
! --------------------------------------------------------------------------
#ifdef PRINT_TIMES
integer   :: day(10),hour(10),minute(10),second(10),millisec(10)
character*30 :: descr_milestone(10)

integer    :: day_beg,hour_beg,minute_beg,second_beg,millisec_beg
integer    :: day_end,hour_end,minute_end,second_end,millisec_end
character*23 :: date_time
real(kind=8) :: T1, T2
real(kind=8) :: TOT_TIME
real(kind=8) :: tm
#endif

! --------------------------------------------------------------------------

interface
subroutine report_back ( context,                          &
iam,nprocs,myrow,nprow,mycol,npcol)
implicit none
integer :: context
integer :: iam,nprocs,myrow,nprow,mycol,npcol
integer :: i,j,pnum
end subroutine report_back

subroutine printlocals( context,                          &
a,iam,nb,nprocs,myrow,mycol,l_nrows,l_ncols,IAorC)
implicit none
real(kind=8)    :: a(:,:)
integer :: context
integer :: iam,nb,nprocs,myrow,mycol,l_nrows,l_ncols,IAorC
integer :: i,j,pnum
end subroutine printlocals

subroutine init_my_matrix (n,a,nprow,npcol,myrow,mycol,desca)
implicit none
integer*8 :: N
integer :: nprow,npcol,myrow,mycol
integer :: desca(:)
real(kind=8)    :: a(:,:)
end subroutine init_my_matrix

subroutine init_my_rhs (n,c,nprow,npcol,myrow,mycol,descc)
implicit none
integer*8 :: N
integer :: nprow,npcol,myrow,mycol
integer :: descc(:)
real(kind=8)    :: c(:,:)
end subroutine init_my_rhs

subroutine get_solution (n,c,descc,solution)
implicit none
integer*8 :: N
integer :: descc(:)
real(kind=8) ::  c(:,:),solution(:)
end subroutine get_solution

#ifdef PRINT_TIMES
subroutine elapsed_time(day_beg,hour_beg, minute_beg,second_beg,&
&  millisec_beg, day_end, hour_end, minute_end, second_end,        &
&  millisec_end,tid,ITEM )
implicit none
integer :: day_beg,hour_beg,minute_beg,second_beg,millisec_beg
integer :: day_end,hour_end,minute_end,second_end,millisec_end
integer :: day,hour, minute, second, millisec, tid, ITEM
end subroutine elapsed_time
subroutine timestamp(date_time,day,hour,minute,second,millisec)
implicit none
character*23 :: date_time
integer      :: day, hour, minute, second, millisec
integer      :: elements(8)
character*3  :: months(12)
end subroutine timestamp
#endif

end interface

! ---------------------------------------------------------------------------

#ifdef PRINT_MEMORY_DISTR
PRT_MEM = .TRUE.
#else
PRT_MEM = .FALSE.
#endif

!
!     Trace intermediate steps only for small values of N

DEBUG = 1
root = 0

!     N must be EVEN

NH    = N/2
NH1   = NH + 1

IF (2*NH .NE. N) THEN
WRITE(6,2001) N
2001   FORMAT("N must be EVEN, got N = ",i9)
STOP
ENDIF

IF ( N .gt. N_PRT ) DEBUG = 0
!
! -----    Initialize the blacs.  Note: processors are counted starting at 0.
!
call blacs_pinfo( iam, nprocs )

! ---------------------------------------------------------------------------

ITEM = 1000 + iam

#ifdef PRINT_TIMES
call timestamp(date_time,day_beg,hour_beg,minute_beg,second_beg,  &
&               millisec_beg )

IF (IAM .eq. 0) write(6,1006) ITEM,iam,date_time
1006 format(I4," iam ",i3," BEGIN ",a23)

call cpu_time(T1)

milestone = 1
call timestamp( date_time,day(milestone), hour(milestone),        &
&     minute(milestone),second(milestone),millisec(milestone) )

descr_milestone(milestone) = "Init blacs_gridinit, allocate"
#endif
!
! -----    Print out nprocs   -----

IF (IAM .EQ. 0) THEN
WRITE(6,101) nprocs
101     FORMAT(" nprocs = ",i3)
call flush(6_4)
ENDIF
! -----    Set the dimension of the 2d processors grid.
!
call gridsetup(nprocs,nprow,npcol)
!
! -----    Initialize a single blacs context.  Determine which processor I
!          am in the 2D process or grid.
!
call blacs_get( -1, 0, context )
call blacs_gridinit( context, 'r', nprow, npcol )
call blacs_gridinfo( context, nprow, npcol, myrow, mycol )

IF (DEBUG .eq. 1) call report_back (context,                      &
&  iam,nprocs,myrow,nprow,mycol,npcol)
!
! -----    Calculate the blocking factor for the matrix.
!
call blockset( nb, 64, n, nprow, npcol)
!
! -----    Distributed matrices: get num. local rows/cols. Create description.
!
l_nrowsa = numroc(n,nb,myrow,0,nprow)
l_ncolsa = numroc(n,nb,mycol,0,npcol)
call descinit( desca, n, n, nb, nb, 0, 0, context, l_nrowsa, info )

l_nrowsc = numroc(n,nb,myrow,0,nprow)
l_ncolsc = numroc(1,nb,mycol,0,npcol)
call descinit( descc, n, 1, nb, nb, 0, 0, context, l_nrowsc, info )
!
! -----   Allocate LHS, RHS, pivot, and solution -----
!
allocate (a(l_nrowsa, l_ncolsa ), stat=istat)
if (istat/=0) stop "ERR:ALLOCATE FAILS for A"

allocate (c(l_nrowsc,l_ncolsc), stat=istat)
if (istat/=0) stop "ERR:ALLOCATE FAILS for C"

allocate (ipiv (n), stat=istat)
if (istat/=0) stop "ERR:ALLOCATE FAILS for IPIV"

allocate (solution (n), stat=istat)
if (istat/=0) stop "ERR:ALLOCATE FAILS for solution"

allocate (memory_a (nprocs), stat=istat)
if (istat/=0) stop "ERR:ALLOCATE FAILS for memory_a"
!
! -----    Reassemble memory_a on all processors. Add and Print on 0 -----
!

memory_local = l_nrowsa*l_ncolsa*8

call MPI_Gather(memory_local, 1, MPI_REAL8,                      &
&                memory_a, 1, MPI_REAL8,   root,                   &
&                MPI_COMM_WORLD, ierr)

IF (IAM.EQ.0) THEN
IF (N .le. N_PRT .or. PRT_MEM ) THEN
ITEM = 6500
memory_sum = 0
write (6,550) ITEM,N
550     FORMAT(I4," N = ",i12 /" Distributed memory for A: ")
DO I=0,nprocs-1
ITEM = 6500 + I
WRITE (6,551) ITEM,I,memory_a(I+1)
memory_sum = memory_sum + memory_a(I+1)
551       FORMAT(I4,"       processor = ",I3," ",f14.1)
ENDDO

ITEM = ITEM + 1
WRITE (6,552) ITEM,memory_sum,float(N)*float(N)*8
552     FORMAT(I4," TOTAL MEMORY for A = ",f14.1," ==  N*N*8 = ",f14.1)
ENDIF
call flush(6_4)

ENDIF

!
! -----    Initialize LHS and RHS
!
call init_my_matrix (n,a,nprow,npcol,myrow,mycol,desca)
call init_my_rhs    (n,c,nprow,npcol,myrow,mycol,descc)
!
! -----    Show how arrays distributed
!
IF (DEBUG .eq. 1) THEN

ITEM = 2000 + 100*MYROW + 10*MYCOL

IF (IAM.EQ.0) THEN
write(6,300) ITEM,n,n
300     format(i4," DISTRIBUTION OF ARRAY: A  -  Global dimension:",  &
&         i3,":",i3)
ENDIF

IAorC = 0
call printlocals ( context,                                     &
&       a,iam,nb,nprocs,myrow,mycol,l_nrowsa,l_ncolsa,IAorC)

IF (IAM.EQ.0) THEN
write(6,400) ITEM,N,1
400     FORMAT(i4," DISTRIBUTION OF ARRAY: C  -  Global dimension:",  &
&           i3,":",i3)
ENDIF

IAorC = 1000
call printlocals ( context,                                     &
&  c,iam,nb,nprocs,myrow,mycol,l_nrowsc,l_ncolsc,IAorC)

ENDIF

#ifdef PRINT_TIMES
milestone = milestone + 1
call timestamp( date_time,day(milestone), hour(milestone),        &
&     minute(milestone),second(milestone),millisec(milestone) )

descr_milestone(milestone) = "call pdgetrf"
#endif
!
! -----    Do the work -----
!
call pdgetrf( n, n, a, 1, 1, desca, ipiv, info )

if (iam.eq.0 .AND. DEBUG .eq. 1)                                  &
&  write (6,*) "info (from pdgetrf) = ", info

#ifdef PRINT_TIMES
milestone = milestone + 1
call timestamp( date_time,day(milestone), hour(milestone),        &
&     minute(milestone),second(milestone),millisec(milestone) )

descr_milestone(milestone) = "call pdgetrs"
#endif

call pdgetrs( 'n', n, 1, a, 1, 1, desca, ipiv,                    &
&               c, 1, 1, descc, info )

if (iam.eq.0 .AND. DEBUG .eq. 1)                                  &
&  write (6,*) "info (from pdgetrs) = ", info
!
! -----    Reassemble solution on all processors. Print on 0 -----
!

#ifdef PRINT_TIMES
milestone = milestone + 1
call timestamp( date_time,day(milestone), hour(milestone),        &
&     minute(milestone),second(milestone),millisec(milestone) )

descr_milestone(milestone) = "get_solution --> root"
#endif

call get_solution (n,c,descc,solution)

IF (IAM.EQ.0) THEN
IF (N .le. N_PRT) THEN
ITEM = 6000
write (6,500) ITEM
500     FORMAT(I4," SOLUTION: ")
DO I=1,N
ITEM = 6000 + I
WRITE (6,501) ITEM,I,SOLUTION(I)
501       FORMAT(I4," I = ",I3," ",f6.2)
ENDDO

ELSE

!       For large N print only last N_PRT elements of vector

DO I=N-N_PRT,N
WRITE (6,502) I,SOLUTION(I)
502       FORMAT(" I=",I10," ",f13.2)
ENDDO

ENDIF

ENDIF

#ifdef PRINT_TIMES
milestone = milestone + 1
call timestamp( date_time,day(milestone), hour(milestone),        &
&     minute(milestone),second(milestone), millisec(milestone) )

descr_milestone(milestone) = "print and clean up"
#endif

!     Now report the milestone time differences

#ifdef PRINT_TIMES
ITEM = 7000
IF (IAM .eq. 0) WRITE(6,1007) ITEM,iam,milestone
1007 FORMAT(I4," iam = ",i3," FINAL value of milestone = ",i2)

do i=1,milestone-1
ITEM = 7000 + I
tm = 86400.0*(day(i+1)-day(i)) +  3600.0*(hour(i+1)-hour(i)) +  &
&       60.0*(minute(i+1)-minute(i)) + (second(i+1)-second(i))  +  &
&       0.001 * (millisec(i+1)-millisec(i))
IF (IAM .EQ. 0) write(6,1005) ITEM,iam, i, i+1, tm,             &
&                  descr_milestone(i)
1005   format(I4," iam ",i3," Time from milestone ",i2," to milestone "&
&        ,I2," equals ",f10.2,3x,a30)
enddo

call timestamp( date_time,day_end,hour_end,minute_end,second_end, &
&                millisec_end  )

ITEM = ITEM + 1
IF (IAM .EQ. 0) write(6,1011) ITEM,iam,date_time
1011 format(I4," iam ",i3," END   ",a23)

call cpu_time(T2)
TOT_TIME = T2-T1

call elapsed_time(day_beg,hour_beg, minute_beg, second_beg,       &
&  millisec_beg, day_end, hour_end, minute_end, second_end,        &
&  millisec_end,iam,ITEM )

ITEM = ITEM + 1
IF (IAM .EQ. 0) write(6,1012) ITEM,iam,TOT_TIME
1012 format(I4," IAM ",i3," Total CPU Time/processor = ",f11.4,        &
&       " seconds")
#endif

! ---------------------------------------------------------------------------

!
! -----    Cleanup arrays -----
!
deallocate (a,c,ipiv,solution,memory_a)
!
! -----    Exit BLACS cleanly -----
!
call blacs_gridexit( context )
call blacs_exit( 0 )

end program ScaLAPACK_SOLVE

subroutine gridsetup(nproc,nprow,npcol)
!
! This subroutine factorizes the number of processors (nproc)
! into nprow and npcol,  that are the sizes of the 2d processors mesh.
!
! Written by Carlo Cavazzoni
!
implicit none
integer nproc,nprow,npcol
integer sqrtnp,i

sqrtnp = int( sqrt( dble(nproc) ) + 1 )
do i=1,sqrtnp
if(mod(nproc,i).eq.0) nprow = i
end do
npcol = nproc/nprow

return
end
!
!-----------------------------------------------------------------------
!
subroutine blockset( nb, nbuser, n, nprow, npcol)
!
!     This subroutine try to choose an optimal block size
!     for the distributd matrix.
!
!     Written by Carlo Cavazzoni, CINECA
!
implicit none
integer*8 :: N
integer nb, nprow, npcol, nbuser

nb = min ( n/nprow, n/npcol )
if(nbuser.gt.0) then
nb = min ( nb, nbuser )
endif
nb = max(nb,1)

return
end subroutine blockset
!
!-----------------------------------------------------------------------
!
subroutine report_back ( context,                                 &
iam,nprocs,myrow,nprow,mycol,npcol)
!
! Each processor identifies itself and its place in the processor grid
!
implicit none
integer :: context
integer :: iam,nprocs,myrow,nprow,mycol,npcol
integer :: i,j,pnum,ITEM

do i=0,nprocs-1
call blacs_barrier (context, 'a')

if (iam.eq.i) then
ITEM = 3000 + 100*MYROW + 10*MYCOL
write(6,100) ITEM,iam,nprocs,myrow,mycol,myrow,nprow,mycol, &
&                   npcol

100       format(i4," PE=",i2,":",i2," Grid-coord(",i2,",",i2,        &
&               ")   PROW=",i2,":",i2,"   PCOL=",i2,":",i2)
call flush(6_4)
endif
enddo

call blacs_barrier (context, 'a')
end subroutine report_back
!
!-----------------------------------------------------------------------
!
subroutine printlocals( context,                          &
a,iam,nb,nprocs,myrow,mycol,l_nrows,l_ncols,IAorC)
!
! Prints the local array "a", processor by processor. Only use this
! for very small arrays or sub-arrays...
!
implicit none
real(kind=8)    :: a(:,:)
integer :: context
integer :: iam,nb,nprocs,myrow,mycol,l_nrows,l_ncols,IAorC
integer :: i,j,pnum,ITEM

call blacs_barrier (context, 'a')

do pnum=0,nprocs-1
if (iam .eq. pnum) then
ITEM = IAorC + 4000 + 100*MYROW + 10*MYCOL
write (6,100) ITEM,iam,myrow,mycol,nb,l_nrows,l_ncols,        &
&          l_nrows*l_ncols*8
100       format (i4," proc:",i3," grid position:",i3,",",i3,           &
&            " blksz:",i3," numroc:",i3,":",i3," Memory Allocated "&
&           ,i6," bytes")
do i=1,l_nrows
ITEM = IAorC + 4000 + 100*MYROW + 10*MYCOL + I
write (6,200) ITEM,(a(i,j),j=1,l_ncols)
200         format (I4," ",40(" ",f5.1))
call flush(6_4)
enddo
endif
call blacs_barrier (context, 'a')
enddo

call flush(6_4)
end subroutine printlocals
!
!-----------------------------------------------------------------------
!
SUBROUTINE INIT_MY_MATRIX (N,A,NPROW,NPCOL,MYROW,MYCOL,DESCA)
IMPLICIT NONE
integer*8 :: N
INTEGER      :: NH,NH1,NPROW,NPCOL,MYROW,MYCOL
INTEGER      :: DESCA(:)
REAL(kind=8) :: A(:,:)
REAL(kind=8) :: AII,AII1,AI1I,AINI1

INTEGER      :: I, J

!     Compute values for all elements of the global array, but using
!     pdelset, only set those elements which occur in local portion.

NH  = N/2
NH1 = NH + 1

!     Diagonal elements

DO I = 1,N
!       A(I,I) = 3.0
AII = 3.0
J = I
CALL PDELSET(A,I,J,DESCA,AII)
ENDDO

!     Upper diagonal

DO 180 I = 1,N-1
if (I .EQ. NH) go to 180
!       A(I,I+1) = -1.0d0
AII1 = -1.0d0
J = I+1
CALL PDELSET(A,I,J,DESCA,AII1)
180 CONTINUE

!     Lower diagonal

DO 160 I = 1,N-1
if (I .EQ. NH) go to 160
!     A(I+1,I) = -1.0d0
AI1I = -1.0d0
J = I
CALL PDELSET(A,I+1,J,DESCA,AI1I)
160 CONTINUE

!
!     ANTI-DIAGONAL
!
DO 190 I = 1,N
if (I .EQ. NH .OR. I .EQ. NH1) go to 190

!       A(I,N-I+1) = 1.0d0
AINI1 = 1.0d0
J = N-I+1
CALL PDELSET(A,I,J,DESCA,AINI1)
190 CONTINUE

return
end

!
!-----------------------------------------------------------------------
!
SUBROUTINE INIT_MY_RHS (N,C,NPROW,NPCOL,MYROW,MYCOL,DESCC)
IMPLICIT NONE
integer*8 :: N
INTEGER      :: NPROW,NPCOL,MYROW,MYCOL
INTEGER      :: DESCC(:)
REAL(kind=8) :: C(:,:)
REAL(kind=8) :: CI, CN

INTEGER :: I

DO I= 1, N-1
!       C(I) = N + 1.0
CI = N + 1.0
CALL PDELSET(C,I,1,DESCC,CI)
ENDDO

!     C(N) = 2*N + 2
CN = 2*N + 2
CALL PDELSET(C,I,1,DESCC,CN)

return
end
!
!-----------------------------------------------------------------------
!
subroutine get_solution (n,c,descc,solution)
implicit none
integer*8 :: n
integer :: descc(:)
real(kind=8) ::  c(:,:),solution(:)

integer :: i

do i= 1, n
call pdelget('A',' ',solution(i),c,i,1,descc)
enddo

return
end

!
!-----------------------------------------------------------------------
!
#ifdef PRINT_TIMES
subroutine elapsed_time(day_beg, hour_beg, minute_beg, second_beg,&
&          millisec_beg, day_end, hour_end, minute_end, second_end,&
&          millisec_end,tid, ITEM )

!date_and_time

! VALUES
!   must be of type default integer and of rank one. It is an INTENT(OUT)
!   argument. Its size must be at least eight. The values returned in VALUES
!   are as follows:

!   VALUES(1)
!   is the year (for example, 1998), or -HUGE (0) if no date is available.

!   VALUES(2)
!   is the month of the year, or -HUGE (0) if no date is available.

!   VALUES(3)
!   is the day of the month, or -HUGE (0) if no date is available.

!   VALUES(4)
!   is the time difference with respect to Coordinated Universal Time (UTC)
!   in minutes, or -HUGE (0) if this information is not available.
!
!   VALUES(5)
!   is the hour of the day, in the range 0 to 23, or -HUGE (0) if there is
!   no clock.
!
!   VALUES(6)
!   is the minutes of the hour, in the range 0 to 59, or -HUGE (0) if there
!   is no clock.
!   !
!   VALUES(7)
!   is the seconds of the minute, in the range 0 to 60, or -HUGE (0) if
!   there is no clock.
!
!   VALUES (8)
!   is the milliseconds of the second, in the range 0 to 999, or -HUGE (0)
!   if there is no clock.

implicit none
integer :: day_beg, hour_beg, minute_beg, second_beg, millisec_beg
integer :: day_end, hour_end, minute_end, second_end, millisec_end
integer :: day, hour, minute, second, millisec, tid, ITEM

if ( millisec_end .lt. millisec_beg ) then
millisec_end = millisec_end + 1000
second_end = second_end - 1
endif

if ( second_end .lt. second_beg ) then
second_end = second_end + 60
minute_end = minute_end - 1
endif

if ( minute_end .lt. minute_beg ) then
minute_end = minute_end + 60
hour_end   = hour_end - 1
endif

if ( hour_end .lt. hour_beg ) then
hour_end = hour_end + 24
day_end  = day_end - 1
endif

millisec = millisec_end - millisec_beg
second   = second_end   - second_beg
minute   = minute_end   - minute_beg
hour     = hour_end     - hour_beg
day      = day_end      - day_beg

if ( .not. ( day .eq. 0 .and. hour .eq. 0 .and. minute .eq. 0     &
&        .and. second .eq. 0 .and. millisec .eq. 0 ) ) then
ITEM = ITEM + 1
IF (tid .eq. 0 ) THEN
write(6,8000) ITEM,tid,day
8000     format(I4," IAM = ",I3," ELAPSED0 days          = ",i3)
write(6,8001) ITEM,tid,hour
8001     format(I4," IAM = ",I3," ELAPSED1 hours         = ",i3)
write(6,8002) ITEM,tid,minute
8002     format(I4," IAM = ",I3," ELAPSED2 minute        = ",i3)
write(6,8003) ITEM,tid,second
8003     format(I4," IAM = ",I3," ELAPSED3 second        = ",i3)
write(6,8004) ITEM,tid,millisec
8004     format(I4," IAM = ",I3," ELAPSED4 millisec      = ",i3)
ENDIF
endif

return

end subroutine elapsed_time
#endif

!
!-----------------------------------------------------------------------
!
#ifdef PRINT_TIMES
subroutine timestamp( date_time, day,hour,minute,second,millisec)
implicit none
character*23 :: date_time
integer      :: day, hour, minute, second, millisec
integer      :: elements(8)
character*3  :: months(12)

data months /'Jan','Feb','Mar','Apr','May','Jun',                 &
&             'Juk','Aug','Sep','Oct','Nov','Dec' /
call date_and_time ( VALUES=elements )

if ( elements(1) .ne. -HUGE(0) ) then
century: if ( elements(1) .lt. 2000 ) then
elements(1) = elements(1) - 1900
else
elements(1) = elements(1) - 2000
endif century
write(date_time,1010) elements(3),                       &
&                       months ( elements(2) ),                    &
&                                elements(1)  , elements(5),       &
&                 elements(6)  , elements(7)  , elements(8)

1010          format(i2.2,1x,a3,1x,i2.2,1x,                            &
&                i2.2,':',i2.2,':',i2.2,'.',i4.4 )

day      = elements(3)
hour     = elements(5)
minute   = elements(6)
second   = elements(7)
millisec = elements(8)

else
date_time=' '

day      = 0
hour     = 0
minute   = 0
second   = 0
millisec = 0

endif

end subroutine timestamp
#endif

!
!-----------------------------------------------------------------------
!

### Results for solutions of the STRIDWAD system for different orders of matrix A

#### LAPACK RESULTS

These are the results for serial LAPACK runs (nprocs=1). Note that on one processor the highest matrix size is 30,000.

   N         CPU Time
order of       per
matrix      processor
[ sec ]

10000      312.33
20000     1808.32
30000     5487.23
32000        N/A


#### ScaLAPACK RESULTS

    N                CPU Time                CPU Time                CPU Time                CPU Time
order of  nprocs       per         nprocs      per         nprocs      per                     per
matrix             processor                processor               processor               processor
[ sec ]                  [ sec ]                 [ sec ]                 [ sec ]

10000      64        8.17          128         5.15         256        6.31         512         .
20000                40.40                    24.37                   24.44
30000               118.24                    64.40                   59.70
40000               254.73                   135.43                  110.18
50000               462.04                   244.30                  219.82
60000               754.16                   398.18                  288.39
70000              1164.53                   608.61                  562.66
80000              1666.45                   881.23                  587.94
90000              2330.41                  1227.83                 1146.27
100000              3134.40                  1645.89                 1028.17                 3134.39
110000                                       2131.83                 1328.90
120000                                       2726.03                 1672.00
130000                                       3412.37                 2071.03
140000                                          N/A                  2558.06
150000                                                               3097.58
160000                                                               3661.49
170000                                                               4318.20
180000                                                               5092.44
190000                                                               5873.94
200000                                                               6794.57
210000                                                               7843.55
220000                                                               8798.14
230000                                                              10190.61
240000                                                              11230.16
250000                                                              12874.02
N/A


#### CONCLUSIONS

With ScaLAPACK on 256 processors matrices up to order 250,000 can be inverted, compared to order of 30,000 for LAPACK. When the matrix is distributed over many processors it takes less time and larger order matrices can be inverted.

For a list of the available ScaLAPACK routines see:

http://www.netlib.org/scalapack/double/


All ScaLAPACK routines assume that the data has been distributed on the process grid prior to the invocation of the routine. Detailed descriptions of the appropriate calling sequences for each of the ScaLAPACK routines can be found in the leading comments of the source code or the ScaLAPACK Users' Guide:

http://netlib2.cs.utk.edu/scalapack/slug/scalapack_slug.html