The performance meter allows you to report the wall clock time elapsed during a computation, as well as message-passing statistics. Since the performance meter is always activated, you can access the statistics by printing them after the computation is completed. To view the current statistics, use the Parallel/Timer/Usage menu item.
Parallel Timer Usage
Performance statistics will be printed in the text window (console).
To clear the performance meter so that you can eliminate past statistics from the future report, use the Parallel/Timer/Reset menu item.
Parallel Timer Reset
The following example demonstrates how the current parallel statistics are displayed in the console window:
Performance Timer for 1 iterations on 4 compute nodes Average wall-clock time per iteration: 4.901 sec Global reductions per iteration: 408 ops Global reductions time per iteration: 0.000 sec (0.0%) Message count per iteration: 801 messages Data transfer per iteration: 9.585 MB LE solves per iteration: 12 solves LE wall-clock time per iteration: 2.445 sec (49.9%) LE global solves per iteration: 27 solves LE global wall-clock time per iteration: 0.246 sec (5.0%) AMG cycles per iteration: 64 cycles Relaxation sweeps per iteration: 4160 sweeps Relaxation exchanges per iteration: 920 exchanges Total wall-clock time: 4.901 sec Total CPU time: 17.030 sec
A description of the parallel statistics is as follows:
A global reduction is a collective operation over all processes for the given job that reduces a vector quantity (the length given by the number of processes or nodes) to a scalar quantity (e.g., taking the sum or maximum of a particular quantity). The number of global reductions cannot be calculated from any other readily known quantities. The number is generally dependent on the algorithm being used and the problem being solved.
A message is defined as a single point-to-point, send-and-receive operation between any two processes. (This excludes global, collective operations such as global reductions.) In terms of domain decomposition, a message is passed from the process governing one subdomain to a process governing another (usually adjacent) subdomain.
The message count per iteration is usually dependent on the algorithm being used and the problem being solved. The message count and the number of messages that are reported are totals for all processors.
The message count provides some insight into the impact of communication latency on parallel performance. A higher message count indicates that the parallel performance may be more adversely affected if a high-latency interconnect is being used. Ethernet has a higher latency than Myrinet or Infiniband. Thus, a high message count will more adversely affect performance with Ethernet than with Infiniband.
Data transfer per iteration is usually dependent on the algorithm being used and the problem being solved. This number generally increases with increases in problem size, number of partitions, and physics complexity.
The data transfer per iteration may provide some insight into the impact of communication bandwidth (speed) on parallel performance. The precise impact is often difficult to quantify because it is dependent on many things including: ratio of data transfer to calculations, and ratio of communication bandwidth to CPU speed. The unit of data transfer is a byte.
The most relevant quantity is the Total wall clock time. This quantity can be used to gauge the parallel performance (speedup and efficiency) by comparing this quantity to that from the serial analysis (the command line should contain -t1 in order to obtain the statistics from a serial analysis). In lieu of a serial analysis, an approximation of parallel speedup may be found in the ratio of Total CPU time to Total wall clock time.