From Documentation
Jump to: navigation, search

The Architecture

Xeon Phi is the first Intel Many Integrated Core (MIC) architecture product that combines many Intel CPU cores onto a single chip. The Xeon Phi co-processor can be considered as an on-chip computing node which has 60(P5110) small and weak cores with 512bit GDDR5 memory. Each core only has dual-issue, in-order execution x86 architecture with 64KB L1 cache and 512KB L2 cache. There are four hardware thread units per core managing the 4x the core count threads in total. There is an x86/87 unit, two ALU and a 512bit SIMD unit connected with two separated pipelines. When developing programs on Phi, we should always consider about the weak x86 ALU and the much more advanced vector process unit. Reference here for more details of the hardware architecture of Phi[1]

Mic-core-arc.jpg

Working Mode

The most common execution models can be broadly categorized as follows:

Offload Execution Mode

Also known as heterogeneous programming mode, here the host system offloads part or all of the computation from one or multiple processes or threads running on host. The application starts execution on the host. As the computation proceeds it can decide to send data to the coprocessor and let that work on it and the host and the coprocessor may or may not work in parallel.

Coprocessor Native Execution Mode

Phi hosts a Linux micro OS in it and can appear as another machine connected to the host like another node in a cluster. This execution environment allows the users to view the coprocessor as another compute node. In order to run natively, an application has to be cross compiled for Phi operating environment.

Symmetric Execution

In this case the application processes run on both the host and the Phi coprocessor. They usually communicate through some sort of message passing interface like MPI. This execution environment treats Phi card as another node in a cluster in a heterogeneous cluster environment.

Preparation work on Goblin node 49

Please reference here for logging issues first[2]. The gb49 node doesn't have the sharcnet module mode. You should set up the environment manually by simply running:

[feimao@gb49 feimao]$ source /opt/sharcnet/intel/15.0.1/bin/compilervars.sh intel64
[feimao@gb49 feimao]$ source /opt/sharcnet/intel/15.0.1/mkl/bin/mklvars.sh intel64
[feimao@gb49 feimao]$ export INTEL_LICENSE_FILE=/opt/sharcnet/intel/15.0.1/license/intel.lic

OpenMP on Phi

Here we have a simple 2D convolution program written in C with OpenMP. The outer loop is parallelized with OpenMP. Each thread process one row of the image. The most two inner loops are unrolled automatically by -O3 flag. The filter has a size 7x7, and the rim of image is ignored. The source files can be found in /global/c/work/feimao/openmp2dcov/

int main(int argc, const char * argv[])
{
    int imageHeight;
    int imageWidth;
 
    float filter[49] =
    {0,      0,      0,      0,      0, 0.0145,      1,
        0,      0,      0,      0, 0.0376, 0.1283, 0.0145,
        0,      0,      0, 0.0376, 0.1283, 0.0376,      0,
        0,      0, 0.0376, 0.1283, 0.0376,      0,      0,
        0, 0.0376, 0.1283, 0.0376,      0,      0,      0,
        0.0145, 0.1283, 0.0376,      0,      0,      0,      0,
        0, 0.0145,      0,      0,      0,      0,      0};
 
    int filterWidth = 7;
 
    const char* inputFile = "input.bmp";
    const char* outputFile = "output.bmp";
 
    float* inputImage;
    inputImage= readImage(inputFile, &imageWidth,&imageHeight);
    int dataSize = imageHeight*imageWidth*sizeof(float);
    float* outputImage = NULL;
    outputImage = (float*)malloc(dataSize);
    printf("Load image done...\n");
 
    double start,end;
 
    int yOut,xOut,r,c;
    int halffilter = filterWidth/2;
    int imgh = imageHeight - halffilter; 
    int imgw = imageWidth - halffilter;
 
    start = omp_get_wtime();
 
 
#pragma omp parallel for
    for (yOut = halffilter; yOut < imgh; yOut++)
    {
        for (xOut = halffilter; xOut < imgw; xOut++)
        {
 
            float sum = 0;
            for (r = -3; r < 4; r++)
            {
                for (c = -3; c < 4; c++)
                {
                    sum += inputImage[(yOut+r)*imageWidth+xOut+c] * filter[(r+halffilter)*filterWidth+c+halffilter];
                }
            }
 
            outputImage[yOut * imageWidth + xOut] = sum;
        } 
    }
    printf("Computation done...\n");
    end = omp_get_wtime();
    printf("TIME = %.0f ms\n",(end-start)*1000);
 
    storeImage(outputImage, outputFile, imageHeight,imageWidth, inputFile);
    printf("Done...\n");
    free(inputImage);
    free(outputImage);
    return 0;
}

Native Mode

The code can be compiled and run on Phi's native mode without changing any code. The flag "-mmic" should be added while compiling. The flag "-openmp_report" and "-vec_report" are used to check if the program is fully parallelized and vectorized by the compiler.

[feimao@gb49 openmp2dcov]$ icc -mmic -O3 -o bmpfuncs.mico -c bmpfuncs.c
[feimao@gb49 openmp2dcov]$ icc -mmic -O3 -openmp -openmp_report=2 -vec_report=2 -o omp2dcov.mico -c omp2dcov.c 
icc: command line remark #10010: option '-vec_report=2' is deprecated and will be removed in a future release. See '-help deprecated'
icc: remark #10397: optimization reports are generated in *.optrpt files in the output location
[feimao@gb49 openmp2dcov]$ icc -mmic -openmp -o omp2dcov.micx omp2dcov.mico bmpfuncs.mico 

In the omp2dcov.optrpt, we should see that the inner loop is not auto-vectorized by the compiler:

OpenMP Construct at omp2dcov.c(58,1)
   remark #16200: OpenMP DEFINED LOOP WAS PARALLELIZED
LOOP BEGIN at omp2dcov.c(63,9)
   <Distributed chunk1>
      remark #15344: loop was not vectorized: vector dependence prevents vectorization. First dependence is shown below. Use level 5 report for details
      remark #15346: vector dependence: assumed OUTPUT dependence between  line 70 and  line 70

Then you have to login to the Phi's subsystem to execute the native compiled program.

[feimao@gb49 openmp2dcov]$ ssh mic0

The most important thing is that our reading and storing image functions are written in serial codes. It will be executed painfully slow on Phi because of the weak single core. It is better to use CPU accessing I/O and offload the works to Phi.

Offload Mode

To use offload mode, we have to add offload directives in the code then just compiled as host code(without -mmic). You can reference here for more information:[3]

#pragma offload target(mic:0) in(inputImage:length(imageHeight*imageWidth)) out(outputImage:length(imageHeight*imageWidth))
{
#pragma omp parallel for
    for (yOut = halffilter; yOut < imgh; yOut++)
    {
        for (xOut = halffilter; xOut < imgw; xOut++)
        {
            float sum = 0;
            for (r = -3; r < 4; r++)
            {
                for (c = -3; c < 4; c++)
                {
                    sum += inputImage[(yOut+r)*imageWidth + xOut+c] * filter[(r+halffilter)*filterWidth +c+halffilter];
                }
            }
 
            outputImage[yOut * imageWidth + xOut] = sum;
        }
    }
}

Processor affinity may have a significant impact on the performance of your algorithm, so understanding the affinity types available and the behaviors of your algorithm can help you adapt affinity on the target architecture to maximize performance. You can reference here for more information:[4] OpenMP Loop Scheduling is another impact on the performance. Please reference here:[5]. In our case, the affinity is set to balanced and the loop scheduling is static. So that the image rows with adjacent numbers will be placed on the same core for better cache reuse. It is worth mention that "Under Intel MPSS many of the kernel services and daemons are affinitized to the “Bootstrap Processor” (BSP), which is the last physical core. This is also where the offload daemon runs the services required to support data transfer for offload. It is therefore generally sensible to avoid using this core for user code. (Indeed, as already discussed, the offload system does that automatically by removing the logical CPUs on the last core from the default affinity of offloaded processes). Offloaded programs inherit an affinity map that hides the last core, which is dedicated to offload system functions. Native programs can use all the cores, making the calculations required for balancing the threads slightly different." So we avoid the last core, use only 59 core by setting "KMP_PLACE_THREADS=59c".

Tuning for SIMD

We can force the compiler to explicitly enable vectorization/SIMD in a program by using "#pragma simd":

#pragma omp parallel for
    for (yOut = halffilter; yOut < imgh; yOut++)
    {
        #pragma simd
        for (xOut = halffilter; xOut < imgw; xOut++)
        {
 
            float sum = 0;
            for (r = -3; r < 4; r++)
            {
                for (c = -3; c < 4; c++)
                {
                    sum += inputImage[(yOut+r)*imageWidth + xOut+c] * filter[(r+halffilter)*filterWidth +c+halffilter];
                }
            }
 
            outputImage[yOut * imageWidth + xOut] = sum;
        }
    }
}

Performance

Test image has 15552 × 7776 pixels in greyscale. The performance of phi is measured in native mode. The CPU execution is compiled with -xAVX instead of -mmic.

Hardware Time(ms) SIMD(ms)
2 x E5-2630(12 cores,AVX)_12threads 433 109
Phi_59threads 615 135
Phi_118threads 466 172
Phi_236threads 507 264

MKL on Phi

There are three MKL usage modes available for the Xeon Phi:

Automatic Offload

The MKL runtime may automatically offload some or all work on Phi. In this mode, user does not have to change the code at all. The data transfer is managed completely by the MKL runtime. In MKL 11.1, the following Level-3 BLAS functions and LAPACK functions are AO-enabled:

  •  ?GEMM, ?SYMM, ?TRMM, and ?TRSM
  • LU, QR, Cholesky factorizations

Reference here for AO-enabled functions and restrictions:[6] Automatic Offload can be enabled using a single call of a support function, or setting an environment variable. Compiler pragmas are not needed, and users can compile and link the code the usual way. The following text shows how to enable AO in FORTRAN or C code,

rc = mkl_mic_enable( )

Alternatively, it can be enabled using an environment variable.

MKL_MIC_ENABLE=1

Reference here for more information[7].

Compiler Assisted Offload

MKL functions are offloaded in the same way as OpenMP offloaded function. An example for offloading MKL’s sgemm routine looks as follows:

#pragma offload target(mic) in(transa, transb, N, alpha, beta) in(A:length(N*N)) \
                                            in(B:length(N*N)) in(C:length(N*N)) \
                                            out(C:length(N*N) alloc_if(0)) 
     sgemm(&transa, &transb, &N, &N, &N, &alpha, A, &N, B, &N, &beta, C, &N);

To build a program for compiler assisted offload, the following command is recommended by Intel:

icc -O3 -openmp -mkl \
-offload-option,mic,ld, "-L$MKLROOT/lib/mic -Wl,\
-start-group -lmkl_intel_lp64 -lmkl_intel_thread \
-lmkl_core -Wl,-end-group" file.c -o file

Native Mode

To build a program for native mode, the following compiler settings should be used:

icc -O3 -mkl -mmic file.c -o file

The binary must then be manually copied to the coprocessor via ssh and directly started on the coprocessor.

For the executable just created to work correctly, it must find the right libraries at runtime. To ensure that, please execute on the coprocessor:

export LD_LIBRARY_PATH=/global/c/work/feimao/intel15/mkl/lib/mic/:$LD_LIBRARY_PATH

Without modifying the LD_LIBRARY_PATH in this way, the program will find incompatible libraries and produce an error.

For more compiler options, please reference MKL Link Line Advisor[8].

OpenCL on Phi

In the OpenMP case above, we have each thread processing each row of the image. Since the OpenCL workgroup is assigned to one single thread, we can easily rewrite the OpenMP code to OpenCL. The OpenCL compiler will automatically vectorize all work items in the workgroup. Here we have one dimension workgroup with hundreds of work items(e.g. 1024). So that each thread will process one row in the image.

The OpenCL compiler may not completely unroll the loop. Here we unroll manually:

//The NDRange setting is:
localSize[2] = {1024, 1};
globalSize[2] = {1024, imageHeight-6};//ignore the rim
 
__kernel
void convolution(__global float* inputImage, __global float* outputImage, __constant float* filter, int  imageHeight, int  imageWidth, int  filterWidth)
{
    int localsize = get_local_size(0);
    int halffilter = filterWidth/2;
    int yOut = get_group_id(1)+3;
    int xOut = get_local_id(0)+3;
 
    while(xOut<imageWidth-3)
    {
        float sum = 0;
        int filterIdx = 0;
        int offset =(yOut-3)*imageWidth + xOut-3;
        sum += inputImage[offset+0] * filter[filterIdx++];
        sum += inputImage[offset+1] * filter[filterIdx++];
        sum += inputImage[offset+2] * filter[filterIdx++];
        sum += inputImage[offset+3] * filter[filterIdx++];
        sum += inputImage[offset+4] * filter[filterIdx++];
        sum += inputImage[offset+5] * filter[filterIdx++];
        sum += inputImage[offset+6] * filter[filterIdx++];
 
        offset =(yOut-2)*imageWidth + xOut-3;
        sum += inputImage[offset+0] * filter[filterIdx++];
        sum += inputImage[offset+1] * filter[filterIdx++];
        sum += inputImage[offset+2] * filter[filterIdx++];
        sum += inputImage[offset+3] * filter[filterIdx++];
        sum += inputImage[offset+4] * filter[filterIdx++];
        sum += inputImage[offset+5] * filter[filterIdx++];
        sum += inputImage[offset+6] * filter[filterIdx++];
 
        offset =(yOut-1)*imageWidth + xOut-3;
        sum += inputImage[offset+0] * filter[filterIdx++];
        sum += inputImage[offset+1] * filter[filterIdx++];
        sum += inputImage[offset+2] * filter[filterIdx++];
        sum += inputImage[offset+3] * filter[filterIdx++];
        sum += inputImage[offset+4] * filter[filterIdx++];
        sum += inputImage[offset+5] * filter[filterIdx++];
        sum += inputImage[offset+6] * filter[filterIdx++];
 
        offset =(yOut-0)*imageWidth + xOut-3;
        sum += inputImage[offset+0] * filter[filterIdx++];
        sum += inputImage[offset+1] * filter[filterIdx++];
        sum += inputImage[offset+2] * filter[filterIdx++];
        sum += inputImage[offset+3] * filter[filterIdx++];
        sum += inputImage[offset+4] * filter[filterIdx++];
        sum += inputImage[offset+5] * filter[filterIdx++];
        sum += inputImage[offset+6] * filter[filterIdx++];
 
        offset =(yOut+1)*imageWidth + xOut-3;
        sum += inputImage[offset+0] * filter[filterIdx++];
        sum += inputImage[offset+1] * filter[filterIdx++];
        sum += inputImage[offset+2] * filter[filterIdx++];
        sum += inputImage[offset+3] * filter[filterIdx++];
        sum += inputImage[offset+4] * filter[filterIdx++];
        sum += inputImage[offset+5] * filter[filterIdx++];
        sum += inputImage[offset+6] * filter[filterIdx++];
 
        offset =(yOut+2)*imageWidth + xOut-3;
        sum += inputImage[offset+0] * filter[filterIdx++];
        sum += inputImage[offset+1] * filter[filterIdx++];
        sum += inputImage[offset+2] * filter[filterIdx++];
        sum += inputImage[offset+3] * filter[filterIdx++];
        sum += inputImage[offset+4] * filter[filterIdx++];
        sum += inputImage[offset+5] * filter[filterIdx++];
        sum += inputImage[offset+6] * filter[filterIdx++];
 
        offset =(yOut+3)*imageWidth + xOut-3;
        sum += inputImage[offset+0] * filter[filterIdx++];
        sum += inputImage[offset+1] * filter[filterIdx++];
        sum += inputImage[offset+2] * filter[filterIdx++];
        sum += inputImage[offset+3] * filter[filterIdx++];
        sum += inputImage[offset+4] * filter[filterIdx++];
        sum += inputImage[offset+5] * filter[filterIdx++];
        sum += inputImage[offset+6] * filter[filterIdx++];
 
        outputImage[yOut * imageWidth + xOut] = sum;
 
        xOut+=localsize;
    }
}

The work items in the same work group share the memory cache. For better memory reuses, a 2D work group is more suitable for the 2D image. Here we set up an 32x32 work group and have one work item processing one pixel. Because the global size is larger than image, we have to handle the boundary in the kernel.

size_t localSize[2] = {32, 32};
totalWorkItemsX = roundUp(imageWidth-3, localSize[0]);
totalWorkItemsY = roundUp(imageHeight-3, localSize[1]);
size_t globalSize[2] = {totalWorkItemsX, totalWorkItemsY};
kernel
void convolution(const __global float* inputImage, __global float* outputImage, __constant float* filter, int  imageHeight, int  imageWidth, int  filterWidth)
{
 
    int yOut = get_global_id(1)+3;
    int xOut = get_global_id(0)+3;
 
    if(xOut<imageWidth-3 && yOut<imageHeight-3)
    {
        float sum = 0;
        int filterIdx = 0;
        int offset =(yOut-3)*imageWidth + xOut-3;
        sum += inputImage[offset+0] * filter[filterIdx++];
 
        ... SAME AS ABOVE
 
        outputImage[yOut * imageWidth + xOut] = sum;
    }
}

Performance

For GPU development, local memory(shared memory in CUDA) is for better memory reuse. But there is no need to use local memory and barriers on Phi because the local memory is allocated at the regular memory and all memory objects are cached.

Test image is the same as OpenMP code. Only kernel execution time is measured:

Kernels Time(ms)
Phi_1D_1024work-items 202
Phi_2D_32x32work-items 70
Phi_2D_32x32work-items(local-mem) 82
K20_32x32work-items 101
K20_32x32work-items(local-mem) 38

Pthread on Phi

Pthread can be executed on Phi in native mode. There is no need to change any code. Only "-mmic" flag should be added while compiling.

#include <stdio.h>
#include <sys/types.h>
#include <pthread.h>
 
#define MAX_THREAD 10000
 
typedef struct {
	int id;
} parm;
 
void *hello(void *arg)
{
	parm *p=(parm *)arg;
	printf("Hello from thread %d\n", p->id);
	return (NULL);
}
 
void main(int argc, char* argv[]) {
	int n,i;
	pthread_t *threads;
	pthread_attr_t pthread_custom_attr;
	parm *p;
 
	if (argc != 2)
	{
		printf ("Usage: %s n\n  where n is no. of threads\n",argv[0]);
		exit(1);
	}
 
	n=atoi(argv[1]);
 
	if ((n < 1) || (n > MAX_THREAD))
	{
		printf ("The no of thread should between 1 and %d.\n",MAX_THREAD);
		exit(1);
	}
 
	threads=(pthread_t *)malloc(n*sizeof(*threads));
	pthread_attr_init(&pthread_custom_attr);
 
	p=(parm *)malloc(sizeof(parm)*n);
	/* Start up thread */
 
	for (i=0; i<n; i++)
	{
		p[i].id=i;
		pthread_create(&threads[i], &pthread_custom_attr, hello, (void *)(p+i));
	}
 
	/* Synchronize the completion of each thread. */
 
	for (i=0; i<n; i++)
	{
		pthread_join(threads[i],NULL);
	}
	free(p);
}

The code should be compiled on gb49 and then run on mic0/mic1.

[feimao@gb49 pthread_phi]$ icc -mmic -o pthread.mico -c pthread.c 
[feimao@gb49 pthread_phi]$ icc -mmic -o pthread.micx pthread.mico -pthread
[feimao@gb49 pthread_phi]$ ssh mic1
feimao@gb49-mic1:~/pthread_phi$ ./pthread.micx 60

MPI on Phi

Here is a simple MPI C program from Intel[9]. The time measurement method is modified to print out the wall time of all processes.

#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <time.h>
#include <math.h>
 
#include "mpi.h"
 
#define MASTER 0
#define TAG_HELLO 4
#define TAG_TEST 5
#define TAG_TIME 6
 
int main(int argc, char *argv[])
{
  int i, id, remote_id, num_procs;
 
  MPI_Status stat;
  int namelen;
  char name[MPI_MAX_PROCESSOR_NAME];
 
  // Start MPI.
  if (MPI_Init (&argc, &argv) != MPI_SUCCESS)
    {
      printf ("Failed to initialize MPIn");
      return (-1);
    }
  // Create the communicator, and retrieve the number of processes.
  MPI_Comm_size (MPI_COMM_WORLD, &num_procs);
 
  // Determine the rank of the process.
  MPI_Comm_rank (MPI_COMM_WORLD, &id);
 
  // Get machine name
  MPI_Get_processor_name (name, &namelen);
 
  if (id == MASTER)
    {
      printf ("Hello world: rank %d of %d running on %s\n", id, num_procs, name);
 
      for (i = 1; i<num_procs; i++) 
	{	
	  MPI_Recv (&remote_id, 1, MPI_INT, i, TAG_HELLO, MPI_COMM_WORLD, &stat);	
	  MPI_Recv (&num_procs, 1, MPI_INT, i, TAG_HELLO, MPI_COMM_WORLD, &stat);  		
	  MPI_Recv (&namelen, 1, MPI_INT, i, TAG_HELLO, MPI_COMM_WORLD, &stat);			
	  MPI_Recv (name, namelen+1, MPI_CHAR, i, TAG_HELLO, MPI_COMM_WORLD, &stat);
 
	  printf ("Hello world: rank %d of %d running on %s\n", remote_id, num_procs, name);
	}
}
  else   
    {	    
      MPI_Send (&id, 1, MPI_INT, MASTER, TAG_HELLO, MPI_COMM_WORLD);
      MPI_Send (&num_procs, 1, MPI_INT, MASTER, TAG_HELLO, MPI_COMM_WORLD);
      MPI_Send (&namelen, 1, MPI_INT, MASTER, TAG_HELLO, MPI_COMM_WORLD);
      MPI_Send (name, namelen+1, MPI_CHAR, MASTER, TAG_HELLO, MPI_COMM_WORLD);
}
 
   // Rank 0 distributes seek randomly to all processes.
  double startprocess, endprocess;
 
  int distributed_seed = 0;
  int *buff;
 
  buff = (int *)malloc(num_procs * sizeof(int));
 
  unsigned int MAX_NUM_POINTS = pow (2,32) - 1;
  unsigned int num_local_points = MAX_NUM_POINTS / num_procs;
 
  if (id == MASTER)
    {		  
      srand (time(NULL));
 
      for (i=0; i<num_procs; i++)    
	{           
	  distributed_seed = rand();
	  buff[i] = distributed_seed;
	}
    }
  // Broadcast the seed to all processes
  MPI_Bcast(buff, num_procs, MPI_INT, MASTER, MPI_COMM_WORLD);
 
  // At this point, every process (including rank 0) has a different seed. Using their seed,
  // each process generates N points randomly in the interval [1/n, 1, 1]
  MPI_Barrier(MPI_COMM_WORLD);//barrier before all process
  startprocess = MPI_Wtime();
 
  srand (buff[id]);
 
  unsigned int point = 0;
  unsigned int rand_MAX = 128000;
  float p_x, p_y, p_z;
  float temp, temp2, pi;
  double result;
  unsigned int inside = 0, total_inside = 0;
 
  for (point=0; point<num_local_points; point++)
    {
      temp = (rand() % (rand_MAX+1));
      p_x = temp / rand_MAX;
      p_x = p_x / num_procs;
 
      temp2 = (float)id / num_procs;	// id belongs to 0, num_procs-1
      p_x += temp2;
 
      temp = (rand() % (rand_MAX+1));
      p_y = temp / rand_MAX;
 
      temp = (rand() % (rand_MAX+1));
      p_z = temp / rand_MAX;
 
      // Compute the number of points residing inside of the 1/8 of the sphere
      result = p_x * p_x + p_y * p_y + p_z * p_z;
 
      if (result <= 1)
	  {
		inside++;
	  }
    }
  MPI_Barrier(MPI_COMM_WORLD);//barrier after all process
  double elapsed = MPI_Wtime() - startprocess;
 
  MPI_Reduce (&inside, &total_inside, 1, MPI_UNSIGNED, MPI_SUM, MASTER, MPI_COMM_WORLD);
 
 
#if DEBUG 
  printf ("rank %d counts %u points inside the spheren", id, inside);
#endif
  if (id == MASTER)
    {
      double timeprocess[num_procs];
 
      timeprocess[MASTER] = elapsed;
      printf("Elapsed time from rank %d: %10.2f (sec) \n", MASTER, timeprocess[MASTER]);
 
      temp = 6 * (float)total_inside;
      pi = temp / MAX_NUM_POINTS;   
      printf ( "Out of %u points, there are %u points inside the sphere => pi=%16.12f\n", MAX_NUM_POINTS, total_inside, pi);
    }
 
  free(buff);
 
  // Terminate MPI.
  MPI_Finalize();
 
  return 0;
}

Phis on SHARCNET

SHARCNET doesn't provide Intel MPI compiler, host and mic programs should be compiled separately by different MPICH compiler. To run the mpi program, the Intel MPI runtime which is included in Intel Compiler 15 should be used. Before compiling, you should export the path to ICC15 as explained above. To compile execution for host CPU, you should use mpicc in /global/c/work/feimao/mpich-3.1.3-host/bin/ with flag -I/global/c/work/feimao/mpich-3.1.3-host/include/ and link with flag -L/global/c/work/feimao/mpich-3.1.3-host/lib

[feimao@gb49 MC_PI_MPI]$ /global/c/work/feimao/mpich-3.1.3-host/bin/mpicc -O3 -xHost montecarlo.c -o montecarlo -I/global/c/work/feimao/mpich-3.1.3-host/include -L/global/c/work/feimao/mpich-3.1.3-host/lib

The mic program should be compiled with MPICH mic version in /global/c/work/feimao/mpich-3.1.3-phi/

[feimao@gb49 MC_PI_MPI]$ /global/c/work/feimao/mpich-3.1.3-phi/bin/mpicc -O3 -mmic montecarlo.c -o montecarlo.MIC -I/global/c/work/feimao/mpich-3.1.3-phi/include -L/global/c/work/feimao/mpich-3.1.3-phi/lib

To execute the mpi program, you should create a machinefile indicating how many processes you want run on each device. For example, we want the program runs 120 processes on CPU and each Phi.

gb49:120
mic0:120
mic1:120

To load the program, we should

export I_MPI_MIC=1
export I_MPI_MIC_PROXY_PATH=/global/c/work/feimao/intel15/mpirt/bin/mic
export I_MPI_FABRICS=shm:tcp
export I_MPI_MIC_POSTFIX=.MIC
export I_MPI_PIN_PROCESSOR_LIST=all:map=spread

then use mpirun from ICC15:

export PATH=/global/c/work/feimao/intel15/mpirt/bin/intel64/:$PATH

Then run

mpirun -n 360 -machine machinefile ./montecarlo

Tuning for SIMD

The code below is executed by every MPI processes. It won't be auto-vectorized by the compiler because the loop dependency is not safe.

for (point=0; point<num_local_points; point++)
{
      temp = (rand() % (rand_MAX+1));
      p_x = temp / rand_MAX;
      p_x = p_x / num_procs;
 
      temp2 = (float)id / num_procs;	// id belongs to 0, num_procs-1
      p_x += temp2;
 
      temp = (rand() % (rand_MAX+1));
      p_y = temp / rand_MAX;
 
      temp = (rand() % (rand_MAX+1));
      p_z = temp / rand_MAX;
 
      // Compute the number of points residing inside of the 1/8 of the sphere
      result = p_x * p_x + p_y * p_y + p_z * p_z;
 
      if (result <= 1)
	  {
		inside++;
	  }
 }

We can simply use "#pragma simd" to force the compiler to vectorize the for loop and specify inside as a reduction variable. The function rand() will not be vectorized because it has to execute serially to generate different random numbers. So the speed up will be limited.

#pragma simd vectorlength(16) private(r_tmp,temp,temp2,p_x,p_y,p_z,result) reduction(+:inside)
  for (point=0; point<num_local_points; point++)
    {
      temp = (rand() % (rand_MAX+1));
      p_x = temp / rand_MAX;
      p_x = p_x / num_procs;
 
      temp2 = (float)id / num_procs;    // id belongs to 0, num_procs-1
      p_x += temp2;
 
      temp = (rand() % (rand_MAX+1));
      p_y = temp / rand_MAX;
 
      temp = (rand() % (rand_MAX+1));
      p_z = temp / rand_MAX;
 
      // Compute the number of points residing inside of the 1/8 of the sphere
      result = p_x * p_x + p_y * p_y + p_z * p_z;
      if (result <= 1)
        {
 
                inside++;
        }
    }

In the tests, we manually set the number of processes running on different devices. For non-SIMD tests, the CPU performance is ~1.8x Phi. As Phi is running 120 processes, the CPU will run 216 processes to balance the load. For SIMD tests, both CPU and Phi runs 120 processes as they have nearly the same performance.

Hardware Time(s) SIMD Time(s)
2 x E5-2630(12 cores,AVX) 17.06 19.21
1 Phi 30.67 18.14
1 CPU + 2 x Phi 8.53 6.72

The vectorization report for Phi is listed below:

LOOP BEGIN at montecarlo_simd.c(122,3)
   remark #15417: vectorization support: number of FP up converts: single precision to double precision 1   [ montecarlo_simd.c(139,21) ]
   remark #15525: call to function 'rand' is serialized   [ montecarlo_simd.c(124,15) ]
   remark #15525: call to function 'rand' is serialized   [ montecarlo_simd.c(131,15) ]
   remark #15525: call to function 'rand' is serialized   [ montecarlo_simd.c(134,15) ]
   remark #15301: SIMD LOOP WAS VECTORIZED
   remark #15475: --- begin vector loop cost summary ---
   remark #15476: scalar loop cost: 521 
   remark #15477: vector loop cost: 336.680 
   remark #15478: estimated potential speedup: 1.450 
   remark #15479: lightweight vector operations: 58 
   remark #15481: heavy-overhead vector operations: 1 
   remark #15482: vectorized math library calls: 3 
   remark #15485: serialized function calls: 3
   remark #15487: type converts: 4 
   remark #15488: --- end vector loop cost summary ---
   remark #15489: --- begin vector function matching report ---
   remark #15490: Function call: rand(void) with simdlen=16, actual parameter types: (void)   [ montecarlo_simd.c(124,15) ]
   remark #15545: SIMD annotation was not seen, consider adding 'declare simd' directives at function declaration 
   remark #15490: Function call: rand(void) with simdlen=16, actual parameter types: (void)   [ montecarlo_simd.c(131,15) ]
   remark #15545: SIMD annotation was not seen, consider adding 'declare simd' directives at function declaration 
   remark #15490: Function call: rand(void) with simdlen=16, actual parameter types: (void)   [ montecarlo_simd.c(134,15) ]
   remark #15545: SIMD annotation was not seen, consider adding 'declare simd' directives at function declaration 
   remark #15493: --- end vector function matching report ---
LOOP END

The vectorization report for CPU:

LOOP BEGIN at montecarlo_simd.c(122,3)
   remark #15399: vectorization support: unroll factor set to 2
   remark #15417: vectorization support: number of FP up converts: single precision to double precision 1   [ montecarlo_simd.c(139,21) ]
   remark #15525: call to function 'rand' is serialized   [ montecarlo_simd.c(124,15) ]
   remark #15525: call to function 'rand' is serialized   [ montecarlo_simd.c(131,15) ]
   remark #15525: call to function 'rand' is serialized   [ montecarlo_simd.c(134,15) ]
   remark #15301: SIMD LOOP WAS VECTORIZED
   remark #15475: --- begin vector loop cost summary ---
   remark #15476: scalar loop cost: 521 
   remark #15477: vector loop cost: 416.250 
   remark #15478: estimated potential speedup: 1.230 
   remark #15479: lightweight vector operations: 57 
   remark #15481: heavy-overhead vector operations: 1 
   remark #15482: vectorized math library calls: 3 
   remark #15485: serialized function calls: 3
   remark #15487: type converts: 4 
   remark #15488: --- end vector loop cost summary ---
   remark #15489: --- begin vector function matching report ---
   remark #15490: Function call: rand(void) with simdlen=16, actual parameter types: (void)   [ montecarlo_simd.c(124,15) ]
   remark #15545: SIMD annotation was not seen, consider adding 'declare simd' directives at function declaration 
   remark #15490: Function call: rand(void) with simdlen=16, actual parameter types: (void)   [ montecarlo_simd.c(131,15) ]
   remark #15545: SIMD annotation was not seen, consider adding 'declare simd' directives at function declaration 
   remark #15490: Function call: rand(void) with simdlen=16, actual parameter types: (void)   [ montecarlo_simd.c(134,15) ]
   remark #15545: SIMD annotation was not seen, consider adding 'declare simd' directives at function declaration 
   remark #15493: --- end vector function matching report ---
LOOP END

Phis on Calcul Quebec

There are 100 Intel Xeon Phi co-processors with Intel MPI available on the Guillimin cluster. Submitting a Xeon Phi job is similar to submitting a regular job. The main difference is that the submission script and/or the qsub command should specify the "mics=2" resource in the resource list and the job must be submitted to the phi queue:

$ qsub -q phi -l nodes=1:ppn=16:mics=2 ./submit_script.sh

Each Xeon Phi node has 2 co-processors and 16 cores. You may request mics=1 or mics=2, but your access will be limited to 1 or 2 accelerators (respectively) for each node reserved for your job. You must request at least one core per node to gain access to the Xeon Phis. Please reference here for details:[10]

TBB on Phi

Running TBB natively

Code that runs natively on the Phi can apply the TBB parallel programming model just as they would on the host, with no unusual complications beyond the larger number of threads. When compiling programs that employ TBB constructs, be sure to link in the Intel TBB shared library with –tbb and -pthread. If you don't, undefined references will occur.

icpc -mmic –tbb -pthread yours.cpp

Afterwards you should export the tbb's library path on Phi OS:

export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/global/c/work/feimao/intel/icc/tbb/lib/mic/

Offloading TBB

The Intel TBB header files are not available on the Intel MIC target environment by default (the same is also true for Intel Cilk Plus). To make them available on the coprocessor the header files have to be wrapped with #pragma offload directives as demonstrated in the example below:

#pragma offload_attribute (push,target(mic))
#include “tbb/task_scheduler_init.h”
#include “tbb/parallel_for.h”
#include "tbb/blocked_range.h"
#pragma offload_attribute (pop)

Functions called from within the offloaded construct and global data required on the Intel Xeon Phi coprocessor should be appended by the special function attribute:

__attribute__((target(mic)))

Please refrence here for more details about using TBB on Phi.[11]

References

  • Intel® Xeon Phi™ Coprocessor - the Architecture[12]
  • Best Known Methods for Using OpenMP* on Intel® Many Integrated Core (Intel® MIC) Architecture[13]
  • Best Practice Guide - Intel Xeon Phi[14]
  • Intel® Math Kernel Library (Intel® MKL) on Intel® Many Integrated Core Architecture (Intel® MIC Architecture)[15]
  • Using the Intel® SDK for OpenCL™ Applications XE 2013 with the Intel® Xeon Phi™ Coprocessor[16]
  • Using the Intel® MPI Library on Intel® Xeon Phi™ Coprocessor Systems[17]
  • Xeon Phi - Wiki de Calcul Quebec[18]