IBM Books

Hitchhiker's Guide


How much communication is enough?

A significant factor that affects the performance of a parallel application is the balance between communication and workload. In some cases, the workload is unevenly distributed or is duplicated across multiple tasks. Ideally, you'd like perfect balance among the tasks, but doing so may require additional communication that actually makes the performance worse. We discussed this briefly in Duplication versus redundancy when we said that sometimes it's better to have all the tasks do the same thing rather than have one do it and try to send the results to the rest.

An example of where the decision is not so clear cut is the matrix inversion program in Chapter 2, The answer is 42. We showed you how to start making the sequential program into a parallel one by distributing the element calculation once the determinant was found. What we didn't show you was how poor a start this actually is. Part of the program is shown below. You can access the complete program from the IBM RS/6000 World Wide Web site. See Accessing PE documentation online for more information.

/*************************************************************************
*
* Matrix Inversion Program - First parallel implementation
*
* To compile:
* mpcc -g -o inverse_parallel inverse_parallel.c
*
*************************************************************************/
     {
/* There are only 2 unused rows/columns left */
 
/* Find the second unused row */
for(row2=row1+1;row2<size;row2++)
  {
    for(k=0;k<depth;k++)
      {
        if(row2==used_rows[k]) break;
      }
    if(k>=depth)  /* this row is not used */
      break;
  }
assert(row2<size);
 
/* Find the first unused column */
for(col1=0;col1<size;col1++)
  {
    for(k=0;k<depth;k++)
      {
         if(col1==used_cols[k]) break;
      }
    if(k>=depth)  /* this column is not used */
      break;
  }
assert(col1<size);
 
/* Find the second unused column */
for(col2=col1+1;col2<size;col2++)
  {
    for(k=0;k<depth;k++)
      {
        if(col2==used_cols[k]) break;
      }
    if(k>=depth)  /* this column is not used */
      break;
  }
assert(col2<size);
 
/* Determinant = m11*m22-m12*m21 */
return matrix[row1][col1]*matrix[row2][col2]-matrix
[row1][col2]*matrix[row2][col1];
        }
 
        /* There are more than 2 rows/columns in the matrix being processed  */
        /* Compute the determinant as the sum of the product of each element */
        /* in the first row and the determinant of the matrix with its row   */
        /* and column removed                                                */
        total = 0;
 
        used_rows[depth] = row1;
        for(col1=0;col1<size;col1++)
          {
            for(k=0;k<depth;k++)
              {
                if(col1==used_cols[k]) break;
              }
             if(k<depth)  /* This column is used -- skip it*/
               continue;
            used_cols[depth] = col1;
            total += sign*matrix[row1][col1]*determinant(matrix,size,used_rows,
            used_cols,depth+1);
            sign=(sign==1)?-1:1;
          }
        return total;
 
  }
 
void print_matrix(FILE * fptr,float ** mat,int rows, int cols)
{
  int i,j;
  for(i=0;i<rows;i++)
    {
      for(j=0;j<cols;j++)
        {
          fprintf(fptr,"%10.4f ",mat[i][j]);
        }
      fprintf(fptr,"\n");
    }
}
 
float coefficient(float **matrix,int size, int row, int col)
{
  float coef;
  int * ur, *uc;
 
  ur = malloc(size*sizeof(matrix));
  uc = malloc(size*sizeof(matrix));
  ur[0]=row;
  uc[0]=col;
  coef = (((row+col)%2)?-1:1)*determinant(matrix,size,ur,uc,1);
  return coef;
}

We suspect that we have a problem, and that it is not a communication bottleneck, but rather a computation problem. To illustrate this, compile the parallel matrix inversion program, inverse_parallel.c, with the -pg flag. Next, run gprof on the monitor files for tasks 0-7 (we know task 8 just collects the results so we aren't that concerned with its performance).

$ mpcc -g -pg -o inverse_parallel inverse_parallel.c
$ inverse_parallel -procs 9
$ gprof inverse_parallel gmon.out.[0-7]
 
%   cumulative   self              self     total
time   seconds  seconds     calls  ms/call  ms/call  name
80.3       9.81     9.81       72   136.25   136.25  .determinant [1]
8.8      10.89     1.08                             .__mcount [5]
3.6      11.33     0.44                             .kickpipes [6]
0.9      11.44     0.11                             .readsocket [7]
0.0      12.21     0.00       64     0.00   136.25  .coefficient [4]
0.0      12.21     0.00        8     0.00  1226.25  .main [2]

We see that we spend a lot of time in determinant, first to compute the determinant for the entire matrix and then in computing the determinant as part of computing the element values. That seems like a good place to start optimizing.

This algorithm computes the determinant of a matrix by using the determinants of the submatrices formed by eliminating the first row and a column from the matrix. The result of this recursion is that, eventually, the algorithm computes the determinants of all the 2 by 2 matrices formed from the last two rows and each combination of columns. This isn't so bad, but the same 2 by 2 matrix formed in this manner is computed n-2 times (once for each column except the 2 from which it is formed) each time a determinant is computed and there are n*(n-1)/2 such matrices. If the 2 by 2 matrix determinants can be captured and re-used, it would provide some improvements.

Not only is this a good approach for optimizing a sequential program, but parallelism capitalizes on this approach as well. Because the 2 by 2 determinants are independent, they can be computed in parallel and distributed among the tasks. Each task can take one of the columns and compute the determinants for all the matrices formed by that column and subsequent columns. Then the determinants can be distributed among all the tasks and used to compute the inverse elements.

The following example shows only the important parts of the program. You can access the complete program from the IBM RS/6000 World Wide Web site. See Accessing PE documentation online for more information.

Here's the call to partial determinant:

/************************************************************************
*
* Matrix Inversion Program - First optimized parallel version
*
* To compile:
* mpcc -g -o inverse_parallel_fast inverse_parallel_fast.c
*
************************************************************************/
 
  /* Compute determinant of last two rows */
  pd = partial_determinant(matrix,rows);
  /* Everyone computes the determinant (to avoid message transmission) */
  determ=determinant(matrix,rows,used_rows,used_cols,0,pd);

And here's the partial determinant call:

/* Compute the determinants of all 2x2 matrices created by combinations */
/* of columns of the bottom 2 rows                                      */
/* partial_determinant[i] points to the first determinant of all the 2x2*/
/* matricies formed by combinations with column i.  There are n-i-1     */
/* such matricies (duplicates are eliminated)                           */
float **partial_determinant(float **matrix,int size)
{
  int col1, col2, row1=(size-2), row2=(size-1);
  int i,j,k;
  int terms=0;
  float **partial_det,  /* pointers into the 2x2 determinants*/
                        /* by column                         */
        *buffer,        /* the 2x2 determinants              */
        *my_row;        /* the determinants computed by this */
                        /* task                              */
  int * recv_counts, * recv_displacements; /* the size and offsets for the */
                                           /* determinants to be received from  */
                                           /* the other tasks        */
 
  terms = (size-1)*(size)/2;  /* number of combinations of columns */
 
  /* Allocate work areas for paritial determinants and message passing, */
  partial_det = (float **) malloc((size-1)*sizeof(*partial_det));
  buffer      = (float *)  malloc(terms*sizeof(buffer));
  my_row      = (float *)  malloc((size-me-1)*sizeof(my_row));
  recv_counts = (int *)    malloc(tasks*sizeof(*recv_counts));
  recv_displacements = (int *) malloc(tasks*sizeof(*recv_displacements));
 
  /* the tasks after the column size - 2 don't have to do anything */
  for(i=tasks-1;i>size-2;i--)
    {
        recv_counts[i]=0;
        recv_displacements[i]=terms;
    }
  /* all the other tasks compute the determinants for combinations */
  /* with its column                                               */
  terms--;
  for(i=size-2;i>=0;i--)
    {
        partial_det[i]=&(buffer[terms]);
        recv_displacements[i]=terms;
        recv_counts[i]=size-i-1;
        terms-=(size-i);
    }
  for(j=0;j<(size-me-1);j++)
    {
      my_row[j]=matrix[row1][me]*matrix[row2][me+j+1]
      -matrix[row1][me+j+1]*matrix[row2][me];
    }
 
  /* Now everybody sends their columns determinants to everybody else */
  /* Even the tasks that did not compute determinants will get the    */
  /* results from everyone else (doesn't sound fair, does it?)        */
  MPI_Allgatherv(my_row,
                 ((size-me-1)>0)?(size-me-1):0,
                 MPI_REAL,
                 buffernts,
                 recv_displacements,
                 MPI_REAL,MPI_COMM_WORLD);
 
  /* Free up the work area and return the array of pointers into the */
  /* determinants                                                    */
  free(my_row);
  return partial_det;
}

The question is whether the cost of the additional communication offsets the advantage of computing the 2 by 2 determinants in parallel. In this example, it may not be because the small message sizes (the largest is three times the size of a float). As the matrix size increases, the cost of computing the 2 by 2 determinants will increase with the square of n (the size of the matrix) but the cost of computing the determinants in parallel will increase with n (each additional dimension increases the work of each parallel task by only one additional 2 by 2 matrix) so, eventually, the parallel benefit will offset the communication cost.


[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]