Previously it was mentioned that PETSc has a number of options for the
multigrid preconditioner. Figure 1 shows the difference in
scalability between two options for the coarse grid solve when using multigrid
as a preconditioner to GMRES(30) on a structured grid on the IA32 cluster.
Keeping the number of unknowns per processor fixed at
million, we
would see a horizontal line for ideal scaling. The orange plot shows a direct
method, LU factorization, for the coarse grid solve. The blue plot shows an
iterative method using the preconditioner only which is set to Block Jacobi.
LU does better than the iterative method until 128 processors, when the time
for the coarse grid solve blows up to around 700 seconds with LU. This is most
likely the result of the poor scaling of the direct solver. Multigrid methods
can tolerate an approximate (and cheaper) coarse grid solver and so we will use
the block Jacobi preconditioner on the coarse grid for all other PETSc results
presented here.
Doing the same scaling study on IA32 using the more standard preconditioning methods implemented in PETSc on an unstructured grid, it becomes more evident how important a good preconditioner is. Figure 2 shows that Jacobi doesn't work very well as a preconditioner on the model problem, but Block Jacobi does. CG preconditioned with Block Jacobi is the best choice from these options.
Figure 3 shows the difference between GMRES preconditioned with multigrid with Block Jacobi on the coarse grid, and CG preconditioned with Block Jacobi. Both are implemented in PETSc on IA32. Indeed, the multigrid preconditioned method scales much better than the others. This demonstrates the better algorithmic scalability of multigrid.
Figure 4 shows hypre solving a structured grid problem on two
different architectures. As before, scalability is tested using
million unknowns per processor up to 256 processors. The architectures show
similar single-processor performance, although the theoretical peak flop rate
of the Linux cluster is higher than that of the SGI Origin2000. The slope of
the curves is small for the Linux cluster, meaning that it scales well.
However, the wall clock time goes up more for the SGI Origin2000 as number of
processors increases. This is probably due to the better network connection in
the Linux clusters. Comparing the two numerical solvers, GMRES with multi-grid
preconditioners is performing better than multi-grid alone. The results from
the IA64 cluster have been omitted from the plotted results. The reason for
this is that the IA64 clusters are not presently showing improved times when
compared to the IA32 clusters. We are investigating what might be causing
this.