CRPC-TR95557-S: Compiler Blockability of Dense Matrix Factorizations

Steve Carr (Michigan Technological University)
Richard Lehoucq (Rice University)

August, 1995

Recent architectural advances have made memory accesses a significant
bottleneck for computational problems.  Even though cache memory helps
in some cases, it fails to alleviate the bottleneck for problems with
large working sets.  As a result, scientists are forced to restructure
their codes by hand to reduce the working-set size to fit a particular
machine. Unfortunately, these hand optimizations create
machine-specific code that is not portable across multiple
architectures without a significant loss in performance or a
significant effort to re-optimize the code.

It is the thesis of this paper that most of the hand optimizations
performed on matrix factorization codes are unnecessary because they
can and should be performed by the compiler.  It is better for the
programmer to express algorithms in a machine-independent form and
allow the compiler to handle the machine-dependent details.  This
gives the algorithms portability across architectures and removes the
error-prone, expensive and tedious process of hand optimization.

In this paper, we show that Cholesky and factorizations may be
optimized automatically by the compiler to be as efficient as the same
hand-optimized version found in.  We also show that the factorization
may be optimized by the compiler to perform comparably with the
hand-optimized version on matrix sizes that are typically run on nodes
of massively parallel systems. Our approach allows us to conclude that
matrix factorizations can be expressed in a machine-independent form
with the expectation of good memory performance across a variety of
architectures.

NOTE: Also available as TR95-08 from the Department of Computer
Science, Michigan Technological University.