CRPC-TR95557-S: Compiler Blockability of Dense Matrix Factorizations Steve Carr (Michigan Technological University) Richard Lehoucq (Rice University) August, 1995 Recent architectural advances have made memory accesses a significant bottleneck for computational problems. Even though cache memory helps in some cases, it fails to alleviate the bottleneck for problems with large working sets. As a result, scientists are forced to restructure their codes by hand to reduce the working-set size to fit a particular machine. Unfortunately, these hand optimizations create machine-specific code that is not portable across multiple architectures without a significant loss in performance or a significant effort to re-optimize the code. It is the thesis of this paper that most of the hand optimizations performed on matrix factorization codes are unnecessary because they can and should be performed by the compiler. It is better for the programmer to express algorithms in a machine-independent form and allow the compiler to handle the machine-dependent details. This gives the algorithms portability across architectures and removes the error-prone, expensive and tedious process of hand optimization. In this paper, we show that Cholesky and factorizations may be optimized automatically by the compiler to be as efficient as the same hand-optimized version found in. We also show that the factorization may be optimized by the compiler to perform comparably with the hand-optimized version on matrix sizes that are typically run on nodes of massively parallel systems. Our approach allows us to conclude that matrix factorizations can be expressed in a machine-independent form with the expectation of good memory performance across a variety of architectures. NOTE: Also available as TR95-08 from the Department of Computer Science, Michigan Technological University.