* Program re-ordering for improved L2 cache hit rate. * Automatic performance tuning. # Motivations # Matrix multiplications are a key building block of most modern high-performance computing systems.
Construct the relevant 4x4 transform matrices: $$\mathbf{A}_{S}=\left(\begin{array}{cccc} s_{x} & 0 & 0 & 0\\ 0 & s_{y} & 0 & 0\\ 0 & 0 ...