D.7.10. FMS Matrix Factoring Parameter IALGOR

Data Type

Integer

Default Value

Machine dependent

Description

This parameter directs FMS to use alternative algorithms when performing matrix multiplies during factoring, solving and matrix-vector multiply. These algorithms are designed to reduce the number of floating point operations performed, thereby providing a faster solution. The options may be selected independently. To determine the value for IALGOR, add the values of the options selected. (Note that the option values are a power of 2 and have the effect of setting bits in IALGOR).

0, Perform matrix multiplies using traditional techniques.
+1, Use Strassen's algorithm for performing matrix multiplies.
This algorithm modifies the traditional matrix multiply
```
   [C] = [A][B]
```
by first bisecting each of the matrices [A] and [B] into 4 equal parts. By appropriately adding, subtracting and multiplying the resulting quadrants of [A] and [B], it is possible to compute the matrix [C] with 7/8 of the multiplies required by the traditional technique. Because this algorithm is recursive, the matrices may be bisected several times, with each bisection reducing the remaining work by another factor of 7/8.

Ultimately, however, the matrices become too small and the overhead of adding and subtracting the quadrants exceeds the 7/8 benefit. The FMS parameter MINDIM is used to terminate this process by specifying the minimum dimension of a matrix quadrant at or below which traditional techniques are used.

There are, however, some downsides to this algorithm. First, because additions and subtractions are performed ahead of multiplications, some precision will be lost. The amount is problem-dependent. You should always check your answer with the traditional technique, IALGOR=0.

Second, the adding and subtracting of the quadrants results in a more random memory access pattern than the traditional technique. For this reason, some of the benefit may be lost in cache thrashing.

Third, current floating point processors fuse the multiplier and adder so the output of the multiplier flows directly into the adder. This hardware development occurred after these software algorithms were originally developed. When additions are performed, multiply cycles go unused.
+2, 3M Complex multiply (Complex data type only).
Perform complex matrix multiply using 3 instead of the traditional 4 multiply and add operations. In the traditional case, when two complex numbers A and B are multiplied and added to C, 4 multiplies and 4 adds are required as follows:
```
   (Re_C + Im_C i) = (Re_A + Im_A i) * (Re_B + Im_B i)

   P1 = Re_A*Re_B
   P2 = Re_A*Im_B
   P3 = Im_A*Re_B
   P4 = Im_A*Im_B

   Re_C = Re_C + (P1 - P4)
   Im_C = Im_C + (P2 + P3)
```
In this reduced operation case, the same calculation is performed with 3 multiplies as follows:
```
   P1 = Re_A*(Re_B - Im_B)
   P2 = (Re_A + Im_A)*Re_B
   P3 = (Re_A - Im_A)*Im_B

   Re_C = Re_C + (P1 + P3)
   Im_C = Im_C + (P2 - P1)
```
At first it would appear that we have not gained anything because, while we have reduced the number of multiplies from 4 to 3, we have increased the number of adds from 4 to 7. Since most computers perform multiplies and adds at the same speed (and many concurrently), we have actually increased the number of cycles to perform the operation.
However when A, B and C become matrices [A], [B] and [C], the balance shifts. For matrices of size N by N, N² operations are required for addition while 2N³ operations are required for multiplication. For sufficiently large N, the extra additions become insignificant.

Again there is a down side to this algorithm. The additions and subtractions performed prior to multiplies will reduce accuracy. The amount is problem-dependent. Multiply cycles will be wasted on floating point units with a fused multiply-adder.
+4, Complex data type only.
This is the same as option 2, except the 3 real matrices are formed first, before the data is divided among the processors or Strassen's algorithm is applied. Since they are larger, the work performed for additions is reduced.

For complex matrices both options may be selected with IALGOR=3 or 5.

In addition to observing run time, you may obtain the number of floating point operations saved from the parameter SAVOPS.