A11X1 + A12X2 + . . . + A1nXn = B1
A21X1 + A22X2 + . . . + A2nXn = B2
An1X1 + An2X2 + . . . + AnnXn = Bn
The calculation required is to determine the values of N unknowns X(N) from the array of known coefficients A(N,N) and vector B(N). Because the time required to solve this system increases as the cube of the number of equations N, this calculation often dominates analysis programs, resulting in limitations being placed on the analysis resolution, N.
In discussing techniques used to solve simultaneous equations, it is convenient to use the following matrix notation:
[A]{X} = {B}
where
| [A] = |
A11 A12 . . . A1n A21 A22 . . . A2n ... ... An1 An2 . . . Ann |
{X} = |
X1 X2 ... ... Xn |
{B} = |
B1 B2 ... ... Bn |
The symbol [A] deontes the N-by-N matrix of known coefficients. A single term in the matrix [A] is referred to as Aij, where the first subscript i is the row and the second subscript j is the column. The set of N known values in {B} is called the right-hand side vector while the set of N unknown values represented by {X} is called the solution vector.
Systems of simultaneous equations are often used to solve boundary value problems in the fields of structural mechanics, thermal analysis, fluid mechanics, diffusion, and electromagnetics. For these applications, the vector {X} usually represents unknown values of the field variable at sample points distributed throughout the domain of the problem. The vector {B} contains the known boundary condition values at the sample points. The matrix [A] serves as a mathematical model of the domain, relating the known boundary conditions {B} to the unknown field variable values {X}. The number of sample points, N, generally determines the resolution of the analysis.
As an example, suppose a structural analysis is performed. The matrix [A], known as the stiffness matrix, serves as a mathematical model of the object being analyzed. The vector {B} contains loads that are applied to the object [A], resulting in the displacements {X}.
Several techniques are currently used for generating simultaneous equation models of boundary value problems. Among the most popular are finite elements, finite difference, boundary integrals and the moment method. Simultaneous equations can also be generated directly from a model of discrete elements, as occurs in power network and electrical circuit analysis.
FMS includes special features for parallel processing of multiple right-hand side and solution vectors that provide extremely fast processing of related systems of euations. Therefore, the symbols {X} and {B} are used throughout this manual to mean one or more vectors.
Many boundary value problems produce symmetric matrices because the influence of one point i on another point j is the same as the influence of point j on point i. The property of symmetry is used in the equation solving algorithm to eliminate almost half of the matrix data storage and computation. The more general case of nonsymmetrical matrices, which provides for arbitrary coefficients Aij, is also included.
The data type can be real or complex. Real data requires only a single floating-point value to represent each term, while complex data requires both the real and imaginary parts. In addition to requiring twice as much storage, complex data requires four times as much processing because cross-products between the real and imaginary parts must be evaluated. For example, it requires one multiply(*) and one add (+) to compute the following for real data:
A = B + C * D
However, for complex data, the calculation becomes the following:
(A+ai) = (B+bi) + (C+Ci)*(D+di)
Uppercase letters denote the real part of the data, lowercase letters denote the imaginary part of the data, and i is the square root of (-1). The real and imaginary parts of the data are computed as follows:
A = B + C*D - c*d a = b + C*d + c*D
This computation requires four multiplies(*) and four additions (+ or -).
To optimize the use of memory pipelines, complex data is stored as the real part followed by the imaginary part. This storage is the default for variables typed COMPLEX in FORTRAN programs.
[A] = [AL] + [D] + [AU]
[A] is stored by rows, starting at the first nonzero term and proceeding across the row up to, but not including, the diagonal. Similarly, [AU] is stored by columns, starting at the first nonzero term and proceeding down the column to, but not including, the diagonal. The diagonal is stored as a separate vector.
A profile vector {LOWEQ} is used, which completely describes the profile sparsity of the matrix. This vector contains the lowest coupled equation for each equation. For the lower triangle matrix [AL], this is the column number of the first nonzero term in each row. For the upper traingle matrix [AU], this is the row number of the first nonzero term in each column. The number of off-diagonal terms in row I of [AL] or column I of [AU] is I - LOWEQ(I). For a full matrix, all terms in the vector {LOWEQ} would be 1.
FMS profile storage includes full matrices, which always start at row and column 1, and banded matrices, which start at a constant distance from the diagonal. As discussed during the factoring process, the matrix factors may always be overlaid on the profile storage described above.
In FORTRAN, the vector inner product is implemented by a function as follows:
DOUBLE PRECISION FUNCTION DOT (A, B, N)
REAL*8 A(N), B(N)
DOT = 0.0D0
DO I=1,N
DOT = DOT + A(I) * B(I)
END DO
RETURN
END
This calculation, which uses unit address increments for the vectors {A} and {B}, achieves maximum memory performance. Several modern processors contain a multiply-accumulate instruction which implements the above loop in hardware.
[A] = [B][C]
A typical term of the matrix [A], Aij is computed from the inner-product of the i-th row of matrix [B] and the j-th column of matrix [C],
| Aij = |
N k=1 |
Bik Ckj |
If the first matrix [B] is stored by rows and the second matrix [C] is stored by columns, the resulting inner-product uses incremental memory addressing that may be pipelined.
[A] = [L][U]
To obtain an expression for computing a typical term Uij in [U], consider its corresponding term Aij in [A]. Based on the matrix multiplication process described above, the term Aij is computed from the inner-product of the i-th row of [L] and the j-th column of [U],
| Aij = |
N k=1 |
Lik Ukj |
This inner-product can be divided into the following four parts:
| Aij = |
(IJLOW-1) k=1 |
Lik Ukj |   +  |
(i-1) k=IJLOW |
Lik Ukj |    +  LiiUij  +  |
N k=(i+1) |
Lik Ukj |
In this expression, the term IJLOW is equal to the maximum of ILOW and JLOW, where ILOW = LOWEQ(I) is the column number of the first nonzero term in i-th row of [L], and JLOW = LOWEQ(J) is the row number of the first nonzero term in column j-th column of [U].
The value of the first part of the inner-product is zero, because either or both of Lik and Ukj will be outside the matrix profile. For this reason profile sparsity is used on large matrices that are to be factored. The terms Lik and Ukj for k=1,(IJLOW-1) do not contribute to the matrix factors and remain zero during the factoring process.
The last part of the inner product involving the sum from (i+1) to N will also be zero because Lik is zero for terms in the upper triangle.
In computing the factors [L] and [U], there are N(N+1) terms to be determined from the N2 terms in [A]. Therefore, N terms of [L] or [U] may be chosen arbitrarily. The choice that results in the most efficient algorithm is to select the diagonal terms of [L] to be 1. Using Lii = 1, the above equation maybe solved for Uij as follows:
| Uij = Aij - |
(i-1) k=IJLOW |
Lik Ukj |
[A] = [L][U]
Expressions for computing terms in the lower triangular factor [L] can be derived in an identical way. Consider a term Aij that is below the diagonal,
| Aij = |
N k=1 |
Lik Ukj |
This inner-product can be divided into the following four parts:
| Aij = |
(IJLOW-1) k=1 |
Lik Ukj |   +  |
(j-1) k=IJLOW |
Lik Ukj |    + LijUjj  +  |
N k=(j+1) |
Lik Ukj |
As before, the term IJLOW is equal to the maximum of ILOW and JLOW where ILOW = LOQEQ(I) is the column number of the first nonzero term in row I and JLOW = LOWEQ(J) is the row number of the first nonzero term in column J.
The values of the first and fourth parts of the inner-product are zero as before because they involve terms outside the matrix profile or terms in the lower triangle of [U]. Solving the equation for the term Lij gives the following:
| Lij = (Aij - |
(j-1) k=IJLOW |
Lik Ukj)/Ujj |
Therefore, terms in the lower triangular factor [L] are also computed by subtracting an inner-product from the corresponding term in the original matrix [A]. However, the result is then multiplied by the diagonal factor reciprocal 1/Ujj.
During the factoring and vector solution processes, the diagonals of [U] are only used as denominators. Because multiplication is faster than division, the terms Uii are inverted immediately after being computed and the resiprocal values 1/Uii are overlaid on the original matrix diagonals Aii in the diagonal vector.
[A] = [L][D][L]T
| Aij = |
N k=1 |
Lik Dkk Ljk |
In this expression, the intermediate result Lik Dkk is computed, which is represented by the symbol (LD)ik, as follows:
(LD)ik = Lik Dkk
| Aij = |
N k=1 |
(LD)ik Ljk |
This inner-product can be divided into the following four parts:
| Aij = |
(IJLOW-1) k=1 |
(LD)ik Ljk |   +  |
(j-1) k=IJLOW |
(LD)ik Ljk |    + (LD)ijLjj  +  |
N k=(j+1) |
(LD)ik Ljk |
The term IJLOW is equal to the maximum of ILOW and JLOW, where ILOW = LOWEQ(I) and JLOW = LOWEQ(J) are the profile equation numbers for rows I and J of [L].
The value of the first part of the inner-product is zero because (LD)ik or Ljk is outside the matrix profile. The last part of the inner-product is zero because terms in the upper triangle of [L] are zero. Noting that Ljj = 1, the above equation can be solved to give the following:
| (LD)ij = Aij - |
(j-1) k=IJLOW |
(LD)ik Ljk |
Therefore, the first step in computing the matrix factor term Lij is to subtract an inner-product from the corresponding term in the original matrix, Aij. The inner-product is between all (LD) terms on row i to the left of Lij and the corresponding terms in row j of [L]. As the calculation proceeds across row i, the original matrix terms are transformed from Aij to (LD)ij.
The final value of Lij for row i are obtained by multiplying the terms (LD)ik by the diagonal factor reciprocals 1/Dkk.
Lik = (LD)ik (1/Dkk)
Because the diagonal factors [D] are only used as denominators, they are inverted and stored in the diagonal vector as reciprocals.
Before completing the calculation for Lij the value in (LD)ij must be used to compute the diagonal factor term Dii. Letting j = i in the equation for (LD) is as follows:
| (LD)ii = Aii - |
(i-1) k=IJLOW |
(LD)ik Lik |
Noting that:
(LD)ii = Lii Dii
and that Lii = 1, the equation for Dii becomes the following:
| Dii = Aii - |
(i-1) k=IJLOW |
(LD)ik Lik |
The computation of the diagonal for row i involves an inner-product between the intermediate result for row i, (LD)ik, and the final result for row i, Lik. Therefore, the conversion of (LD)ik and the calculation of Dii proceed together as follows:
DOT = 0.0D0
DO K = ILOW,(I-1)
LDSAVE = LD(I,K)
L(I,K) = DINV(K) * LD(I,K)
DOT = DOT + LDSAVE * L(I,K)
END DO
D(I,I) = A(I,I) - DOT
DINV(I) = 1.0D0/D(I,I)
When computing [L] and [D], the order of computation must begin at the upper left corner of the matrix A11 and end at the lower right Ann. The only restriction is that terms required to compute each Lij and Dii be known from previous calculations. Unless additional storage is to be used, the calculation of the diagonal factor Dii and the multiplication of the i-th row by the diagonal factor reciprocals must occur together, as discussed above. The actual FMS order is based on maximizing the reuse of high-speed registers once they have been loaded with data.
[L][D][L]T{X} = {B} (symmetric)
or
[L][U]{X} = {B} (nonsymmetric)
The first step is to solve the intermediate problem for the vector {Y} as follows:
[L]{Y} = {B}
For the data structure and algorithms in FMS, this is exactly the same for symmetric and nonsymmetric problems. In fact, if a symmetric matrix is processed with the nonsymmetric module, the matrix factor [L] is identical to the matrix factor [L] that would have been obtainied by using the symmetric module.
An expression for computing a typical term in vector {Y}, Yi, may be obtained from the vector calculation for Bi
| Bi = |
N k=1 |
Lik Yk |
This inner-product can be divided into the following four parts:
| Bi = |
(ILOW-1) k=1 |
Lik Yk |   +  |
(i-1) k=ILOW |
Lik Yk |    + LiiYi  +  |
N k=(i+1) |
Lik Yk |
ILOW = LOWEQ(I) is the column number of the first nonzero term in row i of [L]. The value of the first part of the inner product is zero because Lik is outside the matrix profile. The last part is also zero because Lik is zero for terms in the upper triangle. Recalling that Lii = 1, the expression for Yi is obtained as follows:
| Yi = Bi - |
(i-1) k=ILOW |
Lik Yk |
FMS stores the matrix [L] by rows and the terms Yk as a vector. Therefore the inner-product used for forward reduction produces incremental addressing that can be pipelined.
[D]{Z} = {Y}
Recalling that the diatonal factor [D] was stored in rediprocal form, this calculation becomes the following:
Zi = (1/Dii) Yi
As each multiplication is performed, the values of the scaled vector Zi are overlaid on the reduced vector Yi.
[U]{X} = {Y}
To obtain an expression for computing {X}, consider the vector calculation to compute Yi, as follows:
| Yi = |
N k=1 |
Uik Xk |
This inner-product can be divided into the following three parts:
| Yi = |
(i-1) k=1 |
Uik Xk |    + UiiXi  +  |
N k=(i+1) |
Uik Xk |
The first part is zero because it involves terms in the lower triangle of [U]. Letting
(XU)i = Xi Uii
the above equation can be solved for the following:
| (XU)i = Yi  - |
N k=(i+1) |
Uik Xk |
and
Xi = (XU)i (1/Uii)
As the calculations proceed in a backwards direction of decreasing i, the terms in Xk are known when they are required to compute Xi. For this reason, this calculation is usually called back substitution.
The inner-product
|
N k=(i+1) |
Uik Xk |
is not evaluated directly because it involves nonunity address increments in [U]. Instead partial sums are computed and stored after each Xi is obtained. Using the same storage locations for {Y}, {XU}, and {X}, the actual back substitution algorithm becomes the following:
Xi = (XU)i (1/Uii)
| (XU)k = (XU)k  - |
(i-1) k=ILOW |
Uki Xi |
In the previous equation, ILOW = LOWEQ(I). This calculational order provices incremental addressing of terms in [U] and processes only those terms that are within the matrix profile. After each term Xi is computed, it scales the i-th column of [U], Uki. The resulting vector terms are subtracted from the terms in {XU} below Xi.
| Xi = Zi  - |
N k=(i+1) |
Lki Xk |
This inner-product also involves nonunity addressing in [L]. Therefore partial sums are stored after each term Xi is computed. The actual algorithm uses the same storage for {X} and {Z} and computes the following:
| Xk = Xk  - |
(i-1) k=ILOW |
Lik Xi |
In this equation ILOW = LOWEQ(I). As each term Xi is computed, it scales the i-th row of [L]. The resulting vector terms are subtracted from the values of {X} below Xi.
{C} = {D} - [A]{B}
{C} = [A]{B}
{C} = {D} + [A]{B}
In these operations, [A] is a matrix stored in FMS matrix format and {B}, {C} and {D} are one or more vectors stored in FMS vector format.
One application of these multiply operations is to compute the residual error after solving a system of equations,
{E} = {B} - [A]{X}
The resulting error can then be used in interative refinement to compute a correction to the solution {X}.
A similar set of multiply routines are provided which use submatrix files instead of the assmebled matrix:
{C} = {D} - [Si]{B}
{C} = [Si]{B}
{C} = {D} + [Si]{B}
This calculation may be more efficient if the assembled matrix [A] is sparse.
FMS also contains the following multiplication routines:
{C} = {D} - {A}{B}
{C} = {A}{B}
{C} = {D} + {A}{B}
{C} = {D} - {A}T{B}
{C} = {A}T{B}
{C} = {D} + {A}T{B}
where {A}, {B}, {C} and {D} are FMS vector files. The above products may also be stored in FMS matrix files using
[C] = [D] - {A}T{B}
[C] = [D] {A}T{B}
[C] = [D] + {A}T{B}
FMS can also store the vector products in a memory based array to compute
[F] = {X}T{Y}
where [F] is a full matrix and {X} and {Y} are FMS vector files containing one or more vectors. For symmetric problems, only the lower triangle of [F] is computed.
These multiply operations have several pratical applications in computing Eigenvalues of the matrix [A]. For example, if {X} is a known Eigenvector, the correcponding Eigenvalue can be computed from the following:
e = ({X}T[A]{X})/({X}T{X})
This calculation uses the matrix-vectors multiply operation once and the vectors-vectors multiply operation twice.
These multiply operations are also used to project the matrix [A] into a subspace [A*] where its Eigenvalues can be computed more efficiently. This projection involves the following computation:
[A*] = {X}T[A]{X}
where {X} are the vectors that define the subspace.
In the special case where the matrix is diagonal [D], FMS directly computes the following quadratic form:
[F] = {X}T[D]{X} (symmetric)
[F] = {X}T[D]{Y} (nonsymmetric)
FMS also includes the vectors matrix multiply calculation
{Y} = {X}[F]
This calculation may be used to project subspace eigenvectors back into the problem space or to compute new vectors {Y} from a linear combination of old vectors {X}.