A. FMS Theory

This appendix contains a mathematical description of the matrix algebra functions performed by FMS.

Introduction

Simultaneous Equations

The solution of the following system of simultaneous equations is one of the most important calculations in scientific analysis:

A₁₁X₁ + A₁₂X₂ + . . . + A_1nX_n = B₁
A₂₁X₁ + A₂₂X₂ + . . . + A_2nX_n = B₂
...
...
A_n1X₁ + A_n2X₂ + . . . + A_nnX_n = B_n

The calculation required is to determine the values of N unknowns X(N) from the array of known coefficients A(N,N) and vector B(N). Because the time required to solve this system increases as the cube of the number of equations N, this calculation often dominates analysis programs, resulting in limitations being placed on the analysis resolution, N.

In discussing techniques used to solve simultaneous equations, it is convenient to use the following matrix notation:

[A]{X} = {B}

where

[A] =

A₁₁ A₁₂ . . . A_1n
A₂₁ A₂₂ . . . A_2n
...
...
A_n1 A_n2 . . . A_nn

{X} =

X₁
X₂
...
...
X_n

{B} =

B₁
B₂
...
...
B_n

The symbol [A] denotes the N-by-N matrix of known coefficients. A single term in the matrix [A] is referred to as A_ij, where the first subscript i is the row and the second subscript j is the column. The set of N known values in {B} is called the right-hand side vector while the set of N unknown values represented by {X} is called the solution vector.

Systems of simultaneous equations are often used to solve boundary value problems in the fields of structural mechanics, thermal analysis, fluid mechanics, diffusion, and electromagnetics. For these applications, the vector {X} usually represents unknown values of the field variable at sample points distributed throughout the domain of the problem. The vector {B} contains the known boundary condition values at the sample points. The matrix [A] serves as a mathematical model of the domain, relating the known boundary conditions {B} to the unknown field variable values {X}. The number of sample points, N, generally determines the resolution of the analysis.

As an example, suppose a structural analysis is performed. The matrix [A], known as the stiffness matrix, serves as a mathematical model of the object being analyzed. The vector {B} contains loads that are applied to the object [A], resulting in the displacements {X}.

Several techniques are currently used for generating simultaneous equation models of boundary value problems. Among the most popular are finite elements, finite difference, boundary integrals and the moment method. Simultaneous equations can also be generated directly from a model of discrete elements, as occurs in power network and electrical circuit analysis.

Related Systems of Equations

Often several related systems of equations that all have the same coefficient matrix [A] must be solved. For example, each vector {B}may represent a different set of boundary conditions that are applied to the same object [A]. If the systems are independent, where the values of each {B} do not depend on the previous solutions {X}, then all systems may be solved at the same time.

FMS includes special features for parallel processing of multiple right-hand side and solution vectors that provide extremely fast processing of related systems of equations. Therefore, the symbols {X} and {B} are used throughout this manual to mean one or more vectors.

Linear and Nonlinear Systems of Equations

Systems of simultaneous equations can be linear or nonlinear. For linear problems, the matrix coefficients A_ij are constant, while for nonlinear problems they depend on the solution values {X}. Nonlinear problems are usually solved as a sequence of linear problems, with the solution values {X} used to update the matrix coefficients A_ij at each stage of the analysis. FMS contains several features that minimize the amount of computation performed when only some of the matrix coefficients A_ij are changed.

Matrix Symmetry and Data Type

The matrix [A] can be symmetric, Hermitian or nonsymmetric. The corresponding values in the upper and lower triangle of a symmetric matrix are equal, A_ij = A_ji. For a Hermitian matrix, the upper triangle terms are the complex conjugate of the lower triangle terms and the diagonal terms are real.

Many boundary value problems produce symmetric matrices because the influence of one point i on another point j is the same as the influence of point j on point i. The property of symmetry is used in the equation solving algorithm to eliminate almost half of the matrix data storage and computation. The more general case of nonsymmetrical matrices, which provides for arbitrary coefficients A_ij, is also included.

The data type can be real or complex. Real data requires only a single floating-point value to represent each term, while complex data requires both the real and imaginary parts. In addition to requiring twice as much storage, complex data requires four times as much processing because cross-products between the real and imaginary parts must be evaluated. For example, it requires one multiply(*) and one add (+) to compute the following for real data:

A = B + C * D

However, for complex data, the calculation becomes the following:

(A+ai) = (B+bi) + (C+Ci)*(D+di)

Uppercase letters denote the real part of the data, lowercase letters denote the imaginary part of the data, and i is the square root of (-1). The real and imaginary parts of the data are computed as follows:

A = B + C*D - c*d
a = b + C*d + c*D

This computation requires four multiplies(*) and four additions (+ or -).

To optimize the use of memory pipelines, complex data is stored as the real part followed by the imaginary part. This storage is the default for variables typed COMPLEX in FORTRAN programs.

Profile Matrix Storage

FMS uses profile sparsity storage to account for large areas of the matrix [A] that may be zero. The original matrix is stored by lower triangular part [AL], diagonal [D], and upper triangular part [AU] as follows:

[A] = [AL] + [D] + [AU]

[A] is stored by rows, starting at the first nonzero term and proceeding across the row up to, but not including, the diagonal. Similarly, [AU] is stored by columns, starting at the first nonzero term and proceeding down the column to, but not including, the diagonal. The diagonal is stored as a separate vector.

A profile vector {LOWEQ} is used, which completely describes the profile sparsity of the matrix. This vector contains the lowest coupled equation for each equation. For the lower triangle matrix [AL], this is the column number of the first nonzero term in each row. For the upper triangle matrix [AU], this is the row number of the first nonzero term in each column. The number of off-diagonal terms in row I of [AL] or column I of [AU] is I - LOWEQ(I). For a full matrix, all terms in the vector {LOWEQ} would be 1.

Nonsymmetric Matrices

For nonsymmetric matrices, FMS uses a single Profile Vector {LOWEQ} to describe the profiles of both [AL] and [AU]. This assumption of symmetric sparsity results in a considerably more efficient data handling algorithm than if separate profile vectors were used for [AL] and [AU]. Physically, symmetric sparsity implies that if point i couples to point j, then point j also couples to point i. The values of the compling coefficients A_ij and A_ji may be equal or unequal to produce a symmetric or nonsymmetric matrix. Naturally, zero matrix coefficients may be used if the profile is not symmetric.

FMS profile storage includes full matrices, which always start at row and column 1, and banded matrices, which start at a constant distance from the diagonal. As discussed during the factoring process, the matrix factors may always be overlaid on the profile storage described above.

Symmetric Matrices

For symmetric matrices [A], the upper triangular matrix [AU] is the transpose of the lower triangular matrix [AL], [AU] = [AL]^T. Therefore, [AU] is not required or used for symmetric matrix problems. Note that the FMS symmetric storage, which is by rows in [AL], is equivalent to storage by columns in [AU]. The FMS storage can be viewed as a generalization of the skyline format storage frequently used for symmetric matrices.

Advantages of Profile Storage

Profile storage offers both mathematical and computational advantages over other methods of storing the matrix [A].

Mathematical Advantages

First, the matrix factors may be overlaid on the original matrix during the factoring process, resulting in efficient use of memory and disk space. Second, each matrix factor term is computed from the corresponding term in the original matrix by subtracting an inner-product sum. Therefore, the matrix terms are transformed one-by-one from initial matrix values to final factored values without storing intermediate results. In addition to being computationally efficient, this transformation provides inherent restart capability. The ability to modify part of the matrix and restart the factoring process is a valuable property that eliminates a significant amount of processing for nonlinear problems.

Computational Advantages

In addition to mathematical advantages of profile storage, there are also computational advantages for modern computers. The inner-products required to compute the matrix factor terms are computed using incremental addressing when the matrix is stored in FMS profile format. High-speed scientific computers operate the arithmetic units at a faster speed than the memory access time. To provide data at the faster rate, memory is divided into banks (interleaved). As long as memory addresses are generated sequentially, data references rotate among the memory banks to provide a data stream equal to the arithmetic processing speed. Random access of data, which results from indirect addressing or a step size other than unity, results in the arithmetic units waiting for data from memory. The profile storage described above results in the best possible performance for modern scientific computers.

Vector Inner-Products

The FMS package is based on the efficient implementation of the vector inner-product. This operation requires two vectors as input and produces a scalar result. The inner-product is also called a dot product or scalar product.

In FORTRAN, the vector inner-product is implemented by a function as follows:

       DOUBLE PRECISION FUNCTION DOT (A, B, N)
       REAL*8 A(N), B(N)
       DOT = 0.0D0
       DO I=1,N
          DOT = DOT + A(I) * B(I)
       END DO
       RETURN
       END

This calculation, which uses unit address increments for the vectors {A} and {B}, achieves maximum memory performance. Several modern processors contain a multiply-accumulate instruction which implements the above loop in hardware.

Factoring

Before deriving expressions for factoring the matrix [A], it is useful to examine the matrix multiplication process. Let [A] be the product of two full matrices [B] and [C], as follows:

matrix_multiply

[A] = [B][C]

A typical term of the matrix [A], A_ij is computed from the inner-product of the i-th row of matrix [B] and the j-th column of matrix [C],

A_ij =

N
Σ
k=1

B_ik C_kj

If the first matrix [B] is stored by rows and the second matrix [C] is stored by columns, the resulting inner-product uses incremental memory addressing that may be pipelined.

Nonsymmetric Factoring

The nonsymmetric factoring process consists of computing the lower and upper triangular matrices whose product equals the original matrix [A], as follows:

Factoring U
[A] = [L][U]

To obtain an expression for computing a typical term U_ij in [U], consider its corresponding term A_ij in [A]. Based on the matrix multiplication process described above, the term A_ij is computed from the inner-product of the i-th row of [L] and the j-th column of [U],

A_ij =

N
Σ
k=1

L_ik U_kj

This inner-product can be divided into the following four parts:

A_ij =

(IJLOW-1)
Σ
k=1

L_ik U_kj

(i-1)
Σ
k=IJLOW

L_ik U_kj

+ L_iiU_ij +

N
Σ
k=(i+1)

L_ik U_kj

In this expression, the term IJLOW is equal to the maximum of ILOW and JLOW, where ILOW = LOWEQ(I is the column number of the first nonzero term in i-th row of [L], and JLOW = LOWEQ(J) is the row number of the first nonzero term in column j-th column of [U].

The value of the first part of the inner-product is zero, because either or both of L_ik and U_kj will be outside the matrix profile. For this reason profile sparsity is used on large matrices that are to be factored. The terms L_ik and U_kj for k=1,(IJLOW-1) do not contribute to the matrix factors and remain zero during the factoring process.

The last part of the inner-product involving the sum from (i+1) to N will also be zero because L_ik is zero for terms in the upper triangle.

In computing the factors [L] and [U], there are N(N+1) terms to be determined from the N² terms in [A]. Therefore, N terms of [L] or [U] may be chosen arbitrarily. The choice that results in the most efficient algorithm is to select the diagonal terms of [L] to be 1. Using L_ii = 1, the above equation may be solved for U_ij as follows:

U_ij = A_ij -

(i-1)
Σ
k=IJLOW

L_ik U_kj

Terms in the Lower Triangular Matrix

Factoring L [A] = [L][U]

Expressions for computing terms in the lower triangular factor [L] can be derived in an identical way. Consider a term A_ij that is below the diagonal,

A_ij =

N
Σ
k=1

L_ik U_kj

This inner-product can be divided into the following four parts:

A_ij =

(IJLOW-1)
Σ
k=1

L_ik U_kj

(j-1)
Σ
k=IJLOW

L_ik U_kj

+ L_ijU_jj +

N
Σ
k=(j+1)

L_ik U_kj

As before, the term IJLOW is equal to the maximum of ILOW and JLOW where ILOW = LOQEQ(I) is the column number of the first nonzero term in row I and JLOW = LOWEQ(J) is the row number of the first nonzero term in column J.

The values of the first and fourth parts of the inner-product are zero as before because they involve terms outside the matrix profile or terms in the lower triangle of [U]. Solving the equation for the term L_ij gives the following:

L_ij = (A_ij -

(j-1)
Σ
k=IJLOW

L_ik U_kj)/U_jj

Therefore, terms in the lower triangular factor [L] are also computed by subtracting an inner-product from the corresponding term in the original matrix [A]. However, the result is then multiplied by the diagonal factor reciprocal 1/U_jj.

During the factoring and vector solution processes, the diagonals of [U] are only used as denominators. Because multiplication is faster than division, the terms U_ii are inverted immediately after being computed and the reciprocal values 1/U_ii are overlaid on the original matrix diagonals A_ii in the diagonal vector.

Term-by-term Computation

The expressions for U_ij and L_ij compute the matrix factors [U] and [L] term-by-term, replacing each corresponding term of A_ij as the calculation proceeds. In general, the calculations must progress from the upper left corner of the matrix A₁₁ and end at the lower right corner A_nn. The only restriction is that the terms required to compute each U_ij or L_ij be known from previous calculations. The actual FMS order is based on maximizing the reuse of high-speed registers once they have been loaded with data.

Symmetric Factoring

The symmetric factoring process consists of computing lower triangular and diagonal matrices whose product [L][D][L]^T equals the original symmetric matrix [A], as follows:

Factoring LDL [A] = [L][D][L]^T

A_ij =

N
Σ
k=1

L_ik D_kk L_jk

In this expression, the intermediate result L_ik D_kk is computed, which is represented by the symbol (LD)_ik, as follows:

(LD)_ik = L_ik D_kk

A_ij =

N
Σ
k=1

(LD)_ik L_jk

This inner-product can be divided into the following four parts:

A_ij =

(IJLOW-1)
Σ
k=1

(LD)_ik L_jk

(j-1)
Σ
k=IJLOW

(LD)_ik L_jk

+ (LD)_ijL_jj +

N
Σ
k=(j+1)

(LD)_ik L_jk

The term IJLOW is equal to the maximum of ILOW and JLOW, where ILOW = LOWEQ(I) and JLOW = LOWEQ(J) are the profile equation numbers for rows I and J of [L].

The value of the first part of the inner-product is zero because (LD)_ik or L_jk is outside the matrix profile. The last part of the inner-product is zero because terms in the upper triangle of [L] are zero. Noting that L_jj = 1, the above equation can be solved to give the following:

(LD)_ij = A_ij -

(j-1)
Σ
k=IJLOW

(LD)_ik L_jk

Therefore, the first step in computing the matrix factor term L_ij is to subtract an inner-product from the corresponding term in the original matrix, A_ij. The inner-product is between all (LD) terms on row i to the left of L_ij and the corresponding terms in row j of [L]. As the calculation proceeds across row i, the original matrix terms are transformed from A_ij to (LD)_ij.

The final values of L_ij for row i are obtained by multiplying the terms (LD)_ik by the diagonal factor reciprocals 1/D_kk.

L_ik = (LD)_ik (1/D_kk)

Because the diagonal factors [D] are only used as denominators, they are inverted and stored in the diagonal vector as reciprocals.

Before completing the calculation for L_ij the value in (LD)_ij must be used to compute the diagonal factor term D_ii. Letting j = i in the equation for (LD) is as follows:

(LD)_ii = A_ii -

(i-1)
Σ
k=IJLOW

(LD)_ik L_ik

Noting that:

(LD)_ii = L_ii D_ii

and that L_ii = 1, the equation for D_ii becomes the following:

D_ii = A_ii -

(i-1)
Σ
k=IJLOW

(LD)_ik L_ik

The computation of the diagonal for row i involves an inner-product between the intermediate result for row i, (LD)_ik, and the final result for row i, L_ik. Therefore, the conversion of (LD)_ik and the calculation of D_ii proceed together as follows:

       DOT = 0.0D0
       DO K = ILOW,(I-1)
          LDSAVE = LD(I,K)
          L(I,K) = DINV(K) * LD(I,K)
          DOT    = DOT + LDSAVE * L(I,K)
       END DO
       D(I,I) = A(I,I) - DOT
       DINV(I) = 1.0D0/D(I,I)

When computing [L] and [D], the order of computation must begin at the upper left corner of the matrix A₁₁ and end at the lower right A_nn. The only restriction is that terms required to compute each L_ij and D_ii be known from previous calculations. Unless additional storage is to be used, the calculation of the diagonal factor D_ii and the multiplication of the i-th row by the diagonal factor reciprocals must occur together, as discussed above. The actual FMS order is based on maximizing the reuse of high-speed registers once they have been loaded with data.

Vector Solution

Once the matrix [A] is factored, the following system of equations can be solved for the unknowns {X}:

[L][D][L]^T{X} = {B} (symmetric)

[L][U]{X} = {B} (nonsymmetric)

The first step is to solve the intermediate problem for the vector {Y} as follows:

[L]{Y} = {B}

For the data structure and algorithms in FMS, this is exactly the same for symmetric and nonsymmetric problems. In fact, if a symmetric matrix is processed with the nonsymmetric module, the matrix factor [L] is identical to the matrix factor [L] that would have been obtained by using the symmetric module.

An expression for computing a typical term in vector {Y}, Y_i, may be obtained from the vector calculation for B_i

B_i =

N
Σ
k=1

L_ik Y_k

This inner-product can be divided into the following four parts:

B_i =

(ILOW-1)
Σ
k=1

L_ik Y_k

(i-1)
Σ
k=ILOW

L_ik Y_k

+ L_iiY_i +

N
Σ
k=(i+1)

L_ik Y_k

ILOW = LOWEQ(I) is the column number of the first nonzero term in row i of [L]. The value of the first part of the inner-product is zero because L_ik is outside the matrix profile. The last part is also zero because L_ik is zero for terms in the upper triangle. Recalling that L_ii = 1, the expression for Y_i is obtained as follows:

Y_i = B_i -

(i-1)
Σ
k=ILOW

L_ik Y_k

Forward Reduction

As the calculation proceeds in the forward direction of increasing i, the terms in Y_k are known when they are required to compute Y_i. For this reason, this calculation is usually called forward reduction and the vector {Y} is called the reduced vector. As each term Y_i is computed, it can be overlaid on the right-hand side vector term B_i.

FMS stores the matrix [L] by rows and the terms Y_k as a vector. Therefore the inner-product used for forward reduction produces incremental addressing that can be pipelined.

Diagonal Scaling

For symmetric matrices, the second step of vector solution is diagonal scaling. This operation computes the intermediate vector {Z} from the reduced vector {Y} as follows:

[D]{Z} = {Y}

Recalling that the diagonal factor [D] was stored in reciprocal form, this calculation becomes the following:

Z_i = (1/D_ii) Y_i

As each multiplication is performed, the values of the scaled vector Z_i are overlaid on the reduced vector Y_i.

Back Substitution in Nonsymmetric Problems

The final step in vector solution is back substitution. For nonsymmetric problems, this step involves solving the following equation for the solution {X}:

[U]{X} = {Y}

To obtain an expression for computing {X}, consider the vector calculation to compute Y_i, as follows:

Y_i =

N
Σ
k=1

U_ik X_k

This inner-product can be divided into the following three parts:

Y_i =

(i-1)
Σ
k=1

U_ik X_k

+ U_iiX_i +

N
Σ
k=(i+1)

U_ik X_k

The first part is zero because it involves terms in the lower triangle of [U]. Letting

(XU)_i = X_i U_ii

the above equation can be solved for the following:

(XU)_i = Y_i -

N
Σ
k=(i+1)

U_ik X_k

and

X_i = (XU)_i (1/U_ii)

As the calculations proceed in a backwards direction of decreasing i, the terms in X_k are known when they are required to compute X_i. For this reason, this calculation is usually called back substitution.

The inner-product

N
Σ
k=(i+1)

U_ik X_k

is not evaluated directly because it involves nonunity address increments in [U]. Instead partial sums are computed and stored after each X_i is obtained. Using the same storage locations for {Y}, {XU}, and {X}, the actual back substitution algorithm becomes the following:

X_i = (XU)_i (1/U_ii)

(XU)_k = (XU)_k -

(i-1)
Σ
k=ILOW

U_ki X_i

In the previous equation, ILOW = LOWEQ(I). This calculational order provides incremental addressing of terms in [U] and processes only those terms that are within the matrix profile. After each term X_i is computed, it scales the i-th column of [U], U_ki. The resulting vector terms are subtracted from the terms in {XU} below X_i.

Back Substitution in Symmetric Problems

The symmetric back substitution process is almost identical, except [L]^T is used for [U] with L_ii = 1. A typical term X_i is obtained from the scaled vector Z_i by subtracting the inner-product as follows:

X_i = Z_i -

N
Σ
k=(i+1)

L_ki X_k

This inner-product also involves nonunity addressing in [L]. Therefore partial sums are stored after each term X_i is computed. The actual algorithm uses the same storage for {X} and {Z} and computes the following:

X_k = X_k -

(i-1)
Σ
k=ILOW

L_ik X_i

In this equation ILOW = LOWEQ(I). As each term X_i is computed, it scales the i-th row of [L]. The resulting vector terms are subtracted from the values of {X} below X_i.

Matrix and Vector Multiplication

FMS includes multiply operations that occur frequently in matrix algebra. The first operation computes the following:

{C} = {D} - [A]{B}
{C} =       [A]{B}
{C} = {D} + [A]{B}

In these operations, [A] is a matrix stored in FMS matrix format and {B}, {C} and {D} are one or more vectors stored in FMS vector format. One application of these multiply operations is to compute the residual error after solving a system of equations,

{E} = {B} - [A]{X}

The resulting error can then be used in interative refinement to compute a correction to the solution {X}.

A similar set of multiply routines are provided which use submatrix files instead of the assembled matrix:

{C} = {D} - [Si]{B}
{C} =       [Si]{B}
{C} = {D} + [Si]{B}

This calculation may be more efficient if the assembled matrix [A] is sparse.

FMS also contains the following multiplication routines:

{C} = {D} - {A}{B}
{C} =       {A}{B}
{C} = {D} + {A}{B}

{C} = {D} - {A}^T{B}
{C} =       {A}^T{B}
{C} = {D} + {A}^T{B}

where {A}, {B}, {C} and {D} are FMS vector files. The above products may also be stored in FMS matrix files using

[C] = [D] - {A}^T{B}
[C] = [D]   {A}^T{B}
[C] = [D] + {A}^T{B}

FMS can also store the vector products in a memory-based array to compute

[F] = {X}^T{Y}

where [F] is a full matrix and {X} and {Y} are FMS vector files containing one or more vectors. For symmetric problems, only the lower triangle of [F] is computed.

These multiply operations have several practical applications in computing Eigenvalues of the matrix [A]. For example, if {X} is a known Eigenvector, the corresponding Eigenvalue can be computed from the following:

e = ({X}^T[A]{X})/({X}^T{X})

This calculation uses the matrix-vectors multiply operation once and the vectors-vectors multiply operation twice.

These multiply operations are also used to project the matrix [A] into a subspace [A^*] where its Eigenvalues can be computed more efficiently. This projection involves the following computation:

[A^*] = {X}^T[A]{X}

where {X} are the vectors that define the subspace.

In the special case where the matrix is diagonal [D], FMS directly computes the following quadratic form:

[F] = {X}^T[D]{X} (symmetric)

[F] = {X}^T[D]{Y} (nonsymmetric)

FMS also includes the vectors matrix multiply calculation

{Y} = {X}[F]

This calculation may be used to project subspace eigenvectors back into the problem space or to compute new vectors {Y} from a linear combination of old vectors {X}.