Integer
0
This is the master flag for activating all the NonUniform Memory Access (NUMA) directives. You must set this parameter in the FMS license file in order to activate any of the NUMA technology.
The NUMA directives discussed below are a powerful set of programming tools for achieving maximum performance. These are the directives which FMS uses internally to achieve peak performance. However, you may also use these directives in your own application.
Many of the NUMA features used by FMS work best when the machine is dedicated to FMS applications. This may be a single application or multiple applications where each one is explicitly placed. Running more than one application on a group of processors, or letting the operating system schedule other tasks on top of a FMS NUMA job may significantly degrade performance.
The fundamental building block of a NUMA machine is the "node". On Compaq systems, this is referred to as a Resource Affinity Domain (RAD) and on SGI it is referred to as a Memory Locality Domain (MLD). Within the node, all processors share a common memory with high-speed uniform access. Programming at this level is similar to Symmetric MultiProcessor (SMP) programming techniques. Currently the number of processors per node is 2 or 4, but this number may increase in the future. The FMS Parameter NPNODE defines the number of processors per node. It defaults to a value appropriate for your machine (except for the SGI Origin 2000, where it must be explicitly set to 2 in the license file).
The number of nodes NUMNOD is obtained by dividing the number of processors being used MAXCPU by the number of processors per node NPNODE. This parameter is automatically computed by FMS.
In order to achieve peak performance on a NUMA machine, the software must maximize references to local memory and minimize references to remote memory. This requires close coordination between which processor runs the thread and where the memory is placed. In addition, the software must be designed to distribute the data in a fashion similar to programming techniques used on distributed memory computers.
To achieve peak performance, FMS implements the following features:
When FMS places memory using NUMA directives, the requested memory is dealt out round robin among the nodes using a stride that is specified by the FMS Parameter MAXLMD. For example, the records used to hold matrix data use a stride that evenly distributes the record among the number of nodes being used for the problem. If you allocate memory for your application using one of the FMS memory management routines FMSIMG, FMSRMG or FMSCMG, and make the call from the parent, your memory will be distributed according to the specified value of MAXLMD. If you call the FMS memory management routines to allocate memory form a child thread, all the memory requested will be allocated on the node where that thread is running. This provides a simple, yet effective, way for you to control where your data resides on the NUMA machine.
For some applications, a single distribution of data will be optimal for the entire job. Most FMS applications, however, go through different phases, with each phase requiring a different data distribution. For example, an application may form matrix data, factor, solve and process results as distinct phases. FMS includes the Parameter MDWHEN that controls when the memory is placed. When the NUMA flag is set, memory is placed as it is required. When the NUMA flag is not set, all the memory is allocated once at the beginning.
FMS
includes options for attaching each thread to an individual processor or node. The NUMAFX Parameter controls these options. When the NUMA flag is set, this parameter automatically defaults to the optimum value for each machine.The threads are placed in increasing order, starting on processor MYCPU1 (1 is the first CPU in the machine), and continuing for MAXCPU processors. You may obtain the thread number MYCPU of a subroutine you have running in parallel by calling FMSIGT('MYNODE', MYCPU). Knowing the thread number and how the memory was distributed, you can determine what calculations should be performed on the local data.
One simple case is filling a matrix. First, you call the routine that fills the matrix in parallel using FMSPAR. At the beginning of the subroutine, you find out what thread you are by calling FMSIGT ('MYNODE', MYCPU). Next, you call one or more of the FMS memory management routines FMSIMG, FMSRMG or FMSCMG to allocate memory for your data. Because these routines are being called from a thread, they will automatically allocate the memory on the local node. After performing your part of the computation, you return the memory and return from your subroutine.