"Why GPUs?"

NVIDIA V100 PCIe

NVIDIA Tesla V100 PCIe GPU

Graphics Processing Units (GPU) were originally designed to accelerate the large number of multiply and add computations performed in graphics rendering. Packaged as a video card attached to the PCI bus, they offloaded this numerically intensive computation from the central processing unit (CPU). As the demand for high performance graphics grew, so did the GPU, eventually becoming far more powerful than the CPU.

The simulation of engineering and scientific problems is very closely related to the type of computation performed for graphic rendering. Both perform a large number of floating point multiply-add computations. However there are two significant differences:

In general purpose science and engineering, the amount of information stored and processed with each data point requires 64-bits (8-bytes) of precision. For graphics applications usually 32-bits (4-bytes) are required. The extra precision required is the result of solving differential equations where the difference between data points is significant. For graphics applications only the value at a data point is required.
For graphics the data is often recomputed several times a second. An error in a data point is usually not noticeable. For scientific and engineering applications, the results of one computation are used for the next computation. As a result, an error in computing the value of one data point usually will render the analysis useless.

The NVIDIA Tesla family of GPUs are designed specifically for the engineering and scientific marketplace. They include native 64-bit precision in data storage, paths and arithmetic units. In addition they have error-correcting memories which provide the reliability required for long simulations.

Multiple GPU Solutions

NVIDIA V100 HGX-2

NVIDIA HGX-2 with 16 V100 GPUs

Many servers provide for multiple GPU PCI cards. Most allow up to 4, with a few allowing 8. However, for larger configurations, Nvidia offers a newer packaging as cards that include 8 GPU processors. This configuration, termed the HGX-1, provides a more integrated solution for a new line of servers. For even larger GPU configurations, the HGX-2 (shown above) includes two of these 8-GPU cards. Several vendors are (and will) be making servers based on this technology.

Nvidia also makes complete servers using the HGX-1 and HGX-2 components, termed the DGX-1 and DGX-2 respectively. They contain two CPUs, memory, internal storage and network adapters.

NVIDIA DGX-1 with 8 GPUs

NVIDIA DGX-2 with 16 GPUs

Comparing CPUs and GPUs

The time required for FMS to solve a problem depends on the speed of the processors in the computer. This speed is expressed in units of flops (floating point operations per second). However, with the speed of today's computers, units of Gflops (billions of flops) or Tflops (trillions of flops) are typically used.

The floating point performance of a processor may be computed as follows:

Performance = (Number of cores) X (Clock speed) X (Operations per core per clock)

Most computers today compute Fused Multiply-Add instructions (FMA), which produce two floating point operations per clock (one multiply followed by one add). For this case

(Operations per core per clock) = 2

The following table compares the performance of a general purpose CPU processor with those found in GPUs.

	CPU (Skylake)	GPU (V100 SXM2)
Number of processors	1	84
64-bit Cores per Processor	24	32
Total Number of cores	24	2688
Clock Speed (GHz)	2.7	1.38
Flops per core	8x2	2
Performance (Gflops)	1037	7419

CPUs: CPU processors contain multiple cores, with each core capable of performing multiple fused multiply-add (FMA) instructions every cycle. The number of FMA operations per core per clock is determined by the bus width:
- 128-bit, 2
- 256-bit, 4
- 512-bit, 8
In this example a 24 core processor running at 2.7 GHz performing fused multiply-add instructions on a processor that supports AVX-512 instructions has a peak performance of 1037 Gflops. The actual clock may be less than 2.7 GHz if thermal limits are reached. This is determined by what is happening in all 24 cores of the processor.
GPUs: Each GPU contains multiple processors, with each processor containing multiple cores. In this example an 84 processor GPU having 32 64-bit cores per processor contains 2688 cores. This GPU running at 1.38 GHz performing fused multiply-add instructions has a peak performance of 7419 Gflops.

The superior floating point performance provided by GPUs is due to the large number of cores. Matrix algebra applications, including FMS, can divide their work evenly among a large number of cores, making GPUs an ideal processor for these applications.

Computers may have multiple CPU and GPU processors. Popular configurations include 2 CPUs and 1 to 16 GPUs.

The architecture of GPUs offer the following benefits:

Faster Processing
Each GPU provides an order of magnitude or more in performance over general purpose CPU processors. The result is faster solution times and the ability to solve large problems.
Lower capital cost
GPUs provide an order of magnitude or more in processing power for the same capital cost.
Reduced power consumption
The efficient architecture of GPUs perform more floating point operations per watt of power consumed.