NVIDIA Tesla Pascal P100 GPU
Graphics Processing Units (GPU) were originally designed to accelerate the large number of multiply and add computations performed in graphics rendering. Packaged as a video card attached to the PCI bus, they offloaded this numerically intensive computation from the central processing unit (CPU). As the demand for high performance graphics grew, so did the GPU, eventually becoming far more powerful than the CPU.
The simulation of engineering and scientific problems is very closely related to the type of computation performed for graphic rendering. Both perform a large number of floating point multiply-add computations. However there are two significant differences:
- In general purpose science and engineering, the amount of information stored and processed with each data point requires 64-bits (8-bytes) of precision. For graphics applications usually 32-bits (4-bytes) are required. The extra precision required is the result of solving differential equations where the difference between data points is significant. For graphics applications only the value at a data point is required.
- For graphics the data is often recomputed several times a second. An error in a data point is usually not noticeable. For scientific and engineering applications, the results of one computation are used for the next computation. As a result, an error in computing the value of one data point usually will render the analysis useless.
Comparing CPUs and GPUsThe time required for FMS to solve a problem depends on the speed of the processors in the computer. This speed is expressed in units of flops (floating point operations per second). However with the speed of today's computers, units of Gflops (billions of flops) or Tflops (trillions of flops) are typically used.
The floating point performance of a processor may be computed as follows:
Performance = (Number of cores) X (Clock speed) X (Operations per core per clock)
(Operations per core per clock) = 2
The following table compares the performance of a general purpose CPU processor with those found in GPUs.
|Number of processors||1||56|
|Cores per Processor||16||32|
|Total Number of cores||16||1792|
|Clock Speed (GHz)||3.0||1.48|
|Flops per core||2||2|
- CPUs: CPU processors contain multiple cores, with each core capable of performing a fused multiply-add instruction every cycle. In this example a 16 core processor running at 3.0 GHz performing fused multiply-add instructions has a peak performance of 96 Gflops.
- GPUs: Each GPU contains multiple processors, with each processor containing multiple cores. In this example a 56 processor GPU having 32 cores per processor contains 1792 cores. This GPU running at 1.48 GHz performing fused multiply-add instructions has a peak performance of 5300 Gflops.
The superior floating point performance provided by GPUs is due to the large number of cores. Matrix algebra applications, including FMS, can divide their work evenly among a large number of cores, making GPUs an ideal processor for these applications.
Computers may have multiple CPU and GPU processors. Popular configurations include 2 CPUs and 1 to 8 GPUs.
The architecture of GPUs offer the following benefits:
- Faster Processing
Each GPU provides an order of magnitude or more in performance over general purpose CPU processors. The result is faster solution times and the ability to solve large problems.
- Lower capital cost
GPUs provide an order of magnitude or more in processing power for the same capital cost.
- Reduced power consumption
The efficient architecture of GPUs perform more floating point operations per watt of power consumed.
A Workstation ExampleTwo NVIDIA Fermi GPUs were benchmarked in a workstation. Based on actual performance and costs, the following chart shows the performance and cost/performance of adding GPUs to a system.
The chart above illustrates two key points:
- GPUs lower computational cost.
A typical workstation with CPUs only configured for FMS computation will cost about $200 per Gflop of performance. GPUs, however, cost less than $9 per Gflop of performance. The difference is due to the large number of multiply-adder units on the GPU processor. Adding 2 GPUs to the workstation lowered the cost of a Gflop of performance from $200 to $25. For FMS applications this can lower machine cost by a factor of 8 or provide 8 times the performance for the same cost. GPUs provide a similar reduction in power consumption, cooling and space requirements.
- GPUs increase performance.
The performance of the workstation without GPUs was 80 Gflops. The performance with 2 GPUs was 660 Gflops, a performance increase of over 8. The GPUs extended the performance beyond what is possible with CPUs alone at any cost. Note that FMS operates the CPUs and GPUs in parallel so the total performance includes the contribution from both types of processors.
A Server ExampleGPUs can extend server performance far beyond that which can be obtained with CPUs alone. The following example is a server having 8 CPUs. While several CPU options are available, the numbers shown are an average. The server achieved 435 Gflops of performance at a cost of $211 per Gflop.
First, 2 NVIDIA Fermi GPUs were installed in the PCI slots inside the server. The performance increased to over 1,000 Gflops (1 Tflop) while the cost performance improved to $90 per Gflop.
Next, the number of GPUs was increased to 8. The resulting performance increased to 2,800 Gflops (2.8 Tflops) and the price/performance improved to $42/Gflop.
This example shows the power of GPUs in increasing performance and reducing costs.