A war is shaping up over the lowly PCI slot. The slot that holds everything from network to audio cards and can trace its origins back to the beginning of the PC age is emerging as the latest battleground for HPC. GPUs and Intel Phi are contending for the privilege to deliver FLOPS to your applications and on the not too distant horizon is competition from FPGAs. Chip vendors for all three have large markets outside of HPC to leverage and stabilize their business. This is the principal reason they are the viable options today…they don’t depend on HPC to survive. As heterogeneous computing has become mainstream computing, businesses and academic groups that count on HPC need to assess their development strategies and balance the risks of making a technology commitment vs. the risk of inaction and falling behind competitors.
GPUs, Phi and FPGAs
CUDA, Nvidia’s language for GPU computing turned six this year defying early critics of the technology. It has bridged three different hardware generations and by most objective measures has been a notable success. Birthing a new HPC platform is no easy trick. Spurred by Nvidia’s investment and rapid user adoption, CUDA itself and the underlying hardware have demonstrated the signs of a maturing technology with continual evolution and improvement. Intel Phi is now unveiled. While generation one may not be a GPU killer as some hoped or feared, experience cautions never to count Intel out. Some of the smartest people I know work there and the company has money for the long haul. Finally, keep an eye on ACL, Altera’s OpenCL compiler targeting their line of high performance FPGAs. I’ve been following the FPGA world for over a decade and I’ve seen many attempts to bridge the so called programmability gap by developing C-to-gates type compilers. Most have failed to make any impact. All have failed to make FPGAs common HPC components. ACL has an excellent chance to do both. While it’s difficult to predict which HPC architectures will dominate 10 years up the road I think it’s possible to resolve the broad common features, those driven by the physics.
Here come the cores
First, it’s clear that HPC platforms in the near term will deliver many relatively low power cores. “Many” here means thousands to tens of thousands. My opinion is that programming models that capture this massive parallelism and allow its natural expression are the ones which will be relevant. CUDA and OpenCL are the best examples we have today. The traditional approach to parallelism in scientific computing is domain decomposition, i.e. divide the physical problem into separate domains and assign each to a different core. This works pretty well for two cores or maybe even eight cores but not for thousands of cores on the same chip. With thousands of cores the domains are small and the ratio of calculation to communication becomes unfavorable, reducing efficiency. In addition the cores interact asynchronously with each other and work is not distributed in precisely equivalent amounts. When core A completes its work, it notifies its neighbors and likewise for all other cores. This asynchronicity is inefficient and becomes more significant with scale. I believe that the fine-grained parallelism enforced by CUDA and OpenCL are a better programming model for the immenint many core world. They don’t eliminate domain decomposition but they push it up to a more manageable level. They employ their thousands of cores to work together in gangs that read, write and compute on domain data in concert.
Cache me if you can
Second, cache hierarchies have become more and more complex and that trend will continue. Stacked DRAM, for example, offers yet another way station for bits traveling between compute cores and main memory. Traditional x86 architectures tend to hide the complexity of data movement and placement through the cache hierarchy from the developer to increase programmability. However, optimizing performance requires reverse-engineering this data movement and experienced developers often prefer to manage it more explicitly. I believe the programming models that will thrive will allow this close control over data placement. A related concept is the memory wall, the disparity in the time it takes to move data into a core compared to the time it takes to calculate with it. The memory wall will remain. The layers of fast intermediate caches only help when data reuse is possible. To achieve peak performance and alleviate the memory wall, successful future hardware architectures will be those that effectively hide the latency of memory retrieves by juggling threads, executing those that have all required memory available and parking those that are waiting on transfers.
Go for scaleup
Finally I think more attention will be paid to multi-X solutions where X can be CPU nodes, GPUs, Phi’s or FPGAs. The largest industry and academic problems are always going to require the resources of more than one compute entity and the way in which the efforts of multiple entities are coordinated to solve problems is still a difficult and painful process. Typically it involves a tiered approach with MPI layered on top of a threading solution (e.g. OpenMP or pthreads). Moving data across compute domains is a complicated application specific effort that requires very careful bookkeeping and attention to compute and communication patterns. With MPI soon turning 20 I have to believe that more sophisticated solutions will be introduced in the coming years. Ideally developers will not need to know how many compute entities are dedicated to their problem and they would view the compute platform as a large virtual machine without regard to the discrete component elements. I know that this is a difficult problem and perhaps this section is more of a plaintive plea to our colleagues in the CS departments but it’s hard for me to imagine that in another 20 years we will still be issuing explicit MPI_SEND() and MPI_RECV() commands hand orchestrating the flow of data around huge collections of machines.
He who hesitates
More cores, more layers of memory with more developer control over data placement and better support for multi-node solutions. I think these are some of the broad features and common aspects of next generation HPC platforms. What does this mean for companies and academic groups that depend on HPC for advances in their field? Since all instances of future compute platforms will involve many, many cores, if they have not done so already, developers should think about their problems in terms of fine-grained parallelism. Whether Phi, GPU, FPGA or some new platform, the architectures share common traits that are dictated by the constraints of semiconductor physics, device speeds and capacities. These new architectures look more like each other than they do the large-core architectures of the present. We are midstream in the technology transition from scalar processing to many core processing and there is no turning back. Uncertainty about which technology will win the slot wars is no longer a good excuse for doing nothing. There are enough success stories and variety of solved problems to find one that matches yours and the only certain way to be unprepared for tomorrow’s architecture is to wait indefinitely until all the dust has settled.