Program:
8:45 AM - 9 AM Welcome Address and Opening Remarks
9 AM - 10 AM "Scaling Resource Compositions in a Flatter World" Sudhakar Yalamanchili, Georgia Institute of Technology, USA
Following the end of Dennard scaling and the transition to multicore we are now seeing an evolution to communication-centric architectures and systems. Data movement is more expensive in time and energy than compute, and as a consequence systems are undergoing another fundamental transformation to optimize data movement rather than compute. This transformation will percolate up through the software stacks and to clusters and data centers. This trend has been amplified with the emergence of big data as a major challenge for future systems. This talk will make some observations about the impact of technology trends on cluster architectures and offer some opinions on anticipated research problems. It will conclude with a description of our new project, Oncilla, an experimental platform where we explore data movement optimizations for data warehousing applications in context of clusters that are architected to offer flexible compositions of heterogeneous compute and memory resources.
10 AM - 10:30 AM Coffee break
10:30 AM - 12 AM Technical Session 1: GPU-based computing
- "Direct GPU/FPGA Communication Via PCI Express", Ray Bittner, Erik Ruf
Parallel processing has hit mainstream computing in the form of CPUs, GPUs and FPGAs. While explorations proceed with all three platforms individually and with the CPU-GPU pair, little exploration has been performed with the synergy of GPU-FPGA. This is due in part to the cumbersome nature of communication between the two. This paper presents a mechanism for direct GPU-FPGA communication and characterizes its performance in a full hardware implementation.- "Adapting Sparse Triangular Solution to GPUs", Brad Suchoski, Caleb Severn, Manu Shantharam and Padma Raghavan
High performance computing systems are increasingly incorporating hybrid CPU/GPU nodes to accelerate the rate at which floating point calculations can be performed for scientific applications. Currently, a key challenge is adapting scientific applications to such systems when the underlying computations are sparse, such as sparse linear solvers for the simulation of partial differential equation models using semiimplicit methods. Now, a key bottleneck is sparse triangular solution for solvers such as preconditioned conjugate gradients (PCG). We show that sparse triangular solution can be effectively mapped to GPUs by extracting very large degrees of fine grained parallelism using graph coloring. We develop simple performance models to predict these effects at intersection of the data and hardware attributes and we evaluate our scheme on a Nvidia Tesla M2090 GPU relative to the level set scheme developed at NVIDIA. Our results indicate that our approach significantly enhances the available fine-grained parallelism to speed-up execution time compared to the NVIDIA scheme, by a factor with a geometric mean of 5.41 on a single GPU, with speedups as high as 63 in some cases.- "MRF satellite image classification on GPU", Pedro Valero-Lara
One of the stages of the analysis of satellite images is given by a classification based on the Markov Random Fields (MRF) method. It is possible to find in literature several packages to carry out this analysis, and of course the classification tasks. One of them is the Orfeo ToolBox (OTB). The analysis of satellite images is an expensive computational task requiring real time execution or automatization. In order to reduce the execution time spent on the analysis of satellite images, parallelism techniques can be used. Currently, Graphics Processing Units (GPUs) are becoming a good choice to reduce the execution time of several applications at a low cost. In this paper, the author presents a GPU-based classification using MRF from the sequential algorithm that appears in the OTB package. The experimental results show a spectacular reduction of the execution time for the GPU-based algorithm, up to 225 times faster than the sequential algorithm included in the OTB package. Moreover, this result is also observed in the total power consumption, which is reduced by a significant amount.12 PM - 1:30 PM Lunch
1:30 PM - 2:30 PM "More Speed Counts in Data-Centric Computing" Rainer Brendle, SAP, Palo Alto, USA
Are we at the eve of new datacenter setup for the enterprise? Many indications show into this direction. The need for multi-core- and NUMA-aware architectures requires rethinking of the classical scaling models of databases and application servers, as it is common in the data-centric computing of enterprises today. The growing discrepancy of DRAM access speed and CPU compute power give huge opportunities, which we can achieve with improved data and code locality. Locality becomes important more and more. There is a need to scale with an increasing amount of data and to serve more users at the same time in a parallel manner. A growing amount of mobile devices require consistent minimal response times on the other side for many concurrent users. Latencies and response times, not only bandwidth and throughput become important. Techniques from High-Performance-Computing may come at the right time. But the data-centric computing as it dominates in the enterprise and HPC are also not the same. Data locality requires that databases have to be the central element of a future architecture. For this today's database technology has to move into a distributed data platform. Fabric network infrastructures, and high-performance storage using non-volatile memory concepts allow to scale with memory of various kinds in a memory hierarchy in computer clusters. Both shared-nothing or shared-everything database architectures as a basis have their drawbacks and limitations here. What are the options to move forward?
2:30 PM - 3 PM Technical Session 2: FPGA-based computing
- "Architecture and Applications for an All-FPGA Parallel Computer",
Yamuna Rajasekhar and Ron Sass
3 PM - 3:30 PM Coffee break
3:30 PM - 4 PM Technical Session 3: Low-power based computing
- "Evaluating Performance and Energy of ARM-based Clusters for High Performance Computing", Edson L. Padoin, Daniel A. G. de Oliveira, Pedro Velho, Philippe O. A. Navaux,
The High-Performance Computing (HPC) community aimed for many years at increasing performance regardless to energy consumption. However, energy is limiting the scalability of next generation supercomputers. Current HPC systems already cost huge amounts of power, in the order of a few MegaWatts (MW). The future HPC systems intend to achieve 10 to 100 times more performance, but the accepted energy to power those machines must remain below 20 MW. Therefore, the scientic community is investigating ways to improve energy efficiency. This paper presents a study of the execution time, power consumption, maximum power and energy efficiency using ARM architectures. Our objective is to verify the feasibility of clusters using processors that target low power consumption. As a subproduct of our research we built an unconventional cluster of PandaBoards each one featuring two ARM Cortex A9 cores. We believe that these unconventional solutions bring an alternative base to build HPC clusters that respect the limits of electric energy.4 PM - 5 PM "Scaling the Von-Neumann Wall with Reconfigurable Computing" Joel Emer, Intel, USA
5 PM - 5:30 PM Closing Panel