Dissertation Defense Schedule
Academic Excellence
Sharing original dissertation research is a principle to which the University of Delaware is deeply committed. It is the single most important assignment our graduate students undertake and upon completion is met with great pride.
We invite you to celebrate this milestone by attending their dissertation defense. Please review the upcoming dissertation defense schedule below and join us!
PROGRAM | Electrical and Computer Engineering
Toward High Performance and Energy Efficiency on Many-Core Architectures
By: Elkin Garcia Chair: Guang Gao
ABSTRACT
Recent attempts to build peta-scale and exa-scale systems have precipitated the development of new processor with hundreds, or even thousands, of independent processing units. This many-core era have brought new challenges on several fields including computer architecture, algorithm design and operating systems among others. Addressing these challenges implies new paradigms over some well-established methodologies for traditional serial architectures.
These new many-core architectures are characterized not only by the large amount of processing elements but also by the large number and heterogeneity of resources. This new environment has prompted the development of new techniques that seek finer granularity and a greater interplay in the sharing of resources. As a result, several elements of computer systems and algorithm design need to be re-evaluated under these new scenarios; it includes runtime systems, scheduling schemes and compiler transformations.
The number of transistors on a chip continues to grow following Moore’s law, but single processor architectures manufactured by main vendors in the late 90’s were in trouble taking advantage of the increasing number of transistors. As a consequence, Computer Architecture has become extremely parallel at all levels. It has been preferred to have several simpler processing elements than fewer more complex and powerful ones. Two main challenges in the algorithms implemented on these modern many-core architectures have arisen: (1) Shared resources have become the norm, ranging from the memory hierarchy and the interconnections between processing elements and memory to arithmetic blocks such as double floating point units, different mechanism at software and hardware levels are used for the arbitration of these shared resources and need to be considered on the scheduling and orchestration of tasks. (2) In order to take advantage of the increasing amount of parallelism available, the number of tasks has increased and tasks have become finer, imposing new challenges for a light and balanced scheduling subject to resource and energy constraints.
The research proposed in this thesis will provide an analysis of these new scenarios, proposing new methodologies and solutions that leverage these new challenges in order to increase the performance and energy efficiency of modern many-core architectures. During the pursue of these objectives, this research intends to answer the following question:
1. Which is the impact of low-level compiler transformations such as tiling and percolation to effectively produce high performance code for many-core architectures?
2. What are the tradeoffs of static and dynamic scheduling techniques to efficiently schedule fine grain tasks with hundreds of threads sharing multiple resources under different conditions in a single chip?
3. Which hardware architecture features can contribute to better scalability and higher performance of scheduling techniques on many-core architectures on a single-chip?
4. How to effectively model high performance programs on many-core architectures under resource coordination conditions?
5. How to efficiently model energy consumption on many-cores managing tradeoffs between scalability and accuracy?
6. Which are feasible methodologies for designing power-aware tiling transformations on many-core architectures?
So far, this thesis establishes a clear methodology in order to answer these questions. This thesis addresses the research questions raised and support the claims and observations made through this document with several experiments.
We have shown the importance of tiling using dense matrix multiplication on the Cyclops-64 many-core architecture as an example. This technique alone is able to increase the performance from 3.16 GFLOPS to 30.42 GFLOPS. This performance was further improved using Instruction Scheduling and other Architecture specific optimizations reaching 44.12 GFLOPS. Later, with the use of Percolation, the new performance was 58.23 GFLOPS.
We have also shown how Dynamic Scheduling can overcome a highly balanced Static Scheduling on a Matrix Multiplication. For this case, we were able to increase the performance from 58.23 GFLOPS to 70.87 GFLOPS on SRAM and from 38.73 GFLOPS to 56.26 GFLOPS on DRAM using Dynamic Percolation. These results are by far greater than any other previous published result for this architecture and it approaches the 80 GFLOPS of theoretical peak performance.
We demonstrated how Dynamic Scheduling can overcome Static Scheduling with regard to performance with other two additional applications. First, the tradeoffs of Static Scheduling (SS) vs. Dynamic Scheduling (DS) are exposed using a Memory Copy microbenchmark. Under scenarios with small amount of Hardware threads (e.g. less than 48), SS overcome DS because SS is able to produce a balanced workload with minimum overhead. However, increasing the number Thread Units makes SS schedule highly unbalanced, loosing performance. DS is a feasible solution to manage these complex scenarios and produces balanced workloads under more than a hundred Thread Units with light overhead that allows doubling the performance in some cases. Second, Sparse Vector Matrix Multiplication (SpVMM) was used to show the tradeoffs of SS vs DS under heterogeneity of task controlling the variance of the sparsity distribution for the matrix. In addition, we explained how the advantages of DS are further improved by a low-overhead implementation using mechanisms provided by the architecture, particularly in-memory atomic operations, diminishing the overall overhead of DS. As a result, DS can remain efficient for finer task granularities.
We have demonstrated a technique to model the performance of parallel applications on many-core architectures with resource coordination conditions. Our approach, based on timed Petri nets, results in algorithm specific models that allow us to account for the resource constraints of the system and the needs of the algorithm itself. With our approach, we were able to model the performance of a dense matrix multiplication algorithm and a finite difference time-domain (FDTD) solution in 1D and 2D with a very high degree of accuracy, an average error of 4.4% with respect to the actual performance of the algorithms.
Finally, we demonstrated how to use our approach to performance modeling to investigate, develop, and tune algorithms for modern many-core architectures, we compared two different tiling strategies for the FDTD kernel and we tested two different algorithms for LU Factorization.
We also proposed a general methodology for designing tiling techniques for energy efficient applications. The methodology proposed is based on an optimization problem that produces optimal tiling and sequence of traversing tiles minimizing the energy consumed and parametrized by the sizes of each level in the memory hierarchy. We also showed two different techniques for solving the optimization problem for two different applications: Matrix Multiplication (MM) and Finite Difference Time Domain (FDTD). Our experimental evaluation shows that the techniques proposed reduce the total energy consumption effectively, decreasing the static and dynamic component. The average energy saving for MM is 61.21%, this energy saving is 81.26% for FDTD compared with the naive tiling.
We studied and implemented several optimizations to target energy efficiency on many-core architectures with software managed memory hierarchies using LU factorization as a case of study. Starting with an implementation optimized for High Performance. We analyzed the impact of these optimizations on the Static Energy Es, Dynamic Energy Ed, Total Energy Et and Power Efficiency using the energy model previously developed. We designed and applied further optimizations strategies at the instruction-level and task-level to directly target the reduction of Static and Dynamic Energy and indirectly increase the Power Efficiency. We designed and implemented an energy aware tiling to decrease the Dynamic Energy. The tiling proposed minimizes the energy contribution of the most power hungry instructions. The proposed optimizations for energy efficiency increase the power efficiency of the LU factorization benchmark by 1.68X to 4.87X, depending on the problem size, with respect to a highly optimized version designed for performance. In addition, we point out examples of optimizations that scale in performance but not necessarily in power efficiency.
Finally, we showed tradeoffs between performance and energy optimizations for Many-core architectures. We explained the partial relation between performance and energy consumption through the Static Energy and execution time. We detailed some reasons that explain why energy optimization are more challenging than performance optimizations including: a) Performance optimizations just target directly the Static Energy component, with diminishing benefits for the total energy consumption. b) Some performance optimizations can affect negatively the Dynamic Energy component diminishing even more the benefits for total energy; and c) Latency can be hidden but energy cannot be; while multiple performance optimizations target a better use of resources by reordering instructions, computations or tasks in order to hide latency, the amount of work performed and the energy associated keep the same. All these reasons motivate a deeper look at strategies that optimize Dynamic Energy such as the Power Aware Tiling. Last, we showed how to exploit tradeoffs between performance and energy using a parametric power aware tiling on a parallel matrix multiplication. We reached 42% energy saving allowing a 10% decrease in performance using a rectangular tiling instead of an square tiling.
The Process
Step-by-Step
Visit our “Step-by-Step Graduation Guide” to take you through the graduation process.From formatting your Dissertation to Doctoral Hooding procedures.
Dissertation Manual
Wondering how to set up the format for your paper. Refer to the “UD Thesis/Dissertation Manual” for formatting requirements and more.
Defense Submission Form
This form must be completed two weeks in advance of a dissertation defense to meet the University of Delaware Graduate and Professional Education’s requirements.