Energy aware scheduling model and online heuristics for stencil codes on heterogeneous computing architectures

Cluster Computing, Nov 2016

Performance of high-end supercomputers will reach the exascale through the advent of core counts in billions. However, in the upcoming exascale computing era it is important not only to focus on the performance, but also on scalability of fine-grained parallel applications, data locality and energy aware scheduling within the parallel code. In fact, parallel applications need to change even now by redesigning algorithms and data structures respectively to take advantage of the recent improvements in energy efficiency of heterogeneous computing hardware, including multicore processors and GPU accelerators. Over the next few years one of the biggest challenges for exascale will be the ability of parallel applications to fully exploit locality which will, in turn, be required to achieve expected performance and energy efficiency. Future highly parallel applications will have to deal with deep memory hierarchies taking into account energy cost in moving data off-chip. Therefore, they will have to apply new coordinated scheduling approaches to balance energy aware resource utilization and minimize work starvation during runtime. As new constraints and limits on memory bandwidth and energy will play a key role in high performance computing (HPC) in the future, more sophisticated and dynamic scheduling techniques will be needed and applied within the parallel code. In this paper we focus on an energy-aware distribution of the stencil workload on heterogeneous processors. Our analysis of energy and performance models focused on relevant class of stencil computations to explore the relationship between task scheduling algorithms and energy constraints. More precisely, we search for a schedule which minimizes the energy usage within a specified computation’s deadline of the stencil workload on heterogeneous architectures. Since the problem is computationally intractable, we present an integer linear programming formulation for finding optimal schedules. As finding optimal schedules is time consuming we have developed four heuristics and tested them experimentally with respect to optimal solutions. In our work we focus on a single node configurations with heterogeneous processors. These configurations represent the state of the art multi- and many-core architectures.

A PDF file should load here. If you do not see its contents the file may be temporarily unavailable at the journal website or you do not have a PDF plug-in installed and enabled in your browser.

Alternatively, you can download the file locally and open with any standalone PDF reader:

https://link.springer.com/content/pdf/10.1007%2Fs10586-016-0686-2.pdf

Energy aware scheduling model and online heuristics for stencil codes on heterogeneous computing architectures

Energy aware scheduling model and online heuristics for stencil codes on heterogeneous computing architectures Milosz Ciznicki 0 1 Krzysztof Kurowski 0 Jan Weglarz 0 1 0 Poznan ́ Supercomputing and Networking Center , Jana Pawla II 10, 61-139 Poznan , Poland 1 Institute of Computing Science, Poznan University of Technology , Piotrowo 2, 60-965 Poznan , Poland Performance of high-end supercomputers will reach the exascale through the advent of core counts in billions. However, in the upcoming exascale computing era it is important not only to focus on the performance, but also on scalability of fine-grained parallel applications, data locality and energy aware scheduling within the parallel code. In fact, parallel applications need to change even now by redesigning algorithms and data structures respectively to take advantage of the recent improvements in energy efficiency of heterogeneous computing hardware, including multicore processors and GPU accelerators. Over the next few years one of the biggest challenges for exascale will be the ability of parallel applications to fully exploit locality which will, in turn, be required to achieve expected performance and energy efficiency. Future highly parallel applications will have to deal with deep memory hierarchies taking into account energy cost in moving data off-chip. Therefore, they will have to apply new coordinated scheduling approaches to balance energy aware resource utilization and minimize work starvation during runtime. As new constraints and limits on memory bandwidth and energy will play a key role in high performance computing (HPC) in the future, more sophisticated and dynamic scheduling techniques will be needed and applied within the parallel code. In this paper we focus on an energy-aware distribution of the stencil workload on heterogeneous processors. Our analysis of energy and performance models focused on relevant class of stencil computations to explore the relationship between task scheduling algorithms Power and energy modelling; Performance analysis; Scheduling; Resource management; Stencil computations; GPUs; Many-core systems - and energy constraints. More precisely, we search for a schedule which minimizes the energy usage within a specified computation’s deadline of the stencil workload on heterogeneous architectures. Since the problem is computationally intractable, we present an integer linear programming formulation for finding optimal schedules. As finding optimal schedules is time consuming we have developed four heuristics and tested them experimentally with respect to optimal solutions. In our work we focus on a single node configurations with heterogeneous processors. These configurations represent the state of the art multi- and many-core architectures. 1 Introduction Stencil computations as relevant class of applications occur in many HPC codes on block-structured grids for modelling various physical phenomena, e.g. for computational fluid dynamics, geometric modelling, solving partial differential equations or image and video processing [1–5]. As computing time and memory usage grow linearly with the number of array elements in stencil computations our research targets highly parallel implementations of stencil codes together with task scheduling and optimization techniques taking into consideration energy cost and data locality [6–10]. We have proved during our experimental studies that recent changes introduced in heterogeneous computing hardware resulted in different performance and energy characteristics that are critical for highly efficient and scalable stencil computations [11]. As shown in [12,13], the overall performance of stencil computations is memory bound. One should note that many existing HPC architectures mainly focus on floating point performance [14]. However, only a partial and limited usage of the floating point units in a given computing architecture is possible today and may reduce energy cost without the performance degradation. Moreover, many latest improvements introduced in dynamic power management policies at the hardware level, e.g. dynamic voltage and frequency scaling (DVFS) or even switching off an entire unit block of a chip (clock gating), can lead to significant reduction in the energy required for memory-bound workloads. Advanced dynamic power management policies give new opportunities for scheduling tasks within the fine-grained parallel code as users are able to control the utilization of various functional units in heterogeneous computing hardware, e.g. turn on and off dynamically individual cores, change on-demand the frequency of a small processing and communication units or even put portions of cache memory at specific sleep states during runtime. In our previous work [15] we performed an exhaustive evaluation of the key characteristics that have a relevant impact on the performance and energy usage of a stencil computation running on a certain processing unit. Based o (...truncated)


This is a preview of a remote PDF: https://link.springer.com/content/pdf/10.1007%2Fs10586-016-0686-2.pdf

Milosz Ciznicki, Krzysztof Kurowski, Jan Weglarz. Energy aware scheduling model and online heuristics for stencil codes on heterogeneous computing architectures, Cluster Computing, 2016, pp. 1-15, DOI: 10.1007/s10586-016-0686-2