An optimizing pipeline stall reduction algorithm for power and performance on multi-core CPUs

Human-centric Computing and Information Sciences, Jan 2015

The power-performance trade-off is one of the major considerations in micro-architecture design. Pipelined architecture has brought a radical change in the design to capitalize on the parallel operation of various functional blocks involved in the instruction execution process, which is widely used in all modern processors. Pipeline introduces the instruction level parallelism (ILP) because of the potential overlap of instructions, and it does have drawbacks in the form of hazards, which is a result of data dependencies and resource conflicts. To overcome these hazards, stalls were introduced, which are basically delayed execution of instructions to diffuse the problematic situation. Out-of-order (OOO) execution is a ramification of the stall approach since it executes the instruction in an order governed by the availability of the input data rather than by their original order in the program. This paper presents a new algorithm called Left-Right (LR) for reducing stalls in pipelined processors. This algorithm is built by combining the traditional in-order and the out-of-order (OOO) instruction execution, resulting in the best of both approaches. As instruction input, we take the Tomasulo’s algorithm for scheduling out-of-order and the in-order instruction execution and we compare the proposed algorithm’s efficiency against both in terms of power-performance gain. Experimental simulations are conducted using Sim-Panalyzer, an instruction level simulator, showing that our proposed algorithm optimizes the power-performance with an effective increase of 30% in terms of energy consumption benefits compared to the Tomasulo’s algorithm and 3% compared to the in-order algorithm.

A PDF file should load here. If you do not see its contents the file may be temporarily unavailable at the journal website or you do not have a PDF plug-in installed and enabled in your browser.

Alternatively, you can download the file locally and open with any standalone PDF reader:

https://link.springer.com/content/pdf/10.1186%2Fs13673-014-0016-8.pdf

An optimizing pipeline stall reduction algorithm for power and performance on multi-core CPUs

Vijayalakshmi Saravanan 0 Kothari Dwarkadas Pralhaddas 0 Dwarkadas Pralhaddas Kothari Isaac Woungang 0 0 WINCORE Lab, Ryerson University , Toronto , Canada The power-performance trade-off is one of the major considerations in micro-architecture design. Pipelined architecture has brought a radical change in the design to capitalize on the parallel operation of various functional blocks involved in the instruction execution process, which is widely used in all modern processors. Pipeline introduces the instruction level parallelism (ILP) because of the potential overlap of instructions, and it does have drawbacks in the form of hazards, which is a result of data dependencies and resource conflicts. To overcome these hazards, stalls were introduced, which are basically delayed execution of instructions to diffuse the problematic situation. Out-of-order (OOO) execution is a ramification of the stall approach since it executes the instruction in an order governed by the availability of the input data rather than by their original order in the program. This paper presents a new algorithm called Left-Right (LR) for reducing stalls in pipelined processors. This algorithm is built by combining the traditional in-order and the out-of-order (OOO) instruction execution, resulting in the best of both approaches. As instruction input, we take the Tomasulo's algorithm for scheduling out-of-order and the in-order instruction execution and we compare the proposed algorithm's efficiency against both in terms of power-performance gain. Experimental simulations are conducted using Sim-Panalyzer, an instruction level simulator, showing that our proposed algorithm optimizes the power-performance with an effective increase of 30% in terms of energy consumption benefits compared to the Tomasulo's algorithm and 3% compared to the in-order algorithm. - Instruction pipeline is extensively used in modern processors in order to achieve instruction level parallelism in pipelined processor architectures [1]. In a conventional pipelined processor, there are 5- pipe stages, namely FETCH (FE), DECODE (DE), EXECUTE (EXE), MEMORY (MEM) and WRITE-BACK (WB). In the first stage, the instruction is read from the memory, loaded into the register, then the decoding of an instruction takes place in the succeeding stage. In the third stage, the execution of an instruction is carried out and in the fourth stage, the desired value is written into the memory; and finally, the computed value is written into a register file. For example, in pipelined processors, if there is any dependency between two consecutive instructions, then the instruction in the decode stage will not be valid. The Tomasulo hardware algorithm is used to overcome this situation. Typically, it is a hardware dynamic scheduling algorithm, in which a separate hardware unit (so-called forwarding) is added to manage the sequential instructions that would normally stall (due to certain dependencies) and execute non-sequentially (This is also referred to as out-of-order execution). Due to data forwarding, there is at least a clock cycle delay and the stall is inserted in a pipeline. These no-operation (NOP) or stalls are used to eliminate the hazards in the pipeline. The NOP instructions contribute to the overall dynamic power consumption of a pipelined processor by generating a number of unnecessary transitions. Our main goal is to minimize such stalls which in turn increases the CPU throughput, thus saves the power consumption. Generally, the time taken by computing devices is determined by the following factors: The system performance can be enhanced by reducing one or more of these factors. Pipelining does just that by dividing the workload into various sub units and by assigning a processing time to each unit, thereby reducing the waiting time period which occurs if the sequential execution was adopted. Various approaches can be to increase the pipeline stages, and various strategies can be used to reduce the stalls caused by the pipeline hazards. To solve this hazard, one can use a large and faster buffer to fetch the instructions and perform an out of order execution. Though, this method increases the hardware complexity cost. It also reduces the branch penalty by re-arranging the instructions to fill the stalls due to branching instruction. But, this requires the use of a suitable scheduling algorithm for the instruction [2]. There is an ongoing research on variable pipeline stages, where it is advocated that processors pipeline stages can be varied within a certain range. In this type of processors, one can vary the workload and power consumption as per our requirement. Our proposed work on the analysis of stall reduction of pipelined processors is motivated by the following facts: (1) How to identify the power consumption of the instruction execution in a pipelined processor, i.e. does the power consumption of a instruction execution caused by the number of instructions or t (...truncated)


This is a preview of a remote PDF: https://link.springer.com/content/pdf/10.1186%2Fs13673-014-0016-8.pdf

Vijayalakshmi Saravanan, Kothari Dwarkadas Pralhaddas, Dwarkadas Pralhaddas Kothari, Isaac Woungang. An optimizing pipeline stall reduction algorithm for power and performance on multi-core CPUs, Human-centric Computing and Information Sciences, 2015, pp. 2, Volume 5, Issue 1, DOI: 10.1186/s13673-014-0016-8