

## Introduction

- Pipelining become universal technique in 1985
  - Overlaps execution of instructions
  - Exploits "Instruction Level Parallelism"
- Beyond this, there are two main approaches:
  - Hardware-based dynamic approaches
    - Used in server and desktop processors
    - Not used as extensively in PMP processors
  - Compiler-based static approaches
    - Not as successful outside of scientific applications

Copyright © 2012, Elsevier Inc. All rights reserved.

#### **Instruction-Level Parallelism**

- When exploiting instruction-level parallelism, goal is to minimize CPI
  - Pipeline CPI =
    - Ideal pipeline CPI +
    - Structural stalls +
    - Data hazard stalls +
    - Control stalls
- Parallelism with basic block is limited
  - Typical size of basic block = 3-6 instructions
  - Must optimize across branches

M<

Copyright © 2012, Elsevier Inc. All rights reserved.

3

## **Data Dependence**

- Loop-Level Parallelism
  - Unroll loop statically or dynamically
  - Use SIMD (vector processors and GPUs)
- Challenges:
  - Data dependency
    - Instruction j is data dependent on instruction i if
      - Instruction *i* produces a result that may be used by instruction *j*
      - Instruction *j* is data dependent on instruction *k* and instruction *k* is data dependent on instruction *i*
- Dependent instructions cannot be executed simultaneously

M<

Copyright © 2012, Elsevier Inc. All rights reserved.

## **Data Dependence**

- Dependencies are a property of programs
- Pipeline organization determines if dependence is detected and if it causes a stall
- Data dependence conveys:
  - Possibility of a hazard
  - Order in which results must be calculated
  - Upper bound on exploitable instruction level parallelism
- Dependencies that flow through memory locations are difficult to detect

M<

Copyright © 2012, Elsevier Inc. All rights reserved.

5

## **Data Dependence**

■ Loop: L.D **F0**,0(R1)

ADD.D F4,F0,F2

■ S.D F4,0(R1)

DADDUI R1,R1,#-8

■ BNE R1,R2,Loop

## **Name Dependence**

- Two instructions use the same name but no flow of information
  - Not a true data dependence, but is a problem when reordering instructions
  - Antidependence: instruction j writes a register or memory location that instruction i reads
    - Initial ordering (i before j) must be preserved
  - Output dependence: instruction i and instruction j write the same register or memory location
    - Ordering must be preserved
- To resolve, use renaming techniques

M<

Copyright © 2012, Elsevier Inc. All rights reserved.

7

#### **Other Factors**

- Data Hazards
  - Read after write (RAW)
  - Write after write (WAW)
  - Write after read (WAR)
- Control Dependence
  - Ordering of instruction i with respect to a branch instruction
    - Instruction control dependent on a branch cannot be moved before the branch so that its execution is no longer controller by the branch
    - An instruction not control dependent on a branch cannot be moved after the branch so that its execution is controlled by the branch

N // /

Copyright © 2012, Elsevier Inc. All rights reserved.

## **Control Dependence**

- Must preserve exception behavior.
- We should not change the exception behavior of the program.
- We often relax this to "reordering of instruction must not raise new exceptions"
- DADDU R2,R3,R4
- BEQZ R2,L1
- LW R1,0(R2)
- L1: ......
- No data dependence prevents us from exchanging BEQZ and LW, but might result in memory protection exception

M<

Copyright © 2012, Elsevier Inc. All rights reserved.

9

## **Examples**

- Example 1: DADDU R1,R2,R3
  - BEQZ R4,L DSUBU R1,R1,R6

L: ... OR

R7,R1,R8

- OR instruction dependent on DADDU and DSUBU
- Preserving the order alone is not sufficient (must have the correct value in R1)
- Example 2:

DADDU R1,R2,R3 BEQZ R12,skip DSUBU R4,R5,R6

DSUBU R4,R5,R6 DADDU R5,R4,R9

skip:

OR R7,R8,R9

- Assume R4 isn't used after skip
  - Possible to move DSUBU before the branch

M<

Copyright © 2012, Elsevier Inc. All rights reserved.

#### **Compiler Techniques for Exposing ILP**

- Pipeline scheduling
  - Separate dependent instruction from the source instruction by the pipeline latency of the source instruction

Example: for (i=999; i>=0; i=i-1) x[i] = x[i] + s;

No dependence between iterations MIPS code?

| Instruction producing result | Instruction using result | Latency in clock cycles |
|------------------------------|--------------------------|-------------------------|
| FP ALU op                    | Another FP ALU op        | 3                       |
| FP ALU op                    | Store double             | 2                       |
| Load double                  | FP ALU op                | 1                       |
| Load double                  | Store double             | 0                       |

M<

Copyright © 2012, Elsevier Inc. All rights reserved.

11

# **Pipeline Stalls**

Loop: L.D F0,0(R1) 1 stall ADD.D F4,F0,F2 3 stall 5 stall F4,0(R1) S.D **DADDUI R1,R1,#-8** 7 8 stall (assume integer load latency is 1) BNE R1,R2,Loop 9

| Instruction producing result | Instruction using result | Latency in clock cycles |
|------------------------------|--------------------------|-------------------------|
| FP ALU op                    | Another FP ALU op        | 3                       |
| FP ALU op                    | Store double             | 2                       |
| Load double                  | FP ALU op                | 1                       |
| Load double                  | Store double             | 0                       |

M<

Copyright © 2012, Elsevier Inc. All rights reserved.

# **Pipeline Scheduling**

#### Scheduled code:

Loop: L.D F0,0(R1) 1
DADDUI R1,R1,#-8 2
ADD.D F4,F0,F2 3
stall 4
stall 5
S.D F4,8(R1) 6
BNE R1,R2,Loop 7

| Instruction producing result | Instruction using result | Latency in clock cycles |
|------------------------------|--------------------------|-------------------------|
| FP ALU op                    | Another FP ALU op        | 3                       |
| FP ALU op                    | Store double             | 2                       |
| Load double                  | FP ALU op                | 1                       |
| Load double                  | Store double             | 0                       |

M<

Copyright © 2012, Elsevier Inc. All rights reserved.

13

# **Loop Unrolling**

Loop unrolling

L.D

Loop:

- Unroll by a factor of 4 (assume # elements is divisible by 4)
- Eliminate unnecessary instructions

F0,0(R1)

ADD.D F4,F0,F2 S.D F4,0(R1); drop DADDUI & BNE 2 stalls L.D F6,-8(R1) ADD.D F8,F6,F2 1 stall F8,-8(R1); drop DADDUI & BNE S.D L.D F10,-16(R1) ADD.D F12,F10,F2 F12,-16(R1); drop DADDUI & BNE S.D F14,-24(R1) L.D ADD.D F16,F14,F2 note: number F16,-24(R1) of live registers DADDUI R1,R1,#-32 vs. original loop BNE R1,R2,Loop

M<

Copyright © 2012, Elsevier Inc. All rights reserved.

#### **Loop Unrolling/Pipeline Scheduling**

• Pipeline schedule the unrolled loop:

```
L.D
                F0,0(R1)
Loop:
        L.D
                F6,-8(R1)
        L.D
                 F10,-16(R1)
        L.D
                F14,-24(R1)
        ADD.D F4,F0,F2
        ADD.D F8,F6,F2
        ADD.D F12,F10,F2
        ADD.D F16,F14,F2
        S.D
                F4,0(R1)
        S.D
                F8,-8(R1)
        DADDUI R1,R1,#-32
        S.D
                F12,16(R1)
        S.D
                F16,8(R1)
        BNE
                 R1,R2,Loop
```

Loop iterations are independent

**M<** 

Copyright © 2012, Elsevier Inc. All rights reserved.

15

# **Strip Mining**

- Unknown number of loop iterations?
  - Number of iterations = n
  - Goal: make *k* copies of the loop body
  - Generate pair of loops:
    - First executes *n* mod *k* times
    - Second executes n / k times
    - "Strip mining"

Compiler Technique

#### **Loop Level Parallelsim**

Loop-Level Parallelism (LLP) analysis focuses on whether data accesses in later iterations of a loop are data dependent on data values produced in earlier iterations and possibly making loop iterations independent.

For(
$$i=0;i<100;i++$$
)  
x[ $i]=x[i]+A;$ 

- the computation in each iteration is independent of the previous iterations and the loop is thus parallel. The use of X[i] twice is within a single iteration.
  - Thus loop iterations are <u>parallel</u> (or independent from each other).



Copyright © 2012, Elsevier Inc. All rights reserved.

4-

## **Loop Level Parallelsim**

- Loop-carried Dependence: A data dependence between different loop iterations (data produced in earlier iteration used in a later one).
- LLP analysis is important in software optimizations such as loop unrolling since it usually requires loop iterations to be independent.
- LLP analysis is normally done at the source code level or close to it since assembly language and target machine code generation introduces loop-carried name dependence in the registers used for addressing and incrementing.
- Instruction level parallelism (ILP) analysis, on the other hand, is usually done when instructions are generated by the compiler

M<

Copyright © 2012, Elsevier Inc. All rights reserved.

#### **Loop Level Parallism**

A[i+1] = A[i] + C[i]; /\* S1 \*/

B[i+1] = B[i] + A[i+1]; /\* S2 \*/

for (i=1; i<=100; i=i+1) {

```
Iteration # \rightarrow i i+1

Not Loop
Carried
Dependence

\begin{array}{c|c}
S1 \\
A_{i+1} \\
\hline
S2 \\
B_{i+1}
\end{array}

S2
```

**Dependency Graph** 

Loop-carried Dependence

- S2 uses the value A[i+1], computed by S1 in the same iteration. This
  data dependence is within the same iteration (not a loop-carried
  dependence).
  - ⇒ does not prevent loop iteration parallelism.
- S1 uses a value computed by S1 in an earlier iteration, since iteration i computes A[i+1] read in iteration i+1 (loop-carried dependence, prevents parallelism). The same applies for S2 for B[i] and B[i+1]
  - These two dependencies are loop-carried spanning more than one iteration preventing loop parallelism.

M<

Copyright © 2012, Elsevier Inc. All rights reserved.

19

## **Loop Level parallelism**

```
for(i=0;i<=100;i++)</pre>
```

- S1 uses the value calculated by S2 in the previous iteration (loop carried dependence)
- The dependence is not circular, S2 does not depend on S1 in the previous iteration

M<

Copyright © 2012, Elsevier Inc. All rights reserved.

```
for (i=1; i<=100; i=i+1) {
                                      A[i] = A[i] + B[i];
                                                           /* S1 */
                                       B[i+1] = C[i] + D[i]; /* S2 */
A[1] = A[1] + B[1];
                       A[2] = A[2] + B[2];
                                                                   A[99] = A[99] + B[99];
                                                                                             A[100] = A[100] + B[100];
 B[2] = C[1] + D[1];
                                                                   B[100] = C[99] + D[99];
                        B[3] = C[2] + D[2];
                                                                                             B[101] = C[100] + D[100];
                                           A[1] = A[1] + B[1];
                                           for (i=1; i<=99; i=i+1) {
                                               B[i+1] = C[i] + D[i];
                                                A[i+1] = A[i+1] + B[i+1];
                                            B[101] = C[100] + D[100];
                                    Iteration 1
Loop Start-up code
A[1] = A[1] + B[1];
                        A[2] = A[2] + B[2];
                                                                   A[99] = A[99] + B[99];
                                                                                              A[100] = A[100] + B[100];
B[2] = C[1] + D[1]; B[3] = C[2] + D[2];
                                                                                             B[101] = C[100] + D[100];
                                               Carried
                                                                                                Loop Completion code
                                               Dependence
                                 Copyright © 2012, Elsevier Inc. All rights reserved.
```

# **Finding Dependence**

- Finding dependences in the program is very important for renaming and executing instructions in parallel.
- Arrays and pointers makes finding dependences very difficult.
- Assume array indices are *affine*, which means on the form ai+b where a and b are constant.
- GCD test can be used to detect dependences.

# **Finding Dependence**

- Assume we stored an array with index value of ai+b and loaded an array with an index value of cj+d
- Are they pointing to the same location?
- Assume the loop limit is *m*,*n*
- Are there

 $j,k \quad m \le j,k \le n \text{ such that } a \times j + b = c \times k + d$ 

M<

Copyright © 2012, Elsevier Inc. All rights reserved.

23

## **GCD** test

- A simple and sufficient test for absence can be found.
- If a loop dependence exists, then

GCD(c,a) divides (d-b)

M<

Copyright © 2012, Elsevier Inc. All rights reserved.

## **GCD Test -- Example**

```
for(i=1; i<=100; i=i+1) { x[2*i+3] = x[2*i] * 5.0; } a = 2 \quad b = 3 \quad c = 2 \quad d = 0 GCD(a, c) = 2  d - b = -3 2 does not divide -3 \Rightarrow No dependence is not possible.
```

5,7,9,11,13,15,17,19,21,23,.... 4,6,8,10,12,14,16,18,20,22,.....

M<

Copyright © 2012, Elsevier Inc. All rights reserved.

25

## **Dependence Analysis**

- Dependence analysis is a very important tool for exploiting LLP, it can not be used in these situations
- Objects are referenced using pointers
- Array indexing using another array a[b[i]]
- Dependence may exist for some values of input, but in reality the input never takes these values.
- When we want to know more than the possibility of dependence (which write causes it?)
- Dependence analysis across procedure boundaries

M<

Copyright © 2012, Elsevier Inc. All rights reserved.

## **Dependence Analysis**

- Sometimes, points-to analysis might help.
- We might be able to answer simpler questions, or get some hints.
- Do 2 pointers point to the same list?
- Type information
- Information derived when the object was allocated
- Pointer assignments

M<

Copyright © 2012, Elsevier Inc. All rights reserved.

27

# **Software Pipelines**

 Software pipelined loop chooses instructions from different loop iterations, thus separating the dependent instructions within one iteration of the original loop



M<

Copyright © 2012, Elsevier Inc. All rights reserved.

# **Software Piplines**

```
Loop:
                 F0,0(R1)
        L.D
        ADD.D
                 F4,F0,F2
        S.D
                 F4,0(R1)
                 R1,R1,#-8
       DADDUI
       BNE
```

#### Before: Unrolled 3 times F0,0(R1)

```
L.D
           F4,F0,F2
   ADD.D
3
   S.D
           F4,0(R1)
   L.D
           F0,-8(R1)
   ADD.D
           F4,F0,F2
   S.D
           F4,-8(R1)
           F0,-16(R1)
   L.D
           F4,F0,F2
   ADD.D
   S.D
           F4,-16(R1)
10 DADDUI R1,R1,#-24
11 BNE
           R1,R2,LOOP
```

#### After: Software Pipelined Version

```
L.D
        F0,0(R1)
        F4,F0,F2
ADD.D
L.D
        F0,-8(R1)
S.D
        F4,0(R1)
                  ;Stores M[i]
ADD.D
        F4,F0,F2
                  ;Adds to M[i-1]
        F0,-16(R1);Loads M[i-2]
L.D
DADDUI
       R1,R1,#-8
        R1,R2,LOOP
BNE
S.D
        F4, 0(R1)
ADDD
        F4,F0,F2
        F4,-8(R1)
S.D
```

Copyright © 2012, Elsevier Inc. All rights reserved.

# **Software Pipelines**



Loop Body of software Pipelined Version

Copyright © 2012, Elsevier Inc. All rights reserved.

#### **Branch Prediction**

- Dynamic scheduling deals with data dependence improving, the limiting factor is the control dependence.
- Branch prediction is important for processors that maintains a CPI of 1, but it is crucial for processors who tries to issue more than one instruction per cycle (CPI < 1).</li>
- We have already studied some techniques (delayed branch, predict not taken), but these do not depend on the dynamic behavior of the code.



Copyright © 2012, Elsevier Inc. All rights reserved.

31

## **Branch History Table**

- A small memory indexed by the lower portion of the address of the branch instruction.
- The memory contains only 1-bit, to predict taken or untaken
- If the prediction is incorrect, the prediction bit is inverted.
- In a loop, it mispredicts twice
  - End of loop case, when it exits instead of looping as before
  - First time through loop on next time through code, when it predicts exit instead of looping

M<

Copyright © 2012, Elsevier Inc. All rights reserved.



# **2-Bit Predictor**

- Uses 2 bits to add some hysteresis to the prediction – Compare with 1 bit?
- 2 bits are as good as N bits (approx.)



M<

Copyright © 2012, Elsevier Inc. All rights reserved.

## **2-bit Predictor**

• 4096 entries 2-bit predictor miss rate



M<

Copyright © 2012, Elsevier Inc. All rights reserved.

3!

# **Correlating Branch Predictors**

```
DSUBUI R3, R1, #2
B1
       if (aa==2)
                            BNEZ
                                      R3, L1
                                                 ; b1 (aa!=2)
         aa=0;
                                      R1, R0, R0 ; aa==0
                             DADD
                        L1: DSUBUI R3, R1, #2
       if (bb==2)
B2
                             BNEZ
                                      R3, L2
                                                ; b2 (bb!=2)
        bb=0;
                             DADD
                                      R2, R0, R0 ; bb==0
B3
       if (aa!==bb){
                       L2: DSUBUI R3, R1, R2; R3=aa-bb
                             BEQZ
                                      R3, L3
                                                 ; b3 (aa==bb)
```

If the condition is true → B1,B2 branch NOT TAKEN

If the condition is true → B3 NOT taken

If B1 and B2 both NOT TAKEN B3 → TAKEN

There is a correlation between B3 and both B1 and B2

M<

Copyright © 2012, Elsevier Inc. All rights reserved.

# **Correlating Branch Predictors**

- Correlating predictors (tw0-level predictors) use the behavior of other branches to make prediction.
- Simplest (1-bit) has 2 predictions, one if the last branch is take, the second is when the last branch is not taken
- The prediction is on the form NT/T

M<

Copyright © 2012, Elsevier Inc. All rights reserved.

37

# **Example**

```
B1 if (d==0)
d=1;
B2 if (d==1)
{
```

```
BNEZ R1, L1 ; d == 0 ?
DADD R1, R0, #1 ; YES d==1
L1: DADD R3, R1, #-1
BNEZ R3, L2 ; b2 (bb!=2)
L2:
```

If b1 not taken, b2 is taken for sure

| Initial d | d==0? | B1    | d befoe b2 | d==1 | b2    |
|-----------|-------|-------|------------|------|-------|
| 0         | Υ     | NO    | 1          | Υ    | NO    |
| 1         | N     | Taken | 1          | Υ    | NO    |
| 2         | N     | Taken | 2          | N    | Taken |

M<

Copyright © 2012, Elsevier Inc. All rights reserved.

| Example                                              |         |        |       |        |        |         |
|------------------------------------------------------|---------|--------|-------|--------|--------|---------|
| In                                                   | itial d | d==0?  | B1    | d befo | e b2 d | ==1 B2  |
|                                                      | 0       | Υ      | /NO   | 1      | •      | Y NO    |
|                                                      | 1       | N      | Taken | 1      | `      | Y NO    |
|                                                      | 2       | N      | Taken | 2      | 1      | N Taken |
| d                                                    | B1      | B1 /   | newB1 | B2     | B2     | new B2  |
|                                                      | Pred    | action | pred  | pred   | action | pred    |
| 2                                                    | NT      | T      | _T    | NT     | T      | T       |
| 0                                                    | T       | NT     | NT    | T      | NT     | NT      |
| 2                                                    | NT      | Т      | Т     | NT     | Т      | Т       |
| 0                                                    | Т       | NT     | NT    | Т      | NT     | NT      |
| Miss on every prediction                             |         |        |       |        |        |         |
| Copyright © 2012, Elsevier Inc. All rights reserved. |         |        |       |        | 39     |         |

| Example                                              |                   |               |                |             |                |
|------------------------------------------------------|-------------------|---------------|----------------|-------------|----------------|
| Initial d                                            | d==0?             | B1            | d befoe b2     | d==         | 1 b2           |
| 0                                                    | Υ                 | NO            | 1              | Υ           | NO             |
| 1                                                    | N                 | Taken         | 1              | Υ           | NO             |
| 2                                                    | N                 | Taken         | 2              | N           | Taken          |
| d b<br>Pr                                            | 1 b1<br>ed action | newb1<br>pred |                | 2<br>action | new b2<br>pred |
| 2 <b>N</b> 7                                         | T/NT % T          | T/NT          | NT/NT %        | Т           | NT/T           |
| 0 T/                                                 | NT √ NT           | T/NT          | NT/T √         | NT          | NT/T           |
| 2 <b>T</b> /i                                        | NT √ T            | T/NT          | NT/ <b>T</b> √ | Т           | NT/T           |
| 0 T/                                                 | NT √ NT           | T/NT          | NT/T √         | NT          | NT/T           |
| Misprediction on first try only                      |                   |               |                |             |                |
| Copyright © 2012, Elsevier Inc. All rights reserved. |                   |               |                |             | 40             |

## **Correlating Predictors**

- The previous predictor is called (1,1) predictor.
- It uses one bit for history (last branch), to choose among two (2¹) 1-bit branch predictors.
- In general a predictor could me (m,n) predictor.
- It uses the last m branch to choose among  $2^m$  branch predictors each is n-bit predictor.

M<

Copyright © 2012, Elsevier Inc. All rights reserved.

41

# (2,2) Correlating Predictors

#### (2,2) predictor

 Behavior of recent branches selects between four predictions of next branch, updating just that prediction



M<

Copyright © 2012, Elsevier Inc. All rights reserved.



# **Branch Prediction**

- Basic 2-bit predictor:
  - For each branch:
    - Predict taken or not taken
    - If the prediction is wrong two consecutive times, change prediction
- Correlating predictor:
  - Multiple 2-bit predictors for each branch
  - One for each possible combination of outcomes of preceding n branches
- Local predictor:
  - Multiple 2-bit predictors for each branch
  - One for each possible combination of outcomes for the last n occurrences of this branch

Dianch Plediction

## **Tournament Predictor**

- Tournament predictor:
  - Combine correlating predictor with local predictor
  - A selector is sued to decide which one of these to use
- The selector could be similar to a 2-bit predictor
  - A saturating 2-bit binary counter with 2 outcomes P1/P2



M<

Copyright © 2012, Elsevier Inc. All rights reserved.



# Alpha 21264 Branch Predictor

- Tournament predictor using, 4K 2-bit counters indexed by local branch address.
- Global predictor
  - 4K entries index by history of last 12 branches (2<sup>12</sup> = 4K)
  - Each entry is a standard 2-bit predictor
- Local predictor
  - Local history table: 1024 10-bit entries recording last 10 branches, index by branch address
  - The pattern of the last 10 occurrences of that **particular** branch used to index table of 1K entries with 3-bit saturating counters

M<

Copyright © 2012, Elsevier Inc. All rights reserved.