## **CSE 4201 Computer Architecture**

Prof. Mokhtar Aboelaze

Parts of these slides are taken from Notes by Prof. David Patterson at UCB

Fall 09

CSE4201

### **Outline**

- MIPS and instruction set
- Simple pipeline in MIPS
- Structural and data hazards
- Forwarding
- Branching
- Exception and interrupts
- Multicycle operations

Fall 09

CSE4201

#### **MIPS Instruction set**

- 32-bit fixed format instruction (3 formats)
- 32 32-bit GPR (R0 contains zero, DP take pair)
- 3-address, reg-reg arithmetic instruction
- Single address mode for load/store: base + displacement
  - no indirection
- Simple branch conditions
- Delayed branch

Fall 09

CSE4201

#### **Instruction Set**

- Instruction Set Architecture
  - Defines set of operations, instruction format, hardware supported data types, named storage, addressing modes, sequencing
- Meaning of each instruction is described by RTL on architected registers and memory
- · Given technology constraints assemble adequate datapath
  - Architected storage mapped to actual storage
  - Function units to do all the required operations
  - Possible additional storage (eg. MAR, MDR, ...)
  - Interconnect to move information among regs and FUs
- Map each instruction to sequence of RTLs
- Collate sequences into symbolic controller state transition diagram (STD)
- Lower symbolic STD to control points
- Implement controller

Fall 09

CSE4201

### **MIPS Instruction Set**



FIGURE 3.1 The implementation of the DLX datapath allows every instruction to be executed in four or five clock cycles.

Fall 09 CSE42

### **MIPS Instruction Set**

#### Register-Register



#### Register-Immediate

| 31 | 26 | 25 2 | 120 | 16 | 15 |           | 0 |
|----|----|------|-----|----|----|-----------|---|
| Op |    | Rs1  | Rd  |    |    | immediate |   |

#### Branch



#### Jump / Call













## Pipelining is not quite that easy!

- Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle
  - Structural hazards: HW cannot support this combination of instructions
  - <u>Data hazards</u>: Instruction depends on result of prior instruction still in the pipeline
  - Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps).





### Speed Up Equation for **Pipelining**

CPI<sub>pipelined</sub> = Ideal CPI + Average Stall cycles per Inst

$$Speedup = \frac{Ideal \ \textit{CPI} \times Pipeline \ depth}{Ideal \ \textit{CPI} + Pipeline \ stall \ \textit{CPI}} \times \frac{\textit{Cycle Time}_{unpipelined}}{\textit{Cycle Time}_{pipelined}}$$

For simple RISC pipeline, CPI = 1:

$$Speedup = \frac{Pipeline \ depth}{1 + Pipeline \ stall \ CPI} \times \frac{Cycle \ Time_{unpipelined}}{Cycle \ Time_{pipelined}}$$

Fall 09 CSE4201

### **Example: Dual-port vs.** Single-port Machine A: Dual ported memory ("Harvard Architecture")

- Machine B: Single ported memory, but its pipelined implementation has a 1.05 times faster clock rate
- Ideal CPI = 1 for both
- Loads are 40% of instructions executed

```
SpeedUp<sub>A</sub> = Pipeline Depth/(1 + 0) \times (clock_{unpipe}/clock_{pipe})
              = Pipeline Depth
SpeedUp<sub>B</sub> = Pipeline Depth/(1 + 0.4 x 1) x (clock_{unpipe}/(clock_{unpipe}/ 1.05)
              = (Pipeline Depth/1.4) \times 1.05
              = 0.75 x Pipeline Depth
```

 $SpeedUp_A / SpeedUp_B = Pipeline Depth/(0.75 x Pipeline Depth) = 1.33$ 

Machine A is 1.33 times faster



## Three Generic Data Hazards

Read After Write (RAW)
 Instr<sub>.</sub> tries to read operand before Instr<sub>.</sub> writes it

I: add r1,r2,r3 J: sub r4,r1,r3

• Caused by a "Dependence" (in compiler nomenclature). This hazard results from an actual need for communication.

### Three Generic Data Hazards

Write After Read (WAR)
 Instr<sub>J</sub> writes operand <u>before</u> Instr<sub>I</sub> reads it

```
I: sub r4,r1,r3
J: add r1,r2,r3
K: mul r6,r1,r7
```

- Called an "anti-dependence" by compiler writers. This results from reuse of the name "r1".
- Can't happen in MIPS 5 stage pipeline because:
  - All instructions take 5 stages, and
  - Reads are always in stage 2, and
  - Writes are always in stage 5

Fall 09

CSE4201

## Three Generic Data Hazards

Write After Write (WAW)
 Instr, writes operand <u>before</u> Instr, writes it.

```
I: sub r1,r4,r3
J: add r1,r2,r3
K: mul r6,r1,r7
```

- Called an "output dependence" by compiler writers This also results from the reuse of name "r1".
- Can't happen in MIPS 5 stage pipeline because:
  - All instructions take 5 stages, and
  - Writes are always in stage 5
- FalWill see WAR and WAW in the complicated pipes















### **Branch Stall Impact**

- If CPI = 1, 30% branch,
   Stall 3 cycles => new CPI = 1.9!
- Two part solution:
  - Determine branch taken or not sooner, AND
  - Compute taken branch address earlier
- MIPS branch tests if register = 0 or ≠ 0
- MIPS Solution:
  - Move Zero test to ID/RF stage
  - Adder to calculate new PC in ID/RF stage
  - 1 clock cycle penalty for branch versus 3



## Four Branch Hazard Alternatives

#1: Stall until branch direction is clear

#### #2: Predict Branch Not Taken

- Execute successor instructions in sequence
- "Squash" instructions in pipeline if branch actually taken
- Advantage of late pipeline state update
- 47% MIPS branches not taken on average
- PC+4 already calculated, so use it to get next instruction

#### #3: Predict Branch Taken

Fall 09

- 53% MIPS branches taken on average
- But haven't calculated branch target address in MIPS
  - MIPS still incurs 1 cycle branch penalty
  - Other machines: branch target known before outcome
     CSE4201

15

### Four Branch Hazard Alternatives

#### #4: Delayed Branch

Define branch to take place AFTER a following instruction

```
branch instruction
sequential successor<sub>1</sub>
sequential successor<sub>2</sub>
.....
sequential successor<sub>n</sub>
branch target if taken
```

- 1 slot delay allows proper decision and branch target address in 5 stage pipeline
- MIPS uses this



### **Delayed Branch**

- Compiler effectiveness for single branch delay slot:
  - Fills about 60% of branch delay slots
  - About 80% of instructions executed in branch delay slots useful in computation
  - About 50% (60% x 80%) of slots usefully filled
- Delayed Branch downside: As processor go to deeper pipelines and multiple issue, the branch delay grows and need more than one delay slot
  - Delayed branching has lost popularity compared to more expensive but more flexible dynamic approaches
  - Growth in available transistors has made dynamic approaches relatively cheaper

Fall 09 CSE4201

### Evaluating Branch Alternatives

Pipeline speedup =  $\frac{\text{Pipeline depth}}{1 + \text{Branch frequency} \times \text{Branch penalty}}$ 

Assume 4% unconditional branch, 6% conditional branch- untaken, 10% conditional branch-taken

| Scheduling       | Branch  | CPI  | speedup v.  | speedup v. |
|------------------|---------|------|-------------|------------|
| scheme           | penalty |      | unpipelined | stall      |
| Stall pipeline   | 3       | 1.60 | 3.1         | 1.0        |
| Predict taken    | 1       | 1.20 | 4.2         | 1.33       |
| Predict not take | n 1     | 1.14 | 4.4         | 1.40       |
| Delayed branch   | 0.5     | 1.10 | 4.5         | 1.45       |

### **Problems with Pipelining**

- Exception: An unusual event happens to an instruction during its execution
  - Examples: divide by zero, undefined opcode
- Interrupt: Hardware signal to switch the processor to a new instruction stream
  - Example: a sound card interrupts when it needs more audio output samples (an audio "click" happens if it is left waiting)
- Problem: It must appear that the exception or interrupt must appear between 2 instructions (I<sub>i</sub> and I<sub>i+1</sub>)
  - The effect of all instructions up to and including  $\mathbf{I}_{\mathrm{i}}$  is totally complete
  - No effect of any instruction after I<sub>i</sub> can take place
- The interrupt (exception) handler either aborts program or restarts at instruction I<sub>i+1</sub>

Fall 09 CSE4201

## And In Conclusion: Control and Pipelining

- Quantify and summarize performance
  - Ratios, Geometric Mean, Multiplicative Standard Deviation
- F&P: Benchmarks age, disks fail,1 point fail danger
- Control VIA State Machines and Microprogramming
- Just overlap tasks; easy if tasks are independent
- Speed Up ≤ Pipeline Depth; if ideal CPI is 1, then:

Speedup = 
$$\frac{\text{Pipeline depth}}{1 + \text{Pipeline stall CPI}} \times \frac{\text{Cycle Time}_{\text{unpipelined}}}{\text{Cycle Time}_{\text{pipelined}}}$$

- Hazards limit performance on computers:
  - Structural: need more HW resources
  - Data (RAW,WAR,WAW): need forwarding, compiler scheduling
  - Control: delayed branch, prediction
- Exceptions, Interrupts add complexity



### **Multi cycles operations**

Assuming the following

| J           |              | 9                 |  |
|-------------|--------------|-------------------|--|
| Function    | Unit latency | initiation period |  |
| Integer ALU | 0            | 1                 |  |
| Data Memory | 1            | 1                 |  |
| FP add      | 3            | 1                 |  |
| FP Multiply | 6            | 1                 |  |
| FP Divide   | 24           | 24                |  |

Notice that FP add and multiply are pipelined (4 and 7 stages pipeline respectively).

Latency is the number of cycles between an instruction that produces a result and another one that uses the result.



### **Multicycle operations**

ID MULTD М4 м5 М6 м7 MEM WB ADDD IF ID MEM LD MEM WB ID SD MEM ID EΧ WB

Stages in red indicates when data are needed, in blue indicates when data are produced

Need to introduce more pipeline registers A1/A2, ..

### **Hazards and Forwarding**

- Because the divide unit is not pipelined, structural hazards may arise
- Because of different running times. We may need to do more than one register write in a single cycle
- WAW hazard is now possible, WAR is not since they all read in one stage
- Instructions can complete in different order, more complicated exception handling
- Because of the longer latency, more RAW

Fall 09 hazard

CSE420

### **Hazards and Forwarding**

Substantially longer stall and forwarding

### **Hazard and Forwarding**

```
Instruction 1 2 3 4 5 6 7 8 9 10 11

MULTD F0,F4,F6 IF ID M1 M2 M3 M4 M5 M6 M7 MEM WB
...

ADDD F2,F4,F6 IF ID A1 A2 A3 A4 MEM WB
...

LD F8,0(R2)
```

Three different instruction writing in the same cycle, IF LD is issued one cycle earlier, with destination of F2, that will lead to WAW hazard

Fall 09 CSE4201

### **Hazards and Forwarding**

- One way to deal with multiple writes is to have multiple write ports, but it may be rarely used.
- Another way is to detect the structural hazard by using an interlock
  - We can track the use of the write port before it is issued (ID stage) and stall
  - Or, we can detect this hazard at entering the MEM stage, it is easier to detect, and we can choose which instruction to proceed (the one with the longest latency?)

### **Maintaining Precise Exception**

DIVF F0,F2,F4

ADDF F10,F10,F8

SUBF F12,F12,F14

- This is known as out of order completion
- What if SUBF causes an Exception after ADDF is completed but before DIVF is, or if DIVF caused an exception after both ADDF and SUBF completed, there is no way to maintain a precise exception since ADDF destroys one of its operands

Fall 09 CSE4201

## Maintaining Precise Exception (sol 1)

- Early solution is to ignore the problem
- More recent ones, are to introduce two modes of operations, fast but with imprecise exception, and slow with precise exception.
- DEC Alpha 2104, Power1 and Power-2, MIPS R8000

## Maintaining Precise Exception (sol 2)

- Buffer the results of an operation until all the operations before it are completed.
- Costly, especially with long pipes.
- One variation is called history file, old values are stored in the history file and can be restored in case of exception
- Or, we an use future file, where new values are stored until all proceeding instructions are completed.

Fall 09

CSE4201

## Maintaining Precise Exception (sol 3)

- Allow the exception to become imprecise, but we have to keep enough information so that the exception handling routine can recover.
- These information are usually the PC addresses of the instructions that were in the pipe during the exception, who finished, and who didn't.

Fall 09

CSE4201

# Maintaining Precise Exception (sol 4) • A hybrid technique, Allow the

- A hybrid technique, Allow the instructions to be issued only if we are certain that all the instructions before the issuing instruction will complete without causing an exception
- That guarantees that no instruction after the interrupting one will be completed, and all instructions before it will complete.
- Must check for exception early in the EX

Fall 09 stage.

CSE4201

#### **MIPS4000**

- 8 Stage Pipeline:
  - IF-first half of fetching of instruction; PC selection happens here as well as initiation of instruction cache access.
  - IS-second half of access to instruction cache.
  - RF-instruction decode and register fetch, hazard checking and also instruction cache hit detection.
  - EX-execution, which includes effective address calculation, ALU operation, and branch target computation and condition evaluation.
  - DF-data fetch, first half of access to data cache.
  - DS-second half of access to data cache.
  - TC-tag check, determine whether the data cache access hit.
  - WB-write back for loads and register-register operations.
- 8 Stages: What is impact on Load delay? Branch delay? Why?

Fall 09

CSE4201

#### **MIPS4000** RF (DS) WB ΕX DF TC IS 2 cycle load delay IS IF RF EX DF DS TC (data is ready after DS) IF ΕX IS RF DF DS RF (EX) IS DF IS RF EX IF IS RF IF IS IF IF RF DS TC IS DF WB IS RF EX DF DS TC 3 cycles branch delay, IF IS RF ΕX DF DS MIPS4000 has a singly IS RF $\mathsf{EX}$ DF cycle branch delay scheduling with a predict IF IS RF ΕX taken for the remaining 2 IF IS RF IS ΙF Fall 09 CSE4201

### **MIPS4000 FP Pipeline**

- FP Adder, FP Multiplier, FP Divider
- Last step of FP Multiplier/Divider uses FP Adder HW
- 8 kinds of stages in FP units:

| Stage | Functional unit | Description                |
|-------|-----------------|----------------------------|
| Α     | FP adder        | Mantissa ADD stage         |
| D     | FP divider      | Divide pipeline stage      |
| Е     | FP multiplier   | Exception test stage       |
| M     | FP multiplier   | First stage of multiplier  |
| N     | FP multiplier   | Second stage of multiplier |
| R     | FP adder        | Rounding stage             |
| S     | FP adder        | Operand shift stage        |
| U     |                 | Unpack FP numbers          |
|       |                 |                            |

### MIPS4000 F P Pipeline

FP Instr Pipeline stages
Add, Subtract U,S+A,A+R,R+S
Multiply U,E+M,M,M,N,N,N+A,R

Divide U,A,R,D<sup>28</sup>,D+A,D+R, D+R, D+A, D+R, A, R

Square root  $U,E,(A+R)^{108},A,R$ 

Negate U,S Absolute value U,S FP compare U,A,R

| OP      | Latency | Initiation interval |
|---------|---------|---------------------|
| ADD,SUB | 4       | 3                   |
| MUL     | 8       | 4                   |
| DIV     | 36      | 35                  |
| SQRT    | 112     | 111                 |
| NEG,ABS | 2       | 1                   |
| COMP    | 3       | 2                   |

Fall 09 CSE4201

#### **EXAMPLE**

MUL ISSUE MUL Issue U M M M N,A R ADD Issue U S,A A,R R,S U S,A A,R R,S ADD Issue ADD Stall S,A A,R R,S ADD Stall S,A A,R R,S ADD Issue S,A A,R R,S U S,AA,R R,S ADD Issue ADD Issue U S,A A,R R,S

The interaction between a multiply issued at time 0, and add issued between 1 and 7, in all except 2 we can proceed without stalls, in these two we have to stall