









| Specula                                           | ation                                                                                                                  |
|---------------------------------------------------|------------------------------------------------------------------------------------------------------------------------|
| its result                                        | sulo's algorithm, once an instruction writes<br>, any subsequently issued instructions will<br>It in the register file |
| ° With spe<br>until the                           | culation, the register file is not updated instruction commits                                                         |
| • (we kn                                          | ow definitively that the instruction should execute)                                                                   |
| ° Thus, the<br>between<br>instruction             | e ROB supplies operands in interval<br>completion of instruction execution and<br>on commit                            |
| <ul> <li>ROB is<br/>reserv<br/>algorit</li> </ul> | a source of operands for instructions, just as ation stations (RS) provide operands in Tomasulo's hm                   |
| • ROB e                                           | xtends architecture registers like RS                                                                                  |
| ° ROB hole<br>associate<br>commits                | ds the results between the operation<br>ed with the instruction completes, and                                         |
| Fall 09                                           | CSE4201                                                                                                                |







| Exa     | mple |            |  |
|---------|------|------------|--|
| Loop    | LD   | F0,10(R2)  |  |
|         | ADDD | F10,F4,F0  |  |
|         | DIVD | F2,F10,F6  |  |
|         | DADD | R1,R1,-8   |  |
|         | BNE  | R1,R2,Loop |  |
|         |      |            |  |
| Fall 09 |      | CSE4201    |  |











| Source instruction | Instruction using res | ult Latency          |
|--------------------|-----------------------|----------------------|
| FP ALU OP          | FP ALU OP             | 3                    |
| FP ALU OP          | Store double          | 2                    |
| Load double        | FP ALU OP             | 1                    |
| Load Double        | Store double          | 0                    |
| Loop: L.D          | F0,0(R1)              |                      |
| ADD.D              |                       | For (I=1000;I>0;I++) |
| S.D<br>DADDUI      | 0(R1),F4<br>R1,R1,#-8 | x[I]=x[I]+s;         |
| BADDOI<br>BNE R    | 1,R2,Loop             |                      |
|                    |                       |                      |
|                    |                       |                      |
| Fall 09            | CSE4201               |                      |

| ° Assu<br>opera       |                | v can sche<br>FP operati |              |                      |          |
|-----------------------|----------------|--------------------------|--------------|----------------------|----------|
| Memory<br>reference 1 |                | FP<br>operation 1        | FP<br>op. 2  | Int. op/ (<br>branch | Clock    |
| _D F0,0(R1)           | LD F6,-8(R1)   |                          |              |                      | 1        |
| _D F10,-16(R1)        | LD F14, 24(R1) |                          |              |                      | 2        |
| _D F18,-32(R1)        | LD F22,-40(R1) | ADDD F4,F0,F2            | ADDD F8,F6,F | 23                   |          |
| _D F26,-48(R1)        |                | ADDD F12,F10,F2          | ADDD F16,F14 | I,F2                 | 4        |
|                       |                | ADDD F20,F18,F2          | ADDD F24,F22 | 2,F2                 | 5        |
| SD 0(R1),F4           | SD -8(R1),F8   | ADDD F28,F26,F2          |              |                      | 6        |
| SD -16(R1),F12        | SD -24(R1),F16 |                          |              | DADD R1,R1,#-5       | 6 7      |
| SD 24(R1),F20         | SD 16(R1),F24  |                          |              |                      | 8        |
| SD 8(R1),F28          |                | 7 iterations             | n 9          | BNEZ R1,LOOF         | <b>9</b> |
| Fall 09               |                | cvcles = 1.2             | 9 C/I        |                      |          |

| Outline                  |                                       |
|--------------------------|---------------------------------------|
| ° Data depe              | ndence and hazards                    |
| ° Exposing<br>scheduling | parallelism (loop unrolling and<br>g) |
| ° Reducing               | branch costs (prediction)             |
| ° Dynamic s              | scheduling                            |
| ° Speculatio             | on                                    |
| ° Multiple is            | sue and static scheduling             |
| ° Advanced               | techniques                            |
| ° Example                |                                       |
|                          |                                       |
| Fall 09                  | CSE4201                               |



| Adva    | ince | d Dynamic Scheduling |
|---------|------|----------------------|
|         |      |                      |
| Loop:   | LD   | R2,0(R1)             |
|         | ADD  | R2,R2,#1             |
|         | SD   | R2,0(R1)             |
|         | ADD  | R1,R1,#8             |
|         | BNE  | R2,R3,Loop           |
|         |      |                      |
|         |      |                      |
|         |      |                      |
|         |      |                      |
|         |      |                      |
|         |      |                      |
| Fall 09 |      | CSE4201              |

|                     | vvitti | out opo    | culation                           |                                      | Memory                             |                                       |                  |
|---------------------|--------|------------|------------------------------------|--------------------------------------|------------------------------------|---------------------------------------|------------------|
| Iteration<br>number |        |            | lssues at<br>clock cycle<br>number | Executes at<br>clock cycle<br>number | access at<br>clock cycle<br>number | Write CDB at<br>clock cycle<br>number | Comment          |
| 1                   | LD     | R2,0(R1)   | 1                                  | 2                                    | 3                                  | 4                                     | First issue      |
| . 1                 | DADDIU | R2,R2,#1   | 1                                  | 5 🔶                                  |                                    | 6                                     | Wait for LW      |
| 1                   | SD     | R2,0(R1)   | 2                                  | 3                                    | 7                                  |                                       | Wait for DADDI   |
| 1                   | DADDIU | R1,R1,#4   | 2                                  | 3                                    |                                    | 4                                     | Execute directly |
| 1                   | BNE    | R2,R3,L00P | 3                                  | 7                                    |                                    |                                       | Wait for DADDI   |
| 2                   | LD     | R2,0(R1)   | 4                                  | 8                                    | 9                                  | 10                                    | Wait for BNE     |
| 2                   | DADDIU | R2,R2,#1   | 4                                  | 11                                   |                                    | 12                                    | Wait for LW      |
| 2                   | SD     | R2,0(R1)   | 5                                  | 9                                    | 13                                 |                                       | Wait for DADDI   |
| 2                   | DADDIU | R1,R1,#4   | 5                                  | 8                                    |                                    | 9                                     | Wait for BNE     |
| 2                   | BNE    | R2,R3,L00P | 6                                  | 13                                   |                                    |                                       | Wait for DADDI   |
| 3                   | LD     | R2,0(R1)   | 7                                  | 14                                   | 15                                 | 16                                    | Wait for BNE     |
| 3                   | DADDIU | R2,R2,#1   | 7                                  | 17 🔶                                 |                                    | 18                                    | Wait for LW      |
| 3                   | SD     | R2,0(R1)   | 8                                  | 15                                   | 19                                 |                                       | Wait for DADDI   |
| 3                   | DADDIU | R1,R1,#4   | 8                                  | 14                                   |                                    | . 15                                  | Wait for BNE     |
| 3                   | BNZ    | R2,R3,L00P | 9                                  | 19                                   |                                    |                                       | Wait for DADDI   |

Figure 3.33 The time of issue, execution, and writing result for a dual-issue version of our pipeline without speculation. Note that the L. D following the BNE cannot start execution earlier, because it must wait until the branch outcome is determined. This type of program, with data-dependent branches that cannot be resolved earlier, shows the strength of speculation. Separate functional units for address calculation, ALU operations, and branch condition evaluation allow multiple instructions to execute in the same cycle.

Fall 09

CSE4201

| Iteration<br>number | Instruct | tions            | lssues<br>at clock<br>number | Executes<br>at clock<br>number | Read access<br>at clock<br>number | Write<br>CDB at<br>clock<br>number | Commits<br>at clock<br>number | Comment            |
|---------------------|----------|------------------|------------------------------|--------------------------------|-----------------------------------|------------------------------------|-------------------------------|--------------------|
| 1                   | LD       | R2,0(R1)         | 1                            | 2                              | 3                                 | 4                                  | 5                             | First issue        |
| 1                   | DADDIU   | R2,R2,#1         | 1                            | 5                              |                                   | 6                                  | 7                             | Wait for LW        |
| 1                   | SD       | R2,0(R1)         | 2                            | 3                              |                                   |                                    | 7                             | Wait for DADDIU    |
| 1                   | DADDIU   | R1,R1,#4         | 2                            | 3                              |                                   | 4                                  | 8                             | Commit in order    |
| 1                   | BNE      | R2,R3,L00P       | 3                            | 7 🖌                            |                                   |                                    | 8                             | Wait for DADDIU    |
| 2                   | LD       | R2,0(R1)         | 4                            | 5                              | 6                                 | 7                                  | 9                             | No execute delay   |
| 2                   | DADDIU   | R2,R2,#1         | 4                            | 8                              |                                   | 9                                  | 10                            | Wait for LW        |
| 2                   | SD       | R2,0(R1)         | 5                            | 6                              |                                   |                                    | 10                            | Wait for DADDIU    |
| 2                   | DADDIU   | R1,R1,#4         | 5                            | 6                              |                                   | 7                                  | 11                            | Commit in order    |
| 2                   | BNE      | R2,R3,L00P       | 6                            | 10                             |                                   |                                    | 11                            | Wait for DADDIU    |
| 3                   | LD       | R2,0(R1)         | 7                            | 8                              | 9                                 | 10                                 | 12                            | Earliest possible  |
| 3                   | DADDIU   | R2,R2,#1         | 7                            | 11                             |                                   | 12                                 | 13                            | Wait for LW        |
| 3                   | SD       | R2,0(R1)         | 8                            | 9                              |                                   |                                    | 13                            | Wait for DADDIU    |
| 3                   | DADDIU   | R1,R1,#4         | 8                            | 9                              |                                   | 10                                 | 14                            | Executes earlier   |
| 3                   | BNE      | R2,R3,L00P       | 9                            | 13                             |                                   |                                    | 14                            | Wait for DADDIU    |
| Figure 3.           | 34 The t | ime of issue, ex | ecution, and                 | d writing res                  | sult for a dual-i                 |                                    | n of our pipe                 | eline with specula |































| Before:<br>1 L.D<br>2 ADD<br>3 S.D<br>4 L.D<br>5 ADD<br>6 S.D<br>7 L.D<br>8 ADD<br>9 S.D | F0<br>.D F4<br>F4<br>F0<br>.D F4<br>F4<br>F0<br>.D F4<br>F4<br>DUI R1 | <pre>,F0,F2 ,0(R1) ,-8(R1) ,F0,F2 ,-8(R1) ,-16(R1)</pre> | Afte<br>1<br>2<br>3<br>4<br>5 | r: Softwar<br>L.D<br>ADD.D<br>L.D<br>S.D<br>ADD.D<br>L.D<br>DADDUI<br>BNE<br>S.D<br>ADDD<br>S.D | F0,-8(R1)<br>F4,0(R1) ;Stores M<br>F4,F0,F2 ;Adds to M<br>F0,-16(R1);Loads M[ | ([i-1] |
|------------------------------------------------------------------------------------------|-----------------------------------------------------------------------|----------------------------------------------------------|-------------------------------|-------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------|--------|
|------------------------------------------------------------------------------------------|-----------------------------------------------------------------------|----------------------------------------------------------|-------------------------------|-------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------|--------|

