





|              |                     | Bas        | sic C        | once       | epts   | (con        | t'd)             |        |                         |
|--------------|---------------------|------------|--------------|------------|--------|-------------|------------------|--------|-------------------------|
| <            | < Execution cycle > |            |              |            |        |             |                  |        |                         |
| ]            | IF ID OF IE EB      |            |              |            |        |             |                  |        |                         |
|              | Instruction fetch   |            | ode          | Ope<br>fet |        |             | uction<br>cution |        | sult<br>back            |
|              |                     | (a)        | ) Instru     |            |        |             | xecutio<br>ges   | n phas | se                      |
|              | Unp                 | ack        | Ali          | ign        | А      | dd          | Norm             | nalize |                         |
|              |                     | (b) Flo    | oating-      | point      | add pi | peline      | stages           |        |                         |
| 2003<br>To b | be used with        | 5. Dandamu | ıdi, "Fundar | © S. Da    |        | rganization | and Design,      | 1      | oter 8: Page 4<br>2003. |







































|      | Branch Prediction                                                                                                                      |
|------|----------------------------------------------------------------------------------------------------------------------------------------|
| • ]  | Three prediction strategies                                                                                                            |
|      | * Fixed                                                                                                                                |
|      | » Prediction is fixed                                                                                                                  |
|      | – Example: branch-never-taken                                                                                                          |
|      | $\rightarrow$ Not proper for loop structures                                                                                           |
|      | * Static                                                                                                                               |
|      | » Strategy depends on the branch type                                                                                                  |
|      | <ul> <li>Conditional branch: always not taken</li> </ul>                                                                               |
|      | <ul> <li>Loop: always taken</li> </ul>                                                                                                 |
|      | * Dynamic                                                                                                                              |
|      | » Takes run-time history to make more accurate predictions                                                                             |
| 2003 | © S. Dandamudi Chapter 8: Page 24<br>To be used with S. Dandamudi, "Fundamentals of Computer Organization and Design," Springer, 2003. |

| Static predic        | tion                               |                                 |                              |
|----------------------|------------------------------------|---------------------------------|------------------------------|
| * Improves p         | rediction accura                   | acy over Fixe                   | ed                           |
| Instruction type     | Instruction<br>Distribution<br>(%) | Prediction:<br>Branch<br>taken? | Correct<br>prediction<br>(%) |
| Unconditional branch | 70*0.4 = 28                        | Yes                             | 28                           |
| Conditional branch   | 70*0.6 = 42                        | No                              | 42*0.6 = 25.2                |
| Loop                 | 10                                 | Yes                             | 10*0.9 = 9                   |
| Call/return          | 20                                 | Yes                             | 20                           |

|   | Branch Prediction (cont'd)                                                                                     |
|---|----------------------------------------------------------------------------------------------------------------|
| • | Dynamic branch prediction                                                                                      |
|   | * Uses runtime history                                                                                         |
|   | » Takes the past <i>n</i> branch executions of the branch type and makes the prediction                        |
|   | * Simple strategy                                                                                              |
|   | » Prediction of the next branch is the <b>majority</b> of the previous <i>n</i> branch executions              |
|   | » Example: <i>n</i> = 3                                                                                        |
|   | <ul> <li>If two or more of the last three branches were taken, the<br/>prediction is "branch taken"</li> </ul> |
|   | » Depending on the type of mix, we get more than 90% prediction accuracy                                       |

|               |                      | ction (cor  |                |
|---------------|----------------------|-------------|----------------|
| • Impact of p | ast <i>n</i> branche | es on predi | ction accuracy |
|               | ſ                    | Гуре of mi  | X              |
| n             | Compiler             | Business    | Scientific     |
| 0             | 64.1                 | 64.4        | 70.4           |
| 1             | 91.9                 | 95.2        | 86.6           |
| 2             | 93.3                 | 96.5        | 90.8           |
| 3             | 93.7                 | 96.6        | 91.0           |
| 4             | 94.5                 | 96.8        | 91.8           |
| 5             | 94.7                 | 97.0        | 92.0           |













| I      | Performance Enhancements (cont'd)                            |
|--------|--------------------------------------------------------------|
| • Supe | erpipelined processors                                       |
| * Ir   | creases pipeline depth                                       |
|        | » Ex: Divide each processor cycle into two or more subcycles |
| * E    | xample: MIPS R40000                                          |
|        | » Eight-stage instruction pipeline                           |
|        | » Each stage takes half the master clock cycle               |
| IF1 &  | IF2: instruction fetch, first half & second half             |
| RF     | : decode/fetch operands                                      |
| EX     | : execute                                                    |
| DF1 &  | DF2 : data fetch (load/store): first half and second half    |
| TC     | : load/store check                                           |
| WB     | : write back                                                 |











|     | Pentium Pipeline                                             |
|-----|--------------------------------------------------------------|
| • F | Pentium                                                      |
|     | * Uses dual pipeline design to achieve superscalar execution |
|     | » U-pipe                                                     |
|     | <ul> <li>Main pipeline</li> </ul>                            |
|     | - Can execute any Pentium instruction                        |
|     | » V-pipe                                                     |
|     | <ul> <li>Can execute only simple instructions</li> </ul>     |
|     | * Floating-point pipeline                                    |
|     | * Uses the dynamic branch prediction strategy                |
|     |                                                              |







| Integer pipeline                                |                             |
|-------------------------------------------------|-----------------------------|
| * Prefetch (PF)                                 |                             |
| » Prefetches instructions and stores            | s in the instruction buffer |
| * First decode (D1)                             |                             |
| » Decodes instructions and generat              | tes                         |
| - Single control word (for sim                  | ple operations)             |
| → Can be executed directly                      | у                           |
| <ul> <li>Sequence of control words (</li> </ul> | for complex operations)     |
| → Generated by a micropr                        | ogrammed control unit       |
| * Second decode (D2)                            |                             |
| » Control words generated in D1 a               | re decoded                  |
| » Generates necessary operand add               | tresses                     |



































|      | MIPS Processor                                           |
|------|----------------------------------------------------------|
| • N  | /IPS R4000 processor                                     |
| :    | * Superpipelined design                                  |
|      | » Instruction pipeline runs at twice the processor clock |
|      | <ul> <li>Details discussed before</li> </ul>             |
| :    | * Like SPARC, uses 8-stage instruction pipeline for both |
|      | integer and FP instructions                              |
| :    | * FP unit has three functional units                     |
|      | » Adder, multiplier, and divider                         |
|      | » Divider unit is not pipelined                          |
|      | <ul> <li>Allows only one operation at a time</li> </ul>  |
|      | » Multiplier unit is pipelined                           |
|      | <ul> <li>Allows up to two instructions</li> </ul>        |
| 2003 | © S. Dandamudi Chapter 8: Page 62                        |















|        | Cray                      | /X-MI        | 2 (CO   | nt'd)     |            |
|--------|---------------------------|--------------|---------|-----------|------------|
| • Addr | ess registers             |              |         |           |            |
| * Eig  | ht 24-bit addr            | esses (A(    | ) – A7  | )         |            |
| »      | Hold memory ad            | dress for lo | oad and | store ope | rations    |
|        | o functional u<br>rations | nits to pe   | erform  | address   | arithmetic |
|        | 24-bit inte               | eger ADD     |         | 2 stages  |            |
|        | 24-bit inte               | eger MULT    | TIPLY   | 4 stages  |            |
| * Cra  | y assembly la             | nguage f     | ormat   |           |            |
| A:     | i Aj+Ak                   | (Ai =        | Aj+A    | Ak)       |            |
| A      | i Aj∗Ak                   | (Ai =        | Aj*A    | k)        |            |

| Cray X-MP (c                       | ont'd)      |  |
|------------------------------------|-------------|--|
| • Scalar registers                 |             |  |
| * Eight 64-bit scalar registers (S | 0 - S7)     |  |
| * Four types of functional units   |             |  |
| Scalar functional unit             | # of stages |  |
| Integer add (64-bit)               | 3           |  |
| 64-bit shift                       | 2           |  |
| 128-bit shift                      | 3           |  |
| 64-bit logical                     | 1           |  |
| POP/Parity (population/parity)     | 4           |  |
| POP/Parity (leading zero count)    | 3           |  |



| Cray X                                      | K-MP (                  | cont'd)         |                                     |  |  |  |  |
|---------------------------------------------|-------------------------|-----------------|-------------------------------------|--|--|--|--|
| Vector                                      | Vector functional units |                 |                                     |  |  |  |  |
| Vector functional unit                      | #stages                 | Avail. to chain | Results                             |  |  |  |  |
| 64-bit integer ADD                          | 3                       | 8               | VL + 8                              |  |  |  |  |
| 64-bit SHIFT                                | 3                       | 8               | VL + 8                              |  |  |  |  |
| 128-bit SHIFT                               | 4                       | 9               | VL + 9                              |  |  |  |  |
| Full vector LOGICAL                         | 2                       | 7               | VL + 7                              |  |  |  |  |
| Second vector LOGICAL                       | 4                       | 9               | VL + 9                              |  |  |  |  |
| POP/Parity                                  | 5                       | 10              | VL + 10                             |  |  |  |  |
| Floating ADD                                | 6                       | 11              | VL + 11                             |  |  |  |  |
| Floating MULTIPLY                           | 7                       | 12              | VL + 12                             |  |  |  |  |
| Reciprocal approximation                    | 14                      | 19              | VL + 19                             |  |  |  |  |
| 2003 To be used with S. Dandamudi, "Fundame | © S. Dandamuc           | -               | Chapter 8: Page 73<br>ringer, 2003. |  |  |  |  |

| Sample instruction | ons                         |
|--------------------|-----------------------------|
| 1.Vi Vj+Vk         | ; $Vi = Vj+Vk$ integer add  |
| 2.Vi Sj+Vk         | ; $Vi = Sj+Vk$ integer add  |
| 3.Vi Vj+FVk        | ; $Vi = Vj + Vk$ FP add     |
| 4.Vi Sj+FVk        | ; $Vi = Vj + Vk$ FP add     |
| 5.Vi ,A0,Ak        | ;Vi = M(A0;Ak)              |
|                    | Vector load with stride Ak  |
| 6.,A0,Ak Vi        | ;M(A0;Ak) = Vi              |
|                    | Vector store with stride Ak |



| loop |
|------|
| loop |
| loop |
| loop |
|      |
|      |
|      |
|      |
|      |
|      |
|      |
|      |
|      |
|      |























