The Processor

The Processor (Part 1)

We're ready to look at an implementation of the MIPS

• Simplified to contain only:
   – memory-reference instructions: lw, sw
   – arithmetic-logical instructions: add, sub, and, or, slt
   – control flow instructions: branch-on-equal (beq), jump (j)
• Generic Implementation:
   – use the program counter (PC) to supply instruction address
   – get the instruction from memory
   – read registers
   – use the instruction to decide exactly what to do
• All instructions use the ALU after reading the registers
     Why? memory-reference? arithmetic? control flow?


For every instruction, the first two steps are identical:
  1. Send the PC to the memory that contains the code and fetch the instruction from that memory.
  2. Read 1 or 2 registers, using fields of the instruction to select the registers to read. For load word instruction we need to read only 1 register, but most other instructions require that we read 2 registers.
    • After these two steps, the actions required to complete the instruction depend on the instruction class (memory-reference, arithmetic-logical, and branches).
    • All instruction classes use the ALU after reading the registers:
               – Memory-reference instructions use ALU for an address
                   calculation.
               – Arithmetic-logical instructions use ALU for operation
                  execution.
               – Branch instructions use ALU for comparison
    3. After using ALU, the actions required to complete the different
        instruction classes differ:
            – A memory-reference instruction will need to access the memory either to write data for a  store or read data for a load.
            – An arithmetic-logical instruction must write the data from the ALU back into a register.
            – A branch instruction may need to change the next instruction address based on the comparison.                              




Implementation Details of MIPS

The functional units in MIPS implementation consist of two different types of logic elements:
      1. Elements that operate on data values:
  • These elements are all combinational, which means that their outputs depend only on the current inputs.
  • Given the same input, a combinational element always produces the same output, because it has no internal storage.
  • The ALU is a combinational element.
      2. Elements that contain state:
  • An element contains state if it has some internal storage.
  • We call these elements state elements because, if we pulled the plug on the machine, we could restart it by loading the state elements with the values they contained before we pulled the plug.
  • Logic components that contain state called sequential because their outputs depend on both their inputs and the contents of the internal state.
  • The instruction and data memories as well as the registers are all examples of state elements.
  • A state element has at least two inputs and one output.
  • The required inputs are the data value to be written into the element, and the clock, which determines when the data value is written.
  • The output from a state element provides the value that was written in an earlier clock cycle.
State Elements

Unclocked vs. Clocked

Clocks used in synchronous logic
– when should an element that contains state be updated?

An unclocked state element

The set-reset latch
– output depends on present inputs and also on past inputs

Example of State elements: D Flip-flops
  • A D-type flip-flop has exactly two inputs ( a value and a clock) and one output.
  • The clock is used to determine when the state element should be written; a state element can be read at any time.
Clocking Methodology

A clocking methodology defines when signals can be read and when they can be written.
  • If a signal is written at the same time it is read, the value of the read could correspond to the old value, the newly written value, or even some mix of the two!
  • Assume an edge-triggered clocking methodology, which means that any values stored in the machine are updated only on a clock edge.
  • The state elements all update their internal storage on the clock edge.
  • Because only state elements can store a data value, any collection of combinational logic must have its inputs coming from a set of state elements and its outputs written into a set of state elements.
  • The inputs are values that were written in a previous clock cycle, while the outputs are values that can be used in the following clock cycle.
The MIPS Subset Implementation

We start with a simple implementation that uses a single long clock cycle for every instruction and
follows the general form (figure) in slide 3.
  • In this design, every instruction begins execution on one clock edge and completes execution on the next clock edge.
  • This approach is not practical, since it would be slower than an implementation that allows different instruction classes to take different number of clock cycles, each of which could be much shorter.
  • An implementation that uses multiple clock cycles for each instruction is more realistic and requires more complex control.

BUILDING A DATAPATH (PART 2)

Instruction Memory, Program Counter, and Adder

First element needed is a place to store the program instructions. Memory is used to hold and suppy the instructions given an address. The address must be kept in the Program Counter (PC), and in order to increment the PC to the address of the next instruction, we also neeed an Adder. All these elements are shown in figure. After fetching one instruction from the instruction memory, the program counter has to be incremented so that it points to the address of the next instruction 4 bytes later.

Include the functional units we need for each instruction
A portion of the datapath used for fetching instructions and incrementing the PC


  • The PC is a 32-bit register that will be written at the end of every clock cycle.
  • The adder is an ALU wired to always perform an add of its two 32-bit inputs and place the result on its output.
Register File

R-format instructions (add, sub, slt, and, or) all read two registers, perform an ALU operation on the contents of the registers, and write the result.
  • The processor’s 32 registers are stored in a structure called a register file.
  • A register file is a collection of registers in which any register can be read or written by specifying the number of the register in the file.
  • R-format instructions have 3 register operands, we will need to read two data words from the register file and write one data word into the register file for each instruction.
  • For each data word to be read from the registers:
            – We need an input to the register file that specifies the register number to be read
            – We need an output from the register file that will carry the value that has been read from the registers.
  • To write a data word, we need two inputs:
           – One to specify the register number to be written
           – One to supply the data to be written into the register.
  • We need a total of 4 inputs (3 for register numbers and 1 for data) and 2 outputs (both for data)
Register File (Built using D flip-flops)


we still use the real clock to determine when to write


Register File and ALU
  • The register number inputs are 5 bits wide to specify one of the 32 registers (32=25), whereas the data input and the two output buses are each 32 bits wide.
  • ALU is controlled by 3-bit signal and it takes two 32-bit inputs and produces a 32-bit result as show in figure. 


The datapath for R-type instructions

Two elements needed to implement R-format ALU operations are register file and ALU.
  

MIPS load word and store word instructions
  • Load word instruction: lw $t1, offset_value($t2)
  • Store word instruction: sw $t1, offset_value($t2)
  • These instruction compute a memory address by adding the base register, which is $t2, to the 16-bit signed offset field contained in the instruction.
  • If the instruction is a store, the value to be stored must be read from the register file where it resides in $t1.
  • If the instruction is a load, the value read from the memory must be written into register file in the specified register, which is $t1.
  • Therefore we will need the following:
             – Both the register file and ALU.
             – A unit to sign-extend the 16-bit offset field in the instruction to a 32-bit signed value.
             – A data memory unit to read from or write to.
  • The data memory must be written on store instructions, it has both read and write control signals, an address input, and an input for the data to be written into memory 
Data Memory and Sign-extension Units


Datapath for MIPS load and store instructions
  • Assume the instruction has already been fetched.
  • The register number inputs for the register file come from fields of the instruction, as does the offset value, which after sign extension becomes the second ALU input.
  • The figure in the next slide shows the datapath for a load or a store does a register access, followed by a memory address calculation, then a read or write from memory, and a write into the register file if the instruction is a load.

PIPELINING (PART 3)

MIPS Pipeline
1.     Five stages, one step per stage
  • IF - Instruction ftech from memory
  • ID - Instruction decode and register read
  • EX - Execute operation or calculate address
  • MEM - Access memory operand
  • WB - Write result back to register


Basic Pipelining

Basic := single, in-order issue
• single issue := one instruction at a time (per stage)
• in-order issue := instructions (start to) execute in order
• next unit: multiple issue
• unit after that: out-of-order issue

Pipelining principles
• tradeoff: clock rate vs. IPC
• hazards: structural, data, control

Vanilla pipeline: single-cycle operations
• structural hazards, RAW hazards, control hazards

Dealing with multi-cycle operations
• more structural hazards, WAW hazards, precise state


Pipelining

observe: instruction processing consists of N sequential stages

idea: overlap different instructions at different stages


The Laundry Analogy


If we do laundry sequentially?



To pipeline, we overlap tasks




Pipelining a digital system
  • 1 nanosecond = 10^-9 second
  • 1 picosecond = 10^-12 second
Key idea: break big computation up into pieces


Separate each piece with a pipeline register


Pipelining a digital system

Why do this?  Because it's faster for repeated computations.


      •Pipelining increases throughput, but not latency
        –Answer available every 200ps, BUT
        –A single computation still takes 1ns

      •Limitations:
        –Computations must be divisible into stage size
        –Pipeline registers add overhead 

Pipelines Example - lw 






Single-Cycle vs. Pipelined Execution


Pipeline Speedup
       Consider the unpipelined processor introduced previously. Assume that it has a 1 ns clock cycle and it uses 4 cycles for ALU operations and branches, and 5 cycles for memory operations, assume that the relative frequencies of these operations are 40%, 20%, and 40%, respectively. Suppose that due to clock skew and setup, pipelining the processor adds 0.2ns of overhead to the clock. Ignoring any latency impact, how much speedup in the instruction execution rate will we gain from a pipeline?

Average instruction execution time
    = 1 ns * ((40% + 20%)*4 + 40%*5)
    = 4.4ns

Speedup from pipeline
    = Average instruction time unpiplined/Average instruction time pipelined
    = 4.4ns/1.2ns = 3.7

Pipeline Hazard

        Limits to pipelining: Hazards prevent next instruction from executing during its  designated clock cycle
        – Structural hazards: two different instructions use same h/w in same cycle
        – Data hazards: Instruction depends on result of prior instruction still in the pipeline
  – Control hazards: Pipelining of branches & other instructions  that change the PC 



PIPELINED DATAPATH (PART 4)

    - The goal of pipelining is to allow multiple instructions execute at the same time
    - We may need to perform several operations in a cycle
    - Increment the PC and add registers at the same time.
  *  Fetch one instruction while another one reads or writes data.




  • A pipeline diagram shows the execution of a series of instructions.
          - The instruction sequence is shown vertically, from top to bottom.
          - Clock cycles are shown horizontally, from left to right.
          - Each instruction is divided into its component stages. (We show five stages for                    every instruction, which will make the control unit easier.)
  • This clearly indicates the overlapping of instructions. For example, there are three instructions active in the third cycle above.
          - The “lw” instruction is in its Execute stage.
          - Simultaneously, the “sub” is in its Instruction Decode stage.
          - Also, the “and” instruction is just being fetched.


          Pipeline Terminology


      - The pipeline depth is the number of stages—in this case, five.
      - In the first four cycles here, the pipeline is filling, since there are unused functional units.
     - In cycle 5, the pipeline is full. Five instructions are being executed simultaneously, so all hardware units are in use.
     - In cycles 6-9, the pipeline is emptying.
     - Now we’ll see a basic implementation of a pipelined processor.
     - The datapath and control unit share similarities with both the single-cycle and multicycle implementations that we already saw.
     - An example execution highlights important pipelining concepts.
     - In future lectures, we’ll discuss several complications of pipelining that we’re hiding from you for now.

          One register file is enough

-               We need only one register file to support both the ID and WB stages.


- Reads and writes go to separate ports on the register file
- Writes occur in the first half of the cycle, reads occur in the second half 

    Pipelining Concept

      -  A pipelined processor allows multiple instructions to execute at once, and each instruction  uses a different functional unit in the datapath.
      - This increases throughput, so programs can run faster.
·                     *  One instruction can finish executing on every clock cycle, and simpler stages also lead to shorter cycle times.


    Pipelined Datapath

-   The whole point of pipelining is to allow multiple instructions to execute at the same time.
-   We may need to perform several operations in the same cycle.
        * Increment the PC and add registers at the same time. 
        * Fetch one instruction while another one reads or writes data.


       - Thus, like the single-cycle datapath, a pipelined processor will need to duplicate hardware elements that are needed several times in the same clock cycle.
       
        Single-cycle datapath, slightly rearranged


         What’s been changed?

                1. Almost nothing! This is equivalent to the original single-cycle datapath.

                        - There are separate memories for instructions and data.
                        - There are two adders for PC-based computations and one ALU.
                        - The control signals are the same.

                  2. Only some cosmetic changes were made to make the diagram smaller.
                         - A few labels are missing, and the muxes are smaller.
                        - The data memory has only one Address input. The actual memory operation                         can be determined from the MemRead and MemWrite control signals.

                 3. The datapath components have also been moved around in preparation for adding               pipeline registers.
      
            Multiple Cycles
        
                   1.   In pipelining, we also divide instruction execution into multiple cycles.
                   2.   Information computed during one cycle may be needed in a later cycle.
                      -  The instruction read in the IF stage determines which registers are fetched in                       the ID stage, what constant is used for the EX stage, and what the destination                     register is for WB.   
                      - The registers read in ID are used in the EX and/or MEM stages.
                      - The ALU output produced in the EX stage is an effective address for the MEM                    stage or a result for the WB stage.
                  3. We added several intermediate registers to the multicycle datapath to preserve                    information between stages, as highlighted on the next slide.
                  
               Registers added to the multi-cycle


            Pipeline Registers
    
          1. We’ll add intermediate registers to our pipelined datapath too.
          2. There’s a lot of information to save, however. We’ll simplify our diagrams by                   drawing just one big pipeline register between each stage.        
          3. The registers are named for the stages they connect.
  IF/ID  ID/EX  EX/MEM  MEM/WB
            4. No register is needed after the WB stage, because after WB the instruction is done.
       
             Pipelined Datapath
      
             - The control signals are grouped together in the pipeline registers, just to make the 
                diagram a little clearer. 
            - Not all of the registers have a write enable signal.
                  * Because the datapath fetches one instruction per cycle, the PC must also be                        updated  on each clock cycle. Including a write enable for the PC would be                          redundant.      
                  * Similarly, the pipeline registers are also written on every cycle, so no explicit                      write signals are needed.
                 
          An Example Execution Sequence

-                    Here’s a sample sequence of instructions to execute.


-              We’ll make some assumptions, just so we can show actual data values.
                       *Each register contains its number plus 100. For instance, register $8 contains                        108, register $29 contains 129, and so forth.
                      *Every data memory location contains 99.

  -             Our pipeline diagrams will follow some conventions.
                     * An X indicates values that aren’t important, like the constant field of an R-type                   instruction.
                    * Question marks ??? indicate values we don’t know, usually resulting from                            instructions  coming before and after the ones in our example.

HAZARD (PART 5)

           In the domain of central processing unit (CPU) designhazards are problems with the instruction pipeline in CPU microarchitectures when the next instruction cannot execute in the following clock cycle, and can potentially lead to incorrect computation results. There are typically three types of hazards:
  • data hazards
  • structural hazards
  • control hazards (branching hazards)

There are several methods used to deal with hazards, including pipeline stalls/pipeline bubbling, operand forwarding, and in the case of out-of-order execution, and the scoreboarding method.

Control Hazards

1. Branch determines flow of control
  • Fetching next instruction depends on branch outcome
  • Pipeline can’t always fetch correct instruction
            - Still working on ID stage of branch

.2. In MIPS Pipeline
  • Need to compare registers and compute target early in the pipeline
  • Assume, add hardware to do it in ID stage, test registers, calculate branch address & update PC during 2nd stage of pipeline.
Control hazards can cause a greater performance loss for DLX pipeline than data hazards. When a branch is executed, it may or may not change the PC (program counter) to something other than its current value plus 4. If a branch changes the PC to its target address, it is a taken branch; if it falls through, it is not taken.


MIPS with Predict Not Taken




Stall On Branch

Wait until branch outcome determined before fetching next instruction

Figure : a pipeline showing stalling on every conditional branch as solution to control hazards.

This example assumes the conditional branch is taken, and the instruction at the destination of the branch is the OR instruction.

Branch Prediction

- Longer pipelines can’t readily determine branch outcome early
  • Stall penalty becomes unacceptable
- Predict outcome of branch
  • Only stall if prediction is wrong
- In MIPS pipeline
  • Can predict branches not taken
  • When you are right, fetch instruction after branch, with no delay
* Branch took one extra clock cycle for stall, hence slow down the throughput

More -Realistic Branch Prediction

1. Static Branch Prediction
  • Based on typical branch behavior
  • Example: loop and if-statement branches
           - Predict backward branches taken (Loop usually jump backward)
           - Predict forward branches not taken

2. Dynamic Branch Prediction
  • Hardware measures actual branch behavior
        * e.g., record recent history of each branch (as taken or untaken branch)
  • Assume future behavior (predict from past behavior) will continue the trend
        * When wrong, stall while re-fetching, and update history

Conclusion

1. Pipelining improves performance by increasing instruction throughput
  • Executes multiple instructions in parallel
  • Each instruction has the same latency
2. Subject to hazards
  • Structure, data, control
3. Instruction set design affects complexity of pipeline implementation

NOR EZREEN FARA BINTI KHALID - B031410024
NUR DIYANA BINTI DZOLKIFLI - B031410414
NUR SYAZWANI BINTI MAHADZIR - B031410014
NURUL HASLINDA BINTI MOHAMMAD - B03141017
NOR NURUL AIN BINTI HASSAN - B031410032