#### RESISTIVE COMPUTATION: AVOIDING THE POWER WALL WITH LOW-LEAKAGE, STT-MRAM BASED COMPUTING

<u>Xiaochen Guo</u>, Engin Ipek, and Tolga Soyata

Rochester Computer Systems Architecture Laboratory



# Multicore Scaling Limited by Power

Traditional MOSFET scaling theory relies on reducing V<sub>DD</sub> in proportion to device dimensions

$$P = P_{dynamic} P_{dynstatic} = N \cdot (C_{eff} \cdot V_{DD}^2 \cdot f) + I_{leak} \cdot V_{DD})$$

$$I_{leak} \cdot V_{DD}$$

$$I_{leak} \cdot V_{DD}$$

$$I_{leak} \cdot V_{DD}$$

$$I_{leak} \cdot V_{DD}$$

 $\Box$  V<sub>DD</sub> has scaled very slowly since 90nm

Multicore scaling severely challenged by power



2

#### Our Approach: Resistive Computation

- Opportunity: spin-torque transfer magnetoresistive RAM (STT-MRAM)
  - Near-zero leakage power
  - Low-energy read operation
- Goal: selectively migrate on-chip storage and combinational logic to STT-MRAM to reduce power
  - On-chip storage
    - Caches, TLBs, RF, queues
  - Combinational logic
    - Lookup-table (LUT) based computing



#### STT-MRAM

- Desirable properties
  - CMOS compatibility
  - Read speed as fast as SRAM
  - Density comparable to DRAM
  - Unlimited write endurance
- Access transistor
- Key challenge: expensive writes
   Long switching latency (6.7ns @ 32nm)
   High switching energy (0.3pJ/bit @ 32nm)



## Switching Time vs. Cell Size

Faster switching with L2\$, L11\$, LUTs, TLBs, MC Queues wider access transistors 7 Swithching Time (ns) 6 + Faster writes 5 RF, L1D\$ -Slower reads 4 3 -Lower density 2 -Higher read energy 1 0 20 0 40 60 80

ROCHESTER

6/21/12

Cell Size (F<sup>2</sup>)

## Fundamental Building Blocks

#### **RAM Arrays and Lookup Tables**

# STT-MRAM Arrays

7

#### Problem: low write throughput



Existing solutions incur high overheads to sustain adequate write throughput in STT-MRAM arrays



# STT-MRAM Arrays

- CMOS subbank buffers
  - Latch in addr/data and release H-tree; complete write locally
  - Allow forwarding from ongoing writes
  - Facilitate local differential writes
- Reads access subbank via exclusive read port





#### STT-MRAM LUTs [Suzuki09, Matsunaga08]

9





### Case Study: 3-bit Adder

10



ROCHESTER



# Hybrid CMT Pipeline

12





## Front End

13

LUT-based carryselect adder to compute PC+4

LUT-based front-end thread selection logic

SRAM-based refill queue to avoid I\$ conflicts

Predecode and backend thread selection with MRAM-related stall conditions





# **Register File**

Architectural registers of all threads aggregated in a unified STT-MRAM array to amortize subbank buffers

Registers of a single thread striped across subbanks to reduce subbank buffer conflicts





# Floating-Point Unit

|                   | STT-MRAM<br>FPU | CMOS<br>FPU |
|-------------------|-----------------|-------------|
| Add, Sub,<br>Mult | 24 cycles       | 12 cycles   |
| Div               | 64 cycles       | 64 cycles   |





# Memory System

 $30 F^2 Cells$ Pure CMOS Use store buffers to L1 D\$ L1 D\$ STT-MRAM avoid L1 D\$ subbank Store Buffer Banko Bank<sub>1</sub> **LUTs** conflicts STT-MRAM L1s optimized for fast Arrays  $10 F^2 Cells$ writes using 30F<sup>2</sup> cells L2\$ Banko L2\$ Bank7 L2 and memory Sub1 Subo Subo Sub<sub>1</sub> controllers optimized for density using 10F<sup>2</sup> cells Sub3 Sub<sub>2</sub> Sub<sub>2</sub> Sub<sub>3</sub>  $10 F^{2}$ Cells MC<sub>2</sub> Queue MC1 Queue MC<sub>0</sub> Queue MC<sub>3</sub> Queue MC<sub>0</sub> Logic MC<sub>1</sub> Logic MC<sub>2</sub> Logic MC<sub>3</sub> Logic





### Performance

18





#### Power

19



#### Leakage Power Normalized to CMOS Leakage Power





## **Contributions and Findings**

- New technique to reduce leakage and dynamic power in a deep-submicron microprocessor
  - Selectively migrate on-chip storage and combinational logic from CMOS to STT-MRAM
  - Use subbank buffers to alleviate long write latency
- STT-MRAM is an attractive low-power solution beyond 32nm
   Dramatically lower leakage power
   Modest loss in performance



20