Pushing the Boundaries of Moore's Law to Transition from FPGA to All Programmable Platform

Ivo Bolsens, SVP & CTO Xilinx
ISPD, March 2017
High Growth Markets

- Cloud Computing
- Automotive
- IIoT
- 5G Wireless
High Growth Markets Megatrends

- Cloud Computing
  - Accelerating Storage, Networking, and Compute
- Automotive
  - Enabling Autonomous Driving
- IIoT
  - Enabling Industrial IOT Connected, Secure and Real Time Control
- 5G Wireless
  - Creating More Bandwidth and Lower Latency

Machine Learning
Machine Learning: FPGA best performance/watt

- Data Flow Compute
- Distributed Memory
- Programmable Interconnect
- Custom Precision Arithmetic

CPU/GPU

FPGA

© Copyright 2016 Xilinx
From FPGA to ‘All Programmable Platform’

1988

Logic
IO
Memory
Processing
Transceiver
DSP

2003

2012

2.5D- IC
SOC
FPGA
From FPGA to ‘All Programmable Platform’

ASIC Refugees

Logic
BRAM
DSP

Dual A9
acc1
acc2

Quad A53
Dual R5
GPU
H.265
acc1
acc2

HW/SW co-design

Software Programming

C/C++
RTL
CPU
Accel

C/C++
RTL
CPU
Accel

C/C++
RTL
CPU
Accel

C + VHDL

C + High Level Synthesis

OpenCL

VHDL

Bring power of C++ to OpenCL

Intermediate Representation

OpenCL

Heterogeneous Parallel Programming

© Copyright 2016 Xilinx
Ultrascale family (20nm)

- Virtex Ultrascale - 20 Billion Transistors
- > 6.00 GB Data Base Size
- > 24 hrs of DRC using 100 CPUs

- Vivado : 23 million lines of code
- 1000 person years
- 10+ technology acquisitions
The ‘Virtual Foundry’

- Product Spec
- Configuration Bitstream
- Design Tools and Libraries
- “Blank” Programmable Logic Device
- SOC Product
Scalable Programmable Interconnect

Modular
Why do Foundries love FPGA: Product Lifetime

- DD
- +15 Years
Diagnostic for Continuous Improvement

FPGA is the Yield Learning Vehicle

FAB

FPGA

Wafer Sort

Root Cause Analysis

Solution

TIME
Yield Improvement is about:

1) **Defect Reduction** – *You have to find it to fix it*

2) **Process Control** – *You have to measure it to improve it*
Readback Analysis for Process Debugging

Each CLB has > 1K bits configuration & LUT memory.

<table>
<thead>
<tr>
<th>BRAM Col 0</th>
<th>BRAM Col 1</th>
<th>BRAM Col 2</th>
<th>BRAM Col 3</th>
</tr>
</thead>
<tbody>
<tr>
<td>143</td>
<td>143</td>
<td>143</td>
<td>143</td>
</tr>
<tr>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
</tbody>
</table>

Failing add: 1: Col=1, Row=8, x=111, y=30
=> 0's failure
Total 1 bits failed
Metal Failure Test Patterns
(ASIC: the middle metal layers is a black box for fault isolation)

Bit Map
Embedded CFM and SRAM can help to check the process defects level from Si to M4
Readback Signature vs. Process Layers

- P/FG
- FG
- H_Dual
- V_Dual
- P/D
- D
- P/F
- F
- MB
- S

© Copyright 2016 Xilinx
Horizontal Dual Bits failure caused by Via1 open.
Process Characterisation: “Sea of Ring Oscillators”

Configured FPGA Logic Blocks into Ring Oscillator (X1000s) and measure delays.
Variation of logic delay (Tilo)

Large die (1x1 reticle)

Small die (2x2 reticle)
New (OPC) Poly Mask Release

Lot: xyz1:
Die Variation with Old Mask

<table>
<thead>
<tr>
<th>Max</th>
<th>20.7</th>
</tr>
</thead>
<tbody>
<tr>
<td>Min</td>
<td>18.8</td>
</tr>
<tr>
<td>Average</td>
<td>19.9</td>
</tr>
<tr>
<td>Stdev</td>
<td>0.4</td>
</tr>
</tbody>
</table>

Lot: xyz1:
Die Variation with New Mask

<table>
<thead>
<tr>
<th>Max</th>
<th>20.1</th>
</tr>
</thead>
<tbody>
<tr>
<td>Min</td>
<td>19.2</td>
</tr>
<tr>
<td>Average</td>
<td>19.7</td>
</tr>
<tr>
<td>Stdev</td>
<td>0.1</td>
</tr>
</tbody>
</table>

Method to ‘qualify’ Poly mask with new OPC.
Poly CD Variation Improvement

With smaller poly variation, product leakage current is reduced.
Poly CD Variation Improvement

Process (photo/etch) improvement => better wafer-level poly CD uniformity
Poly OPC improvement => better within-die poly CD uniformity
Moore’s Law: The Interconnect Challenge

RC wiring delay
- On/Off Chip IO gap

Relative Delay

<table>
<thead>
<tr>
<th>Process Technology Node (nm)</th>
<th>Gate Delay</th>
</tr>
</thead>
<tbody>
<tr>
<td>250</td>
<td>1 mm without repeaters</td>
</tr>
<tr>
<td>180</td>
<td>1 mm with repeaters</td>
</tr>
<tr>
<td>130</td>
<td>Gate Delay</td>
</tr>
<tr>
<td>90</td>
<td></td>
</tr>
<tr>
<td>65</td>
<td></td>
</tr>
<tr>
<td>45</td>
<td></td>
</tr>
<tr>
<td>32</td>
<td></td>
</tr>
<tr>
<td>22</td>
<td></td>
</tr>
<tr>
<td>15</td>
<td></td>
</tr>
<tr>
<td>10</td>
<td></td>
</tr>
</tbody>
</table>

- RC wiring delay
- On/Off Chip IO gap
Scalable Programmable Interconnect

Connectivity Pitch is limiting factor (not transistor)
Cost of Moore’s Law

- Cost for the same Capacity of logic is increasing
  - Same trend holds for FPGAs

<table>
<thead>
<tr>
<th>Technology</th>
<th>Gates/mm² (KU)</th>
<th>Gate utilization (%)</th>
<th>Used Gates/mm² (KU)</th>
<th>Parametric yield impact (Δ from D2 yield)</th>
<th>Actual used Gates/mm² (KU)</th>
<th>Gates/Wafer (MU)</th>
<th>Wafer cost ($)</th>
<th>Wafer cost (Δ)</th>
<th>Cost per 100M gate ($)</th>
</tr>
</thead>
<tbody>
<tr>
<td>90nm</td>
<td>637</td>
<td>86</td>
<td>546</td>
<td>97</td>
<td>532</td>
<td>33,831</td>
<td>1,357.62</td>
<td>-</td>
<td>4.01</td>
</tr>
<tr>
<td>65nm</td>
<td>1,109</td>
<td>83</td>
<td>919</td>
<td>96</td>
<td>885</td>
<td>56,330</td>
<td>1,585.17</td>
<td>16.8</td>
<td>2.62</td>
</tr>
<tr>
<td>45/40nm</td>
<td>2,139</td>
<td>78</td>
<td>1,677</td>
<td>92</td>
<td>1,598</td>
<td>97,842</td>
<td>1,896.83</td>
<td>19.7</td>
<td>1.94</td>
</tr>
<tr>
<td>28nm</td>
<td>4,262</td>
<td>77</td>
<td>3,282</td>
<td>87</td>
<td>2,855</td>
<td>181,558</td>
<td>2,361.84</td>
<td>24.4</td>
<td>1.30</td>
</tr>
<tr>
<td>20nm</td>
<td>6,992</td>
<td>65</td>
<td>4,524</td>
<td>73</td>
<td>3,293</td>
<td>209,541</td>
<td>2,981.75</td>
<td>26.2</td>
<td>1.42</td>
</tr>
<tr>
<td>16/14nm</td>
<td>10,488</td>
<td>64</td>
<td>6,712</td>
<td>67</td>
<td>4,457</td>
<td>286,140</td>
<td>4,081.22</td>
<td>36.9</td>
<td>1.43</td>
</tr>
<tr>
<td>10nm</td>
<td>14,957</td>
<td>60</td>
<td>8,974</td>
<td>62</td>
<td>5,564</td>
<td>354,013</td>
<td>5,126.35</td>
<td>25.6</td>
<td>1.45</td>
</tr>
<tr>
<td>7nm</td>
<td>17,085</td>
<td>59</td>
<td>10,080</td>
<td>60</td>
<td>6,048</td>
<td>384,813</td>
<td>5,859.28</td>
<td>14.3</td>
<td>1.52</td>
</tr>
</tbody>
</table>

- Moore’s Law Pause @ 28nm

Cost per 100M gate ($):
- ▼ 308%
- ▼ 117%
- ▼ 49%
- ▲ 8%
- ▲ 10%
- ▲ 12%
- ▲ 17%
Process technology is not helping interconnect

- Cost is increasing
  - Significant portion of cost is connectivity (BEOL)

- Resistivity is increasing
  - Gap between Transistor and wire delay is rapidly increasing
  - Slower global data movement

Source: EE Times
Interconnect technology process strategies

» 28nm-16nm nodes
  – Line width (Half pitch): 40-45 nm
  – Key strategies: thinner barriers, larger Cu grain size
  – Resistivity: 2.8-3.5 micro-ohm-cm

» 14nm-7nm nodes
  – 20-30 nm
  – Aggressive barrier engineering, Cu with TaN
  – 6-8.5 micro-ohm-cm

» 5nm node
  – 10-12 nm
  – Greater than 12 micro-ohm-cm

Resistivity Quadruples from 28nm to 5nm

Intel (Roberts et al, IITC 2015); IBM (Pyzyna et al, VLSI Tech 2015)
Increasing demand for global data movement

- Emerging abundant-data applications
  - Machine Learning application require access to 4GB-40GB external memory and very high data movement bandwidth

- External memory trend
  - One 64-bit DDR4 channel BW is 0.17 Tb/s

- Emerging programmable devices
  - External memory access of up to 8GB
  - Up to 3.6 Tb/s edge data rates required
  - ~20X higher than a single DDR channel

Source: ISSCC 2013, Memory trends
Valuable Real Estate

» Building wide
  – Real estate silicon cost manageable

» Building up
  – Use all the air space possible for interconnect
  – Physics sets the limits

» Building smart
  – Economics of sharing
  – What is the most efficient way of sharing valuable silicon resources?
On-chip: mitigate scarcity of global wires

- Global wires do NOT shrink
  - But global data movement requirements are rising with applications
  - Need to share global wires

- Network on chip solutions are emerging
  - Enables systematic sharing of global wires
  - Majority of large SoCs use some form of NoC
On the FPGA: Connectivity without sharing

- 6 AXI master compute clients
  - Communicating to 2 AXI slave clients (external memory)
  - 64-bit, global connectivity resources are highlighted
  - Routing congestion
Connectivity with sharing global wires: SoftNOC

- **FPGA optimized soft NoC**
  - 2-3X better throughput per resources (LUTs and global wires)

- **Next: Customized overlay**
  - Take advantage of FPGA programmability

- **Result:**
  - 64-bit global data movement overlay running at GHz
  - 2-3X higher bandwidth per area
Off-chip : Bandwidth-per-Watt

**DDR-4 DIMM**
Standard commodity memory used in Servers and PC’s.

- **Bandwidth**: 21.3 GB/s
- **Depth**: 16 GB
- **Price / GB**: $
- **PCB Req**: High
- **pJ / bit**: ~27
- **Latency**: Med

* Single DDR4 DIMM

**HMC**
Hybrid-Memory Cube Serial DRAM

- **Bandwidth**: 160 GB/s
- **Depth**: 4 GB
- **Cost / GB**: $$$
- **PCB Req**: Med
- **pJ / bit**: ~30
- **Latency**: High

* Single HMC Device

**HBM**
High Bandwidth Memory
DRAM integrated into the FPGA package

- **Bandwidth**: 460 GB/s
- **Depth**: 8 GB
- **Cost / GB**: $
- **PCB Req**: None
- **pJ / bit**: ~7
- **Latency**: Med

* Single FPGA with HBM
Xilinx leading SSIT technology

Passive Silicon Interposer (65nm)
- 4 Metal Layers Connecting Micro-Bumps & TSVs

Micro-Bumps
- Power / Ground / IOs / Routing

C4 Bumps
- Connects Silicon to Package

Through-Silicon Via (TSVs)
- Connects Power / Ground / IOs to C4 Bumps

- > 150,000 Micro-bumps
- > 10,000 TSVs
- > 10,000 C4 Bumps
- >90 Processing Steps in 3DIC Flow (From Bump to Completed Package)
System-in-Package: Massive Connectivity

- Requires interposer at the higher wire count
  - Emerging HBM application products in data center

- Possible to extend beyond memory access
  - Heterogeneous system in a package
How can we meet BW requirement with limited available connections?
Next trend: sharing packaging wires

PCB trends

- Fast high bandwidth SiP bridge
  - Used for time multiplexing or sharing of package wires
  - Reduces package wire significantly

- Emerging Ultra-Short Reach (USR) IO
  - enables high BW delivery despite limited available MCM connections
Kandou USR 28nm IP
– Published in ISSCC2016
– Power as low as 0.94 pJ/bit for 12mm wire length
– JEDEC standard as of summer 2016

Beyond the spec
Power, throughput and cost trade-offs

- Parallel SiP Connectivity: Interposer, HBM2, EMIB
- Serial SiP connectivity: (6mm-12mm)
- Monolithic global wiring: (> 1mm)

![Graph showing energy (PI/bit) vs. wire throughput (Gb/s)]
Conclusions

From FPGA to ‘All Programmable SOC’
  – FPGA technology at forefront of Moore’s Law

Increasing demand for global data movement
  – Big Data
  – Machine Learning

Connectivity Challenges
  – Trade-offs
    • Point-to-point
    • NOC
    • Serial
    • Parallel
    • Monolithic/SIP
  – New design methods and new figures of merit