# **Full-Stack Optimizations for Next-Generation Deep-Learning Accelerators**



ysshao@berkeley.edu

**Electrical Engineering and Computer Sciences** 

# Growing Demand in Computing

#### Two Distinct Eras of Compute Usage in Training AI Systems



**OpenAl** 



# **Slowing Supply in Computing**

AMD, HotChips, 2019





# Domain-Specific Accelerators

Growing Demand in Computing



Slowing Supply in Computing

## **Domain-Specific Accelerators**

 Customized hardware designed for a domain of applications.



Apple M1 Chip 2020



## **Full-Stack Optimization for DL Accelerators**



# Full-Stack Optimization for DL Accelerators



# **Design of Accelerators**



Integration of Accelerators



Scheduling of Accelerators

# **Scalable Inference Accelerators**

#### Motivation

• Need for fast and efficient inference accelerators from mobile to datacenter.

#### Challenge

• High design cost of building unique hardware for each design target.

#### Opportunities

- Deep learning inference is intrinsically scalable with abundant parallelism.
- Recent advances in package-level integration for multi-chip-module-based designs.

# The Multi-Chip-Module Approach

- Advantages:
- Build systems larger than reticle limit
- Smaller chips are cheaper to design
- Smaller chips have higher yield
- Faster time-to-market
- Challenges:
- Area, energy, and latency for chip-tochip communication



Ref: Zimmer et al., VLSI 2019

# Simba: Scaling Inference with MCM-based Architecture

### Best Paper Award at MICRO'2019, CACM Research Highlights

#### Simba Testchip:

- Package and chiplet architecture
- Processing element design
- Baseline uniform tiling across chiplets and PEs

#### Simba Characterization:

- · Comparison with GPUs
- NoP bandwidth sensitivity
- NoP latency sensitivity

#### Simba NoP-Aware Tiling:

- Non-uniform work partitioning
- Communication-aware data placement
- Cross-layer pipelining

Input Output

## Simba: Scalable MCM-Based Architecture

47.5 mm

#### Package and chiplet spec

6mm<sup>2</sup> chiplet in TSMC 16nm 36 chiplets/package

### Chip-to-chip interconnect

Ground-Referenced Signaling

### **Efficient compute tiles**

128 TOPS 0.11 pJ/Op 8-bit integer datapath





Ref: Zimmer et al., VLSI 2019

Voltage

SRAM

Frequency

0.52-1.1 V

0.48-1.8 GHz

624KB/chip

23MB/package

## **Simba Characterization**

• Comparison with GPUs running ResNet-50





## Simba Characterization

- Layer Sensitivity
- Running three ResNet-50 layers across different number of chiplets.
- Increasing the number of active chiplets does not always translate to performance gains.
- The cost of communication hinders the ability to exploit parallelism.



# Full-Stack Optimization for DL Accelerators



# **Design of Accelerators**



# Integration of Accelerators



Scheduling of Accelerators

## Accelerators don't exist in isolation.





http://vlsiarch.eecs.harvard.edu/research/accelerators/die-photoanalysis/

# **Mobile SoC Usecase**

- Mainstream architecture has long focused on general-purpose CPUs and GPUs.
- In an SoC, multiple IP blocks are active at the same time and communicate frequently with each other.
- Example:
  - Recording a 4K video
  - Camera -> ISP
    - "Preview stream" for display
    - "Video stream" for storage
  - DRAM for data sharing



Two Billion Devices and Counting: An Industry Perspective on the State of Mobile Computer Architecture, IEEE Micro'2018



• Integrated **design**, **simulation** and **implementation** environment for specialized SoCs.



|         | Chipyard              |                            |  |  |
|---------|-----------------------|----------------------------|--|--|
| Tooling | Rocket Chip           | Flows                      |  |  |
| Chisel  | Generators Diplomacy  | FireSim                    |  |  |
| FIRRTL  | Rocket Core BOOM Core |                            |  |  |
|         | Configuration System  | HAMMER                     |  |  |
| RISC-V  | Accelerators TileLink | Software RTL<br>Simulation |  |  |
| Risc-V  | Caches Peripherals    | FPGA-shells                |  |  |
|         |                       |                            |  |  |

https://github.com/ucb-bar/chipyard

[IEEE Micro'2020]

## **Gemmini: Full-System Co-Design of Hardware Accelerators**

- Full-stack
  - Includes OS
  - End-to-end workloads
  - "Multi-level" API
- Full-SoC
  - Host CPUs
  - Shared memory hierarchies
  - Virtual address translation



| Ĩ                                    | Property                                 | NVDLA              | VTA         | PolySA            | DNNBuilder | MAGNet   | DNNWeaver | MAERI            | Gemmini         |
|--------------------------------------|------------------------------------------|--------------------|-------------|-------------------|------------|----------|-----------|------------------|-----------------|
| Hardware<br>Architecture<br>Template | Multiple Datatypes<br>Multiple Dataflows | Int/Float          | Int<br>X    | Int<br>✓          | Int<br>✓   | Int<br>✓ | Int<br>✓  | Int<br>✓         | Int/Float<br>✓  |
|                                      | Spatial Array<br>Direct convolution      | vector             | vector<br>X | systolic<br>X     | systolic   | vector   | vector    | vector           | vector/systolic |
| Programming                          | Software Ecosystem                       | Custom<br>Compiler | TVM         | Xilinx<br>SDAccel | Caffe      | С        | Caffe     | Custom<br>Mapper | ONNX/C          |
| Support                              | Hardware-Supported<br>Virtual Memory     | ×                  | ×           | ×                 | ×          | ×        | ×         | x                | ~               |
| System Support                       | Full SoC                                 | X                  | X           | ×                 | ×          | X        | ×         | X                | 1               |
|                                      | OS Support                               | 1                  | ~           | ×                 | ×          | ×        | ×         | ×                | 1               |

https://github.com/ucb-bar/gemmini

## Gemmini Case Study: Allocating on-chip SRAM



# Where to allocated SRAM?

- Private within each IP
- Shared



https://github.com/ucb-bar/gemmini

## Gemmini Case Study: Allocating on-chip SRAM



# Where to allocated SRAM?

- Private within each IP
- Shared

Application dependent.



SoC configuration dependent.



https://github.com/ucb-bar/gemmini

# Full-Stack Optimization for DL Accelerators



# **Design of Accelerators**



Integration of Accelerators



Scheduling of Accelerators

## Large Space of Mapping Algorithms to ML Hardware

### Algorithm



| 00                                                                                                         |                                                                                                                       |  |  |  |  |
|------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------|--|--|--|--|
| 00                                                                                                         | 1                                                                                                                     |  |  |  |  |
| 00                                                                                                         |                                                                                                                       |  |  |  |  |
| 00                                                                                                         |                                                                                                                       |  |  |  |  |
| 00                                                                                                         |                                                                                                                       |  |  |  |  |
|                                                                                                            |                                                                                                                       |  |  |  |  |
| 00                                                                                                         | 1 . N                                                                                                                 |  |  |  |  |
| 00 1                                                                                                       | 2 3+                                                                                                                  |  |  |  |  |
| 0 1                                                                                                        | Latency (Mcycles)                                                                                                     |  |  |  |  |
|                                                                                                            |                                                                                                                       |  |  |  |  |
| Sch                                                                                                        | neduling                                                                                                              |  |  |  |  |
| Scheduler                                                                                                  | neduling<br>Search Algorithm                                                                                          |  |  |  |  |
|                                                                                                            | Search Algorithm                                                                                                      |  |  |  |  |
| Scheduler                                                                                                  | Search Algorithm                                                                                                      |  |  |  |  |
| Scheduler<br>Brute-force approce                                                                           | Search Algorithm                                                                                                      |  |  |  |  |
| Scheduler<br>Brute-force approca<br>Timeloop                                                               | Search Algorithm<br>ahes:<br>Brute-force & Random                                                                     |  |  |  |  |
| Scheduler<br>Brute-force approce<br>Timeloop<br>dMazeRunner                                                | Search Algorithm<br>whes:<br>Brute-force & Random<br>Brute-force                                                      |  |  |  |  |
| Scheduler<br>Brute-force approco<br>Timeloop<br>dMazeRunner<br>Interstellar                                | Search Algorithm<br>whes:<br>Brute-force & Random<br>Brute-force<br>Brute-force<br>Decoupled Brute-force              |  |  |  |  |
| Scheduler<br>Brute-force approce<br>Timeloop<br>dMazeRunner<br>Interstellar<br>Marvel                      | Search Algorithm<br>whes:<br>Brute-force & Random<br>Brute-force<br>Brute-force<br>Decoupled Brute-force              |  |  |  |  |
| Scheduler<br>Brute-force approca<br>Timeloop<br>dMazeRunner<br>Interstellar<br>Marvel<br>Feedback-based Ap | Search Algorithm<br>ahes:<br>Brute-force & Random<br>Brute-force<br>Brute-force<br>Decoupled Brute-force<br>proaches: |  |  |  |  |

### Hardware



| CoSA | Mixed Integer Programming (MIP) |
|------|---------------------------------|
|------|---------------------------------|

## **CoSA: Constrained-Optimization for Spatial Architecture**



#### **ML Operator**

#### **Spatial Accelerator**

[ISCA'2021]

## **CoSA: Constrained-Optimization for Spatial Architecture**



2.5x speedup compared to SoTA with 90x faster time-to-solution.

## Acknowledgement







Hasan Genc

Jenny Huang

Seah Kim

• Thanks collaborators from UC Berkeley and NVIDIA!

# Full-Stack Optimization for DL Accelerators



# **Design of Accelerators**



# Integration of Accelerators



Scheduling of Accelerators