

## Content

Computing increase and power challenge in (embedded) computing

- Heterogeneous multi-core architectures with dedicated accelerators
- New paradigm e.g. invasive computing

### New Challenges

Memory and bandwidth

Metrics for design space exploration

- Wireless baseband processing
- Impact of memories and data transfers on metrics
- Impact of application (communications) performance on metrics

#### **3D MPSoCs**

**3D** memories and memory controllers

















| All archited<br>technology | ctures based o<br>@worst case,            | n standa<br>all data  | ard synthesis fl<br>in-house availa    | ows, 65<br>able | nm             |                             |
|----------------------------|-------------------------------------------|-----------------------|----------------------------------------|-----------------|----------------|-----------------------------|
| Decoder                    | Flexibility                               | Max<br>Block-<br>size | Payload<br>Throughput<br>[Mbit/s]      | Freq.<br>[MHz]  | Area<br>[mm2]  | Dynamic<br>Power<br>[mWatt] |
| ASIP<br>(Magali)           | Conv. Codes<br>Binary TC<br>Duo-binary TC | N=16k                 | 40<br>14(6iter)<br>28(6iter)           | 385<br>(P&R)    | 0.7<br>(P&R)   | ~100                        |
| LTE Turbo<br>(Music)       | LTE turbo code                            | N=18k                 | 150<br>(6iter)                         | 300<br>(P&R)    | 2.1<br>(P&R)   | ~300                        |
| LDPC flex<br>(Magali)      | R=1/4 to<br>R=9/10                        | N=16k                 | 150-300<br>(20-10iter)                 | 385<br>(P&R)    | 1.172<br>(P&R) | ~389                        |
| LDPC fixed<br>(Magali)     | R=3/4                                     | N=1.2k                | 480 (6iter)                            | 435<br>(P&R)    | 0.583<br>(P&R) | ~202                        |
| LDPC<br>WiMedia 1.5        | R=1/2-4/5                                 | N=1.3k                | 640 (R=1/2,5iter)<br>960 (R=3/4,5iter) | 265             | 0.51           | ~193                        |
| CC Decoder                 | 64-state NSC                              |                       | 500                                    | 500             | 0.1            | ~37                         |

Metric Assessment - Channel Decoders

# Algorithmic Throughput Calculations [GOPs]

| Code             | Operations per of information bit | lecoded       | Infobit-Throughput<br>⇔Giga operations per second [GOPs] |           |          |  |  |
|------------------|-----------------------------------|---------------|----------------------------------------------------------|-----------|----------|--|--|
|                  | normalized to ~                   | 8bit addition | 100Mbit/s                                                | 300Mbit/s | 1 Gbit/s |  |  |
| CC:<br>states=64 | ~200                              |               | ~20                                                      | ~ 60      | ~200     |  |  |
| LDPC             | 5 iter                            | 75/R          | ~7.5/R                                                   | ~22.5/R   | ~ 75/R   |  |  |
| Min-Sum          | 10 iter                           | 150/R         | ~15/R                                                    | ~ 45/R    | ~ 150/R  |  |  |
| (x3.4 appr. BP)  | 20 iter                           | 300/R         | ~ 30/R                                                   | ~ 90/R    | ~ 300/R  |  |  |
|                  | 40 iter                           | 600/R         | ~ 60/R                                                   | ~ 180/R   | ~ 600/R  |  |  |
| Turbo            | 2 iter                            | 280           | ~ 28                                                     | ~ 84      | ~ 280    |  |  |
| Max-Log          | 4 iter                            | 560           | ~ 56                                                     | ~168      | ~ 560    |  |  |
|                  | 6 iter                            | 840           | ~ 84                                                     | ~252      | ~ 840    |  |  |
|                  |                                   |               |                                                          |           |          |  |  |



## What about Memory/Data Transfers

Current metric: energy efficiency = only operations/energy Data transfers/ accesses substantially contribute to the power consumption

Example (R=0.5)

150 Mbit/s Turbo : ~126 Gops~40 Gaccesses150 Mbit/s LDPC : ~90 Gops~80 Gaccesses

Efficient data transfer is key for efficient implementation

- LTE TC: special interleaver structure to avoid access conflicts
- DVB-S2/WiMAX LDPC: special code structure to minimize access conflicts

Efficiency metrics based on operations only are not appropriate

- Power includes operations and accesses!
- Architectures are favored where operations dominate compared to accesses



## **Communications Performance**

Overall efficiency of a baseband receiver depends on

- Implementation performance
- Communications performance
- Flexibility

Scenario 1: Fixed Communication performance

- Comparison of two iterative decoders with same communications performance but different parameters (codes, code rate, iterations)
- $\Rightarrow$  impact on implementation efficiency

Scenario 2: Implementation driven

- Comparison of iterative and non-iterative decoders with varying communications performance
- 64-state convolutional code 960 Mbit/s (WiMedia 1.2) and WiMedia 1.5 LDPC decoder
- ⇒ impact on implementations efficiency









## Lessons learned

- Understanding trade-offs between implementation efficiency, application performance and flexibility requirements is mandatory for efficient baseband receivers
- Operation based metrics for energy and area efficiency can be misleading
- Memory and data transfers have to be considered in metrics for design space exploration
- Implementation efficiency metrics have to be linked to application performance ⇒ trajectory

















| Exa                     | a <b>mple:</b><br>TSV are<br>Deep tr           | 64Mb<br>eas adde<br>ench /                                                               | <b>3D-DRA</b><br>ed<br>buried WL                               | <b>M cor</b><br>/ Stack                                    | e tile                                                       |                                                                                             | 64M<br>Array                                                                                                     |
|-------------------------|------------------------------------------------|------------------------------------------------------------------------------------------|----------------------------------------------------------------|------------------------------------------------------------|--------------------------------------------------------------|---------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------|
|                         | Cell size                                      | es: 8F <sup>2</sup> -                                                                    | - 4F <sup>2</sup>                                              |                                                            |                                                              |                                                                                             |                                                                                                                  |
|                         | Based o                                        | n meas                                                                                   | ured* & sir                                                    | nulated                                                    | data                                                         | c                                                                                           | OLUMN                                                                                                            |
|                         |                                                |                                                                                          |                                                                |                                                            |                                                              |                                                                                             |                                                                                                                  |
|                         |                                                |                                                                                          |                                                                |                                                            |                                                              | Control / Po                                                                                | wer generators / Sign<br>s Power & Signals                                                                       |
|                         |                                                |                                                                                          |                                                                |                                                            |                                                              | Control / Po                                                                                | wer generators / Signals                                                                                         |
|                         | Techn                                          | Call                                                                                     | Coll                                                           | Area                                                       | Pour                                                         | Control / Po                                                                                | wer generators / Sign<br>s Power & Signals                                                                       |
| No.                     | Techn.<br>node                                 | Cell<br>size                                                                             | Cell<br>type                                                   | Area<br>[mm²]                                              | Row<br>t <sub>RAS</sub> [ns]                                 | Control / Po<br>TSV<br>Row -> Col.<br>t <sub>RCD</sub> [ns]                                 | wer generators / Sign<br>s Power & Signals<br>Column<br>t <sub>CCD</sub> [ns]                                    |
| No.                     | Techn.<br>node<br>75nm                         | Cell<br>size<br>8F <sup>2</sup>                                                          | Cell<br>type<br>Deep Trench                                    | Area<br>[mm²]<br>5.20                                      | Row<br>t <sub>RAS</sub> [ns]<br>39.0                         | Row -> Col.<br>t <sub>RCD</sub> [ns]<br>9.30                                                | wer generators / Sign<br>s Power & Signals<br>Column<br>t <sub>ccD</sub> [ns]<br>6.05                            |
| No.<br>1                | Techn.<br>node<br>75nm<br>65nm                 | Cell<br>size<br>8F <sup>2</sup><br>6F <sup>2</sup>                                       | Cell<br>type<br>Deep Trench<br>buried WL                       | Area<br>[mm <sup>2</sup> ]<br>5.20<br>3.54                 | Row<br>t <sub>RAS</sub> [ns]<br>39.0<br>27.1                 | Control / Pc<br>TSV<br>Row -> Col.<br>t <sub>RCD</sub> [ns]<br>9.30<br>7.45                 | Column<br>t <sub>ccb</sub> [ns]<br>6.05                                                                          |
| No.<br>1<br>2<br>3      | Techn.<br>node<br>75nm<br>65nm<br>58nm         | Cell<br>size<br>8F <sup>2</sup><br>6F <sup>2</sup>                                       | Cell<br>type<br>Deep Trench<br>buried WL<br>Stack              | Area<br>[mm <sup>2</sup> ]<br>5.20<br>3.54<br>3.00         | Row<br>t <sub>RAS</sub> [ns]<br>39.0<br>27.1<br>31.9         | Control / Pc<br>TSV<br>Row -> Col.<br>t <sub>RCD</sub> [ns]<br>9.30<br>7.45<br>7.31         | wer generators / Signals<br>s Power & Signals<br>Column<br>t <sub>ccc0</sub> [ns]<br>6.05<br>5.42<br>4.70        |
| No.<br>1<br>2<br>3<br>4 | Techn.<br>node<br>75nm<br>65nm<br>58nm<br>46nm | Cell<br>size<br>8F <sup>2</sup><br>6F <sup>2</sup><br>6F <sup>2</sup><br>6F <sup>2</sup> | Cell<br>type<br>Deep Trench<br>buried WL<br>Stack<br>buried WL | Area<br>[mm <sup>2</sup> ]<br>5.20<br>3.54<br>3.00<br>2.26 | Row<br>t <sub>RAS</sub> [ns]<br>39.0<br>27.1<br>31.9<br>26.4 | Control / PC<br>TSV<br>Row -> Col.<br>t <sub>RCD</sub> [ns]<br>9.30<br>7.45<br>7.31<br>6.44 | war generators / Signals<br>s Power & Signals<br>Column<br>t <sub>ccc</sub> [ns]<br>6.05<br>5.42<br>4.70<br>3.59 |





## Metrics for Exploration

#### Throughput (TP)

- maximal theoretical bandwidth  $(f_{max} \cdot IO width)$
- f<sub>max</sub> determined by architecture & technology <u>here</u>: column to column access delay (t<sub>CCD</sub>)

#### Area efficiency

- Maximum learning out of the commodity DRAM production: minimize cost/bit
- Maximize cell efficiency (CE) = memory cell area / total area [%]

#### Energy efficiency (EE)

TP / average power = access / energy [MB/mJ]









| Multi-Channel 3D-DR                                                                                                                                                     | AM Controller                                                             |
|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------|
| <ul> <li>Front End:</li> <li>Synchronization with Dual<br/>Clock FIFOs</li> <li>Arbitration</li> <li>Buffering, Scheduling,<br/>Reordering</li> <li>Back End</li> </ul> | FE Memory Controller Channel Controll                                     |
| <ul> <li>3D DRAM command Encoding</li> <li>Tracking of the BANK status</li> <li>Multi IO reconfiguration and data latching for 32/64/128 bit</li> </ul>                 | 32<br>64<br>128<br>Memory<br>Controller<br>Front End<br>Cc1<br>cc1<br>cc0 |





|               | 3D-DRAM SIN                     | NGLE CHA      | NNEL CO        | NFIGUR       | ATIONS                    |                |
|---------------|---------------------------------|---------------|----------------|--------------|---------------------------|----------------|
| Dens.<br>[Mb] | Architecture<br># lay. x [org.] | # of<br>banks | Techn.<br>[nm] | Cell<br>size | $rac{A_{total}}{[mm^2]}$ | Freq.<br>[MHz] |
| 104 (1899)    |                                 | SDR           | x128           |              |                           |                |
| **256         | 1 x [4x64Mb]                    | 4             | 58             | $6F^2$       | 16                        | 200            |
| 512           | 2 x [4x64Mb]                    | 4             | 58             | $6F^2$       | 26                        | 200            |
| 1024          | 8 x [2x64Mb]                    | 8             | 46             | $6F^2$       | 35                        | 300            |
| *2048         | 8 x [2x128Mb]                   | 8             | 46             | $6F^2$       | 60                        | 167            |
| 4096          | 8 x [4x128Mb]                   | 8             | 45             | $4F^2$       | 97                        | 200            |
|               |                                 | DDR           | x128           |              |                           |                |
| 256           | 1 x [4x64Mb]                    | 4             | 58             | $6F^2$       | 22                        | 200            |
| 512           | 2 x [4x64Mb]                    | 4             | 58             | $6F^2$       | 32                        | 200            |
| 1024          | 8 x [2x64Mb]                    | 8             | 46             | $6F^2$       | 44                        | 300            |
| *2048         | 8 x [4x64Mb]                    | 8             | 46             | $6F^2$       | 69                        | 300            |
| 4096          | 8 x [4x128Mb]                   | 8             | 45             | $4F^2$       | 98                        | 200            |







# Conclusion

- Bandwidth and memory will be big challenges in future computing systems
- We will see new memory devices e.g. memristor based (RRAMs) or spin based memories (MRAMs)
- The future in computation will be 3D
- New heterogeneous memory architectures
- Large opportunity for research