Google TPU 논문 리뷰 "In-Datacenter Performance Analysis of a Tensor Processing Unit"

Tech/Self-Study

Google TPU 논문 리뷰 "In-Datacenter Performance Analysis of a Tensor Processing Unit"

J9 2023. 2. 21. 15:58

In-Datacenter Performance Analysis of a Tensor Processing Unit is about Google TPU v1.

Currently, TPUV4 has been released, but later versions is expanded from the form of v1, so if you know TPU v1
you can easily understand the rest.

@What is TPU?

A tensor processing unit (TPU) is an machine learning accelerator desigend by google.
It is NPU custom ASICs deployed in datacenter.

@ Why is TPU develpoted instead of GPU?

Domain-specific hardware is required to improve energy-to-use performance.

Basically, NN is deterministic, unlike other programs. Therefore, unnecessary hardware componets can be reduced such as cache, branch predictor , fetching etc.

Therefore, even though TPU is consist of myriad of MACs and large memory, it has relatively low power.

@ Main architecture

It is quite simple architecture. main characteritic are as below

- PCIe allows it to plug into existing server as GPU does
- TPU instruction are sent from Host over PCIe
- 256 x 256 MACs is the heart of TPU, which is systolic array
- The matrix unit produces 256 partial sum per clock cycle
- The weight is staged via weight FIFO (it looks like kinds of weight stationary)
- Input data will be delivered through unified buffer to matrix multiply

@ How does systolic array work?

- Systolic array operation can save energy by reducing reads and writes of Unified buffer
- Key point of TPU , let systolic array be busy. if there is no input activation or weigh data ready ,stall can be occured
- calcuation concep is like a pipe line.

① weights are pre-loaded for weight FIFO , which direction is top
② from the left, input data are flows by cycle like a pipe line
③ Once the 256 data are filled, 256 data will be generated regularly every cycle
-> This is advantage of systolic array , high throughput and regular latency

@Roofline Performance

TPU adapt the roofline performance model, which show well the causes of performance bottleneck.
Roofline model shows relationship between computation limit and memory-bandwidth limit

Normally, performance will be limited by memory bandwidth not peak performance

It's a good tool for evaluating the performance of different hardware DNN platforms to make sure your design is well optimized

As the results shows relative speed , GPU, CPU, TPU

'Tech > Self-Study' 카테고리의 다른 글

[테슬라 FSD] TESLA FSD (Full Self-Driving Compute) 1/2 (0)	2023.02.24
[Verilog] 8bit up/down counter 설계 (8비트 카운터) (0)	2022.03.12
[AI가속기] Eyeriss: An Energy-Efficient ReconfigurableAccelerator for Deep ConvolutionalNeural Networks 논문 리뷰 (1/3) (0)	2022.02.23
[AI 가속기] DianNao: A Small-Footprint High-Throughput Accelerator for Ubiquitous Machine-learning 논문 리뷰 (2/2) (0)	2022.02.21
[AI 가속기] DianNao: A Small-Footprint High-Throughput Accelerator for Ubiquitous Machine-learning 논문 리뷰 (1/2) (0)	2022.02.18

현재글Google TPU 논문 리뷰 "In-Datacenter Performance Analysis of a Tensor Processing Unit"