TNP: Introduction
TNP (Tensor Native Processor) is a experimental processor design to run AI computation in a very efficient way. Primarily, TNP relies on systolic array to accelerate matrix operations (Google’s TPU also uses systolic arrays). However, we want to maximize systolic array performance, we make everything in the project design “tensor native”:
- 2-diagonal cacher read/write - Keep the systolic array busy.
- Matrix-shaped registers - Each matrix register has the same size as the systolic array. Now we can read/write rows, columns and diagonals for each matrix register.
- Local memory for each core (NUMA) - Maximize memory bandwidth, and avoid shared, last level cache.
- Vector-add ALU inside matrix core - Adding two vectors turns out to be a very common case during matrix multiplication, so we include it in the same core.
And we want the compiler to have maximum control of underlying hardware, since the compiler is aware of the global computation structure:
- Software managed cache (instead of LRU) - More intelligent register resource allocation.
- Message-passing interface between cores - Let compilers explicitly indicate inter-core communication.
- Deterministic execution time - This helps us predict the execution time for any partial code. This enables a feedback loop for code generation/execution time evaluation right in compilation time.
We implemented the hardware part (matrix cores, vector cores, switch) in SystemVerilog and the software part (assembler, compiler, ONNX interface) in C++.
This project is initially a 15418/15618 course project (project poster).