Virtual Expo 2026

D09 - MAC-Forge: Systolic Array Based AI Accelerator

Envision Diode

MAC-Forge: Architecting the Silicon Behind AI

Project Details


Aim

  1. Design a systolic array architecture for efficient parallel matrix multiplication or multiply-accumulate operations.
  2. Implement the design using synthesizable Verilog for hardware realization.
  3. Arrange processing elements in a pipelined, rhythmic data flow to maximize throughput.
  4. Understand hardware-level parallelism and dataflow in systolic array structures.

Introduction

This project implements a “Tiny TPU” matrix accelerator using Verilog HDL, based on the systolic array architecture commonly used in industry-leading AI chips such as Google’s TPU and NVIDIA’s Tensor Cores.

In an era where silicon limits performance more than software, and general-purpose CPUs fall short for deep learning workloads, this project offers a hands-on opportunity to explore domain-specific hardware acceleration. By designing the TinyTPU, mentees gain a deep understanding of parallel computation and the building blocks of high-performance AI hardware.

Technologies Used

  • Verilog HDL – Hardware design
  • Xilinx Vivado – Design, synthesis, simulation, and implementation
  • Python & HTML – Building a visualization dashboard from simulation data (.csv file)

Design Methodology

Phase 1: RTL Hardware Architecture Design

  • Processing Element (PE) Design:
    • Module: proc_element_ws.sv
    • Acts as a dedicated Multiply-Accumulate (MAC) unit.
    • Stores a 32-bit stationary weight.
    • Each cycle: multiplies incoming activation × weight, adds partial sum, routes results downward, and passes activation right.
  • Grid Instantiation & Routing:
    • Module: systolic_array_ws.sv
    • Verilog generate blocks used to wire 16 PEs into a 4×4 matrix.
    • Managed boundary conditions (padding edge inputs with zeros).

Phase 2: Automated Testbench & State Extraction

  • Testbench: systolic_array_ws_tb.sv
  • Test Vector Generation: Randomized 32-bit signed integers, identity matrices, and negative arrays.
  • Cycle-by-Cycle CSV Logging:
    • $fopen and $fwrite used to log 64-bit register states of all 16 PEs.
    • Output stored in sim_data.csv for permanent, parseable records.

Phase 3: Software Dashboard & UI Generation

  • Python Tool: build_dashboard.py
  • Data Parsing: Filters uninitialized states, organizes hardware data by test scenarios.
  • Dynamic HTML Rendering:
    • Generates a standalone HTML/JS dashboard.
    • 4×4 grid UI with forward/backward stepping through cycles.
    • Visualizes the diagonal population of the grid, validating 3N–1 cycle latency.

Simulation Conditions

  • Environment: Xilinx Vivado 2025.2
  • Software: Python 3.10+, HTML5/CSS3/JavaScript
  • Clock Period: 10 ns (100 MHz theoretical)
  • Datapath: 32-bit signed inputs, 64-bit accumulators
  • Grid Dimensions: 4×4 (N=4)

Results

Hardware Verification (Waveform Analysis)

  • Correct propagation delays verified.
  • 32-bit signed numbers cascaded successfully.
  • MAC results matched software model outputs.
  • Zero mismatches found.

Software Verification (Visual Dashboard)

  • Python script parsed hardware data correctly.
  • Interactive UI validated diagonal grid data flow.
  • Confirmed exact cycle latency requirements.


Conclusion

The project successfully designed and verified a 4×4 Weight-Stationary Systolic Array. By extracting hardware states into a custom software dashboard, it bridges low-level RTL debugging with high-level algorithmic visualization. Results validate the principles of high arithmetic intensity and demonstrate how systolic architectures mitigate memory bottlenecks while executing complex 32-bit signed matrix multiplication.

Future Scope

  • FPGA Synthesis: Timing closure and area utilization analysis.
  • AXI4-Stream Integration: Interface with ARM processor memory subsystem via DMA.
  • Scalability: Expand N=4 to larger grids, analyze routing and utilization impacts.

References

  1. A. Ankur, “Understanding Matrix Multiplication on a Weight-Stationary Systolic Architecture,” Telesens, Jul. 30, 2018.
  2. Tensor Processing Unit (YouTube Video)
  3. C. Shinn, Tiny TPU Project Repository
  4. Debtanu09, Systolic Array Matrix Multiplier Repository
  5. H. T. Kung, “Why systolic architectures?,” Computer, vol. 15, no. 1, pp. 37–46, Jan. 1982.

Report Information

Explore More Projects

View All 2026 Projects