Abstract
For about fifty years, hardware designers have been relying on different types of semiconductor scaling laws, like Moore�s Law and Dennard scaling, to achieve gains in performance. The industry became used to processor performance per watt, doubling approximately every 18 months. Over the past decade, we have seen the breakdown of these scaling predictions. With the old certainties of scaling silicon geometries gone forever, the industry is already changing. The number of cores has increased in a single die. SoCs like mobile phone processors are combining application-specific co-processors, GPUs, and DSPs in different configurations to maintain the performance scaling trend. However, in a postDennard, post-Moore world, further processor specialization will be needed to achieve performance improvements. Emerging applications such as artificial intelligence and vision are demanding heavy computational performance that cannot be met by conventional architectures. This inevitably leads to creating special purpose, domain-specific accelerators. A domain-specific accelerator (DSA) is a processor or set of processors that are optimized to perform a narrow range of computations. They are tailored to meet the needs of the algorithms required for their domain. For example, an AI accelerator might have an array of elements, including multiplyaccumulate functionality, to efficiently undertake matrix operations. Google�s Tensor Processing Unit (TPU), Neural Engine in Apple�s M1 processor, and Xilinx Vitis-AI Engine are popular ASIC-based DSAs. ASIC bases DSAs provide significant gains in performance and power efficiency. However, due to the long design cycles and high engineering costs, they may not cope with the ever-evolving computation landscape. Field-Programmable Gate Arrays (FPGAs) offer advantages over ApplicationSpecific Integrated Circuits (ASICs) in certain scenarios due to their flexibility and quicker development times. FPGAs are programmable hardware, allowing users to configure their functionality after manufacturing. In contrast, ASICs are hardwired for specific tasks. This flexibility makes FPGAs suitable for prototyping, testing, and adapting to changing requirements without the need for costly chip redesigns. Developing an ASIC is a complex and time-consuming process, often taking several months to years. FPGAs can be programmed and deployed much faster, making them advantageous when speedto-market is crucial. ASICs can be expensive to design and manufacture, especially for low-volume or rapidly changing applications. FPGAs have a lower initial cost, as they don�t require custom chip fabrication. This cost-effectiveness is notable for small production runs or research projects. Therefore, FPGAs are rapidly evolving towards becoming an alternative to custom ASICs to design DSAs because of their low power consumption and a higher degree of parallelism. Designing DSAs on an FPGA requires carefully calibrating the FPGA compute and memory resources to achieve optimal throughput from a given device. Hardware Descriptive Languages (HDL) like Verilog have been traditionally used to design FPGA hardware. HDLs are generic and not geared towards any domain. Also, the user has to put in much effort to describe the hardware at the register transfer level using the HDL. A recent trend is emerging wherein existing HDLs are used to create carefully handwritten templates suiting a specific domain. A compiler framework weaves together these templates to generate the DSA for accelerating the domain computations. This approach requires expensive design synthesis and FPGA re-flashing for accelerating different algorithms from the domain. In many edge and deeply embedded applications, this may not be feasible. Further, these days cloud companies are offering FPGA bases acceleration as a service. A large cluster of custom accelerators supports these services at the backend. In contrast to this fixed-function hardware approach, where the DSA gets tied with a specific function, an alternative design approach of overlay accelerators is gaining prominence. Overlays are DSAs resembling a processor, which is synthesized and flashed on the FPGA once but is flexible enough to process a broad class of computations soft reconfiguration. Over the last couple of years, a few design approaches have developed for overlay accelerators. Overlay designs exist that resemble a processor controlled through an instruction set; the Xilinx Vitis-AI Engine is a prime example of such a design. However, such a homogeneous approach often leads to inefficiencies arising out of the fetch-decodeexecute model of processor design. Instruction-based overlays spend significant energy on instruction overhead rather than actual computation. To solve this problem of homogeneous overlay accelerators, a heterogeneous design methodology is proposed. A heterogeneous overlay accelerator contains multiple small homogeneous units.