This page describes how the software and hardware elements of an FPGA accelerator design communicate to pass data back and forth and maintain synchronization.
The hardware architecture of the Intel (Altera) Cyclone V SoC chip has two basic components: the Hard Processor System (HPS) and an FPGA fabric. The HPS is primarily a dual core ARM Cortex–A9 MPU and an assortment of peripherals (Ethernet, USB, FLASH controllers etc.) accessed through an L3 interconnect block. The FPGA is a standard Cyclone V fabric composed of traditional fine grained logic elements, larger block RAMs, and some dedicated math ("DSP") blocks.
The FPGA and HPS communicate through two AXI4 buses connected to the L3 interconnect: the H2F bus and the F2H bus. As the names imply, the H2F bus is mastered by the HPS and the F2H bus is mastered by the FPGA. The H2F bus allows the CPUs to access logic on the FPGA as a memory mapped device (or devices). Conversely, the F2H AXI bus allows FPGA logic to access HPS devices like the DRAM (or pretty much anything else it wants).
It's worth noting here that while the FPGA fabric can access the DRAM directly via the L3 Interconnect, it can also access it in a coherent manor via the L1 and L2 cache system of the processor. This is facilitated by an Accelerator Coherency Port (ACP) in the HPS. The method of access (coherent or not) depends on which address range the FPGA uses to access the DRAM. The one GByte DRAM can be accessed directly at address offset 0x00000000-0x3FFFFFFF. The same DRAM can be accessed coherently (via the L1/L2 caches) via the address range 0x80000000-0xBFFFFFFF.
The following diagram illustrates the SoC architecture.
The Intel Quartus SoC build environment for the DE1-SoC board is found here: DE1-Soc FPGA This build environment can generate an FPGA image for any given Accelerator.v source file containing the two h2f and f2h AXI interfaces.
The SoC architecture supports a 32 bit physical address range resulting in a 4 Gbyte address space. As can be seen, both the FPGA and CPU have access to a 4 Gbyte address space. These address spaces are similar but not identical. For example, the CPU can access FPGA slaves (via the H2F bus) by accessing the address range starting at 0xC0000000. Conversely, the FPGA can access the one Gbyte DRAM directly at address 0x00000000-0x3FFFFFFF or coherently via the ACP at address 0x80000000-0xBFFFFFFF.
As mentioned, access by the CPU to the 'FPGA Slaves' region will be sent to the FPGA logic via the H2F AXI bus. The FPGA accesses its 4 Gbyte address space by driving the F2H AXI bus.
As mentioned above, the software interface to the FPGA is similar to that of any hardware peripheral in that it is simply memory mapped. In our case, the FPGA logic can be accessed starting at the physical address 0xC0000000. Since the CPU cores in our system are running Linux, a device driver is required to gain access to the physical address range. Happily, the "/dev/mem" device driver can be used by a user application to mmap() a physical address range to a virtual pointer.
volatile void vptr_fpga;
/* Open /dev/mem */
fd = open( "/dev/mem", ( O_RDWR | O_SYNC ) );
/* Get a virtual pointer to 4k window at FPGA address. */
vptr_fpga = mmap( NULL, 4096, ( PROT_READ | PROT_WRITE ), MAP_SHARED, fd, 0xC0000000 );
Any access to vptr_fpga will generate bus cycles on the H2F AXI bus to be processed by the FPGA logic.
Typically, hardware/software synchronization is handled through the use of control and status registers. Control registers are implemented (as one might expect) using registers in hardware. All of the bits in control registers are signals that feed into the FPGA design logic to control its behavior. Control registers may be write only in some designs, but generally they can be read back as well.
In the case of the SoC accelerator logic, we are building a hardware accelerator to replace a function call generated by the NEEDLE software. control registers will be used to start (and possibly reset if necessary) the data flow logic, as well as to hold the values of any the arguments to the accelerator function call. Status registers will be used to signal the completion of the data flow logic and to provide any value returned by the function.
Consider the following example where the accelerator is replacing the following offload function:
uint32_t __offload_func_0(uint32_t len, uint32_t *vec1, uint32_t *vec2) {
// Multiply two vectors and sum results
int i;
uint32_t mac;
for (i=0;i<len;i++) {
mac += vec1[i] * vec2[i];
}
return mac;
}
There would be three control registers to contain the values of len, vec1, and vec2 and pass them into the accelerator data flow logic. Additionally there would be a control register containing a 'start' and possibly 'reset' bit. A status register would store the 'mac' result. A second status register would provide some form of 'done' bit indicating the function was complete.
The exact meaning of the control and status registers will depend on the specific requirements of the algorithm being accelerated. However, it is expected there will be at least one general control register bit to start the accelerator, some number of 'arg' control registers, and status registers to signal completion and or a result if required. Initially, the completion of the accelerator call can be determined by polling the 'done' status bit. However, long term it may be useful to provide an interrupt from the FPGA.
The control and status registers are laid out in the FPGA Slave region of memory across a 4K window as shown below. The number of registers available is configurable and their usage is design dependent.
The accelerator code used to wrap a dataflow scala file can be found in the repository here: Accelerator
The top level file is Accelerator.scala. It instantiates and connects three helper blocks: SimpleReg.scala, Cache.scala, and Core.scala. The relationship of the files is illustrated below:
SimpelReg.scala creates a configurable set of control and status registers as described in earlier sections. It makes the registers accessible to the io.h2f AXI (aka Nasti) bus so that they can be written and read by the CPU. The outputs a vector of control registers as io.ctrl and accepts a vector of status signals on io.stat.
Core.scala should contain the generated data flow logic that implements the function to be accelerated. The Core.scala file checked in to the repository is just test code to run a memory test via the Cache.scala block and should be replaced by any design specific accelerator code that has a been generated. It can be used as an example of the interfaces necessary for the register and cache blocks. The Core.scala file is responsible for mapping the generic control and status register bits to the signals needed by the specific function being accelerated.
Cache.scala provides a cached interface to the io.f2h (Nasti) bus. It responds to requests presented by the Core block on the io.cache inteface.