Benchmark kernel

This folder contains the source code to generate the benchmark application, which is a mix of RTL and HLS kernels.

This application is actually split into two independent kernels, traffic generator and collector

Modes

First let's see what the benchmark application is capable of. The following figure shows a representation of the four modes available on the benchmark kernel.

Mode 0, PRODUCER. Generates payload with a specif dest and size, multiple of 64-Byte up to 1472-Byte, the minimum size is 64-Byte. Only transmitting AXI4-Stream is enabled.
Mode 1, LATENCY. Generates an 18-Byte payload, it also consume incoming payload (which are reflected from other endpoint) and generate a summary of it that is populated to the collector. All three AXI4-Stream are enabled.
Mode 2, LOOPBACK. The receiving payload is forwarded, the dest can be updated in the transmitter side. Both transmitting and receiving AXI4-Stream are enabled.
Mode 3, CONSUMER. Consume incoming packets and do nothing else. Only receiving AXI4-Stream is enabled.

Traffic Generator

The traffic generator kernel is an RTL kernel, and it not only in charge of generating payload, but also consuming payload and generating a summary of incoming payload. The functionality depends on the mode.

Payload generator

The main FSM is in charge of generating the data that will be use as payload in the UDP packets, each payload packet has the following header.

struct payload_header {
  bit[ 39:  0]  packet_id;
  bit[ 79: 40]  local_timestamp;
  bit[119: 80]  total_number_packets;
}

This header is useful when measuring latency. There is a secondary FSM, which depending on the mode will generate a summary that it is populated to the collector. The summary stream has the following structure.

struct summary_struct {
  bit[ 39:  0]  packet_id;
  bit[ 79: 40]  tx_local_timestamp;
  bit[119: 80]  rx_local_timestamp;
}

This secondary FSM will assert tlast in the summary stream either when the experiment ends, last payload packet was received or when a timeout is reached.

Register map

Offset	Name	Mode	Description
0x00	crtl_signals	R/W	Kernel control signals, more info here
0x10	mode	R/W	Mode selector
0x14	outbound_dest	R/W	Set outbound `TDEST`
0x18	num_packets_lsb	R/W	Number of packets LSB
0x1C	num_packets_msb	R/W	Number of packets MSB
0x20	num_beats	R/W	Number of transactions per piece of payload, the size the payload will be num_beats * 64-Byte. Max is 23
0x24	tbwp	R/W	Clock ticks between two consecutive payload packets
0x28	reset_fsm	W	Reset internal FSMs, self clear
0x2C	fsm_debug_info	R	Internal FSM state
0x30	reserved	-	Reserved for future use
0x34	out_traffic_cycles	R	Outbound traffic, number of cycles (64-bit)
0x3C	out_traffic_bytes	R	Outbound traffic, number of bytes (64-bit)
0x44	out_traffic_packets	R	Outbound traffic, number of payload packets (64-bit)
0x4C	in_traffic_cycles	R	Inbound traffic, number of cycles (64-bit)
0x54	in_traffic_bytes	R	Inbound traffic, number of bytes (64-bit)
0x5C	in_traffic_packets	R	Inbound traffic, number of payload packets (64-bit)
0x64	summary_cycles	R	Summary traffic, number of cycles (64-bit)
0x6C	summary_bytes	R	Summary traffic, number of bytes (64-bit)
0x74	summary_packets	R	Summary traffic, number of payload packets (64-bit)
0x7C	debug_reset	W	Reset debug probes, self clear

Collector

The collector is an HLS kernel. It should only be enabled when measuring latency. This kernels reads the summary generated by the payload generator and performs the following operation for each incoming packet rx_local_timestamp - tx_local_timestamp, this result is the Round-Trip Time (RTT) for an individual packet (in clock cycles) and it is represented using 32-bit. The RTT result (32-bit value) for each packet is written to a local memory, in order to maximize throughput with global memory, 16 elements are arranged in a 512-bit vector. Once the local memory is full the results are copied to global memory, again this is done to maximize performance with global memory. The local memory is 512-bit width and 64 rows depth, so I can hold up to 1,024 RTT results.

When tlast is asserted this kernel finishes and moves the local memory to global memory. The total number of received packets can be read from the received_packets argument.

From Software, users should retrieve the results from global memory and multiply each individual result by the clock period in order to convert the RTT measurement to seconds. Statistics such as arithmetic mean, max, min standard deviation can also be computed. The companion notebooks shows how to do this.

Note: The clock frequency can vary between implementations and even at runtime. For this reason the conversion from clock cycles to seconds is perform in software.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Benchmark kernel

Modes

Traffic Generator

Payload generator

Register map

Collector

Files

README.md

Latest commit

History

README.md

File metadata and controls

Benchmark kernel

Modes

Traffic Generator

Payload generator

Register map

Collector