This folder contains the source code to generate the benchmark application, which is a mix of RTL and HLS kernels.
This application is actually split into two independent kernels, traffic generator and collector
First let's see what the benchmark application is capable of. The following figure shows a representation of the four modes available on the benchmark kernel.
-
Mode 0,
PRODUCER
. Generates payload with a specif dest and size, multiple of 64-Byte up to 1472-Byte, the minimum size is 64-Byte. Only transmitting AXI4-Stream is enabled. -
Mode 1,
LATENCY
. Generates an 18-Byte payload, it also consume incoming payload (which are reflected from other endpoint) and generate a summary of it that is populated to the collector. All three AXI4-Stream are enabled. -
Mode 2,
LOOPBACK
. The receiving payload is forwarded, the dest can be updated in the transmitter side. Both transmitting and receiving AXI4-Stream are enabled. -
Mode 3,
CONSUMER
. Consume incoming packets and do nothing else. Only receiving AXI4-Stream is enabled.
The traffic generator kernel is an RTL kernel, and it not only in charge of generating payload, but also consuming payload and generating a summary of incoming payload. The functionality depends on the mode.
The main FSM is in charge of generating the data that will be use as payload in the UDP packets, each payload packet has the following header.
struct payload_header {
bit[ 39: 0] packet_id;
bit[ 79: 40] local_timestamp;
bit[119: 80] total_number_packets;
}
This header is useful when measuring latency. There is a secondary FSM, which depending on the mode will generate a summary that it is populated to the collector. The summary stream has the following structure.
struct summary_struct {
bit[ 39: 0] packet_id;
bit[ 79: 40] tx_local_timestamp;
bit[119: 80] rx_local_timestamp;
}
This secondary FSM will assert tlast
in the summary stream either when the experiment ends, last payload packet was received or when a timeout is reached.
Offset | Name | Mode | Description |
---|---|---|---|
0x00 | crtl_signals | R/W | Kernel control signals, more info here |
0x10 | mode | R/W | Mode selector |
0x14 | outbound_dest | R/W | Set outbound TDEST |
0x18 | num_packets_lsb | R/W | Number of packets LSB |
0x1C | num_packets_msb | R/W | Number of packets MSB |
0x20 | num_beats | R/W | Number of transactions per piece of payload, the size the payload will be num_beats * 64-Byte. Max is 23 |
0x24 | tbwp | R/W | Clock ticks between two consecutive payload packets |
0x28 | reset_fsm | W | Reset internal FSMs, self clear |
0x2C | fsm_debug_info | R | Internal FSM state |
0x30 | reserved | - | Reserved for future use |
0x34 | out_traffic_cycles | R | Outbound traffic, number of cycles (64-bit) |
0x3C | out_traffic_bytes | R | Outbound traffic, number of bytes (64-bit) |
0x44 | out_traffic_packets | R | Outbound traffic, number of payload packets (64-bit) |
0x4C | in_traffic_cycles | R | Inbound traffic, number of cycles (64-bit) |
0x54 | in_traffic_bytes | R | Inbound traffic, number of bytes (64-bit) |
0x5C | in_traffic_packets | R | Inbound traffic, number of payload packets (64-bit) |
0x64 | summary_cycles | R | Summary traffic, number of cycles (64-bit) |
0x6C | summary_bytes | R | Summary traffic, number of bytes (64-bit) |
0x74 | summary_packets | R | Summary traffic, number of payload packets (64-bit) |
0x7C | debug_reset | W | Reset debug probes, self clear |
The collector is an HLS kernel. It should only be enabled when measuring latency. This kernels reads the summary generated by the payload generator and performs the following operation for each incoming packet rx_local_timestamp - tx_local_timestamp
, this result is the Round-Trip Time (RTT) for an individual packet (in clock cycles) and it is represented using 32-bit. The RTT result (32-bit value) for each packet is written to a local memory, in order to maximize throughput with global memory, 16 elements are arranged in a 512-bit vector. Once the local memory is full the results are copied to global memory, again this is done to maximize performance with global memory. The local memory is 512-bit width and 64 rows depth, so I can hold up to 1,024 RTT results.
When tlast
is asserted this kernel finishes and moves the local memory to global memory. The total number of received packets can be read from the received_packets
argument.
From Software, users should retrieve the results from global memory and multiply each individual result by the clock period in order to convert the RTT measurement to seconds. Statistics such as arithmetic mean, max, min standard deviation can also be computed. The companion notebooks shows how to do this.
Note: The clock frequency can vary between implementations and even at runtime. For this reason the conversion from clock cycles to seconds is perform in software.
Copyright© 2022 Xilinx