Skip to content

Commit

Permalink
AIE - design 02,03,04,13: update doc and description file for 2021.2 (#…
Browse files Browse the repository at this point in the history
…71)

* update for 21.2

* update 13-performance-nofifo-hang to not print expected ERROR string

* fix issue in sw/host.cpp in 13-performance-ssfifo

* update sw/host.cpp in 13-performance-dmafifo-opt

* update 02-gmio to 2021.2 AIE API & programming model

* update perf_profile_aie_gmio.md

* update 03-rtp for 2021.2

* update 02-gmio README.md

* update for 21.2 AIE API & programming model

* update 02,03,04 description.json

* update 13 description.json

* add ROOTFS to description.json

* updae description.json

* update description.json

* update 02 aie/graph.cpp

* update description.json

* Techpubs edits

Co-authored-by: brucey <[email protected]>
Co-authored-by: Neha <[email protected]>
  • Loading branch information
3 people authored and GitHub Enterprise committed Nov 3, 2021
1 parent 09de0dc commit 47a9f00
Show file tree
Hide file tree
Showing 19 changed files with 52 additions and 45 deletions.
12 changes: 3 additions & 9 deletions AI_Engine_Development/Feature_Tutorials/02-using-gmio/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,21 +11,15 @@
## Introduction
A GMIO port attribute is used to make external memory-mapped connections to or from the global memory. These connections are made between AI Engine kernels or programmable logic kernels and the logical global memory ports of a hardware platform design. This tutorial is designed to demonstrate how to work with the GMIO interface in AI Engine simulator and hardware flows.

**IMPORTANT**: Before beginning the tutorial make sure you have read and followed the *Vitis Software Platform Release Notes* (v2021.1) for setting up software and installing the VCK190 base platform.
**IMPORTANT**: Before beginning the tutorial make sure you have read and followed the *Vitis Software Platform Release Notes* (v2021.2) for setting up software and installing the VCK190 base platform.

Before starting this tutorial, run the following steps:

1. Set up your platform by running the `xilinx-versal-common-v2021.2/environment-setup-cortexa72-cortexa53-xilinx-linux` script as provided in the platform download. This script sets up the `SYSROOT` and `CXX` variables. If the script is not present, you **must** run the `xilinx-versal-common-v2021.2/sdk.sh`.
2. Set up your ROOTFS to point to the xilinx-versal-common-v2021.2/rootfs.ext4
2. Set up your ROOTFS to point to the xilinx-versal-common-v2021.2/rootfs.ext4
3. Set up your IMAGE to point to xilinx-versal-common-v2021.2/Image.
4. Set up your `PLATFORM_REPO_PATHS` environment variable based upon where you downloaded the platform.

This tutorial targets the VCK190 ES board (see https://www.xilinx.com/products/boards-and-kits/vck190.html). This board is currently available via early access. If you have already purchased this board, download the necessary files from the lounge and ensure you have the correct licenses installed. If you do not have a board and ES license please contact your Xilinx sales contact.

To target the VCK190 production board, modify `PLATFORM` variable in the `Makefile`(s) to:

PLATFORM = ${PLATFORM_REPO_PATHS}/xilinx_vck190_base_202120_1/xilinx_vck190_base_202120_1.xpfm

## Objectives
After completing this tutorial, you will be able to:
* Understand the programming model and software programmability of the AI Engine GMIO.
Expand All @@ -34,7 +28,7 @@ After completing this tutorial, you will be able to:
* Measure the NOC bandwidth and make trade offs between GMIO and PLIO.

## Steps
__Note:__ This tutorial assumes that the user has basic understanding of Adaptive Data Flow (ADF) API and Xilinx® Runtime (XRT) API usage. For more information about ADF API and XRT usage, refer to AI Engine Runtime Parameter Reconfiguration Tutorial and Versal ACAP AI Engine Programming Environment User Guide (UG1076).
__Note:__ This tutorial assumes that the user has basic understanding of Adaptive Data Flow (ADF) API and Xilinx® Runtime (XRT) API usage. For more information about ADF API and XRT usage, refer to AI Engine Runtime Parameter Reconfiguration Tutorial and Versal ACAP AI Engine Programming Environment User Guide ([UG1076](./https://docs.xilinx.com/access/sources/dita/map?Doc_Version=2021.2%20English&amp;url=ug1076-ai-engine-environment)).

**Step 1 - AI Engine GMIO**: Introduces the programming model of AI Engine GMIO, including blocking and non-blocking GMIO transactions. See details in [AIE GMIO Programming Model](./single_aie_gmio.md).

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -9,12 +9,10 @@
</table>

## AI Engine GMIO Performance Profile
This tutorial targets the VCK190 ES board (see https://www.xilinx.com/products/boards-and-kits/vck190.html). This board is currently available via early access. If you have already purchased this board, download the necessary files from the lounge and ensure you have the correct licenses installed. If you do not have a board and ES license, contact your Xilinx sales contact.

AI Engine tools support mapping the GMIO port to the tile DMA one-to-one. It does not support mapping multiple GMIO ports to one tile DMA channel. There is a limit on the number of GMIO ports supported for a given device. For example, the XCVC1902 device on the VCK190 board
has 16 AI Engine to NoC master unit (NMU) in total. For each AI Engine to NMU, it supports two MM2S and two S2MM channels. Hence there can be a maximum of 32 AI Engine GMIO inputs, 32 AI Engine GMIO outputs supported, but note that it can be further limited by the existing hardware platform.

In this example, we will utilize 32 AI Engine GMIO inputs, 32 AI Engine GMIO outputs in the graph, and profile the performance from one input and one output to 32 inputs and 32 outputs through various ways. Then you will learn about the NOC bandwidth and the advantages and disadvantages of choosing GMIO for data transfer.
In this example, 32 AI Engine GMIO inputs, 32 AI Engine GMIO outputs are utilized in the graph and the performance from one input and one output to 32 inputs and 32 outputs through various ways is profiled. Then you will learn about the NOC bandwidth and the advantages and disadvantages of choosing GMIO for data transfer.

### Design Introduction
This design has a graph that has 32 AI Engine kernels. Each kernel has one input and one output. Thus, 32 AI Engine GMIO inputs and 32 AI Engine GMIO outputs are connected to the graph.
Expand All @@ -27,18 +25,18 @@ Change the working directory to `perf_profile_aie_gmio`. Take a look at the grap
{
private:
adf::kernel k[32];

public:
adf::input_gmio gmioIn[32];
adf::output_gmio gmioOut[32];

mygraph()
{
for(int i=0;i<32;i++){
gmioIn[i]=adf::input_gmio::create("gmioIn"+std::to_string(i),/*size_t burst_length*/256,/*size_t bandwidth*/100);
gmioOut[i]=adf::output_gmio::create("gmioOut"+std::to_string(i),/*size_t burst_length*/256,/*size_t bandwidth*/100);
k[i] = adf::kernel::create(vec_incr);
adf::connect<adf::window<1024>>(gmioIn[i].out[0], k[i].in[0]);
adf::connect<adf::window<1024>>(gmioIn[i].out[0], k[i].in[0]);
adf::connect<adf::window<1032>>(k[i].out[0], gmioOut[i].in[0]);
adf::source(k[i]) = "vec_incr.cc";
adf::runtime<adf::ratio>(k[i])= 1;
Expand All @@ -47,14 +45,14 @@ Change the working directory to `perf_profile_aie_gmio`. Take a look at the grap
};
};

In the code above, there are location constraints `adf::location` for each kernel. This is to save time for `aiecompiler`. Note that each kernel has an input window size of 1024 bytes and output window size of 1032 bytes.
In the previous code, there are location constraints `adf::location` for each kernel. This is to save time for `aiecompiler`. Note that each kernel has an input window size of 1024 bytes and output window size of 1032 bytes.

Next, examine the kernel code `aie/vec_incr.cc`. It adds each int32 input by one and additionally outputs the cycle counter of the AI Engine tile. Due to the later introduction, this counter can be used to calculate the system throughput.

#include <aie_api/aie.hpp>
#include <aie_api/aie_adf.hpp>
#include <aie_api/utils.hpp>

void vec_incr(input_window<int32>* data,output_window<int32>* out){
aie::vector<int32,16> vec1=aie::broadcast<int32>(1);
for(int i=0;i<16;i++)
Expand All @@ -70,7 +68,7 @@ Next, examine the kernel code `aie/vec_incr.cc`. It adds each int32 input by one
window_writeincr(out,time);
}

Next, examine the host code `aie/graph.cpp`. The concepts introduced in [AIE GMIO Programming Model](./single_aie_gmio.md) apply here. We will focus on the new concepts and how to do performance profiling. Some constants defined in the code are as follows:
Next, examine the host code `aie/graph.cpp`. The concepts introduced in [AIE GMIO Programming Model](./single_aie_gmio.md) apply here. This section explains new concepts and how performance profiling is done. Some constants defined in the code are as follows:

#if !defined(__AIESIM__) && !defined(__X86SIM__) && !defined(__ADF_FRONTEND__)
const int ITERATION=512;
Expand Down Expand Up @@ -123,7 +121,7 @@ In the main function, the PS code is going to profile `num` GMIO inputs and outp
}

### Performance Profiling Methods
In this example, we will introduce some methods for profiling the design. The code to be profiled is in `aie/graph.cpp`:
In this example, some methods for profiling the design are introduced. The code to be profiled is in `aie/graph.cpp`:

//Profile starts here
for(int i=0;i<num;i++){
Expand Down Expand Up @@ -194,7 +192,7 @@ The output in hardware is as follows:

2. Profile by AI Engine cycles got from AI Engine kernels

In this design, we output the AI Engine cycles at the end of each iteration. Each iteration produces 256 int32 data, plus a long long AI Engine cycle counter number. We record the very beginning cycle and the last cycle of all the AI Engine kernels to be profiled because multiple AI Engine kernels start at different cycles though they are enabled by the same `graph::run`. Thus, we can calculate the system throughput for all the kernels.
In this design, the AI Engine cycles output at the end of each iteration. Each iteration produces 256 int32 data, plus a long long AI Engine cycle counter number. The very beginning cycle and the last cycle of all the AI Engine kernels to be profiled are recorded because multiple AI Engine kernels start at different cycles though they are enabled by the same `graph::run`. Thus, the system throughput for all the kernels can be calculated.

Note that there is some gap between the actual performance and the calculated number because there is already some data transfer before the recorded starting cycle. However, the overhead is negligible when the total iteration number is high, which is 512 in this example.

Expand Down Expand Up @@ -235,7 +233,7 @@ The commands to build and run in hardware are the same as previously shown. The

3. Profile by event API

The AI Engine has hardware performance counters and can be configured to count hardware events for measuring performance metrics. The API used in this example is to profile graph throughput regarding the specific GMIO port. There may be confliction when multiple GMIO ports are used for event API because of the restriction that performance counter is shared between GMIO ports that access the same AI Engine-PL interface column. Thus, we only profile one GMIO output to show this methodology.
The AI Engine has hardware performance counters and can be configured to count hardware events for measuring performance metrics. The API used in this example is to profile graph throughput regarding the specific GMIO port. There may be confliction when multiple GMIO ports are used for event API because of the restriction that performance counter is shared between GMIO ports that access the same AI Engine-PL interface column. Thus, only one GMIO output is profiled to show this methodology.

The code to start profiling is as follows:

Expand All @@ -259,13 +257,13 @@ The code to end profling and calculate performance is as follows:

In this example, `event::start_profiling is` called to configure the AI Engine to count the clock cycles from the stream start event to the event that indicates `BLOCK_SIZE_out_Bytes` bytes have been transferred, assuming that the stream stops right after the specified number of bytes are transferred.

For detailed usage about event API, refer to the *Versal ACAP AI Engine Programming Environment User Guide* (UG1076).
For detailed usage about event API, refer to the *Versal ACAP AI Engine Programming Environment User Guide* ([UG1076](./https://docs.xilinx.com/access/sources/dita/map?Doc_Version=2021.2%20English&amp;url=ug1076-ai-engine-environment)).

The code is guarded by macro `__USE_EVENT_PROFILE__`. To use this method of profiling, define `__USE_EVENT_PROFILE__` for g++ cross compiler in `sw/Makefile`:

CXXFLAGS += -std=c++14 -D__USE_EVENT_PROFILE__ -I$(XILINX_HLS)/include/ -I${SDKTARGETSYSROOT}/usr/include/xrt/ -O0 -g -Wall -c -fmessage-length=0 --sysroot=${SDKTARGETSYSROOT} -I${XILINX_VITIS}/aietools/include ${HOST_INC}

The commands to build and run in hardware is same as previously shown. The output in hardware is as follows:
The commands to build and run in hardware are the same as previously shown. The output in hardware is as follows:

GMIO::malloc completed
total input/output num=1
Expand All @@ -284,7 +282,7 @@ The commands to build and run in hardware is same as previously shown. The outpu
GMIO::free completed
PASS!

It is seen that when used GMIO ports number increases, the performance for the specific GMIO port drops, indicating that the total system throughput is limited by NOC and DDR bandwidth.
It is seen that when the used GMIO ports number increases, the performance for the specific GMIO port drops, indicating that the total system throughput is limited by NOC and DDR bandwidth.

### Conclusion
In this tutorial, you learned about:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -164,11 +164,12 @@ int main(int argc, char ** argv) {
std::cout<<"GMIO::free completed"<<std::endl;

if(error==0){
std::cout<<"PASS!"<<std::endl;
std::cout<<"TEST PASSED!"<<std::endl;
}else{
std::cout<<"ERROR!"<<std::endl;
std::cout<<"TEST FAILED!"<<std::endl;
}

gr.end();
#if !defined(__AIESIM__) && !defined(__X86SIM__) && !defined(__ADF_FRONTEND__)
xrtDeviceClose(dhdl);
#endif
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,8 @@
"EDGE_COMMON_SW=${EDGE_COMMON_SW}",
"SYSROOT=${SYSROOT}",
"SDKTARGETSYSROOT=${SYSROOT}",
"ROOTFS=${EDGE_COMMON_SW}/rootfs.ext4",
"IMAGE=${EDGE_COMMON_SW}/Image",
"EMU_CMD=\\\"./launch_hw_emu.sh -run-app embedded_exec.sh\\\"",
"EMBEDDED_PACKAGE_OUT=./",
"EMBEDDED_EXEC_SCRIPT=./embedded_exec.sh"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -15,26 +15,26 @@ This example introduces the AI Engine GMIO programming model. It includes three
* [Step 2 - Asynchronous GMIO transfer for Input and Synchronous GMIO transfer for Output](#Step-2---Asynchronous-GMIO-transfer-for-Input-and-Synchronous-GMIO-transfer-for-Output)
* [Step 3 - Asynchronous GMIO Transfer and Hardware flow](#Step-3---Asynchronous-GMIO-Transfer-and-Hardware-flow)

We will use AI Engine simulator event trace in each step to see how performance can be improved step by step. The last step introduces code to make GMIO work in hardware.
The AI Engine simulator event trace is used to see how performance can be improved step by step. The last step introduces code to make GMIO work in hardware.

### Step 1 - Synchronous GMIO Transfer
In this step, we introduce the synchronous GMIO transfer mode. Change the working directory to `single_aie_gmio/step1`. Looking at the graph code `aie/graph.h`, it can be seen that the design has one output `gmioOut` with type `output_gmio`, one input `gmioIn` with type `input_gmio`, and an AI Engine kernel `weighted_sum_with_margin`.
In this step, the synchronous GMIO transfer mode is introduced. Change the working directory to `single_aie_gmio/step1`. Looking at the graph code `aie/graph.h`, it can be seen that the design has one output `gmioOut` with type `output_gmio`, one input `gmioIn` with type `input_gmio`, and an AI Engine kernel `weighted_sum_with_margin`.

class mygraph: public adf::graph
{
private:
adf::kernel k_m;

public:
adf::output_gmio gmioOut;
adf::input_gmio gmioIn;

mygraph()
{
k_m = adf::kernel::create(weighted_sum_with_margin);
gmioOut = adf::output_gmio::create("gmioOut",64,1000);
gmioIn = adf::input_gmio::create("gmioIn",64,1000);

adf::connect<adf::window<1024,32>>(gmioIn.out[0], k_m.in[0]);
adf::connect<adf::window<1024>>(k_m.out[0], gmioOut.in[0]);
adf::source(k_m) = "weighted_sum.cc";
Expand All @@ -53,7 +53,7 @@ The GMIO ports `gmioIn` and `gmioOut`, are created and connected as follows:
The GMIO instantiation `gmioIn` represents the DDR memory space to be read by the AI Engine and `gmioOut` represents the DDR memory space to be written by the AI Engine. The creator specifies the logical name of the GMIO, burst length (that can be 64,
128, or 256 bytes) of the memory-mapped AXI4 transaction, and the required bandwidth in MB/s (here 1000 MB/s).

Inside the main function of `aie/graph.cpp`, two 256-element int32 arrays (1024 bytes) are allocated by `GMIO::malloc`. The `dinArray` points to the memory space to be read by the AI Engine and the `doutArray` points to the memory space to be written by the AI Engine. In Linux, the vitual address passed to `GMIO::gm2aie_nb`, `GMIO::aie2gm_nb`, `GMIO::gm2aie`, and `GMIO::aie2gm` must be allocated by `GMIO::malloc`. After the input data is allocated, it can be initialized.
Inside the main function of `aie/graph.cpp`, two 256-element int32 arrays (1024 bytes) are allocated by `GMIO::malloc`. The `dinArray` points to the memory space to be read by the AI Engine and the `doutArray` points to the memory space to be written by the AI Engine. In Linux, the virtual address passed to `GMIO::gm2aie_nb`, `GMIO::aie2gm_nb`, `GMIO::gm2aie`, and `GMIO::aie2gm` must be allocated by `GMIO::malloc`. After the input data is allocated, it can be initialized.

int32* dinArray=(int32*)GMIO::malloc(BLOCK_SIZE_in_Bytes);
int32* doutArray=(int32*)GMIO::malloc(BLOCK_SIZE_in_Bytes);
Expand All @@ -72,7 +72,7 @@ The blocking transfer (`gmioIn.gm2aie`) has to be completed before `gr.run()` be

Because `GMIO::aie2gm()` is working in synchronous mode, the output processing can be done just after it is completed.

__Note:__ The memory is non-cachable for GMIO in Linux.
__Note:__ The memory is non-cacheable for GMIO in Linux.

In the example program, the design runs four iterations in a loop. In the loop, pre-processing and post-processing are done before and after data transfer.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -41,11 +41,5 @@
{
"all": "aiesim"
},
"custom_board_target": "run_test",
"tasks": {
"build": {
"pre_exec": "./env_setup.sh"
}
}
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -41,11 +41,5 @@
{
"all": "run"
},
"custom_board_target": "run_test",
"tasks": {
"build": {
"pre_exec": "./env_setup.sh"
}
}
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,8 @@
"EDGE_COMMON_SW=${EDGE_COMMON_SW}",
"SYSROOT=${SYSROOT}",
"SDKTARGETSYSROOT=${SYSROOT}",
"ROOTFS=${EDGE_COMMON_SW}/rootfs.ext4",
"IMAGE=${EDGE_COMMON_SW}/Image",
"EMU_CMD=\\\"./launch_hw_emu.sh -run-app embedded_exec.sh\\\"",
"EMBEDDED_PACKAGE_OUT=./",
"EMBEDDED_EXEC_SCRIPT=./embedded_exec.sh"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,8 @@
"EDGE_COMMON_SW=${EDGE_COMMON_SW}",
"SYSROOT=${SYSROOT}",
"SDKTARGETSYSROOT=${SYSROOT}",
"ROOTFS=${EDGE_COMMON_SW}/rootfs.ext4",
"IMAGE=${EDGE_COMMON_SW}/Image",
"EMU_CMD=\\\"./launch_hw_emu.sh -run-app embedded_exec.sh\\\"",
"EMBEDDED_PACKAGE_OUT=./",
"EMBEDDED_EXEC_SCRIPT=./embedded_exec.sh"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,8 @@
"EDGE_COMMON_SW=${EDGE_COMMON_SW}",
"SYSROOT=${SYSROOT}",
"SDKTARGETSYSROOT=${SYSROOT}",
"ROOTFS=${EDGE_COMMON_SW}/rootfs.ext4",
"IMAGE=${EDGE_COMMON_SW}/Image",
"EMU_CMD=\\\"./launch_hw_emu.sh -run-app embedded_exec.sh\\\"",
"EMBEDDED_PACKAGE_OUT=./",
"EMBEDDED_EXEC_SCRIPT=./embedded_exec.sh"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,8 @@
"EDGE_COMMON_SW=${EDGE_COMMON_SW}",
"SYSROOT=${SYSROOT}",
"SDKTARGETSYSROOT=${SYSROOT}",
"ROOTFS=${EDGE_COMMON_SW}/rootfs.ext4",
"IMAGE=${EDGE_COMMON_SW}/Image",
"EMU_CMD=\\\"./launch_hw_emu.sh -run-app embedded_exec.sh\\\"",
"EMBEDDED_PACKAGE_OUT=./",
"EMBEDDED_EXEC_SCRIPT=./embedded_exec.sh"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,8 @@
"EDGE_COMMON_SW=${EDGE_COMMON_SW}",
"SYSROOT=${SYSROOT}",
"SDKTARGETSYSROOT=${SYSROOT}",
"ROOTFS=${EDGE_COMMON_SW}/rootfs.ext4",
"IMAGE=${EDGE_COMMON_SW}/Image",
"EMU_CMD=\\\"./launch_hw_emu.sh -run-app embedded_exec.sh\\\"",
"EMBEDDED_PACKAGE_OUT=./",
"EMBEDDED_EXEC_SCRIPT=./embedded_exec.sh"
Expand Down
Loading

0 comments on commit 47a9f00

Please sign in to comment.