Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hammerblade #101

Closed
amithmath opened this issue Sep 27, 2024 · 19 comments
Closed

Hammerblade #101

amithmath opened this issue Sep 27, 2024 · 19 comments

Comments

@amithmath
Copy link

I am trying to implement hammerblade example in pynqz2,ultra96v2, and vu47p but it is running out of resources. This issue has been raised in #76 but even for ultra96v2 it is running out of resources. Please let me know which board to use. Thanks.

@dpetrisko
Copy link
Collaborator

It should definitely fit on vu47p. Perhaps the manycore config is set too large. Do you have a utilization report you can post?

@amithmath
Copy link
Author

Following is the report:

ERROR: [DRC UTLZ-1] Resource utilization: LUT as Logic over-utilized in Top Level Design (This design requires more LUT as Logic cells than are available in the target device. This design requires 71984 of such cell types but only 70560 compatible sites are available in the target device. Please analyze your synthesis results and constraints to ensure the design is mapped to Xilinx primitives as expected. If so, please consider targeting a larger device. Please set tcl parameter "drc.disableLUTOverUtilError" to 1 to change this error to warning.)

@dpetrisko
Copy link
Collaborator

Can you post the actual reports and not just the error? Would need to see the hierarchical breakdown to see where LUTs are going

@amithmath
Copy link
Author

amithmath commented Sep 28, 2024

Copyright 1986-2022 Xilinx, Inc. All Rights Reserved.

| Tool Version : Vivado v.2022.1 (lin64) Build 3526262 Mon Apr 18 15:47:01 MDT 2022
| Date : Sat Sep 28 20:11:13 2024
| Host : amd64 running 64-bit CentOS Linux release 7.9.2009 (Core)
| Command : report_utilization -file hammerblade_bd_1_wrapper_utilization_synth.rpt -pb hammerblade_bd_1_wrapper_utilization_synth.pb
| Design : hammerblade_bd_1_wrapper
| Device : xczu3eg-sbva484-1-e
| Speed File : -1
| Design State : Synthesized

Utilization Design Information

Table of Contents

  1. CLB Logic
    1.1 Summary of Registers by Type

  2. BLOCKRAM

  3. ARITHMETIC

  4. I/O

  5. CLOCK

  6. ADVANCED

  7. CONFIGURATION

  8. Primitives

  9. Black Boxes

  10. Instantiated Netlists

  11. CLB Logic


+----------------------------+-------+-------+------------+-----------+--------+
| Site Type | Used | Fixed | Prohibited | Available | Util% |
+----------------------------+-------+-------+------------+-----------+--------+
| CLB LUTs* | 81591 | 0 | 0 | 70560 | 115.63 |
| LUT as Logic | 72479 | 0 | 0 | 70560 | 102.72 |
| LUT as Memory | 9112 | 0 | 0 | 28800 | 31.64 |
| LUT as Distributed RAM | 8933 | 0 | | | |
| LUT as Shift Register | 179 | 0 | | | |
| CLB Registers | 26938 | 0 | 0 | 141120 | 19.09 |
| Register as Flip Flop | 26669 | 0 | 0 | 141120 | 18.90 |
| Register as Latch | 269 | 0 | 0 | 141120 | 0.19 |
| CARRY8 | 1059 | 0 | 0 | 8820 | 12.01 |
| F7 Muxes | 591 | 0 | 0 | 35280 | 1.68 |
| F8 Muxes | 78 | 0 | 0 | 17640 | 0.44 |
| F9 Muxes | 0 | 0 | 0 | 8820 | 0.00 |
+----------------------------+-------+-------+------------+-----------+--------+

  • Warning! The Final LUT count, after physical optimizations and full implementation, is typically lower. Run opt_design after synthesis, if not already completed, for a more realistic count.

1.1 Summary of Registers by Type

+-------+--------------+-------------+--------------+
| Total | Clock Enable | Synchronous | Asynchronous |
+-------+--------------+-------------+--------------+
| 0 | _ | - | - |
| 0 | _ | - | Set |
| 0 | _ | - | Reset |
| 0 | _ | Set | - |
| 0 | _ | Reset | - |
| 0 | Yes | - | - |
| 0 | Yes | - | Set |
| 395 | Yes | - | Reset |
| 1140 | Yes | Set | - |
| 25403 | Yes | Reset | - |
+-------+--------------+-------------+--------------+

  1. BLOCKRAM

+-------------------+------+-------+------------+-----------+-------+
| Site Type | Used | Fixed | Prohibited | Available | Util% |
+-------------------+------+-------+------------+-----------+-------+
| Block RAM Tile | 40.5 | 0 | 0 | 216 | 18.75 |
| RAMB36/FIFO* | 38 | 0 | 0 | 216 | 17.59 |
| RAMB36E2 only | 38 | | | | |
| RAMB18 | 5 | 0 | 0 | 432 | 1.16 |
| RAMB18E2 only | 5 | | | | |
+-------------------+------+-------+------------+-----------+-------+

  • Note: Each Block RAM Tile only has one FIFO logic available and therefore can accommodate only one FIFO36E2 or one FIFO18E2. However, if a FIFO18E2 occupies a Block RAM Tile, that tile can still accommodate a RAMB18E2
  1. ARITHMETIC

+----------------+------+-------+------------+-----------+-------+
| Site Type | Used | Fixed | Prohibited | Available | Util% |
+----------------+------+-------+------------+-----------+-------+
| DSPs | 19 | 0 | 0 | 360 | 5.28 |
| DSP48E2 only | 19 | | | | |
+----------------+------+-------+------------+-----------+-------+

  1. I/O

+------------+------+-------+------------+-----------+-------+
| Site Type | Used | Fixed | Prohibited | Available | Util% |
+------------+------+-------+------------+-----------+-------+
| Bonded IOB | 0 | 0 | 0 | 82 | 0.00 |
+------------+------+-------+------------+-----------+-------+

  1. CLOCK

+----------------------+------+-------+------------+-----------+-------+
| Site Type | Used | Fixed | Prohibited | Available | Util% |
+----------------------+------+-------+------------+-----------+-------+
| GLOBAL CLOCK BUFFERs | 9 | 0 | 0 | 196 | 4.59 |
| BUFGCE | 2 | 0 | 0 | 88 | 2.27 |
| BUFGCE_DIV | 0 | 0 | 0 | 12 | 0.00 |
| BUFG_PS | 1 | 0 | 0 | 72 | 1.39 |
| BUFGCTRL* | 3 | 0 | 0 | 24 | 12.50 |
| PLL | 0 | 0 | 0 | 6 | 0.00 |
| MMCM | 1 | 0 | 0 | 3 | 33.33 |
+----------------------+------+-------+------------+-----------+-------+

  • Note: Each used BUFGCTRL counts as two GLOBAL CLOCK BUFFERs. This table does not include global clocking resources, only buffer cell usage. See the Clock Utilization Report (report_clock_utilization) for detailed accounting of global clocking resource availability.
  1. ADVANCED

+-----------+------+-------+------------+-----------+--------+
| Site Type | Used | Fixed | Prohibited | Available | Util% |
+-----------+------+-------+------------+-----------+--------+
| PS8 | 1 | 0 | 0 | 1 | 100.00 |
| SYSMONE4 | 0 | 0 | 0 | 1 | 0.00 |
+-----------+------+-------+------------+-----------+--------+

  1. CONFIGURATION

+-------------+------+-------+------------+-----------+-------+
| Site Type | Used | Fixed | Prohibited | Available | Util% |
+-------------+------+-------+------------+-----------+-------+
| BSCANE2 | 0 | 0 | 0 | 4 | 0.00 |
| DNA_PORTE2 | 0 | 0 | 0 | 1 | 0.00 |
| EFUSE_USR | 0 | 0 | 0 | 1 | 0.00 |
| FRAME_ECCE4 | 0 | 0 | 0 | 1 | 0.00 |
| ICAPE3 | 0 | 0 | 0 | 2 | 0.00 |
| MASTER_JTAG | 0 | 0 | 0 | 1 | 0.00 |
| STARTUPE3 | 0 | 0 | 0 | 1 | 0.00 |
+-------------+------+-------+------------+-----------+-------+

  1. Primitives

+------------+-------+---------------------+
| Ref Name | Used | Functional Category |
+------------+-------+---------------------+
| LUT6 | 36979 | CLB |
| FDRE | 25403 | Register |
| LUT5 | 14887 | CLB |
| RAMD32 | 13976 | CLB |
| LUT4 | 13086 | CLB |
| LUT3 | 10708 | CLB |
| LUT2 | 7359 | CLB |
| RAMS32 | 2078 | CLB |
| LUT1 | 1278 | CLB |
| FDSE | 1140 | Register |
| CARRY8 | 1059 | CLB |
| RAMD64E | 868 | CLB |
| MUXF7 | 591 | CLB |
| LDCE | 269 | Register |
| FDCE | 126 | Register |
| SRL16E | 111 | CLB |
| MUXF8 | 78 | CLB |
| SRLC32E | 68 | CLB |
| RAMB36E2 | 38 | BLOCKRAM |
| RAMS64E | 35 | CLB |
| DSP48E2 | 19 | Arithmetic |
| RAMB18E2 | 5 | BLOCKRAM |
| BUFGCTRL | 3 | Clock |
| BUFGCE | 2 | Clock |
| PS8 | 1 | Advanced |
| MMCME4_ADV | 1 | Clock |
| BUFG_PS | 1 | Clock |
+------------+-------+---------------------+

  1. Black Boxes

+----------+------+
| Ref Name | Used |
+----------+------+

  1. Instantiated Netlists

+----------+------+
| Ref Name | Used |
+----------+------+

@amithmath
Copy link
Author

amithmath commented Sep 29, 2024

Above report is from ultra96v2. There is no vps_zynq_bd.vu47p.tcl in https://github.com/black-parrot-hdk/zynq-parrot/tree/master/cosim/tcl/bd

@dpetrisko
Copy link
Collaborator

Yes, so it should fit on vu47p. There is no vps configuration file as the vu47p is not a zynq part (no PS)

For the vu47p, we use a uart bridge to emulate the PS. You can see the connection here: https://github.com/black-parrot-hdk/zynq-parrot/blob/master/cosim/xdc/board.vu47p.xdc and the cosimulation here: https://github.com/black-parrot-hdk/zynq-parrot/blob/master/cosim/include/bridge/bsg_zynq_pl.h but we haven’t open-sourced a hardware configuration as it’s a fairly custom solution.

For the ultra96v2 that report is indicating it is very close to fitting. Reducing sizes of the structures in the BlackParrot cores may get you there. Take a look at the TinyParrot configuration in the aviary and experiment with reducing branch predictors and caches

@amithmath
Copy link
Author

I am wondering, is there any possibility to port the hardware to Alveo U250 data center card (https://www.amd.com/en/products/accelerators/alveo/u250/a-u250-a64g-pq-g.html), if I can what are the changes to be done? By the way, bsg_manycore accelerator is 32 bit, can one change to 64 bit? If possible, what are the changes to be done?

@dpetrisko
Copy link
Collaborator

These are both very substantial projects.

The U250 has a pynq port, so beginning there and working through the cosim examples is the way to start. Once cosim is working, hardware examples should port in a straightforward manner

There was a student who ported the manycore toolchain to 64b:
bespoke-silicon-group/bsg_manycore#720

The hardware would require more changes, but primarily in parameterization. The actual RV64I ISA difference is minimal, especially if only F support is needed

Both projects would require a highly motivated student for likely two+ quarters. Feel free to reach out to discuss funding for these efforts

@amithmath
Copy link
Author

amithmath commented Sep 30, 2024

Thanks let me see. I was running vcs simulation from /home/ynq-parrot/cosim/hammerblade-example/vcs, I am getting following errors:

"/home/zynq-parrot/cosim/v/bsg_zynq_pl_shell.sv", 405: bsg_nonsynth_zynq_testbench.dut.top_fpga_inst.zps.pl_to_ps[0].unnamed$$_0: started at 0ps failed at 0ps
Offending '(((~S_AXI_ARESETN) | (~slv_rd_sel_one_hot[(num_regs_ps_to_pl_p + 0)])) | pl_to_ps_fifo_valid_lo[0])'
Error: "/home/sonal/ViBram/zynq-parrot/cosim/v/bsg_zynq_pl_shell.sv", 405: bsg_nonsynth_zynq_testbench.dut.top_fpga_inst.zps.pl_to_ps[0].unnamed$$_0: at time 0 ps
read from empty fifo
"/home/zynq-parrot/cosim/v/bsg_zynq_pl_shell.sv", 405: bsg_nonsynth_zynq_testbench.dut.top_fpga_inst.zps.pl_to_ps[1].unnamed$$_0: started at 0ps failed at 0ps
Offending '(((~S_AXI_ARESETN) | (~slv_rd_sel_one_hot[(num_regs_ps_to_pl_p + 1)])) | pl_to_ps_fifo_valid_lo[1])'
Error: "/home/zynq-parrot/cosim/v/bsg_zynq_pl_shell.sv", 405: bsg_nonsynth_zynq_testbench.dut.top_fpga_inst.zps.pl_to_ps[1].unnamed$$_0: at time 0 ps
read from empty fifo

bsg_tag_master transitioning to error state; be sure to run gate-level netlist to avoid sim/synth mismatch (bsg_nonsynth_zynq_testbench.dut.top_fpga_inst.master)

@amithmath
Copy link
Author

amithmath commented Oct 3, 2024

Please help, I am getting these errors in VCS:

BSG ERROR (bsg_nonsynth_zynq_testbench.axil4.wready_gpio): final block executed before fini() was called
Fatal: "/home/sonal/ViBram/zynq-parrot/import/basejump_stl/bsg_test/bsg_nonsynth_dpi_gpio.sv", 64: bsg_nonsynth_zynq_testbench.axil4.bresp_gpio: at time 56425001 ps
BSG ERROR (bsg_nonsynth_zynq_testbench.axil4.bresp_gpio): final block executed before fini() was called
Fatal: "/home/sonal/ViBram/zynq-parrot/import/basejump_stl/bsg_test/bsg_nonsynth_dpi_gpio.sv", 64: bsg_nonsynth_zynq_testbench.axil4.bvalid_gpio: at time 56425001 ps
BSG ERROR (bsg_nonsynth_zynq_testbench.axil4.bvalid_gpio): final block executed before fini() was called
Fatal: "/home/sonal/ViBram/zynq-parrot/import/basejump_stl/bsg_test/bsg_nonsynth_dpi_gpio.sv", 64: bsg_nonsynth_zynq_testbench.axil4.bready_gpio: at time 56425001 ps
BSG ERROR (bsg_nonsynth_zynq_testbench.axil4.bready_gpio): final block executed before fini() was called
Fatal: "/home/sonal/ViBram/zynq-parrot/import/basejump_stl/bsg_test/bsg_nonsynth_dpi_gpio.sv", 64: bsg_nonsynth_zynq_testbench.axil4.araddr_gpio: at time 56425001 ps
BSG ERROR (bsg_nonsynth_zynq_testbench.axil4.araddr_gpio): final block executed before fini() was called
Fatal: "/home/sonal/ViBram/zynq-parrot/import/basejump_stl/bsg_test/bsg_nonsynth_dpi_gpio.sv", 64: bsg_nonsynth_zynq_testbench.axil4.arprot_gpio: at time 56425001 ps
BSG ERROR (bsg_nonsynth_zynq_testbench.axil4.arprot_gpio): final block executed before fini() was called
Fatal: "/home/sonal/ViBram/zynq-parrot/import/basejump_stl/bsg_test/bsg_nonsynth_dpi_gpio.sv", 64: bsg_nonsynth_zynq_testbench.axil4.arvalid_gpio: at time 56425001 ps
BSG ERROR (bsg_nonsynth_zynq_testbench.axil4.arvalid_gpio): final block executed before fini() was called
Fatal: "/home/sonal/ViBram/zynq-parrot/import/basejump_stl/bsg_test/bsg_nonsynth_dpi_gpio.sv", 64: bsg_nonsynth_zynq_testbench.axil4.arready_gpio: at time 56425001 ps
BSG ERROR (bsg_nonsynth_zynq_testbench.axil4.arready_gpio): final block executed before fini() was called
Fatal: "/home/sonal/ViBram/zynq-parrot/import/basejump_stl/bsg_test/bsg_nonsynth_dpi_gpio.sv", 64: bsg_nonsynth_zynq_testbench.axil4.rdata_gpio: at time 56425001 ps
BSG ERROR (bsg_nonsynth_zynq_testbench.axil4.rdata_gpio): final block executed before fini() was called
Fatal: "/home/sonal/ViBram/zynq-parrot/import/basejump_stl/bsg_test/bsg_nonsynth_dpi_gpio.sv", 64: bsg_nonsynth_zynq_testbench.axil4.rresp_gpio: at time 56425001 ps
BSG ERROR (bsg_nonsynth_zynq_testbench.axil4.rresp_gpio): final block executed before fini() was called
Fatal: "/home/sonal/ViBram/zynq-parrot/import/basejump_stl/bsg_test/bsg_nonsynth_dpi_gpio.sv", 64: bsg_nonsynth_zynq_testbench.axil4.rvalid_gpio: at time 56425001 ps
BSG ERROR (bsg_nonsynth_zynq_testbench.axil4.rvalid_gpio): final block executed before fini() was called
Fatal: "/home/sonal/ViBram/zynq-parrot/import/basejump_stl/bsg_test/bsg_nonsynth_dpi_gpio.sv", 64: bsg_nonsynth_zynq_testbench.axil4.rready_gpio: at time 56425001 ps
BSG ERROR (bsg_nonsynth_zynq_testbench.axil4.rready_gpio): final block executed before fini() was called
SG ERROR (bsg_nonsynth_zynq_testbench.axil4.rready_gpio): final block executed before fini() was called
V C S S i m u l a t i o n R e p o r t
Time: 56425001 ps
CPU Time: 1.510 seconds; Data structure size: 4.5Mb

@amithmath
Copy link
Author

Can you please point file and line number where I can experiment with aviary by reducing branch predictors and caches?

@amithmath
Copy link
Author

I have curious question, why you did not use MicroBlaze soft core the project?

@dpetrisko
Copy link
Collaborator

dpetrisko commented Jan 18, 2025

Microblaze RISCV was released in 2024, BlackParrot has been around since 2018 when this project was started

Microblaze cannot be taped out, only used in Xilinx FPGAs per license. This project is an ASIC prototype platform. There are currently one academic and two commercial tapeouts of BlackParrot+Manycore HammerBlades that I am aware of, and all were done with 0 licensing cost for this IP

@amithmath
Copy link
Author

My question was, instead of using Zynq processor as PS, you could have used MicroBlaze as PS? I think MicroBlaze is compatible with most boards.

@amithmath
Copy link
Author

amithmath commented Jan 20, 2025

I am curious about above question, can you please clarify?

Many Thanks

@dpetrisko
Copy link
Collaborator

Keeping in mind our goal to SSH into the PS and compile the co-emulation code on it, there are two primary reasons that we chose to use the hard core:

  • We started this project with Z2/U96 boards, using Pynq for bitstream management and peripheral access. On zynq boards, the peripherals are hardwired to the PS so using a microblaze would involve many extra steps to communicate.

  • Using a Linux-capable microblaze involves building and managing a petalinux image, which is not trivial. I’ve seen several companies hire consultants to manage this process, whereas Pynq publishes ready-to-go PetaLinux images for Z2 and U96 boards

By using Z2 boards in the way that we do, students can simply buy a board or ssh into a cluster and be ready to go immediately.

That said, Alveo is a non-Zynq that has Pynq compatibility. However, I believe the “host code” is running all on the x86 and not a microblaze. I haven’t looked too closely into it though

@amithmath
Copy link
Author

Thanks for the clarification. Yes you are right, in Alveo you can run host code on the x86 but PCIe driver is giving errors. Therefore, we are planning to use MicroBlaze soft IP core as PS.

@dpetrisko
Copy link
Collaborator

Yeah, PCIE is a nightmare to debug which is why we try to do as much as possible over Ethernet. Well, if you get a working PetaLinux build for microblaze+SSH, please open-source and send a PR. I'd certainly be interested in seeing if the infrastructure can support non-pynq boards

@amithmath
Copy link
Author

Sure. I will let you know if we get it working on MicroBlaze...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants