Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated Performance test #844

Merged
merged 22 commits into from
Mar 22, 2024
Merged

Conversation

mikaelsky
Copy link
Collaborator

Now with functional makefiles and tests

@mikaelsky
Copy link
Collaborator Author

Created a "clone" of common.mk that makes the software test work.
In the local performance test directory the makefile will run across all tests, meaning make sim will run all the performance tests for all the testsuites as default

@stnolting
Copy link
Owner

Hey @mikaelsky!

Thanks for sharing your CPI evaluation setup!

I'm still looking through your code. I do not quite understand why we need exclusive testbenches here. 🤔 Couldn't we just use the default one together with the existing sim target? To run the entire script we could use a simple script that calls all the benchmark's makefiles.

@stnolting stnolting added the SW Software-related label Mar 11, 2024
@stnolting stnolting mentioned this pull request Mar 11, 2024
@mikaelsky
Copy link
Collaborator Author

The "exclusive" testbench was made to ensure that when we run the CPI test we only run with the core core and not all sorts of fun peripherals like cache and external memory. Basically this means you get just the core performance and not the "core+random peripheral that degrades performance" performance.
In theory this could be achieved by adding more variables to the simple test. But my thinking here was: lets keep the simple test simple :)

The change in common.mk, even though small, does allow for an easy add of different testbenches. So e.g. lets say I want to make a testbench that had a new feature in it, I could make a new area "newFeatureTB" that has my new feature setup. Then I can change the NEORV32_SIM_FOLDER in my local makefile to point to my new feature directory and hey presto I can reuse all of the existing make and sim infrastructure while also testing my new feature :)

There is a simple script that runs all the benchmarks btw. "run_performance.sh".

I had to add the makefile in there as the github pull request script runs through all the examples and call make clean/sim something. If that make target doesn't exist in that first hierarchy level the pull request script fails out.
To make the pull request script pass I had to basically bang together a makefile whose only purpose is to run across a list of CPI tests and call the equivalent make target inside each of those directories.

The upside is you can now just call make sim from the performance_test directory and it will run all the performance tests automatically. So there is really no need for a simple script anymore.
I did change the default "all" target in each of the CPI test to just be "rv32_all" so the makefile "just works".

@stnolting
Copy link
Owner

The "exclusive" testbench was made to ensure that when we run the CPI test we only run with the core core and not all sorts of fun peripherals like cache and external memory. Basically this means you get just the core performance and not the "core+random peripheral that degrades performance" performance.

That's a good point.

However, the instruction timing description in https://stnolting.github.io/neorv32/#_instruction_sets_and_extensions already shows the architectural "best case" CPI values. These are of course heavily dependent on which ISA extensions are switched on and what the memory system looks like.

So, wouldn't it be better to use your benchmarks to profile a particular / application-specific for its average CPI values?

That's were we could stick to the default (simple) testbench: adjust all the configuration options (e.g. CPU extensions but also memory latencies) according to the real setup to get a full-featured simulation model that now could also be used to benchmark average CPI.

Don't get me wrong, I really like your proposal. I am just afraid of maintaining all these different testbench setups. 🙈

@stnolting stnolting added DOC Improvements or additions to documentation enhancement New feature or request labels Mar 12, 2024
@mikaelsky
Copy link
Collaborator Author

:best Hermione impression: "Actually..." ;) The documentation is outdated. The 2 motivations for the CPI was to

  1. Update our internal simulation model with 'actual' CPI vs just a CPI of 1.
  2. To update the documentation as e.g. the Zfinx numbers are fairly dated, but I even see some of the I and M stuff being a tad off. With a CPI test suite one could imagine the documentation numbers could be automated.

I do agree the CPI should be measured on a given implementation. Which is what we are doing internally :)

The documentation should be theoretical best case, with the caveat that the theoretical best case is highly dependent on the execute engine state machine and all its cross dependencies. From there it can only go downhill based on the system implementer.

@stnolting
Copy link
Owner

The documentation is outdated.

That's true (unfortunately) 🙈

Update our internal simulation model with 'actual' CPI vs just a CPI of 1.

Actually (😅), the optimum CPI (cycle per instruction) for the fastest instructions is 2.

To update the documentation as e.g. the Zfinx numbers are fairly dated, but I even see some of the I and M stuff being a tad off. With a CPI test suite one could imagine the documentation numbers could be automated.

Agreed!

But I think we should do this by hand for now (as we have no way to move up-to-date CI results to the documentation yet). So a local run with an adjusted configuration of the default testbench should be fine for this.

I do agree the CPI should be measured on a given implementation. Which is what we are doing internally :)

👍

The documentation should be theoretical best case, with the caveat that the theoretical best case is highly dependent on the execute engine state machine and all its cross dependencies. From there it can only go downhill based on the system implementer.

Right. We should add a note about this to the documentation.


Again, I like this simple-to-use setup: just run your proposed script and get all the interesting CPI stuff. But it feels oversized to have a separate testbench for this (I find it difficult that we already have a VUNIT and a "boring normal" testbench). 🤔

But that's just my opinion. If this is something that can help people to easily benchmark their setups then I am fine with it. 😉

@mikaelsky
Copy link
Collaborator Author

Oh and I added a new feature to the testbench, which is used by the CPI testing system. If you look at the bottom I added a "finish" tester that looks at GPIO(32) assertion and when that happens it calls std.env.finish which stops the test from running at cleanly exits GHDL.

To use this I had to also update the GHDL call scripts to use VHDL-2008 vs the default VHDL-92/93. This as std.env is a VHDL-2008 thing.

Beyond the above modifications to the GHDL calling scripts we would also need add in some parameters we pass into GHDL for controlling features. Not sure how to achieve that with GHDL as well. Having e.g. a "turn off all the stuff" parameter allows us to set enabled core extensions n such in the simple testbench.

@stnolting
Copy link
Owner

Beyond the above modifications to the GHDL calling scripts we would also need add in some parameters we pass into GHDL for controlling features. Not sure how to achieve that with GHDL as well. Having e.g. a "turn off all the stuff" parameter allows us to set enabled core extensions n such in the simple testbench.

I really like that because that would be a very clean solution. We already do something like this in the RISCOF testbench:

entity neorv32_riscof_tb is
  generic (
    MEM_FILE : string;            -- memory initialization file
    MEM_SIZE : natural := 8*1024; -- total memory size in bytes
    RISCV_E  : boolean := false   -- embedded ISA extension
  );
end neorv32_riscof_tb;

https://github.com/stnolting/neorv32-riscof/blob/88f135e0f9bba7c57a1652460328f802168aa6a3/sim/neorv32_riscof_tb.vhd#L62-L68

There we have all the relevant configuration options exposed as generics (including default values) and we simply override them when calling GHDL, e.g. -gRISCV_E=true.

I would love to have something for the default testbench (wink 😉) but that would be a lot of coding effort... 🙈

Oh and I added a new feature to the testbench, which is used by the CPI testing system. If you look at the bottom I added a "finish" tester that looks at GPIO(32) assertion and when that happens it calls std.env.finish which stops the test from running at cleanly exits GHDL.

This is a nice feature we are also using for the RISCOF testbench. Unfortunately, this is a VHDL-2008 feature and some users might have issues with this (e.g. #847) so I owuld suggest to keep this for the VUnit testbench only.

Anyway, I'm still not fully convinced that we should add another testbench here. The maintaining overhead is quite heavy already. 🙈 Can't we find a practical compromise? 🤔

@mikaelsky
Copy link
Collaborator Author

I'll take a look at it, should be okay to add a few extra generics that I can pass in the via GHDL_PARAMS?. For GHDL I assume we are ok with me adding the vhdl 2008 option to the default script?

The one good news is that vivado isim does indeed support some vhdl-2008 features.. in particular it supports what I added which is finish.. assuming folks are on at least 2020.2.. which in the year 2024 we can somewhat assume or tell them to upgrade to at least 2020.2.
One note for Vivado/xsim is for the latest 2023.2 version it automatically detects the need for 2008, for earlier version you might have to push a button/add -2008 on read in.

https://docs.xilinx.com/r/2020.2-English/ug900-vivado-logic-simulation/Supported-Features

image

mikaelsky and others added 2 commits March 17, 2024 12:08
…e. Added generics to the simple testcase to be able to control performance features. Updated the GHDL scripts to accept more than 1 command line parameter. Updated the performance test makefiles to call the simple test bench with appropriate generics and GHDL settings
@mikaelsky
Copy link
Collaborator Author

mikaelsky commented Mar 17, 2024

I've updated the performance test to use the simple test bench without vhdl2008 features. I use a tuned timeout instead by setting the timeout value from the makefile. Its less flexible but works.
I had to update the ghdl scripts to pass down more than 1 command line parameter to be able to push down generics from the makefile. $1 just passed the first parameter down, $@ passes $1 to $N down instead.
The simple tb now has a few generics that set various settings related to measuring CPI. The default value is what the existing value was.
I clean common.mk to be what it was. I removed the benchmark testbench folder.

Should be a lot easier to maintain now :)

@stnolting
Copy link
Owner

Hey @mikaelsky

I'll take a look at it, should be okay to add a few extra generics that I can pass in the via GHDL_PARAMS?. For GHDL I assume we are ok with me adding the vhdl 2008 option to the default script?

The one good news is that vivado isim does indeed support some vhdl-2008 features.. in particular it supports what I added which is finish.. assuming folks are on at least 2020.2.. which in the year 2024 we can somewhat assume or tell them to upgrade to at least 2020.2.

Well, you are right... The truth is, that many people are still using older toolchains (like Xilinx ISE). 🙈 So I would suggest to keep VHDL2008 out of the default simulation setup - at least for the "simple" testbench. The VUnit testbench requires VHDL2008 anyway.

I've updated the performance test to use the simple test bench without vhdl2008 features.

Oh, that's nice! I'll have a look at that!

I had to update the ghdl scripts to pass down more than

That should be just fine I think.

The simple tb now has a few generics that set various settings related to measuring CPI. The default value is what the existing value was.

Maybe we could add a matrix-style set of default configurations and use a single "ID" generic to select one specific configuration - just like the LiteX wrapper:

-- core complex configurations --
constant configs_c : configs_t := (
-- minimal lite standard full
riscv_c => ( false, true, true, true ), -- RISC-V compressed instructions 'C'
riscv_m => ( false, true, true, true ), -- RISC-V hardware mul/div 'M'
riscv_u => ( false, false, true, true ), -- RISC-V user mode 'U'
riscv_zicntr => ( false, false, true, true ), -- RISC-V standard CPU counters 'Zicntr'
riscv_zihpm => ( false, false, false, true ), -- RISC-V hardware performance monitors 'Zihpm'
fast_ops => ( false, false, true, true ), -- use DSPs and barrel-shifters
pmp_num => ( 0, 0, 0, 8 ), -- number of PMP regions (0..16)
hpm_num => ( 0, 0, 0, 8 ), -- number of HPM counters (0..29)
xcache_en => ( false, false, true, true ), -- external bus cache enabled
xcache_nb => ( 0, 0, 32, 64 ), -- number of cache blocks (lines), power of two
xcache_bs => ( 0, 0, 32, 32 ), -- size of cache clock (lines) in bytes, power of two
mtime => ( false, true, true, true ) -- RISC-V machine system timer
);

🤔

Thanks again for all your work! :)

@stnolting
Copy link
Owner

Btw, please revert neorv32_application_image.vhd to its original state.
It also happens to me all the time that I check in my local version... 🙈 😅

@mikaelsky
Copy link
Collaborator Author

After a bit of a typo battle I've changed the performance options to be matrix style. Different than I normally do this, but hey its a style thing :)
Should have undone the default application image 🤞

@stnolting
Copy link
Owner

stnolting commented Mar 17, 2024

After a bit of a typo battle I've changed the performance options to be matrix style.

Looks really great!

Different than I normally do this, but hey its a style thing :)

How would you have done this?


I think we have a simulation time-out issue right now because of the reworked cache system. Updating the cache sizes should fix this:

Upstream cache configuration: ✔️

grafik

Your current cache configuration: ❌

grafik

@mikaelsky
Copy link
Collaborator Author

mikaelsky commented Mar 17, 2024

Got it :) Note to self: when github is faster than your home server time to upgrade :)
Its different styles. I typically spell it out with individual options. This allows DV engineers to control everything themselves vs I have to go in and write/update test benches for them :) Also makes it easier to do a boat load of sweeps.

As an example our internal neorv32 regression suite runs ~520 randomized test runs .. every night that test the core in design and the core by itself. The main "test" areas are riscv_dv driven randomizes instruction suites for our used extensions vs an ISA model, riscv-arch compliance suite vs an ISA model, randomized IRQ, reset etc hits.
By exposing key parameters to my DV team, they where off to the races and began hammering the core in design :)

This is part of what we call "maturing" the core by pushing it into all the little corners that aren't normally tested. I'm working on what can be shared, like coverage numbers and such, as that will help provide confidence in the quality of the core. At least for the features we've tested.

@stnolting
Copy link
Owner

Its different styles. I typically spell it out with individual options.

Right, that seems to be the cleaner approach. Maybe we can implement that somewhere in the future 😉

This is part of what we call "maturing" the core by pushing it into all the little corners that aren't normally tested. I'm working on what can be shared, like coverage numbers and such, as that will help provide confidence in the quality of the core. At least for the features we've tested.

I'm curious. Do you only use the CPU core or do you also use parts from the SoC (interconnect, caches, peripherals)?

Btw, I think there is still a cache configuration issue that causes a simulation timeout. I'll try to fix that so wen can finally merge this. :)

to avoid simulation timeout
@stnolting
Copy link
Owner

Looks ready to me 👍

@mikaelsky
Copy link
Collaborator Author

I'll need to double check what I can disclose of features, beyond that yes we indeed use Zfinx.

As we are a verilog house I've had to limit the exposure to VHDL so its really just the core (neorv32_cpu.vhd) and down. We are also using the 2 OCD modules with a few tweaks to fit into the verilog interconnect. The amount of push back I get from the core being VHDL.. lets just say, a lot. Its definitely an adoption challenge.

Everything outside neorv32_cpu is homegrown verilog. There are a few modules that I converted (mtime/wdt) to verilog and fixed some feature choices. We are using the peripheral bus definition though, so we got that going for us :)
Not sure whether those feature tunings are worth talking about. Highlights are not using an IRQ for the wdt.. as that breaks the concept of a wdt and adding support for SoC lowest power states which impact had the wdt and mtime behaves during power state transitions.

@stnolting
Copy link
Owner

The amount of push back I get from the core being VHDL.. lets just say, a lot. Its definitely an adoption challenge.

I know these kinds of problems only too well... 🙈 😅

Highlights are not using an IRQ for the wdt.. as that breaks the concept of a wdt

Hmm interesting point. I thought that having a "WDT timeout alarm IRQ" was quite common?! Do you think this is something that should not be implemented?

Apart from that... ready to merge? ;)

@mikaelsky
Copy link
Collaborator Author

Just did a sync of the branch to get the PR up to date with HOT. It was blocking merging. So it will need to complete a new check run. Sorry.

"Hmm interesting point. I thought that having a "WDT timeout alarm IRQ" was quite common?! Do you think this is something that should not be implemented?"

If we look at the purpose of a watch dog timer its to "stop a core run wild". So if the core doesn't ping the watch dog before the timer runs out it will stop the core (either by reset or force clock gate depending on the system in which the core exists).

When you use an IRQ to trigger the watchdog update, then it is quite likely that a "core run wild" will still correctly respond to the ISR and keep pinging the WDT even though its currently executing code out of a random location.

So for WDTs you normally don't use IRQs to trigger the WDT service event as it is likely to be functional even with the firmware itself being dysfunctional. Instead the firmware itself has to proactively "ping" the watchdog timer at certain intervals in the firmware itself. Imagine you have an application running that takes X time. Before jumping into the application you ping the WDT, after you come back from the application you also ping the WDT. If you applications breaks and never returns the WDT will trigger.

The other change was to allow the firmware to write to the count down timer. This is more a feature specific to SoC firmware applications as depending on the current state of the firmware you might want to timeout every second, but for a lower power state where e.g. you are supposed to wake up every 10s you would want to reconfigure the WDT to time out after >10s.
So this way the firmware will "ping" the watchdog by writing a new countdown value into the watchdog timer register versus writing to a "reset timer" registers. A countdown is used as that allows firmware to calculate the timeout as WDT(clock) * timer_value.
This is quite conventional btw. not claiming any inventions here :)

@stnolting
Copy link
Owner

Just did a sync of the branch to get the PR up to date with HOT. It was blocking merging. So it will need to complete a new check run. Sorry.

👍

So for WDTs you normally don't use IRQs to trigger the WDT service event as it is likely to be functional even with the firmware itself being dysfunctional.

Good point. The intention was to prevent some kind of warning before a hard reset happens so the software can take actions like storing relevant data to nonvolatile memory. But I see your point here. Maybe we should remove that interrupt feature from the WDT? 🤔

(That would be nice because then we could use the WDT's FIRQ for another peripheral device 😉)

The other change was to allow the firmware to write to the count down timer.

You can update the timeout register already. Couldn't that be used to implement the same feature?

Anyway, we should continue this discussion in a separate issue 😉

@stnolting
Copy link
Owner

If there are no objections, I would like to merge this. 😉

@stnolting stnolting merged commit 3d7012b into stnolting:main Mar 22, 2024
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
DOC Improvements or additions to documentation enhancement New feature or request SW Software-related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants