-
Notifications
You must be signed in to change notification settings - Fork 228
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updated Performance test #844
Conversation
Now with functional makefiles and tests
Created a "clone" of common.mk that makes the software test work. |
Hey @mikaelsky! Thanks for sharing your CPI evaluation setup! I'm still looking through your code. I do not quite understand why we need exclusive testbenches here. 🤔 Couldn't we just use the default one together with the existing |
The "exclusive" testbench was made to ensure that when we run the CPI test we only run with the core core and not all sorts of fun peripherals like cache and external memory. Basically this means you get just the core performance and not the "core+random peripheral that degrades performance" performance. The change in common.mk, even though small, does allow for an easy add of different testbenches. So e.g. lets say I want to make a testbench that had a new feature in it, I could make a new area "newFeatureTB" that has my new feature setup. Then I can change the NEORV32_SIM_FOLDER in my local makefile to point to my new feature directory and hey presto I can reuse all of the existing make and sim infrastructure while also testing my new feature :) There is a simple script that runs all the benchmarks btw. "run_performance.sh". I had to add the makefile in there as the github pull request script runs through all the examples and call make clean/sim something. If that make target doesn't exist in that first hierarchy level the pull request script fails out. The upside is you can now just call make sim from the performance_test directory and it will run all the performance tests automatically. So there is really no need for a simple script anymore. |
That's a good point. However, the instruction timing description in https://stnolting.github.io/neorv32/#_instruction_sets_and_extensions already shows the architectural "best case" CPI values. These are of course heavily dependent on which ISA extensions are switched on and what the memory system looks like. So, wouldn't it be better to use your benchmarks to profile a particular / application-specific for its average CPI values? That's were we could stick to the default (simple) testbench: adjust all the configuration options (e.g. CPU extensions but also memory latencies) according to the real setup to get a full-featured simulation model that now could also be used to benchmark average CPI. Don't get me wrong, I really like your proposal. I am just afraid of maintaining all these different testbench setups. 🙈 |
:best Hermione impression: "Actually..." ;) The documentation is outdated. The 2 motivations for the CPI was to
I do agree the CPI should be measured on a given implementation. Which is what we are doing internally :) The documentation should be theoretical best case, with the caveat that the theoretical best case is highly dependent on the execute engine state machine and all its cross dependencies. From there it can only go downhill based on the system implementer. |
That's true (unfortunately) 🙈
Actually (😅), the optimum CPI (cycle per instruction) for the fastest instructions is 2.
Agreed! But I think we should do this by hand for now (as we have no way to move up-to-date CI results to the documentation yet). So a local run with an adjusted configuration of the default testbench should be fine for this.
👍
Right. We should add a note about this to the documentation. Again, I like this simple-to-use setup: just run your proposed script and get all the interesting CPI stuff. But it feels oversized to have a separate testbench for this (I find it difficult that we already have a VUNIT and a "boring normal" testbench). 🤔 But that's just my opinion. If this is something that can help people to easily benchmark their setups then I am fine with it. 😉 |
Oh and I added a new feature to the testbench, which is used by the CPI testing system. If you look at the bottom I added a "finish" tester that looks at GPIO(32) assertion and when that happens it calls std.env.finish which stops the test from running at cleanly exits GHDL. To use this I had to also update the GHDL call scripts to use VHDL-2008 vs the default VHDL-92/93. This as std.env is a VHDL-2008 thing. Beyond the above modifications to the GHDL calling scripts we would also need add in some parameters we pass into GHDL for controlling features. Not sure how to achieve that with GHDL as well. Having e.g. a "turn off all the stuff" parameter allows us to set enabled core extensions n such in the simple testbench. |
I really like that because that would be a very clean solution. We already do something like this in the RISCOF testbench: entity neorv32_riscof_tb is
generic (
MEM_FILE : string; -- memory initialization file
MEM_SIZE : natural := 8*1024; -- total memory size in bytes
RISCV_E : boolean := false -- embedded ISA extension
);
end neorv32_riscof_tb; There we have all the relevant configuration options exposed as generics (including default values) and we simply override them when calling GHDL, e.g. I would love to have something for the default testbench (wink 😉) but that would be a lot of coding effort... 🙈
This is a nice feature we are also using for the RISCOF testbench. Unfortunately, this is a VHDL-2008 feature and some users might have issues with this (e.g. #847) so I owuld suggest to keep this for the VUnit testbench only. Anyway, I'm still not fully convinced that we should add another testbench here. The maintaining overhead is quite heavy already. 🙈 Can't we find a practical compromise? 🤔 |
I'll take a look at it, should be okay to add a few extra generics that I can pass in the via GHDL_PARAMS?. For GHDL I assume we are ok with me adding the vhdl 2008 option to the default script? The one good news is that vivado isim does indeed support some vhdl-2008 features.. in particular it supports what I added which is finish.. assuming folks are on at least 2020.2.. which in the year 2024 we can somewhat assume or tell them to upgrade to at least 2020.2. https://docs.xilinx.com/r/2020.2-English/ug900-vivado-logic-simulation/Supported-Features |
…e. Added generics to the simple testcase to be able to control performance features. Updated the GHDL scripts to accept more than 1 command line parameter. Updated the performance test makefiles to call the simple test bench with appropriate generics and GHDL settings
I've updated the performance test to use the simple test bench without vhdl2008 features. I use a tuned timeout instead by setting the timeout value from the makefile. Its less flexible but works. Should be a lot easier to maintain now :) |
Hey @mikaelsky
Well, you are right... The truth is, that many people are still using older toolchains (like Xilinx ISE). 🙈 So I would suggest to keep VHDL2008 out of the default simulation setup - at least for the "simple" testbench. The VUnit testbench requires VHDL2008 anyway.
Oh, that's nice! I'll have a look at that!
That should be just fine I think.
Maybe we could add a matrix-style set of default configurations and use a single "ID" generic to select one specific configuration - just like the LiteX wrapper: neorv32/rtl/system_integration/neorv32_litex_core_complex.vhd Lines 137 to 152 in 65c30d2
🤔 Thanks again for all your work! :) |
Btw, please revert |
…into Performance_tests
Signed-off-by: Mikael Mortensen <[email protected]>
…into Performance_tests
…into Performance_tests
After a bit of a typo battle I've changed the performance options to be matrix style. Different than I normally do this, but hey its a style thing :) |
Looks really great!
How would you have done this? I think we have a simulation time-out issue right now because of the reworked cache system. Updating the cache sizes should fix this: Upstream cache configuration: ✔️ Your current cache configuration: ❌ |
Got it :) Note to self: when github is faster than your home server time to upgrade :) As an example our internal neorv32 regression suite runs ~520 randomized test runs .. every night that test the core in design and the core by itself. The main "test" areas are riscv_dv driven randomizes instruction suites for our used extensions vs an ISA model, riscv-arch compliance suite vs an ISA model, randomized IRQ, reset etc hits. This is part of what we call "maturing" the core by pushing it into all the little corners that aren't normally tested. I'm working on what can be shared, like coverage numbers and such, as that will help provide confidence in the quality of the core. At least for the features we've tested. |
…into Performance_tests
…into Performance_tests
The test passes locally with 10ms (default) sim time but for some reason fails on github
Right, that seems to be the cleaner approach. Maybe we can implement that somewhere in the future 😉
I'm curious. Do you only use the CPU core or do you also use parts from the SoC (interconnect, caches, peripherals)? Btw, I think there is still a cache configuration issue that causes a simulation timeout. I'll try to fix that so wen can finally merge this. :) |
to avoid simulation timeout
Looks ready to me 👍 |
I'll need to double check what I can disclose of features, beyond that yes we indeed use Zfinx. As we are a verilog house I've had to limit the exposure to VHDL so its really just the core (neorv32_cpu.vhd) and down. We are also using the 2 OCD modules with a few tweaks to fit into the verilog interconnect. The amount of push back I get from the core being VHDL.. lets just say, a lot. Its definitely an adoption challenge. Everything outside neorv32_cpu is homegrown verilog. There are a few modules that I converted (mtime/wdt) to verilog and fixed some feature choices. We are using the peripheral bus definition though, so we got that going for us :) |
I know these kinds of problems only too well... 🙈 😅
Hmm interesting point. I thought that having a "WDT timeout alarm IRQ" was quite common?! Do you think this is something that should not be implemented? Apart from that... ready to merge? ;) |
Just did a sync of the branch to get the PR up to date with HOT. It was blocking merging. So it will need to complete a new check run. Sorry. "Hmm interesting point. I thought that having a "WDT timeout alarm IRQ" was quite common?! Do you think this is something that should not be implemented?" If we look at the purpose of a watch dog timer its to "stop a core run wild". So if the core doesn't ping the watch dog before the timer runs out it will stop the core (either by reset or force clock gate depending on the system in which the core exists). When you use an IRQ to trigger the watchdog update, then it is quite likely that a "core run wild" will still correctly respond to the ISR and keep pinging the WDT even though its currently executing code out of a random location. So for WDTs you normally don't use IRQs to trigger the WDT service event as it is likely to be functional even with the firmware itself being dysfunctional. Instead the firmware itself has to proactively "ping" the watchdog timer at certain intervals in the firmware itself. Imagine you have an application running that takes X time. Before jumping into the application you ping the WDT, after you come back from the application you also ping the WDT. If you applications breaks and never returns the WDT will trigger. The other change was to allow the firmware to write to the count down timer. This is more a feature specific to SoC firmware applications as depending on the current state of the firmware you might want to timeout every second, but for a lower power state where e.g. you are supposed to wake up every 10s you would want to reconfigure the WDT to time out after >10s. |
👍
Good point. The intention was to prevent some kind of warning before a hard reset happens so the software can take actions like storing relevant data to nonvolatile memory. But I see your point here. Maybe we should remove that interrupt feature from the WDT? 🤔 (That would be nice because then we could use the WDT's FIRQ for another peripheral device 😉)
You can update the Anyway, we should continue this discussion in a separate issue 😉 |
If there are no objections, I would like to merge this. 😉 |
Now with functional makefiles and tests