-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Boost Verilator #657
base: main
Are you sure you want to change the base?
Boost Verilator #657
Conversation
…ions * New replacement modules for encode_one_hot, and mux_one_hot (the latter was taking 50% of simulation time!) * Optimization flags for Verilator * Comments for threading and profiling (Experts only!) * Switched to static linking
Note the behavior is very different if multiple select lines are set. Very
scary. I would suggest an implementation that uses an OR instead of
assignment...
Also a initial begin warning that this is deviating from the synthesized
source base...
M
…On Tue, Dec 29, 2020 at 11:05 AM Dustin Richmond ***@***.***> wrote:
Pulling forward these changes for future reference & possible inclusion.
This was motivated by Dan suggesting that I profile the network for our
network simulator.
This is a couple months work of on-and-off hacking. I figured out how to
profile Verilator (not straightforward...), and then implemented some fixes.
The two significant changes:
- Replaced bsg_mux_one_hot.v from basejump with a simulation-amenable
version. (FOR VERILATOR ONLY)
- Replaced bsg_encode_one_hot.v from basejump with a
simulation-amenable version. (FOR VERILATOR ONLY)
mux_one_hot was consuming 50% of the execution time for network heavy
benchmarks. Encode_one_hot was problematic because it has what Verilator
calls an "unopt-flat" issue that can't be "fixed" without substantially
re-writing our tape-out RTL.
I also added -O2 flags for GCC and Verilator. This makes Verilator take a
long time to compile (20 minutes) but this process only needs to be done
once if someone is not RTL hacking. (otherwise VCS is a better solution)
I left behind some hints for threading and profiling as well.
I haven't done a full analysis, but what I found is that some testbench
programs run as fast as VCS now.
------------------------------
You can view, comment on, or merge this pull request online at:
#657
Commit Summary
- Speedup verilator using non-synthesizable constructions and
optimizations
- Remove linking additions
File Changes
- *M* libraries/features/dma/simulation/feature.mk
<https://github.com/bespoke-silicon-group/bsg_replicant/pull/657/files#diff-334b61ef3be3fc92cfa11913838192f6bbb0b32c0a9b5559bc37602858b30ae6>
(4)
- *M* libraries/features/dma/simulation/libdmamem.mk
<https://github.com/bespoke-silicon-group/bsg_replicant/pull/657/files#diff-8438dcc1cd2ea1c3c89e427dc1403656fc2b31ae99905a84ae6fdf1b8cd96336>
(3)
- *M* libraries/libraries.mk
<https://github.com/bespoke-silicon-group/bsg_replicant/pull/657/files#diff-488f8812ad48fb783133447796c18adec2897c64a36f497d2ac5c122fa6418ad>
(4)
- *M*
libraries/platforms/aws-fpga/hardware/bsg_manycore_endpoint_to_fifos.v
<https://github.com/bespoke-silicon-group/bsg_replicant/pull/657/files#diff-f104ef2611b5f4526c9d7128b89c92d3325ee06df32b9340ac2d013284286d71>
(2)
- *M* libraries/platforms/dpi-verilator/compilation.mk
<https://github.com/bespoke-silicon-group/bsg_replicant/pull/657/files#diff-eb1a21b30efd18337b7f1074b6b674e0fce7742b61c40d106a6ed43d9ba15fe3>
(4)
- *M* libraries/platforms/dpi-verilator/execution.mk
<https://github.com/bespoke-silicon-group/bsg_replicant/pull/657/files#diff-4c1ccaa22d1e2eb345a6a32965e8b52ad54dcef48289cc4c530cbb1c9018d8f1>
(6)
- *M* libraries/platforms/dpi-verilator/hardware.mk
<https://github.com/bespoke-silicon-group/bsg_replicant/pull/657/files#diff-d8b0c1083523cc7ffa550f79c1f4c481f3b79e11956dc11adf8d799ae0e8ada3>
(1)
- *A*
libraries/platforms/dpi-verilator/hardware/bsg_nonsynth_encode_one_hot.v
<https://github.com/bespoke-silicon-group/bsg_replicant/pull/657/files#diff-f4b09fef5a25f1bff2fe93ea830e1aaefc6e5455523d447fad5121d8c162647a>
(26)
- *A*
libraries/platforms/dpi-verilator/hardware/bsg_nonsynth_mux_one_hot.v
<https://github.com/bespoke-silicon-group/bsg_replicant/pull/657/files#diff-c88e8e050055ba10f3665492a3b48a1f2c1b12efa706230764760348c4416d6d>
(26)
- *M* libraries/platforms/dpi-verilator/library.mk
<https://github.com/bespoke-silicon-group/bsg_replicant/pull/657/files#diff-bb8f58eea5acb4c1f825d26cfbbb8c3e96185275f5d4a9bf05be322917fd2f5a>
(4)
- *M* libraries/platforms/dpi-verilator/link.mk
<https://github.com/bespoke-silicon-group/bsg_replicant/pull/657/files#diff-90b01e115c9692dca689005ff78bb3378a4d432859f6db110d0f47e152968b03>
(58)
Patch Links:
- https://github.com/bespoke-silicon-group/bsg_replicant/pull/657.patch
- https://github.com/bespoke-silicon-group/bsg_replicant/pull/657.diff
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#657>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AEFG5AEYIO3Q4SEO73FFG3LSXIR77ANCNFSM4VNQLSMA>
.
|
But very intriguing about the performance differences!
On Tue, Dec 29, 2020 at 12:31 PM Michael Nguyen Taylor <
[email protected]> wrote:
… Note the behavior is very different if multiple select lines are set. Very
scary. I would suggest an implementation that uses an OR instead of
assignment...
Also a initial begin warning that this is deviating from the synthesized
source base...
M
On Tue, Dec 29, 2020 at 11:05 AM Dustin Richmond ***@***.***>
wrote:
> Pulling forward these changes for future reference & possible inclusion.
> This was motivated by Dan suggesting that I profile the network for our
> network simulator.
>
> This is a couple months work of on-and-off hacking. I figured out how to
> profile Verilator (not straightforward...), and then implemented some fixes.
>
> The two significant changes:
>
> - Replaced bsg_mux_one_hot.v from basejump with a simulation-amenable
> version. (FOR VERILATOR ONLY)
> - Replaced bsg_encode_one_hot.v from basejump with a
> simulation-amenable version. (FOR VERILATOR ONLY)
>
> mux_one_hot was consuming 50% of the execution time for network heavy
> benchmarks. Encode_one_hot was problematic because it has what Verilator
> calls an "unopt-flat" issue that can't be "fixed" without substantially
> re-writing our tape-out RTL.
>
> I also added -O2 flags for GCC and Verilator. This makes Verilator take a
> long time to compile (20 minutes) but this process only needs to be done
> once if someone is not RTL hacking. (otherwise VCS is a better solution)
>
> I left behind some hints for threading and profiling as well.
>
> I haven't done a full analysis, but what I found is that some testbench
> programs run as fast as VCS now.
> ------------------------------
> You can view, comment on, or merge this pull request online at:
>
> #657
> Commit Summary
>
> - Speedup verilator using non-synthesizable constructions and
> optimizations
> - Remove linking additions
>
> File Changes
>
> - *M* libraries/features/dma/simulation/feature.mk
> <https://github.com/bespoke-silicon-group/bsg_replicant/pull/657/files#diff-334b61ef3be3fc92cfa11913838192f6bbb0b32c0a9b5559bc37602858b30ae6>
> (4)
> - *M* libraries/features/dma/simulation/libdmamem.mk
> <https://github.com/bespoke-silicon-group/bsg_replicant/pull/657/files#diff-8438dcc1cd2ea1c3c89e427dc1403656fc2b31ae99905a84ae6fdf1b8cd96336>
> (3)
> - *M* libraries/libraries.mk
> <https://github.com/bespoke-silicon-group/bsg_replicant/pull/657/files#diff-488f8812ad48fb783133447796c18adec2897c64a36f497d2ac5c122fa6418ad>
> (4)
> - *M*
> libraries/platforms/aws-fpga/hardware/bsg_manycore_endpoint_to_fifos.v
> <https://github.com/bespoke-silicon-group/bsg_replicant/pull/657/files#diff-f104ef2611b5f4526c9d7128b89c92d3325ee06df32b9340ac2d013284286d71>
> (2)
> - *M* libraries/platforms/dpi-verilator/compilation.mk
> <https://github.com/bespoke-silicon-group/bsg_replicant/pull/657/files#diff-eb1a21b30efd18337b7f1074b6b674e0fce7742b61c40d106a6ed43d9ba15fe3>
> (4)
> - *M* libraries/platforms/dpi-verilator/execution.mk
> <https://github.com/bespoke-silicon-group/bsg_replicant/pull/657/files#diff-4c1ccaa22d1e2eb345a6a32965e8b52ad54dcef48289cc4c530cbb1c9018d8f1>
> (6)
> - *M* libraries/platforms/dpi-verilator/hardware.mk
> <https://github.com/bespoke-silicon-group/bsg_replicant/pull/657/files#diff-d8b0c1083523cc7ffa550f79c1f4c481f3b79e11956dc11adf8d799ae0e8ada3>
> (1)
> - *A*
> libraries/platforms/dpi-verilator/hardware/bsg_nonsynth_encode_one_hot.v
> <https://github.com/bespoke-silicon-group/bsg_replicant/pull/657/files#diff-f4b09fef5a25f1bff2fe93ea830e1aaefc6e5455523d447fad5121d8c162647a>
> (26)
> - *A*
> libraries/platforms/dpi-verilator/hardware/bsg_nonsynth_mux_one_hot.v
> <https://github.com/bespoke-silicon-group/bsg_replicant/pull/657/files#diff-c88e8e050055ba10f3665492a3b48a1f2c1b12efa706230764760348c4416d6d>
> (26)
> - *M* libraries/platforms/dpi-verilator/library.mk
> <https://github.com/bespoke-silicon-group/bsg_replicant/pull/657/files#diff-bb8f58eea5acb4c1f825d26cfbbb8c3e96185275f5d4a9bf05be322917fd2f5a>
> (4)
> - *M* libraries/platforms/dpi-verilator/link.mk
> <https://github.com/bespoke-silicon-group/bsg_replicant/pull/657/files#diff-90b01e115c9692dca689005ff78bb3378a4d432859f6db110d0f47e152968b03>
> (58)
>
> Patch Links:
>
> -
> https://github.com/bespoke-silicon-group/bsg_replicant/pull/657.patch
> - https://github.com/bespoke-silicon-group/bsg_replicant/pull/657.diff
>
> —
> You are receiving this because you are subscribed to this thread.
> Reply to this email directly, view it on GitHub
> <#657>, or
> unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AEFG5AEYIO3Q4SEO73FFG3LSXIR77ANCNFSM4VNQLSMA>
> .
>
|
Oops, did not mean to close.
This is good feedback, I'll implement it after the network simulator. I also don't know why these modules are so costly. That's something I didn't explore, either. |
* Updated mux for OR
Pulling forward these changes for future reference & possible inclusion. This was motivated by Dan suggesting that I profile the network for our network simulator.
This is a couple months work of on-and-off hacking. I figured out how to profile Verilator (not straightforward...), and then implemented some fixes.
The two significant changes:
mux_one_hot was consuming 50% of the execution time for network heavy benchmarks. Encode_one_hot was problematic because it has what Verilator calls an "unopt-flat" issue that can't be "fixed" without substantially re-writing our tape-out RTL.
I also added -O2 flags for GCC and Verilator. This makes Verilator take a long time to compile (20 minutes) but this process only needs to be done once if someone is not RTL hacking. (otherwise VCS is a better solution)
I left behind some hints for threading and profiling as well.
I haven't done a full analysis, but what I found is that some testbench programs run as fast as VCS now.