Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes for atomic coalescing at L1, correlated with QV100 hardware #33

Open
wants to merge 7 commits into
base: dev
Choose a base branch
from

Conversation

abhaumick
Copy link

@abhaumick abhaumick commented Mar 26, 2022

  • fixed atomic coalescing at L1

    • modified warp_inst_t::memory_coalescing_arch_atomic()
    • behavior correlated with QV100 hardware
  • modified trace.h

    • added DPRINTF_RAW() to allow prints without gpu_sim_cycle
    • for prints from classes that do not have a gpu_sim_cycle or gpu_tot_sim_cycle variables used by DPRINTF
  • added config option gpgpu_shmem_atomic_warp_parts

    • added to option parser (default value: 2)
    • updated QV100 config
  • added trace streams

    • ATOMICS
    • ATOMICS_DETAIL
  • resolves mismatch reported in Meeting Minutes -- 3/20/20

    Benchmark 2 (Atomic bandwidth to the same address)
    atomic_add_bw_conflict:

    cycle error = 1554% (Sim/HW cycles = 16.5X) so simulator is 16.5X slower than HW.
    HW l2 atomics = 10240, Sim l2 atomics = 163840

    In the microbench, we generate atomic accesses wherein all threads access the same memory region.
    We generate 163840 threads, each executes 1 time (so total atomic insts = 163840)

    It seems in gpgpu-sim, it serializes the accesses to the same region, while the HW coalesce these accesses into 16 threads group.
    We can fix the simulator to coalesce conflict accesses into 16 threads, this may alleviate the problem here.
    A relatively simple change to "memory_coalescing_arch_atomic" in abstract_hardware_model.cc should fix this.

mkhairy and others added 5 commits August 23, 2021 13:58
Sub core & some minor bug fix
- best case coalescing of atomic operations - full CAM based search
- integrated with DPRINTF with ATOMICS Flag
- replaced full CAM coalescing with common case coalescing
- correlated with QV100 GPU
- added ATOMICS_DETAIL trace flag
- made ATOMICS prints concise
- disabled tracing and restored default trace flags in QV100 tested-cfgs
@abhaumick abhaumick requested review from mkhairy and tgrogers April 4, 2022 17:09
@JRPan JRPan requested review from cesar-avalos3 and removed request for mkhairy May 15, 2023 17:54
@JRPan JRPan requested a review from mkhairy May 23, 2023 16:30
cesar-avalos3
cesar-avalos3 previously approved these changes May 31, 2023
Copy link

@cesar-avalos3 cesar-avalos3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correlation of atomic ubenches does not look significantly better compared to the latest (as of this review) dev branch of GPGPU-sim, atomic_add_bw_diverge still off by a lot. Code makes sense though.
Still waiting for the feedback of the other reviewers, and the original author.

@cesar-avalos3 cesar-avalos3 dismissed their stale review June 19, 2023 17:25

Worse in SASS

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants