Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make evicted_rows a UVA buffer #3079

Closed
wants to merge 2 commits into from
Closed

Commits on Sep 5, 2024

  1. Add compact_indices op

    Differential Revision: D62190777
    sarunya authored and facebook-github-bot committed Sep 5, 2024
    Configuration menu
    Copy the full SHA
    153fa62 View commit details
    Browse the repository at this point in the history
  2. Make evicted_rows a UVA buffer (pytorch#3079)

    Summary:
    Pull Request resolved: pytorch#3079
    
    X-link: facebookresearch/FBGEMM#173
    
    Prior to this diff, SSD-TBE used a combination of a pinned CPU buffer
    and the GPU buffer for `evicted_rows` (the buffer for staging rows
    that are evicted from L1 cache).  It explicitly performed asynchronous
    memory copy (via `cudaMemcpyAsync`) to transfer `evicted_rows` from
    device to host.  Since the number of evicted rows is known only on the
    device, SSD-TBE overallocated the `evicted_rows` CPU and GPU buffers.
    Therefore, it transferred extra data during the device-host memory
    copy.  Such the extra data could be large and could make the memory
    copy a bottleneck of an execution.
    
    This diff mitigates the problem mentioned above by using a unified
    address buffer for `evicted_rows` and using a kernel (namely
    `masked_index_select` to load/store data instead of using a CUDA
    memory copy operation.  This mechanism can avoid the extra memory
    copy.  However, the memory copy can be less efficient (might not be
    able to fully saturate the available memory bandwidth) since it does
    not use the copy engine.  Moreover, since it uses SMs for memory copy,
    when overlapping the operator with other computes, it can potentially
    compete for the SM resources with others.
    
    Reviewed By: q10
    
    Differential Revision: D62114877
    sryap authored and facebook-github-bot committed Sep 5, 2024
    Configuration menu
    Copy the full SHA
    b059883 View commit details
    Browse the repository at this point in the history