Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add cache conflict miss support (pytorch#2596)
Summary: Pull Request resolved: pytorch#2596 Prior to this diff, SSD TBE lacked support for the conflict cache miss scenario. It operated under the assumption that the cache, located in GPU memory, was sufficiently large to hold all prefetched data from SSD. In the event of a conflict cache miss, the behavior of SSD TBE would be unpredictable (it could either fail or potentially access illegal memory). Note that a conflict cache miss happens when an embedding row is absent in the cache, and after being fetched from SSD, it cannot be inserted into the cache due to capacity constraints or associativity limitations. This diff introduces support for conflict cache misses by storing rows that cannot be inserted into the cache due to conflicts in a scratch pad, which is a temporary GPU tensor. In the case where rows are missed from the cache, TBE kernels can access the scratch pad. Prior to this diff, during the SSD prefetch stage, any row that was missed the cache and required fetching from SSD would be first fetched into a CPU scratch pad and then transferred to GPU. Rows that could be inserted into the cache would subsequently be copied from the GPU scratch pad into the cache. If conflict misses occurred, the prefetch behavior would be unpredictable. With this diff, conflict missed rows are now retained in the scratch pad, which is kept alive until the current iteration completes. Throughout the forward and backward + optimizer stages of TBE, both the cache and scratch pad are equivalent in terms of usage. However, following the completion of the backward + optimizer step, rows in the scratch pad are flushed back to SSD, unlike rows residing in the cache which are not evicted for future usage (see the diagram below for more details). {F1645878181} Differential Revision: D55998215
- Loading branch information