From a9273cf4b3782e0df67db82470e99ab8124ff055 Mon Sep 17 00:00:00 2001 From: Benson Ma Date: Tue, 11 Apr 2023 15:01:57 -0700 Subject: [PATCH] wip --- README.md | 3 +- fbgemm_gpu/README.md | 227 ++++++--------------------- fbgemm_gpu/docs/BuildInstructions.md | 40 +++-- 3 files changed, 78 insertions(+), 192 deletions(-) diff --git a/README.md b/README.md index 842cf6d6f6..58c6426f79 100644 --- a/README.md +++ b/README.md @@ -1,7 +1,6 @@ # FBGEMM -[![FBGEMMCI](https://github.com/pytorch/FBGEMM/actions/workflows/fbgemmci.yml/badge.svg)](https://github.com/pytorch/FBGEMM/actions/workflows/fbgemmci.yml) -[![Nightly Build](https://github.com/pytorch/FBGEMM/actions/workflows/fbgemm_nightly_build.yml/badge.svg)](https://github.com/pytorch/FBGEMM/actions/workflows/fbgemm_nightly_build.yml) +[![FBGEMM CI](https://github.com/pytorch/FBGEMM/actions/workflows/fbgemm_ci.yml/badge.svg)](https://github.com/pytorch/FBGEMM/actions/workflows/fbgemm_ci.yml) FBGEMM (Facebook GEneral Matrix Multiplication) is a low-precision, high-performance matrix-matrix multiplications and convolution library for diff --git a/fbgemm_gpu/README.md b/fbgemm_gpu/README.md index d51a9aa948..739cdf682a 100644 --- a/fbgemm_gpu/README.md +++ b/fbgemm_gpu/README.md @@ -1,39 +1,28 @@ # FBGEMM_GPU -[![FBGEMMCI](https://github.com/pytorch/FBGEMM/actions/workflows/fbgemmci.yml/badge.svg)](https://github.com/pytorch/FBGEMM/actions/workflows/fbgemmci.yml) -[![Nightly Build](https://github.com/pytorch/FBGEMM/actions/workflows/fbgemm_nightly_build.yml/badge.svg)](https://github.com/pytorch/FBGEMM/actions/workflows/fbgemm_nightly_build.yml) -[![Nightly Build CPU](https://github.com/pytorch/FBGEMM/actions/workflows/fbgemm_nightly_build_cpu.yml/badge.svg)](https://github.com/pytorch/FBGEMM/actions/workflows/fbgemm_nightly_build_cpu.yml) +[![FBGEMM_GPU CI](https://github.com/pytorch/FBGEMM/actions/workflows/fbgemm_gpu_ci.yml/badge.svg)](https://github.com/pytorch/FBGEMM/actions/workflows/fbgemm_gpu_ci.yml) +[![FBGEMM_GPU-CPU Nightly Build](https://github.com/pytorch/FBGEMM/actions/workflows/fbgemm_gpu_cpu_nightly.yml/badge.svg)](https://github.com/pytorch/FBGEMM/actions/workflows/fbgemm_gpu_cpu_nightly.yml) +[![FBGEMM_GPU-CUDA Nightly Build](https://github.com/pytorch/FBGEMM/actions/workflows/fbgemm_gpu_cuda_nightly.yml/badge.svg)](https://github.com/pytorch/FBGEMM/actions/workflows/fbgemm_gpu_cuda_nightly.yml) -FBGEMM_GPU (FBGEMM GPU kernel library) is a collection of -high-performance CUDA GPU operator library for GPU training and inference. +FBGEMM_GPU (FBGEMM GPU Kernels Library) is a collection of high-performance PyTorch +GPU operator libraries for training and inference. The library provides efficient +table batched embedding bag, data layout transformation, and quantization supports. -The library provides efficient table batched embedding bag, -data layout transformation, and quantization supports. - -Currently tested with CUDA 11.3, 11.5, 11.6, and 11.7 in CI. In all cases, we test with PyTorch packages which are built with CUDA 11.7. +FBGEMM_GPU is currently tested with CUDA 11.7.1 and 11.8 in CI, and with PyTorch +packages that are built against those CUDA versions. Only Intel/AMD CPUs with AVX2 extensions are currently supported. -General build and install instructions are as follows: -Build dependencies: `scikit-build`, `cmake`, `ninja`, `jinja2`, `torch`, `cudatoolkit`, -and for testing: `hypothesis`. +## Build Instructions -``` -conda install scikit-build jinja2 ninja cmake hypothesis -``` +This section is for FBGEMM_GPU developers only. The full build instructions for +the CUDA, ROCm, and CPU-only variants of FBGEMM_GPU can be found [here](docs/BuildInstructions.md). -**If you're planning to build from source** and **don't** have `nvml.h` in your system, you can install it via the command -below. -``` -conda install -c conda-forge cudatoolkit-dev -``` -Certain operations require this library to be present. Be sure to provide the path to `libnvidia-ml.so` to -`--nvml_lib_path` if installing from source (e.g. `python setup.py install --nvml_lib_path path_to_libnvidia-ml.so`). +## Installation - -## PIP install +### Install through PIP Currently only built with sm70/80 (V100/A100 GPU) wheel supports: @@ -53,194 +42,72 @@ pip install fbgemm-gpu-nightly # Nightly CPU-only pip install --pre torch -f https://download.pytorch.org/whl/nightly/cpu/torch_nightly.html pip install fbgemm-gpu-nightly-cpu - ``` -## Build from source +### Running FBGEMM_GPU -Additional dependencies: currently cuDNN is required to be installed. -Please [download][4] and follow instructions [here][5] to install cuDNN. +The tests (in test folder) and benchmarks (in bench folder) are some great +examples of using FBGEMM_GPU. To run the tests or benchmarks after building +FBGEMM_GPU (if tests or benchmarks are built), use the following command: ``` -# Requires PyTorch 1.13 or later -conda install pytorch cuda -c pytorch-nightly -c "nvidia/label/cuda-11.7.1" -git clone --recursive https://github.com/pytorch/FBGEMM.git -cd FBGEMM/fbgemm_gpu -# if you are updating an existing checkout -git submodule sync -git submodule update --init --recursive - -# Specify CUDA version to use -# (may not be needed with only a single version installed) -export CUDA_BIN_PATH=/usr/local/cuda-11.3/ -export CUDACXX=/usr/local/cuda-11.3/bin/nvcc - -# Specify cuDNN library and header paths. We tested CUDA 11.6 and 11.7 with -# cuDNN version 8.5.0.96 -export CUDNN_LIBRARY=${HOME}/cudnn-linux-x86_64-8.5.0.96_cuda11-archive/lib -export CUDNN_INCLUDE_DIR=${HOME}/cudnn-linux-x86_64-8.5.0.96_cuda11-archive/include - -# in fbgemm_gpu folder -# build for the CUDA architecture supported by current system (or all architectures if no CUDA device present) -python setup.py install -# or build it for specific CUDA architectures (see PyTorch documentation for usage of TORCH_CUDA_ARCH_LIST) -python setup.py install -DTORCH_CUDA_ARCH_LIST="7.0;8.0" -``` - - -## Usage Example: -```bash -cd bench -python split_table_batched_embeddings_benchmark.py uvm +# run the tests and benchmarks of table batched embedding bag op, +# data layout transform op, quantized ops, etc. +cd test +python split_table_batched_embeddings_test.py +python quantize_ops_test.py +python sparse_ops_test.py +python split_embedding_inference_converter_test.py +cd ../bench +python split_table_batched_embeddings_benchmark.py ``` -## Build on ROCm - -FBGEMM_GPU supports running on AMD (ROCm) devices. A Docker container is recommended for setting up the ROCm environment. The installation on bare metal is also available. ROCm5.3 is used as an example of the installation below. -##### Build in a Docker container -Pull Docker container and run -``` -docker pull rocm/pytorch:rocm5.4_ubuntu20.04_py3.8_pytorch_staging_base -sudo docker run -it --network=host --shm-size 16G --device=/dev/kfd --device=/dev/dri \ - --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined \ - --ipc=host --env PYTORCH_ROCM_ARCH="gfx906;gfx908;gfx90a" -u 0 \ - rocm/pytorch:rocm5.4_ubuntu20.04_py3.8_pytorch_staging_base -``` -In the container +To run the tests and benchmarks on a GPU-capable device in CPU-only mode use CUDA_VISIBLE_DEVICES=-1 ``` -pip3 install --pre torch torchvision --extra-index-url https://download.pytorch.org/whl/nightly/rocm5.3/ -cd ~ -git clone https://github.com/pytorch/FBGEMM.git -cd FBGEMM/fbgemm_gpu -# if you are updating an existing checkout -git submodule sync -git submodule update --init --recursive -pip install -r requirements.txt -pip install update hypothesis - -# in fbgemm_gpu folder -# build for the current ROCm architecture -gpu_arch="$(/opt/rocm/bin/rocminfo | grep -o -m 1 'gfx.*')" -export PYTORCH_ROCM_ARCH=$gpu_arch -python setup.py install develop -# or build for specific ROCm architectures -export PYTORCH_ROCM_ARCH="gfx906;gfx908" -python setup.py install develop -# otherwise the build will be for the default architectures gfx906;gfx908;gfx90a +CUDA_VISIBLE_DEVICES=-1 python split_table_batched_embeddings_test.py ``` -##### Build on bare metal -Please refer to the installation instructions of ROCm5.3 [here][6]. Take the installation on Ubuntu20.04 as an example -``` -sudo apt-get update -wget https://repo.radeon.com/amdgpu-install/5.3/ubuntu/focal/amdgpu-install_5.3.50300-1_all.deb -sudo apt-get install ./amdgpu-install_5.3.50300-1_all.deb -sudo amdgpu-install --usecase=hiplibsdk,rocm --no-dkms -``` -MIOpen is required and needs to be installed separately. -``` -sudo apt-get install miopen-hip miopen-hip-dev -``` -The remaining steps are the same as the "in the container" section. +### Run the tests on ROCm -##### Run the tests on ROCm Please add `FBGEMM_TEST_WITH_ROCM=1` flag when running tests on ROCm. ``` cd test FBGEMM_TEST_WITH_ROCM=1 python split_table_batched_embeddings_test.py ``` -## Issues +### Benchmark Example -Building is CMAKE based and keeps state across install runs. -Specifying the CUDA architectures in the command line once is enough. -However on failed builds (missing dependencies ..) this can cause problems -and using ```bash -python setup.py clean -``` -to remove stale cached state can be helpful. - -## Examples - -The tests (in test folder) and benchmarks (in bench folder) are some great -examples of using FBGEMM_GPU. - -## Build Notes -FBGEMM_GPU uses a scikit-build CMAKE-based build flow. - -### Dependencies -FBGEMM_GPU requires nvcc and a Nvidia GPU with -compute capability of 3.5+. - -+ ###### CUB - -CUB is now included with CUDA 11.1+ - the section below will still be needed for lower CUDA versions (once they are tested). - -For the [CUB][1] build time dependency, if you are using conda, you can continue with -``` -conda install -c bottler nvidiacub -``` -Otherwise download the CUB library from https://github.com/NVIDIA/cub/releases and unpack it to a folder of your choice. Define the environment variable CUB_DIR before building and point it to the directory that contains CMakeLists.txt for CUB. For example on Linux/Mac, - -``` -curl -LO https://github.com/NVIDIA/cub/archive/1.10.0.tar.gz -tar xzf 1.10.0.tar.gz -export CUB_DIR=$PWD/cub-1.10.0 -``` - -+ ###### PyTorch, Jinja2, scikit-build -[PyTorch][2], [Jinja2][3] and scikit-build are **required** to build and run the table -batched embedding bag operator. One thing to note is that the implementation -of this op relies on the version of PyTorch 1.9 or later. - -``` -conda install scikit-build jinja2 ninja cmake +cd bench +python split_table_batched_embeddings_benchmark.py uvm ``` -## Running FBGEMM_GPU -To run the tests or benchmarks after building FBGEMM_GPU (if tests or benchmarks -are built), use the following command: -``` -# run the tests and benchmarks of table batched embedding bag op, -# data layout transform op, quantized ops, etc. -cd test -python split_table_batched_embeddings_test.py -python quantize_ops_test.py -python sparse_ops_test.py -python split_embedding_inference_converter_test.py -cd ../bench -python split_table_batched_embeddings_benchmark.py -``` +## Documentation -To run the tests and benchmarks on a GPU-capable device in CPU-only mode use CUDA_VISIBLE_DEVICES=-1 -``` -CUDA_VISIBLE_DEVICES=-1 python split_table_batched_embeddings_test.py -``` +### How FBGEMM_GPU works -## How FBGEMM_GPU works For a high-level overview, design philosophy and brief descriptions of various parts of FBGEMM_GPU please see our Wiki (work in progress). -## Full documentation We have extensively used comments in our source files. The best and up-to-date documentation is available in the source files. -# Building API Documentation +### Building the API Documentation See [docs/README.md](docs/README.md). -## Join the FBGEMM community -See the [`CONTRIBUTING`](../CONTRIBUTING.md) file for how to help out. + +## Join the FBGEMM_GPU Community + +For questions or feature requests, please file a ticket over on +[GitHub Issues](https://github.com/pytorch/FBGEMM/issues) or reach out to us on +the `#fbgemm` channel in [PyTorch Slack](https://bit.ly/ptslack). + +For contributions, please see the [`CONTRIBUTING`](../CONTRIBUTING.md) file for +ways to help out. + ## License -FBGEMM is BSD licensed, as found in the [`LICENSE`](../LICENSE) file. - -[0]:https://pytorch.org/tutorials/advanced/torch_script_custom_ops.html -[1]:https://github.com/NVIDIA/cub -[2]:https://github.com/pytorch/pytorch -[3]:https://jinja.palletsprojects.com/en/2.11.x/ -[4]:https://docs.nvidia.com/deeplearning/cudnn/install-guide/index.html#download -[5]:https://docs.nvidia.com/deeplearning/cudnn/install-guide/index.html#installlinux-tar -[6]:https://docs.amd.com/bundle/ROCm-Installation-Guide-v5.3/page/How_to_Install_ROCm.html#_How_to_Install + +FBGEMM_GPU is BSD licensed, as found in the [`LICENSE`](../LICENSE) file. diff --git a/fbgemm_gpu/docs/BuildInstructions.md b/fbgemm_gpu/docs/BuildInstructions.md index 4f2c9c142b..d250640d1f 100644 --- a/fbgemm_gpu/docs/BuildInstructions.md +++ b/fbgemm_gpu/docs/BuildInstructions.md @@ -105,10 +105,10 @@ conda install -n "${env_name}" -y \ ## Set Up for CUDA Build The CUDA build of FBGEMM_GPU requires `nvcc` that supports compute capability -3.5+. Setting the machine up for CUDA builds of FBGEMM_GPU can be done either -through pre-built Docker images or through Conda installation on bare metal. -Note that neither a GPU nor the NVIDIA drivers need to be present for builds, -since they are only used at runtime. +**`3.5+`**. Setting the machine up for CUDA builds of FBGEMM_GPU can be done +either through pre-built Docker images or through Conda installation on bare +metal. Note that neither a GPU nor the NVIDIA drivers need to be present for +builds, since they are only used at runtime. ### Docker Image @@ -182,8 +182,9 @@ wget -q https://github.com/NVIDIA/cub/archive/1.10.0.tar.gz ## Set Up for ROCm Build -Setting the machine up for ROCm builds of FBGEMM_GPU can be done either through -pre-built Docker images or through bare metal. +FBGEMM_GPU supports running on AMD (ROCm) devices. Setting the machine up for +ROCm builds of FBGEMM_GPU can be done either through pre-built Docker images or +through bare metal. ### Docker Image @@ -415,11 +416,30 @@ python setup.py bdist_wheel \ python setup.py install --cpu_only ``` -### Post-Build Checks +### Post-Build Checks (For Developers) -After the build completes, it is useful to check the built library and verify -the version numbers of GLIBCXX referenced as well as the availability of certain -function symbols: +After the build completes, it is useful to run some checks that verify that the +build is actually correct. + +#### Undefined Symbols Check + +Because FBGEMM_GPU contains a lot of template functions and their instantiations, +it is important to make sure that there are no undefined template instantiations: + +```sh +# !! Run in fbgemm_gpu/ directory inside the Conda environment !! + +# Locate the built .SO file +fbgemm_gpu_lib_path=$(find . -name fbgemm_gpu_py.so) + +# Check that the undefined symbols don't include fbgemm_gpu-defined functions +nm -gDCu "${fbgemm_gpu_lib_path}" +``` + +#### GLIBC Version Compatibility Check + +It is also useful to verify that the version numbers of GLIBCXX referenced as +well as the availability of certain function symbols: ```sh # !! Run in fbgemm_gpu/ directory inside the Conda environment !!