diff --git a/.github/PULL_REQUEST_TEMPLATE.md b/.github/PULL_REQUEST_TEMPLATE.md
index ae895daf28a..2c5ecf68690 100644
--- a/.github/PULL_REQUEST_TEMPLATE.md
+++ b/.github/PULL_REQUEST_TEMPLATE.md
@@ -19,15 +19,18 @@ Here are some guidelines to help the review process go smoothly.
    noted here: https://help.github.com/articles/closing-issues-using-keywords/
 
 5. If your pull request is not ready for review but you want to make use of the
-   continuous integration testing facilities please label it with `[WIP]`.
+   continuous integration testing facilities please mark your pull request as Draft.
+   https://docs.github.com/en/github/collaborating-with-issues-and-pull-requests/changing-the-stage-of-a-pull-request#converting-a-pull-request-to-a-draft
 
 6. If your pull request is ready to be reviewed without requiring additional
-   work on top of it, then remove the `[WIP]` label (if present) and replace
-   it with `[REVIEW]`. If assistance is required to complete the functionality,
-   for example when the C/C++ code of a feature is complete but Python bindings
-   are still required, then add the label `[HELP-REQ]` so that others can triage
-   and assist. The additional changes then can be implemented on top of the
-   same PR. If the assistance is done by members of the rapidsAI team, then no
+   work on top of it, then remove it from "Draft" and make it "Ready for Review".
+   https://docs.github.com/en/github/collaborating-with-issues-and-pull-requests/changing-the-stage-of-a-pull-request#marking-a-pull-request-as-ready-for-review
+
+   If assistance is required to complete the functionality, for example when the
+   C/C++ code of a feature is complete but Python bindings are still required,
+   then add the label `help wanted` so that others can triage and assist.
+   The additional changes then can be implemented on top of the same PR.
+   If the assistance is done by members of the rapidsAI team, then no
    additional actions are required by the creator of the original PR for this,
    otherwise the original author of the PR needs to give permission to the
    person(s) assisting to commit to their personal fork of the project. If that
@@ -39,10 +42,10 @@ Here are some guidelines to help the review process go smoothly.
    features or make changes out of the scope of those requested by the reviewer
    (doing this just add delays as already reviewed code ends up having to be
    re-reviewed/it is hard to tell what is new etc!). Further, please do not
-   rebase your branch on main/force push/rewrite history, doing any of these
-   causes the context of any comments made by reviewers to be lost. If
-   conflicts occur against main they should be resolved by merging main
-   into the branch used for making the pull request.
+   rebase your branch on the target branch, force push, or rewrite history.
+   Doing any of these causes the context of any comments made by reviewers to be lost.
+   If conflicts occur against the target branch they should be resolved by
+   merging the target branch into the branch used for making the pull request.
 
 Many thanks in advance for your cooperation!
 
diff --git a/CHANGELOG.md b/CHANGELOG.md
index 21ab8ed3274..df002654aa7 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -1,3 +1,7 @@
+# cuDF 0.20.0 (Date TBD)
+
+Please see https://github.com/rapidsai/cudf/releases/tag/v0.20.0a for the latest changes to this development branch.
+
 # cuDF 0.19.0 (Date TBD)
 
 Please see https://github.com/rapidsai/cudf/releases/tag/v0.19.0a for the latest changes to this development branch.
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
index 8c332539ec7..4edd6965c4b 100644
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -131,14 +131,14 @@ run each time you commit changes.
 
 Compiler requirements:
 
-* `gcc`     version 5.4+
-* `nvcc`    version 10.0+
-* `cmake`   version 3.14.0+
+* `gcc`     version 9.3+
+* `nvcc`    version 11.0+
+* `cmake`   version 3.18.0+
 
 CUDA/GPU requirements:
 
-* CUDA 10.0+
-* NVIDIA driver 410.48+
+* CUDA 11.0+
+* NVIDIA driver 450.80.02+
 * Pascal architecture or better
 
 You can obtain CUDA from [https://developer.nvidia.com/cuda-downloads](https://developer.nvidia.com/cuda-downloads).
@@ -320,7 +320,7 @@ flag. Below is a list of the available arguments and their purpose:
 | `PYARROW_VERSION` | 1.0.1 | Not supported | set pyarrow version |
 | `CMAKE_VERSION` | newest | >=3.14 | set cmake version |
 | `CYTHON_VERSION` | 0.29 | Not supported | set Cython version |
-| `PYTHON_VERSION` | 3.6 | 3.7 | set python version |
+| `PYTHON_VERSION` | 3.7 | 3.8 | set python version |
 
 ---
 
diff --git a/README.md b/README.md
index 687d25c200b..044f3bffa1a 100644
--- a/README.md
+++ b/README.md
@@ -57,35 +57,35 @@ Please see the [Demo Docker Repository](https://hub.docker.com/r/rapidsai/rapids
 
 ### CUDA/GPU requirements
 
-* CUDA 10.1+
-* NVIDIA driver 418.39+
+* CUDA 11.0+
+* NVIDIA driver 450.80.02+
 * Pascal architecture or better (Compute Capability >=6.0)
 
 ### Conda
 
 cuDF can be installed with conda ([miniconda](https://conda.io/miniconda.html), or the full [Anaconda distribution](https://www.anaconda.com/download)) from the `rapidsai` channel:
 
-For `cudf version == 0.18` :
+For `cudf version == 0.19` :
 ```bash
 # for CUDA 10.1
 conda install -c rapidsai -c nvidia -c numba -c conda-forge \
-    cudf=0.18 python=3.7 cudatoolkit=10.1
+    cudf=0.19 python=3.7 cudatoolkit=10.1
 
 # or, for CUDA 10.2
 conda install -c rapidsai -c nvidia -c numba -c conda-forge \
-    cudf=0.18 python=3.7 cudatoolkit=10.2
+    cudf=0.19 python=3.7 cudatoolkit=10.2
 
 ```
 
 For the nightly version of `cudf` :
 ```bash
-# for CUDA 10.1
+# for CUDA 11.0
 conda install -c rapidsai-nightly -c nvidia -c numba -c conda-forge \
-    cudf python=3.7 cudatoolkit=10.1
+    cudf python=3.7 cudatoolkit=11.0
 
-# or, for CUDA 10.2
+# or, for CUDA 11.2
 conda install -c rapidsai-nightly -c nvidia -c numba -c conda-forge \
-    cudf python=3.7 cudatoolkit=10.2
+    cudf python=3.7 cudatoolkit=11.1
 ```
 
 Note: cuDF is supported only on Linux, and with Python versions 3.7 and later.
diff --git a/ci/release/update-version.sh b/ci/release/update-version.sh
index 819a0dcf6bf..78e85501796 100755
--- a/ci/release/update-version.sh
+++ b/ci/release/update-version.sh
@@ -47,11 +47,14 @@ function sed_runner() {
 }
 
 # cpp update
-sed_runner 's/'"CUDA_DATAFRAME VERSION .* LANGUAGES"'/'"CUDA_DATAFRAME VERSION ${NEXT_FULL_TAG} LANGUAGES"'/g' cpp/CMakeLists.txt
+sed_runner 's/'"CUDF VERSION .* LANGUAGES"'/'"CUDF VERSION ${NEXT_FULL_TAG} LANGUAGES"'/g' cpp/CMakeLists.txt
 
 # cpp libcudf_kafka update
 sed_runner 's/'"CUDA_KAFKA VERSION .* LANGUAGES"'/'"CUDA_KAFKA VERSION ${NEXT_FULL_TAG} LANGUAGES"'/g' cpp/libcudf_kafka/CMakeLists.txt
 
+# cpp cudf_jni update
+sed_runner 's/'"CUDF_JNI VERSION .* LANGUAGES"'/'"CUDF_JNI VERSION ${NEXT_FULL_TAG} LANGUAGES"'/g' java/src/main/native/CMakeLists.txt
+
 # doxyfile update
 sed_runner 's/PROJECT_NUMBER         = .*/PROJECT_NUMBER         = '${NEXT_FULL_TAG}'/g' cpp/doxygen/Doxyfile
 
diff --git a/conda/environments/cudf_dev_cuda10.1.yml b/conda/environments/cudf_dev_cuda10.1.yml
index fa0b1126190..3c26dedda20 100644
--- a/conda/environments/cudf_dev_cuda10.1.yml
+++ b/conda/environments/cudf_dev_cuda10.1.yml
@@ -11,10 +11,10 @@ dependencies:
   - clang=8.0.1
   - clang-tools=8.0.1
   - cupy>7.1.0,<9.0.0a0
-  - rmm=0.19.*
+  - rmm=0.20.*
   - cmake>=3.14
   - cmake_setuptools>=0.1.3
-  - python>=3.6,<3.8
+  - python>=3.7,<3.9
   - numba>=0.49.0,!=0.51.0
   - numpy
   - pandas>=1.0,<1.3.0dev0
@@ -36,15 +36,14 @@ dependencies:
   - pandoc=<2.0.0
   - cudatoolkit=10.1
   - pip
-  - partd
   - flake8=3.8.3
   - black=19.10
   - isort=5.0.7
   - mypy=0.782
   - typing_extensions
   - pre_commit
-  - dask>=2021.3.1
-  - distributed>=2.22.0
+  - dask==2021.4.0
+  - distributed>=2.22.0,<=2021.4.0
   - streamz
   - dlpack
   - arrow-cpp=1.0.1
diff --git a/conda/environments/cudf_dev_cuda10.2.yml b/conda/environments/cudf_dev_cuda10.2.yml
index 52d82c4f4ef..cc78894a99c 100644
--- a/conda/environments/cudf_dev_cuda10.2.yml
+++ b/conda/environments/cudf_dev_cuda10.2.yml
@@ -11,10 +11,10 @@ dependencies:
   - clang=8.0.1
   - clang-tools=8.0.1
   - cupy>7.1.0,<9.0.0a0
-  - rmm=0.19.*
+  - rmm=0.20.*
   - cmake>=3.14
   - cmake_setuptools>=0.1.3
-  - python>=3.6,<3.8
+  - python>=3.7,<3.9
   - numba>=0.49,!=0.51.0
   - numpy
   - pandas>=1.0,<1.3.0dev0
@@ -36,15 +36,14 @@ dependencies:
   - pandoc=<2.0.0
   - cudatoolkit=10.2
   - pip
-  - partd
   - flake8=3.8.3
   - black=19.10
   - isort=5.0.7
   - mypy=0.782
   - typing_extensions
   - pre_commit
-  - dask>=2021.3.1
-  - distributed>=2.22.0
+  - dask==2021.4.0
+  - distributed>=2.22.0,<=2021.4.0
   - streamz
   - dlpack
   - arrow-cpp=1.0.1
diff --git a/conda/environments/cudf_dev_cuda11.0.yml b/conda/environments/cudf_dev_cuda11.0.yml
index 2e64365bdf6..10eb683657b 100644
--- a/conda/environments/cudf_dev_cuda11.0.yml
+++ b/conda/environments/cudf_dev_cuda11.0.yml
@@ -11,10 +11,10 @@ dependencies:
   - clang=8.0.1
   - clang-tools=8.0.1
   - cupy>7.1.0,<9.0.0a0
-  - rmm=0.19.*
+  - rmm=0.20.*
   - cmake>=3.14
   - cmake_setuptools>=0.1.3
-  - python>=3.6,<3.8
+  - python>=3.7,<3.9
   - numba>=0.49,!=0.51.0
   - numpy
   - pandas>=1.0,<1.3.0dev0
@@ -36,15 +36,14 @@ dependencies:
   - pandoc=<2.0.0
   - cudatoolkit=11.0
   - pip
-  - partd
   - flake8=3.8.3
   - black=19.10
   - isort=5.0.7
   - mypy=0.782
   - typing_extensions
   - pre_commit
-  - dask>=2021.3.1
-  - distributed>=2.22.0
+  - dask==2021.4.0
+  - distributed>=2.22.0,<=2021.4.0
   - streamz
   - dlpack
   - arrow-cpp=1.0.1
diff --git a/conda/environments/cudf_dev_cuda11.1.yml b/conda/environments/cudf_dev_cuda11.1.yml
new file mode 100644
index 00000000000..30062e38021
--- /dev/null
+++ b/conda/environments/cudf_dev_cuda11.1.yml
@@ -0,0 +1,67 @@
+# Copyright (c) 2021, NVIDIA CORPORATION.
+
+name: cudf_dev
+channels:
+  - rapidsai
+  - nvidia
+  - rapidsai-nightly
+  - conda-forge
+  - defaults
+dependencies:
+  - clang=8.0.1
+  - clang-tools=8.0.1
+  - cupy>7.1.0,<9.0.0a0
+  - rmm=0.20.*
+  - cmake>=3.14
+  - cmake_setuptools>=0.1.3
+  - python>=3.7,<3.9
+  - numba>=0.49,!=0.51.0
+  - numpy
+  - pandas>=1.0,<1.3.0dev0
+  - pyarrow=1.0.1
+  - fastavro>=0.22.9
+  - notebook>=0.5.0
+  - cython>=0.29,<0.30
+  - fsspec>=0.6.0
+  - pytest
+  - pytest-benchmark
+  - pytest-xdist
+  - sphinx
+  - sphinx_rtd_theme
+  - sphinxcontrib-websupport
+  - nbsphinx
+  - numpydoc
+  - ipython
+  - recommonmark
+  - pandoc=<2.0.0
+  - cudatoolkit=11.1
+  - pip
+  - flake8=3.8.3
+  - black=19.10
+  - isort=5.0.7
+  - mypy=0.782
+  - typing_extensions
+  - pre_commit
+  - dask==2021.4.0
+  - distributed>=2.22.0,<=2021.4.0
+  - streamz
+  - dlpack
+  - arrow-cpp=1.0.1
+  - arrow-cpp-proc * cuda
+  - boost-cpp>=1.72.0
+  - double-conversion
+  - rapidjson
+  - flatbuffers
+  - hypothesis
+  - sphinx-markdown-tables
+  - sphinx-copybutton
+  - mimesis
+  - packaging
+  - protobuf
+  - nvtx>=0.2.1
+  - cachetools
+  - pip:
+      - git+https://github.com/dask/dask.git@main
+      - git+https://github.com/dask/distributed.git@main
+      - git+https://github.com/python-streamz/streamz.git
+      - pyorc
diff --git a/conda/environments/cudf_dev_cuda11.2.yml b/conda/environments/cudf_dev_cuda11.2.yml
new file mode 100644
index 00000000000..63821910790
--- /dev/null
+++ b/conda/environments/cudf_dev_cuda11.2.yml
@@ -0,0 +1,67 @@
+# Copyright (c) 2021, NVIDIA CORPORATION.
+
+name: cudf_dev
+channels:
+  - rapidsai
+  - nvidia
+  - rapidsai-nightly
+  - conda-forge
+  - defaults
+dependencies:
+  - clang=8.0.1
+  - clang-tools=8.0.1
+  - cupy>7.1.0,<9.0.0a0
+  - rmm=0.20.*
+  - cmake>=3.14
+  - cmake_setuptools>=0.1.3
+  - python>=3.7,<3.9
+  - numba>=0.49,!=0.51.0
+  - numpy
+  - pandas>=1.0,<1.3.0dev0
+  - pyarrow=1.0.1
+  - fastavro>=0.22.9
+  - notebook>=0.5.0
+  - cython>=0.29,<0.30
+  - fsspec>=0.6.0
+  - pytest
+  - pytest-benchmark
+  - pytest-xdist
+  - sphinx
+  - sphinx_rtd_theme
+  - sphinxcontrib-websupport
+  - nbsphinx
+  - numpydoc
+  - ipython
+  - recommonmark
+  - pandoc=<2.0.0
+  - cudatoolkit=11.2
+  - pip
+  - flake8=3.8.3
+  - black=19.10
+  - isort=5.0.7
+  - mypy=0.782
+  - typing_extensions
+  - pre_commit
+  - dask==2021.4.0
+  - distributed>=2.22.0,<=2021.4.0
+  - streamz
+  - dlpack
+  - arrow-cpp=1.0.1
+  - arrow-cpp-proc * cuda
+  - boost-cpp>=1.72.0
+  - double-conversion
+  - rapidjson
+  - flatbuffers
+  - hypothesis
+  - sphinx-markdown-tables
+  - sphinx-copybutton
+  - mimesis
+  - packaging
+  - protobuf
+  - nvtx>=0.2.1
+  - cachetools
+  - pip:
+      - git+https://github.com/dask/dask.git@main
+      - git+https://github.com/dask/distributed.git@main
+      - git+https://github.com/python-streamz/streamz.git
+      - pyorc
diff --git a/conda/recipes/cudf/meta.yaml b/conda/recipes/cudf/meta.yaml
index a119040bbcf..5635f54ba20 100644
--- a/conda/recipes/cudf/meta.yaml
+++ b/conda/recipes/cudf/meta.yaml
@@ -28,7 +28,7 @@ requirements:
     - numba >=0.49.0
     - dlpack
     - pyarrow 1.0.1
-    - libcudf {{ version }}=*_{{ GIT_DESCRIBE_NUMBER }}
+    - libcudf {{ version }}
     - rmm {{ minor_version }}
     - cudatoolkit {{ cuda_version }}
   run:
diff --git a/conda/recipes/cudf_kafka/meta.yaml b/conda/recipes/cudf_kafka/meta.yaml
index cc3f30091bf..0acd9ec4bb2 100644
--- a/conda/recipes/cudf_kafka/meta.yaml
+++ b/conda/recipes/cudf_kafka/meta.yaml
@@ -29,12 +29,12 @@ requirements:
     - python
     - cython >=0.29,<0.30
     - setuptools
-    - cudf {{ version }}=*_{{ GIT_DESCRIBE_NUMBER }}
-    - libcudf_kafka {{ version }}=*_{{ GIT_DESCRIBE_NUMBER }}
+    - cudf {{ version }}
+    - libcudf_kafka {{ version }}
   run:
-    - libcudf_kafka {{ version }}=*_{{ GIT_DESCRIBE_NUMBER }}
+    - libcudf_kafka {{ version }}
     - python-confluent-kafka
-    - cudf {{ version }}=*_{{ GIT_DESCRIBE_NUMBER }}
+    - cudf {{ version }}
 
 test:
   requires:
diff --git a/conda/recipes/custreamz/meta.yaml b/conda/recipes/custreamz/meta.yaml
index 8edca7a51d0..f65b3cafbd7 100644
--- a/conda/recipes/custreamz/meta.yaml
+++ b/conda/recipes/custreamz/meta.yaml
@@ -23,15 +23,15 @@ requirements:
   host:
     - python
     - python-confluent-kafka
-    - cudf_kafka {{ version }}=*_{{ GIT_DESCRIBE_NUMBER }}
+    - cudf_kafka {{ version }}
   run:
     - python
-    - streamz
-    - cudf {{ version }}=*_{{ GIT_DESCRIBE_NUMBER }}
-    - dask >=2.22.0
-    - distributed >=2.22.0
+    - streamz 
+    - cudf {{ version }}
+    - dask >=2.22.0,<=2021.4.0
+    - distributed >=2.22.0,<=2021.4.0
     - python-confluent-kafka
-    - cudf_kafka {{ version }}=*_{{ GIT_DESCRIBE_NUMBER }}
+    - cudf_kafka {{ version }}
 
 test:
   requires:
diff --git a/conda/recipes/dask-cudf/meta.yaml b/conda/recipes/dask-cudf/meta.yaml
index a8768e26056..8b503840b34 100644
--- a/conda/recipes/dask-cudf/meta.yaml
+++ b/conda/recipes/dask-cudf/meta.yaml
@@ -22,15 +22,15 @@ build:
 requirements:
   host:
     - python
-    - cudf {{ version }}=*_{{ GIT_DESCRIBE_NUMBER }}
-    - dask>=2021.3.1
-    - distributed >=2.22.0
+    - cudf {{ version }}
+    - dask==2021.4.0
+    - distributed >=2.22.0,<=2021.4.0
   run:
     - python
-    - cudf {{ version }}=*_{{ GIT_DESCRIBE_NUMBER }}
-    - dask>=2021.3.1
-    - distributed >=2.22.0
-
+    - cudf {{ version }}
+    - dask==2021.4.0
+    - distributed >=2.22.0,<=2021.4.0
+  
 test:
   requires:
     - cudatoolkit {{ cuda_version }}.*
diff --git a/conda/recipes/libcudf/meta.yaml b/conda/recipes/libcudf/meta.yaml
index 39587b4bd05..75955428eab 100644
--- a/conda/recipes/libcudf/meta.yaml
+++ b/conda/recipes/libcudf/meta.yaml
@@ -178,12 +178,14 @@ test:
     - test -f $PREFIX/include/cudf/strings/detail/converters.hpp
     - test -f $PREFIX/include/cudf/strings/detail/copying.hpp
     - test -f $PREFIX/include/cudf/strings/detail/fill.hpp
+    - test -f $PREFIX/include/cudf/strings/detail/json.hpp
     - test -f $PREFIX/include/cudf/strings/detail/replace.hpp
     - test -f $PREFIX/include/cudf/strings/detail/utilities.hpp
     - test -f $PREFIX/include/cudf/strings/extract.hpp
     - test -f $PREFIX/include/cudf/strings/findall.hpp
     - test -f $PREFIX/include/cudf/strings/find.hpp
     - test -f $PREFIX/include/cudf/strings/find_multiple.hpp
+    - test -f $PREFIX/include/cudf/strings/json.hpp
     - test -f $PREFIX/include/cudf/strings/padding.hpp
     - test -f $PREFIX/include/cudf/strings/replace.hpp
     - test -f $PREFIX/include/cudf/strings/replace_re.hpp
diff --git a/conda/recipes/libcudf_kafka/meta.yaml b/conda/recipes/libcudf_kafka/meta.yaml
index 81ff922b8d7..5348ec471e9 100644
--- a/conda/recipes/libcudf_kafka/meta.yaml
+++ b/conda/recipes/libcudf_kafka/meta.yaml
@@ -25,7 +25,7 @@ requirements:
   build:
     - cmake >=3.17.0
   host:
-    - libcudf {{ version }}=*_{{ GIT_DESCRIBE_NUMBER }}
+    - libcudf {{ version }}
     - librdkafka >=1.5.0,<1.5.3
   run:
     - {{ pin_compatible('librdkafka', max_pin='x.x') }} #TODO: librdkafka should be automatically included here by run_exports but is not
diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt
index 5cd82e52180..453707e4559 100644
--- a/cpp/CMakeLists.txt
+++ b/cpp/CMakeLists.txt
@@ -28,7 +28,7 @@ elseif(CMAKE_CUDA_ARCHITECTURES STREQUAL "")
   set(CUDF_BUILD_FOR_DETECTED_ARCHS TRUE)
 endif()
 
-project(CUDF VERSION 0.19.0 LANGUAGES C CXX)
+project(CUDF VERSION 0.20.0 LANGUAGES C CXX)
 
 # Needed because GoogleBenchmark changes the state of FindThreads.cmake,
 # causing subsequent runs to have different values for the `Threads::Threads` target.
@@ -137,8 +137,8 @@ include(cmake/thirdparty/CUDF_GetDLPack.cmake)
 include(cmake/thirdparty/CUDF_GetLibcudacxx.cmake)
 # find or install GoogleTest
 include(cmake/thirdparty/CUDF_GetGTest.cmake)
-# Stringify libcudf and libcudacxx headers used in JIT operations
-include(cmake/Modules/StringifyJITHeaders.cmake)
+# preprocess jitify-able kernels
+include(cmake/Modules/JitifyPreprocessKernels.cmake)
 # find cuFile
 include(cmake/Modules/FindcuFile.cmake)
 
@@ -153,9 +153,6 @@ add_library(cudf
     src/ast/transform.cu
     src/binaryop/binaryop.cpp
     src/binaryop/compiled/binary_ops.cu
-    src/binaryop/jit/code/kernel.cpp
-    src/binaryop/jit/code/operation.cpp
-    src/binaryop/jit/code/traits.cpp
     src/labeling/label_bins.cu
     src/bitmask/null_mask.cu
     src/column/column.cu
@@ -256,7 +253,6 @@ add_library(cudf
     src/io/utilities/parsing_utils.cu
     src/io/utilities/type_conversion.cpp
     src/jit/cache.cpp
-    src/jit/launcher.cpp
     src/jit/parser.cpp
     src/jit/type.cpp
     src/join/cross_join.cu
@@ -302,8 +298,6 @@ add_library(cudf
     src/reshape/interleave_columns.cu
     src/reshape/tile.cu
     src/rolling/grouped_rolling.cu
-    src/rolling/jit/code/kernel.cpp
-    src/rolling/jit/code/operation.cpp
     src/rolling/rolling.cu
     src/round/round.cu
     src/scalar/scalar.cpp
@@ -346,6 +340,7 @@ add_library(cudf
     src/strings/find.cu
     src/strings/find_multiple.cu
     src/strings/padding.cu
+    src/strings/json/json_path.cu
     src/strings/regex/regcomp.cpp
     src/strings/regex/regexec.cu
     src/strings/replace/backref_re.cu
@@ -386,7 +381,6 @@ add_library(cudf
     src/text/tokenize.cu
     src/transform/bools_to_mask.cu
     src/transform/encode.cu
-    src/transform/jit/code/kernel.cpp
     src/transform/mask_to_bools.cu
     src/transform/nans_to_nulls.cu
     src/transform/row_bit_count.cu
@@ -401,10 +395,11 @@ add_library(cudf
 
 set_target_properties(cudf
     PROPERTIES BUILD_RPATH                         "\$ORIGIN"
+               INSTALL_RPATH                       "\$ORIGIN"
                # set target compile options
-               CXX_STANDARD                        14
+               CXX_STANDARD                        17
                CXX_STANDARD_REQUIRED               ON
-               CUDA_STANDARD                       14
+               CUDA_STANDARD                       17
                CUDA_STANDARD_REQUIRED              ON
                POSITION_INDEPENDENT_CODE           ON
                INTERFACE_POSITION_INDEPENDENT_CODE ON
@@ -464,7 +459,7 @@ endif()
 target_compile_definitions(cudf PUBLIC "SPDLOG_ACTIVE_LEVEL=SPDLOG_LEVEL_${RMM_LOGGING_LEVEL}")
 
 # Compile stringified JIT sources first
-add_dependencies(cudf stringify_run)
+add_dependencies(cudf jitify_preprocess_run)
 
 # Specify the target module library dependencies
 target_link_libraries(cudf
@@ -475,9 +470,15 @@ target_link_libraries(cudf
                   rmm::rmm)
 
 if(CUDA_STATIC_RUNTIME)
-    target_link_libraries(cudf PUBLIC CUDA::nvrtc CUDA::cudart_static CUDA::cuda_driver)
+    # Tell CMake what CUDA language runtime to use
+    set_target_properties(cudf PROPERTIES CUDA_RUNTIME_LIBRARY Static)
+    # Make sure to export to consumers what runtime we used
+    target_link_libraries(cudf PUBLIC CUDA::cudart_static CUDA::cuda_driver)
 else()
-    target_link_libraries(cudf PUBLIC CUDA::nvrtc CUDA::cudart CUDA::cuda_driver)
+    # Tell CMake what CUDA language runtime to use
+    set_target_properties(cudf PROPERTIES CUDA_RUNTIME_LIBRARY Shared)
+    # Make sure to export to consumers what runtime we used
+    target_link_libraries(cudf PUBLIC CUDA::cudart CUDA::cuda_driver)
 endif()
 
 # Add cuFile interface if available
@@ -516,7 +517,7 @@ target_compile_options(cudftestutil
 )
 
 target_compile_features(cudftestutil
-    PUBLIC cxx_std_14 $<BUILD_INTERFACE:cuda_std_14>)
+    PUBLIC cxx_std_17 $<BUILD_INTERFACE:cuda_std_17>)
 
 target_link_libraries(cudftestutil
                PUBLIC GTest::gmock
@@ -582,7 +583,14 @@ install(DIRECTORY
         DESTINATION ${CMAKE_INSTALL_INCLUDEDIR}/libcudf)
 
 install(DIRECTORY ${Thrust_SOURCE_DIR}/
-  DESTINATION ${CMAKE_INSTALL_INCLUDEDIR}/libcudf/Thrust)
+  DESTINATION ${CMAKE_INSTALL_INCLUDEDIR}/libcudf/Thrust
+  PATTERN "*.py" EXCLUDE
+  PATTERN "benchmark" EXCLUDE
+  PATTERN "build" EXCLUDE
+  PATTERN "doc" EXCLUDE
+  PATTERN "examples" EXCLUDE
+  PATTERN "test" EXCLUDE
+  PATTERN "testing" EXCLUDE)
 
 include(CMakePackageConfigHelpers)
 
diff --git a/cpp/benchmarks/CMakeLists.txt b/cpp/benchmarks/CMakeLists.txt
index 5aa7e0132f8..78cb35865e9 100644
--- a/cpp/benchmarks/CMakeLists.txt
+++ b/cpp/benchmarks/CMakeLists.txt
@@ -17,7 +17,7 @@
 find_package(Threads REQUIRED)
 
 add_library(cudf_datagen STATIC common/generate_benchmark_input.cpp)
-target_compile_features(cudf_datagen PUBLIC cxx_std_14 cuda_std_14)
+target_compile_features(cudf_datagen PUBLIC cxx_std_17 cuda_std_17)
 
 target_compile_options(cudf_datagen
             PUBLIC "$<$<COMPILE_LANGUAGE:CXX>:${CUDF_CXX_FLAGS}>"
@@ -202,3 +202,8 @@ ConfigureBench(STRINGS_BENCH
   string/substring_benchmark.cpp
   string/translate_benchmark.cpp
   string/url_decode_benchmark.cpp)
+
+###################################################################################################
+# - json benchmark -------------------------------------------------------------------
+ConfigureBench(JSON_BENCH
+  string/json_benchmark.cpp)
diff --git a/cpp/benchmarks/column/concatenate_benchmark.cpp b/cpp/benchmarks/column/concatenate_benchmark.cpp
index b04cfba7d07..3634b2f08a2 100644
--- a/cpp/benchmarks/column/concatenate_benchmark.cpp
+++ b/cpp/benchmarks/column/concatenate_benchmark.cpp
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2020, NVIDIA CORPORATION.
+ * Copyright (c) 2020-2021, NVIDIA CORPORATION.
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
  * you may not use this file except in compliance with the License.
@@ -62,7 +62,7 @@ static void BM_concatenate(benchmark::State& state)
   CHECK_CUDA(0);
 
   for (auto _ : state) {
-    cuda_event_timer raii(state, true, 0);
+    cuda_event_timer raii(state, true, rmm::cuda_stream_default);
     auto result = cudf::concatenate(column_views);
   }
 
@@ -124,7 +124,7 @@ static void BM_concatenate_tables(benchmark::State& state)
   CHECK_CUDA(0);
 
   for (auto _ : state) {
-    cuda_event_timer raii(state, true, 0);
+    cuda_event_timer raii(state, true, rmm::cuda_stream_default);
     auto result = cudf::concatenate(table_views);
   }
 
@@ -184,7 +184,7 @@ static void BM_concatenate_strings(benchmark::State& state)
   CHECK_CUDA(0);
 
   for (auto _ : state) {
-    cuda_event_timer raii(state, true, 0);
+    cuda_event_timer raii(state, true, rmm::cuda_stream_default);
     auto result = cudf::concatenate(column_views);
   }
 
diff --git a/cpp/benchmarks/join/join_benchmark.cu b/cpp/benchmarks/join/join_benchmark.cu
index fa6afdd908c..d1c11696ddd 100644
--- a/cpp/benchmarks/join/join_benchmark.cu
+++ b/cpp/benchmarks/join/join_benchmark.cu
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2019, NVIDIA CORPORATION.
+ * Copyright (c) 2019-2021, NVIDIA CORPORATION.
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
  * you may not use this file except in compliance with the License.
@@ -103,7 +103,7 @@ static void BM_join(benchmark::State &state)
   // Benchmark the inner join operation
 
   for (auto _ : state) {
-    cuda_event_timer raii(state, true, 0);
+    cuda_event_timer raii(state, true, rmm::cuda_stream_default);
 
     auto result = cudf::inner_join(
       probe_table, build_table, columns_to_join, columns_to_join, cudf::null_equality::UNEQUAL);
diff --git a/cpp/benchmarks/sort/sort_benchmark.cpp b/cpp/benchmarks/sort/sort_benchmark.cpp
index fb74469e7c0..fe68ddd0051 100644
--- a/cpp/benchmarks/sort/sort_benchmark.cpp
+++ b/cpp/benchmarks/sort/sort_benchmark.cpp
@@ -61,7 +61,7 @@ static void BM_sort(benchmark::State& state, bool nulls)
   auto input = cudf::table_view(column_views);
 
   for (auto _ : state) {
-    cuda_event_timer raii(state, true, 0);
+    cuda_event_timer raii(state, true, rmm::cuda_stream_default);
 
     auto result = (stable) ? cudf::stable_sorted_order(input) : cudf::sorted_order(input);
   }
diff --git a/cpp/benchmarks/sort/sort_strings_benchmark.cpp b/cpp/benchmarks/sort/sort_strings_benchmark.cpp
index 54e85b7ea8c..f5effcafcfb 100644
--- a/cpp/benchmarks/sort/sort_strings_benchmark.cpp
+++ b/cpp/benchmarks/sort/sort_strings_benchmark.cpp
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2020, NVIDIA CORPORATION.
+ * Copyright (c) 2020-2021, NVIDIA CORPORATION.
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
  * you may not use this file except in compliance with the License.
@@ -32,7 +32,7 @@ static void BM_sort(benchmark::State& state)
   auto const table = create_random_table({cudf::type_id::STRING}, 1, row_count{n_rows});
 
   for (auto _ : state) {
-    cuda_event_timer raii(state, true, 0);
+    cuda_event_timer raii(state, true, rmm::cuda_stream_default);
     cudf::sort(table->view());
   }
 }
diff --git a/cpp/benchmarks/string/case_benchmark.cpp b/cpp/benchmarks/string/case_benchmark.cpp
index 9c1c81da22a..508ae49e093 100644
--- a/cpp/benchmarks/string/case_benchmark.cpp
+++ b/cpp/benchmarks/string/case_benchmark.cpp
@@ -32,7 +32,7 @@ static void BM_case(benchmark::State& state)
   cudf::strings_column_view input(table->view().column(0));
 
   for (auto _ : state) {
-    cuda_event_timer raii(state, true, 0);
+    cuda_event_timer raii(state, true, rmm::cuda_stream_default);
     cudf::strings::to_lower(input);
   }
 
diff --git a/cpp/benchmarks/string/combine_benchmark.cpp b/cpp/benchmarks/string/combine_benchmark.cpp
index 2a5013a9ae7..7dabd32e874 100644
--- a/cpp/benchmarks/string/combine_benchmark.cpp
+++ b/cpp/benchmarks/string/combine_benchmark.cpp
@@ -43,7 +43,7 @@ static void BM_combine(benchmark::State& state)
   cudf::string_scalar separator("+");
 
   for (auto _ : state) {
-    cuda_event_timer raii(state, true, 0);
+    cuda_event_timer raii(state, true, rmm::cuda_stream_default);
     cudf::strings::concatenate(table->view(), separator);
   }
 
diff --git a/cpp/benchmarks/string/contains_benchmark.cpp b/cpp/benchmarks/string/contains_benchmark.cpp
index 1a2ac8ad602..79bdda77634 100644
--- a/cpp/benchmarks/string/contains_benchmark.cpp
+++ b/cpp/benchmarks/string/contains_benchmark.cpp
@@ -35,7 +35,7 @@ static void BM_contains(benchmark::State& state, contains_type ct)
   cudf::strings_column_view input(table->view().column(0));
 
   for (auto _ : state) {
-    cuda_event_timer raii(state, true, 0);
+    cuda_event_timer raii(state, true, rmm::cuda_stream_default);
     // contains_re(), matches_re(), and count_re() all have similar functions
     // with count_re() being the most regex intensive
     switch (ct) {
diff --git a/cpp/benchmarks/string/copy_benchmark.cpp b/cpp/benchmarks/string/copy_benchmark.cpp
index af9f5b4fa4a..b49bc878ca7 100644
--- a/cpp/benchmarks/string/copy_benchmark.cpp
+++ b/cpp/benchmarks/string/copy_benchmark.cpp
@@ -54,7 +54,7 @@ static void BM_copy(benchmark::State& state, copy_type ct)
                                                                     host_map_data.end());
 
   for (auto _ : state) {
-    cuda_event_timer raii(state, true, 0);
+    cuda_event_timer raii(state, true, rmm::cuda_stream_default);
     switch (ct) {
       case gather: cudf::gather(source->view(), index_map); break;
       case scatter: cudf::scatter(source->view(), index_map, target->view()); break;
diff --git a/cpp/benchmarks/string/extract_benchmark.cpp b/cpp/benchmarks/string/extract_benchmark.cpp
index dbae18dde3b..aa1e59a22bf 100644
--- a/cpp/benchmarks/string/extract_benchmark.cpp
+++ b/cpp/benchmarks/string/extract_benchmark.cpp
@@ -14,6 +14,8 @@
  * limitations under the License.
  */
 
+#include "string_bench_args.hpp"
+
 #include <benchmark/benchmark.h>
 #include <benchmarks/common/generate_benchmark_input.hpp>
 #include <benchmarks/fixture/benchmark_fixture.hpp>
@@ -23,43 +25,55 @@
 #include <cudf/strings/strings_column_view.hpp>
 #include <cudf_test/column_wrapper.hpp>
 
-#include "string_bench_args.hpp"
+#include <random>
 
 class StringExtract : public cudf::benchmark {
 };
 
-static void BM_extract(benchmark::State& state, int re_instructions)
+static void BM_extract(benchmark::State& state, int groups)
 {
-  cudf::size_type const n_rows{static_cast<cudf::size_type>(state.range(0))};
-  cudf::size_type const max_str_length{static_cast<cudf::size_type>(state.range(1))};
-  data_profile table_profile;
-  table_profile.set_distribution_params(
-    cudf::type_id::STRING, distribution_id::NORMAL, 0, max_str_length);
-  auto const table =
-    create_random_table({cudf::type_id::STRING}, 1, row_count{n_rows}, table_profile);
-  cudf::strings_column_view input(table->view().column(0));
-  std::string const raw_pattern =
-    "1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234"
-    "5678901234567890123456789012345678901234567890";
-  std::string const pattern = "(" + raw_pattern.substr(0, re_instructions) + ")";
+  auto const n_rows   = static_cast<cudf::size_type>(state.range(0));
+  auto const n_length = static_cast<cudf::size_type>(state.range(1));
+
+  std::default_random_engine generator;
+  std::uniform_int_distribution<int> words_dist(0, 999);
+
+  std::vector<std::string> samples(100);  // 100 unique rows of data to reuse
+  std::generate(samples.begin(), samples.end(), [&]() {
+    std::string row;  // build a row of random tokens
+    while (static_cast<int>(row.size()) < n_length) {
+      row += std::to_string(words_dist(generator)) + " ";
+    }
+    return row;
+  });
+
+  std::string pattern;
+  while (static_cast<int>(pattern.size()) < groups) { pattern += "(\\d+) "; }
+
+  std::uniform_int_distribution<int> distribution(0, samples.size() - 1);
+  auto elements = cudf::detail::make_counting_transform_iterator(
+    0, [&](auto idx) { return samples.at(distribution(generator)); });
+  cudf::test::strings_column_wrapper input(elements, elements + n_rows);
+  cudf::strings_column_view view(input);
 
   for (auto _ : state) {
-    cuda_event_timer raii(state, true, 0);
-    auto results = cudf::strings::extract(input, pattern);
+    cuda_event_timer raii(state, true);
+    auto results = cudf::strings::extract(view, pattern);
   }
 
-  state.SetBytesProcessed(state.iterations() * input.chars_size());
+  state.SetBytesProcessed(state.iterations() * view.chars_size());
 }
 
 static void generate_bench_args(benchmark::internal::Benchmark* b)
 {
-  int const min_rows   = 1 << 12;
-  int const max_rows   = 1 << 24;
-  int const row_mult   = 8;
-  int const min_rowlen = 1 << 5;
-  int const max_rowlen = 1 << 13;
-  int const len_mult   = 4;
-  generate_string_bench_args(b, min_rows, max_rows, row_mult, min_rowlen, max_rowlen, len_mult);
+  int const min_rows          = 1 << 12;
+  int const max_rows          = 1 << 24;
+  int const row_multiplier    = 8;
+  int const min_row_length    = 1 << 5;
+  int const max_row_length    = 1 << 13;
+  int const length_multiplier = 4;
+  generate_string_bench_args(
+    b, min_rows, max_rows, row_multiplier, min_row_length, max_row_length, length_multiplier);
 }
 
 #define STRINGS_BENCHMARK_DEFINE(name, instructions)          \
@@ -70,6 +84,6 @@ static void generate_bench_args(benchmark::internal::Benchmark* b)
     ->UseManualTime()                                         \
     ->Unit(benchmark::kMillisecond);
 
-STRINGS_BENCHMARK_DEFINE(small, 4)
-STRINGS_BENCHMARK_DEFINE(medium, 48)
-STRINGS_BENCHMARK_DEFINE(large, 128)
+STRINGS_BENCHMARK_DEFINE(small, 2)
+STRINGS_BENCHMARK_DEFINE(medium, 10)
+STRINGS_BENCHMARK_DEFINE(large, 30)
diff --git a/cpp/benchmarks/string/factory_benchmark.cu b/cpp/benchmarks/string/factory_benchmark.cu
index 6c5dceffaa8..802ca949976 100644
--- a/cpp/benchmarks/string/factory_benchmark.cu
+++ b/cpp/benchmarks/string/factory_benchmark.cu
@@ -63,7 +63,7 @@ static void BM_factory(benchmark::State& state)
                     string_view_to_pair{});
 
   for (auto _ : state) {
-    cuda_event_timer raii(state, true, 0);
+    cuda_event_timer raii(state, true, rmm::cuda_stream_default);
     cudf::make_strings_column(pairs);
   }
 
diff --git a/cpp/benchmarks/string/filter_benchmark.cpp b/cpp/benchmarks/string/filter_benchmark.cpp
index 123c5597df9..d510ca9baed 100644
--- a/cpp/benchmarks/string/filter_benchmark.cpp
+++ b/cpp/benchmarks/string/filter_benchmark.cpp
@@ -50,7 +50,7 @@ static void BM_filter_chars(benchmark::State& state, FilterAPI api)
     {cudf::char_utf8{'a'}, cudf::char_utf8{'c'}}};
 
   for (auto _ : state) {
-    cuda_event_timer raii(state, true, 0);
+    cuda_event_timer raii(state, true, rmm::cuda_stream_default);
     switch (api) {
       case filter: cudf::strings::filter_characters_of_type(input, types); break;
       case filter_chars: cudf::strings::filter_characters(input, filter_table); break;
diff --git a/cpp/benchmarks/string/find_benchmark.cpp b/cpp/benchmarks/string/find_benchmark.cpp
index 200527d606e..fd7c515eb0b 100644
--- a/cpp/benchmarks/string/find_benchmark.cpp
+++ b/cpp/benchmarks/string/find_benchmark.cpp
@@ -46,7 +46,7 @@ static void BM_find_scalar(benchmark::State& state, FindAPI find_api)
   cudf::test::strings_column_wrapper targets({"+", "-"});
 
   for (auto _ : state) {
-    cuda_event_timer raii(state, true, 0);
+    cuda_event_timer raii(state, true, rmm::cuda_stream_default);
     switch (find_api) {
       case find: cudf::strings::find(input, target); break;
       case find_multi:
diff --git a/cpp/benchmarks/string/json_benchmark.cpp b/cpp/benchmarks/string/json_benchmark.cpp
new file mode 100644
index 00000000000..6fb6a07a8d0
--- /dev/null
+++ b/cpp/benchmarks/string/json_benchmark.cpp
@@ -0,0 +1,140 @@
+/*
+ * Copyright (c) 2021, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include <benchmark/benchmark.h>
+#include <benchmarks/common/generate_benchmark_input.hpp>
+#include <benchmarks/fixture/benchmark_fixture.hpp>
+#include <benchmarks/synchronization/synchronization.hpp>
+
+#include <cudf_test/base_fixture.hpp>
+#include <cudf_test/column_wrapper.hpp>
+
+#include <cudf/strings/json.hpp>
+#include <cudf/strings/strings_column_view.hpp>
+
+class JsonPath : public cudf::benchmark {
+};
+
+float frand() { return static_cast<float>(rand()) / static_cast<float>(RAND_MAX); }
+
+int rand_range(int min, int max) { return min + static_cast<int>(frand() * (max - min)); }
+
+std::vector<std::string> Books{
+  "{\n\"category\": \"reference\",\n\"author\": \"Nigel Rees\",\n\"title\": \"Sayings of the "
+  "Century\",\n\"price\": 8.95\n}",
+  "{\n\"category\": \"fiction\",\n\"author\": \"Evelyn Waugh\",\n\"title\": \"Sword of "
+  "Honour\",\n\"price\": 12.99\n}",
+  "{\n\"category\": \"fiction\",\n\"author\": \"Herman Melville\",\n\"title\": \"Moby "
+  "Dick\",\n\"isbn\": \"0-553-21311-3\",\n\"price\": 8.99\n}",
+  "{\n\"category\": \"fiction\",\n\"author\": \"J. R. R. Tolkien\",\n\"title\": \"The Lord of the "
+  "Rings\",\n\"isbn\": \"0-395-19395-8\",\n\"price\": 22.99\n}"};
+constexpr int Approx_book_size = 110;
+std::vector<std::string> Bicycles{
+  "{\"color\": \"red\", \"price\": 9.95}",
+  "{\"color\": \"green\", \"price\": 29.95}",
+  "{\"color\": \"blue\", \"price\": 399.95}",
+  "{\"color\": \"yellow\", \"price\": 99.95}",
+  "{\"color\": \"mauve\", \"price\": 199.95}",
+};
+constexpr int Approx_bicycle_size = 33;
+std::string Misc{"\n\"expensive\": 10\n"};
+std::string generate_field(std::vector<std::string> const& values, int num_values)
+{
+  std::string res;
+  for (int idx = 0; idx < num_values; idx++) {
+    if (idx > 0) { res += std::string(",\n"); }
+    int vindex = std::min(static_cast<int>(floor(frand() * values.size())),
+                          static_cast<int>(values.size() - 1));
+    res += values[vindex];
+  }
+  return res;
+}
+
+std::string build_row(int desired_bytes)
+{
+  // always have at least 2 books and 2 bikes
+  int num_books    = 2;
+  int num_bicycles = 2;
+  int remaining_bytes =
+    desired_bytes - ((num_books * Approx_book_size) + (num_bicycles * Approx_bicycle_size));
+
+  // divide up the remainder between books and bikes
+  float book_pct    = frand();
+  float bicycle_pct = 1.0f - book_pct;
+  num_books += (remaining_bytes * book_pct) / Approx_book_size;
+  num_bicycles += (remaining_bytes * bicycle_pct) / Approx_bicycle_size;
+
+  std::string books    = "\"book\": [\n" + generate_field(Books, num_books) + "]\n";
+  std::string bicycles = "\"bicycle\": [\n" + generate_field(Bicycles, num_bicycles) + "]\n";
+
+  std::string store = "\"store\": {\n";
+  if (frand() <= 0.5f) {
+    store += books + std::string(",\n") + bicycles;
+  } else {
+    store += bicycles + std::string(",\n") + books;
+  }
+  store += std::string("}\n");
+
+  std::string row = std::string("{\n");
+  if (frand() <= 0.5f) {
+    row += store + std::string(",\n") + Misc;
+  } else {
+    row += Misc + std::string(",\n") + store;
+  }
+  row += std::string("}\n");
+  return row;
+}
+
+template <class... QueryArg>
+static void BM_case(benchmark::State& state, QueryArg&&... query_arg)
+{
+  srand(5236);
+  auto iter = thrust::make_transform_iterator(
+    thrust::make_counting_iterator(0),
+    [desired_bytes = state.range(1)](int index) { return build_row(desired_bytes); });
+  int num_rows = state.range(0);
+  cudf::test::strings_column_wrapper input(iter, iter + num_rows);
+  cudf::strings_column_view scv(input);
+  size_t num_chars = scv.chars().size();
+
+  std::string json_path(query_arg...);
+
+  for (auto _ : state) {
+    cuda_event_timer raii(state, true, 0);
+    auto result = cudf::strings::get_json_object(scv, json_path);
+    cudaStreamSynchronize(0);
+  }
+
+  // this isn't strictly 100% accurate. a given query isn't necessarily
+  // going to visit every single incoming character.  but in spirit it does.
+  state.SetBytesProcessed(state.iterations() * num_chars);
+}
+
+#define JSON_BENCHMARK_DEFINE(name, query)                         \
+  BENCHMARK_CAPTURE(BM_case, name, query)                          \
+    ->ArgsProduct({{100, 1000, 100000, 400000}, {300, 600, 4096}}) \
+    ->UseManualTime()                                              \
+    ->Unit(benchmark::kMillisecond);
+
+JSON_BENCHMARK_DEFINE(query0, "$");
+JSON_BENCHMARK_DEFINE(query1, "$.store");
+JSON_BENCHMARK_DEFINE(query2, "$.store.book");
+JSON_BENCHMARK_DEFINE(query3, "$.store.*");
+JSON_BENCHMARK_DEFINE(query4, "$.store.book[*]");
+JSON_BENCHMARK_DEFINE(query5, "$.store.book[*].category");
+JSON_BENCHMARK_DEFINE(query6, "$.store['bicycle']");
+JSON_BENCHMARK_DEFINE(query7, "$.store.book[*]['isbn']");
+JSON_BENCHMARK_DEFINE(query8, "$.store.bicycle[1]");
diff --git a/cpp/benchmarks/string/replace_benchmark.cpp b/cpp/benchmarks/string/replace_benchmark.cpp
index 968b8f5abb0..0d785fd25aa 100644
--- a/cpp/benchmarks/string/replace_benchmark.cpp
+++ b/cpp/benchmarks/string/replace_benchmark.cpp
@@ -49,7 +49,7 @@ static void BM_replace(benchmark::State& state, replace_type rt)
   cudf::test::strings_column_wrapper repls({"", ""});
 
   for (auto _ : state) {
-    cuda_event_timer raii(state, true, 0);
+    cuda_event_timer raii(state, true, rmm::cuda_stream_default);
     switch (rt) {
       case scalar: cudf::strings::replace(input, target, repl); break;
       case slice: cudf::strings::replace_slice(input, repl, 1, 10); break;
diff --git a/cpp/benchmarks/string/replace_re_benchmark.cpp b/cpp/benchmarks/string/replace_re_benchmark.cpp
index 616e2c0f22c..18ec28371e3 100644
--- a/cpp/benchmarks/string/replace_re_benchmark.cpp
+++ b/cpp/benchmarks/string/replace_re_benchmark.cpp
@@ -43,7 +43,7 @@ static void BM_replace(benchmark::State& state, replace_type rt)
   cudf::test::strings_column_wrapper repls({"#", ""});
 
   for (auto _ : state) {
-    cuda_event_timer raii(state, true, 0);
+    cuda_event_timer raii(state, true, rmm::cuda_stream_default);
     switch (rt) {
       case replace_type::replace_re:  // contains_re and matches_re use the same main logic
         cudf::strings::replace_re(input, "\\d+");
diff --git a/cpp/benchmarks/string/split_benchmark.cpp b/cpp/benchmarks/string/split_benchmark.cpp
index 35bedb1b767..0494fba7642 100644
--- a/cpp/benchmarks/string/split_benchmark.cpp
+++ b/cpp/benchmarks/string/split_benchmark.cpp
@@ -44,7 +44,7 @@ static void BM_split(benchmark::State& state, split_type rt)
   cudf::string_scalar target("+");
 
   for (auto _ : state) {
-    cuda_event_timer raii(state, true, 0);
+    cuda_event_timer raii(state, true, rmm::cuda_stream_default);
     switch (rt) {
       case split: cudf::strings::split(input, target); break;
       case split_ws: cudf::strings::split(input); break;
diff --git a/cpp/benchmarks/string/substring_benchmark.cpp b/cpp/benchmarks/string/substring_benchmark.cpp
index d47c42e45be..e8a66f7b323 100644
--- a/cpp/benchmarks/string/substring_benchmark.cpp
+++ b/cpp/benchmarks/string/substring_benchmark.cpp
@@ -54,7 +54,7 @@ static void BM_substring(benchmark::State& state, substring_type rt)
   cudf::test::strings_column_wrapper delimiters(delim_itr, delim_itr + n_rows);
 
   for (auto _ : state) {
-    cuda_event_timer raii(state, true, 0);
+    cuda_event_timer raii(state, true, rmm::cuda_stream_default);
     switch (rt) {
       case position: cudf::strings::slice_strings(input, 1, max_str_length / 2); break;
       case multi_position: cudf::strings::slice_strings(input, starts, stops); break;
diff --git a/cpp/benchmarks/string/translate_benchmark.cpp b/cpp/benchmarks/string/translate_benchmark.cpp
index c49a986d744..49396b0ce71 100644
--- a/cpp/benchmarks/string/translate_benchmark.cpp
+++ b/cpp/benchmarks/string/translate_benchmark.cpp
@@ -54,7 +54,7 @@ static void BM_translate(benchmark::State& state, int entry_count)
                  });
 
   for (auto _ : state) {
-    cuda_event_timer raii(state, true, 0);
+    cuda_event_timer raii(state, true, rmm::cuda_stream_default);
     cudf::strings::translate(input, entries);
   }
 
diff --git a/cpp/benchmarks/string/url_decode_benchmark.cpp b/cpp/benchmarks/string/url_decode_benchmark.cpp
index 26c23ea23b4..fbb99bf3e8f 100644
--- a/cpp/benchmarks/string/url_decode_benchmark.cpp
+++ b/cpp/benchmarks/string/url_decode_benchmark.cpp
@@ -80,7 +80,7 @@ void BM_url_decode(benchmark::State& state)
   auto strings_view = cudf::strings_column_view(column);
 
   for (auto _ : state) {
-    cuda_event_timer raii(state, true, 0);
+    cuda_event_timer raii(state, true, rmm::cuda_stream_default);
     auto result = cudf::strings::url_decode(strings_view);
   }
 
diff --git a/cpp/benchmarks/text/normalize_benchmark.cpp b/cpp/benchmarks/text/normalize_benchmark.cpp
index 32c4fb7dcde..bb872fee0b3 100644
--- a/cpp/benchmarks/text/normalize_benchmark.cpp
+++ b/cpp/benchmarks/text/normalize_benchmark.cpp
@@ -41,7 +41,7 @@ static void BM_normalize(benchmark::State& state, bool to_lower)
   cudf::strings_column_view input(table->view().column(0));
 
   for (auto _ : state) {
-    cuda_event_timer raii(state, true, 0);
+    cuda_event_timer raii(state, true, rmm::cuda_stream_default);
     nvtext::normalize_characters(input, to_lower);
   }
 
diff --git a/cpp/benchmarks/text/normalize_spaces_benchmark.cpp b/cpp/benchmarks/text/normalize_spaces_benchmark.cpp
index dcabb0c225c..6260bb02c55 100644
--- a/cpp/benchmarks/text/normalize_spaces_benchmark.cpp
+++ b/cpp/benchmarks/text/normalize_spaces_benchmark.cpp
@@ -42,7 +42,7 @@ static void BM_normalize(benchmark::State& state)
   cudf::strings_column_view input(table->view().column(0));
 
   for (auto _ : state) {
-    cuda_event_timer raii(state, true, 0);
+    cuda_event_timer raii(state, true, rmm::cuda_stream_default);
     nvtext::normalize_spaces(input);
   }
 
diff --git a/cpp/benchmarks/text/tokenize_benchmark.cpp b/cpp/benchmarks/text/tokenize_benchmark.cpp
index f9e742f0f31..7bb84e11a4a 100644
--- a/cpp/benchmarks/text/tokenize_benchmark.cpp
+++ b/cpp/benchmarks/text/tokenize_benchmark.cpp
@@ -46,7 +46,7 @@ static void BM_tokenize(benchmark::State& state, tokenize_type tt)
   cudf::test::strings_column_wrapper delimiters({" ", "+", "-"});
 
   for (auto _ : state) {
-    cuda_event_timer raii(state, true, 0);
+    cuda_event_timer raii(state, true, rmm::cuda_stream_default);
     switch (tt) {
       case tokenize_type::single: nvtext::tokenize(input); break;
       case tokenize_type::multi:
diff --git a/cpp/cmake/Modules/FindcuFile.cmake b/cpp/cmake/Modules/FindcuFile.cmake
index 4f67e186f42..880ad773369 100644
--- a/cpp/cmake/Modules/FindcuFile.cmake
+++ b/cpp/cmake/Modules/FindcuFile.cmake
@@ -62,6 +62,7 @@ find_path(cuFile_INCLUDE_DIR
     cufile.h
   HINTS
     ${PKG_cuFile_INCLUDE_DIRS}
+    /usr/local/cuda/include
     /usr/local/cuda/lib64
 )
 
diff --git a/cpp/cmake/Modules/JitifyPreprocessKernels.cmake b/cpp/cmake/Modules/JitifyPreprocessKernels.cmake
new file mode 100644
index 00000000000..eb1ade61440
--- /dev/null
+++ b/cpp/cmake/Modules/JitifyPreprocessKernels.cmake
@@ -0,0 +1,66 @@
+#=============================================================================
+# Copyright (c) 2021, NVIDIA CORPORATION.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#=============================================================================
+
+# Create `jitify_preprocess` executable
+add_executable(jitify_preprocess "${JITIFY_INCLUDE_DIR}/jitify2_preprocess.cpp")
+
+target_link_libraries(jitify_preprocess CUDA::cudart ${CMAKE_DL_LIBS})
+
+function(jit_preprocess_files)
+    cmake_parse_arguments(ARG
+                          ""
+                          "SOURCE_DIRECTORY"
+                          "FILES"
+                          ${ARGN}
+                          )
+
+    foreach(ARG_FILE ${ARG_FILES})
+        set(ARG_OUTPUT ${CUDF_GENERATED_INCLUDE_DIR}/include/jit_preprocessed_files/${ARG_FILE}.jit.hpp)
+        get_filename_component(jit_output_directory "${ARG_OUTPUT}" DIRECTORY )
+        list(APPEND JIT_PREPROCESSED_FILES "${ARG_OUTPUT}")
+        add_custom_command(WORKING_DIRECTORY ${ARG_SOURCE_DIRECTORY}
+                           DEPENDS jitify_preprocess "${ARG_SOURCE_DIRECTORY}/${ARG_FILE}"
+                           OUTPUT ${ARG_OUTPUT}
+                           VERBATIM
+                           COMMAND ${CMAKE_COMMAND} -E make_directory "${jit_output_directory}"
+                           COMMAND jitify_preprocess ${ARG_FILE}
+                                    -o ${CUDF_GENERATED_INCLUDE_DIR}/include/jit_preprocessed_files
+                                    -i
+                                    -m
+                                    -std=c++17
+                                    -remove-unused-globals
+                                    -D__CUDACC_RTC__
+                                    -I${CUDF_SOURCE_DIR}/include
+                                    -I${CUDF_SOURCE_DIR}/src
+                                    -I${LIBCUDACXX_INCLUDE_DIR}
+                                    -I${CUDAToolkit_INCLUDE_DIRS}
+                                    --no-preinclude-workarounds
+                                    --no-replace-pragma-once
+                           )
+    endforeach()
+    set(JIT_PREPROCESSED_FILES "${JIT_PREPROCESSED_FILES}" PARENT_SCOPE)
+endfunction()
+
+jit_preprocess_files(SOURCE_DIRECTORY      ${CUDF_SOURCE_DIR}/src
+                     FILES                 binaryop/jit/kernel.cu
+                                           transform/jit/kernel.cu
+                                           rolling/jit/kernel.cu
+                     )
+
+add_custom_target(jitify_preprocess_run DEPENDS ${JIT_PREPROCESSED_FILES})
+
+file(COPY "${LIBCUDACXX_INCLUDE_DIR}/" DESTINATION "${CUDF_GENERATED_INCLUDE_DIR}/include/libcudacxx")
+file(COPY "${LIBCXX_INCLUDE_DIR}"      DESTINATION "${CUDF_GENERATED_INCLUDE_DIR}/include/libcxx")
diff --git a/cpp/cmake/Modules/SetGPUArchs.cmake b/cpp/cmake/Modules/SetGPUArchs.cmake
index f09d5ead8e2..8ab3c14d671 100644
--- a/cpp/cmake/Modules/SetGPUArchs.cmake
+++ b/cpp/cmake/Modules/SetGPUArchs.cmake
@@ -38,16 +38,6 @@ if(NOT DEFINED CUDAToolkit_VERSION AND CMAKE_CUDA_COMPILER)
   unset(NVCC_OUT)
 endif()
 
-if(CUDAToolkit_VERSION_MAJOR LESS 11)
-  list(REMOVE_ITEM SUPPORTED_CUDA_ARCHITECTURES "80")
-endif()
-if(CUDAToolkit_VERSION_MAJOR LESS 10)
-  list(REMOVE_ITEM SUPPORTED_CUDA_ARCHITECTURES "75")
-endif()
-if(CUDAToolkit_VERSION_MAJOR LESS 9)
-  list(REMOVE_ITEM SUPPORTED_CUDA_ARCHITECTURES "70")
-endif()
-
 if(${PROJECT_NAME}_BUILD_FOR_ALL_ARCHS)
   set(CMAKE_CUDA_ARCHITECTURES ${SUPPORTED_CUDA_ARCHITECTURES})
 
diff --git a/cpp/cmake/Modules/StringifyJITHeaders.cmake b/cpp/cmake/Modules/StringifyJITHeaders.cmake
deleted file mode 100644
index 0bfb37773dc..00000000000
--- a/cpp/cmake/Modules/StringifyJITHeaders.cmake
+++ /dev/null
@@ -1,168 +0,0 @@
-#=============================================================================
-# Copyright (c) 2018-2021, NVIDIA CORPORATION.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#=============================================================================
-
-file(MAKE_DIRECTORY "${CUDF_GENERATED_INCLUDE_DIR}/include")
-
-# Create `stringify` executable
-add_executable(stringify "${JITIFY_INCLUDE_DIR}/stringify.cpp")
-
-execute_process(WORKING_DIRECTORY ${CUDF_GENERATED_INCLUDE_DIR}
-    COMMAND ${CMAKE_COMMAND} -E make_directory
-        ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/libcudacxx/cuda/std/detail/libcxx/include
-    )
-
-# Use `stringify` to convert types.h to c-str for use in JIT code
-add_custom_command(WORKING_DIRECTORY ${CUDF_SOURCE_DIR}/include
-                   COMMENT "Stringify headers for use in JIT compiled code"
-                   DEPENDS stringify
-                   OUTPUT ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/types.h.jit
-                          ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/types.hpp.jit
-                          ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/bit.hpp.jit
-                          ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/timestamps.hpp.jit
-                          ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/fixed_point.hpp.jit
-                          ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/durations.hpp.jit
-                          ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/libcudacxx/cuda/std/chrono.jit
-                          ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/libcudacxx/cuda/std/climits.jit
-                          ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/libcudacxx/cuda/std/cstddef.jit
-                          ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/libcudacxx/cuda/std/cstdint.jit
-                          ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/libcudacxx/cuda/std/ctime.jit
-                          ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/libcudacxx/cuda/std/limits.jit
-                          ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/libcudacxx/cuda/std/ratio.jit
-                          ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/libcudacxx/cuda/std/type_traits.jit
-                          ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/libcudacxx/cuda/std/version.jit
-                          ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/libcudacxx/cuda/std/detail/__config.jit
-                          ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/libcudacxx/cuda/std/detail/__pragma_pop.jit
-                          ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/libcudacxx/cuda/std/detail/__pragma_push.jit
-                          ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/libcudacxx/cuda/std/detail/libcxx/include/__config.jit
-                          ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/libcudacxx/cuda/std/detail/libcxx/include/__pragma_pop.jit
-                          ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/libcudacxx/cuda/std/detail/libcxx/include/__pragma_push.jit
-                          ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/libcudacxx/cuda/std/detail/libcxx/include/__undef_macros.jit
-                          ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/libcudacxx/cuda/std/detail/libcxx/include/chrono.jit
-                          ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/libcudacxx/cuda/std/detail/libcxx/include/climits.jit
-                          ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/libcudacxx/cuda/std/detail/libcxx/include/cstddef.jit
-                          ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/libcudacxx/cuda/std/detail/libcxx/include/cstdint.jit
-                          ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/libcudacxx/cuda/std/detail/libcxx/include/ctime.jit
-                          ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/libcudacxx/cuda/std/detail/libcxx/include/limits.jit
-                          ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/libcudacxx/cuda/std/detail/libcxx/include/ratio.jit
-                          ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/libcudacxx/cuda/std/detail/libcxx/include/type_traits.jit
-                          ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/libcudacxx/cuda/std/detail/libcxx/include/version.jit
-                   MAIN_DEPENDENCY ${CUDF_SOURCE_DIR}/include/cudf/types.h
-                                   ${CUDF_SOURCE_DIR}/include/cudf/types.hpp
-                                   ${CUDF_SOURCE_DIR}/include/cudf/utilities/bit.hpp
-                                   ${CUDF_SOURCE_DIR}/include/cudf/wrappers/timestamps.hpp
-                                   ${CUDF_SOURCE_DIR}/include/cudf/fixed_point/fixed_point.hpp
-                                   ${CUDF_SOURCE_DIR}/include/cudf/wrappers/durations.hpp
-                                   ${LIBCUDACXX_INCLUDE_DIR}/cuda/std/chrono
-                                   ${LIBCUDACXX_INCLUDE_DIR}/cuda/std/climits
-                                   ${LIBCUDACXX_INCLUDE_DIR}/cuda/std/cstddef
-                                   ${LIBCUDACXX_INCLUDE_DIR}/cuda/std/cstdint
-                                   ${LIBCUDACXX_INCLUDE_DIR}/cuda/std/ctime
-                                   ${LIBCUDACXX_INCLUDE_DIR}/cuda/std/limits
-                                   ${LIBCUDACXX_INCLUDE_DIR}/cuda/std/ratio
-                                   ${LIBCUDACXX_INCLUDE_DIR}/cuda/std/type_traits
-                                   ${LIBCUDACXX_INCLUDE_DIR}/cuda/std/version
-                                   ${LIBCUDACXX_INCLUDE_DIR}/cuda/std/detail/__config
-                                   ${LIBCUDACXX_INCLUDE_DIR}/cuda/std/detail/__pragma_pop
-                                   ${LIBCUDACXX_INCLUDE_DIR}/cuda/std/detail/__pragma_push
-                                   ${LIBCUDACXX_INCLUDE_DIR}/cuda/std/detail/libcxx/include/__config
-                                   ${LIBCUDACXX_INCLUDE_DIR}/cuda/std/detail/libcxx/include/__pragma_pop
-                                   ${LIBCUDACXX_INCLUDE_DIR}/cuda/std/detail/libcxx/include/__pragma_push
-                                   ${LIBCUDACXX_INCLUDE_DIR}/cuda/std/detail/libcxx/include/__undef_macros
-                                   ${LIBCUDACXX_INCLUDE_DIR}/cuda/std/detail/libcxx/include/chrono
-                                   ${LIBCUDACXX_INCLUDE_DIR}/cuda/std/detail/libcxx/include/climits
-                                   ${LIBCUDACXX_INCLUDE_DIR}/cuda/std/detail/libcxx/include/cstddef
-                                   ${LIBCUDACXX_INCLUDE_DIR}/cuda/std/detail/libcxx/include/cstdint
-                                   ${LIBCUDACXX_INCLUDE_DIR}/cuda/std/detail/libcxx/include/ctime
-                                   ${LIBCUDACXX_INCLUDE_DIR}/cuda/std/detail/libcxx/include/limits
-                                   ${LIBCUDACXX_INCLUDE_DIR}/cuda/std/detail/libcxx/include/ratio
-                                   ${LIBCUDACXX_INCLUDE_DIR}/cuda/std/detail/libcxx/include/type_traits
-                                   ${LIBCUDACXX_INCLUDE_DIR}/cuda/std/detail/libcxx/include/version
-
-                   # stringified headers are placed underneath the bin include jit directory and end in ".jit"
-                   COMMAND ${CUDF_BINARY_DIR}/stringify cudf/types.h > ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/types.h.jit
-                   COMMAND ${CUDF_BINARY_DIR}/stringify cudf/types.hpp > ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/types.hpp.jit
-                   COMMAND ${CUDF_BINARY_DIR}/stringify cudf/utilities/bit.hpp > ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/bit.hpp.jit
-                   COMMAND ${CUDF_BINARY_DIR}/stringify ../src/rolling/rolling_jit_detail.hpp > ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/rolling_jit_detail.hpp.jit
-                   COMMAND ${CUDF_BINARY_DIR}/stringify cudf/wrappers/timestamps.hpp > ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/timestamps.hpp.jit
-                   COMMAND ${CUDF_BINARY_DIR}/stringify cudf/fixed_point/fixed_point.hpp > ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/fixed_point.hpp.jit
-                   COMMAND ${CUDF_BINARY_DIR}/stringify cudf/wrappers/durations.hpp > ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/durations.hpp.jit
-                   COMMAND ${CUDF_BINARY_DIR}/stringify ${LIBCUDACXX_INCLUDE_DIR}/cuda/std/chrono cuda_std_chrono > ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/libcudacxx/cuda/std/chrono.jit
-                   COMMAND ${CUDF_BINARY_DIR}/stringify ${LIBCUDACXX_INCLUDE_DIR}/cuda/std/climits cuda_std_climits > ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/libcudacxx/cuda/std/climits.jit
-                   COMMAND ${CUDF_BINARY_DIR}/stringify ${LIBCUDACXX_INCLUDE_DIR}/cuda/std/cstddef cuda_std_cstddef > ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/libcudacxx/cuda/std/cstddef.jit
-                   COMMAND ${CUDF_BINARY_DIR}/stringify ${LIBCUDACXX_INCLUDE_DIR}/cuda/std/cstdint cuda_std_cstdint > ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/libcudacxx/cuda/std/cstdint.jit
-                   COMMAND ${CUDF_BINARY_DIR}/stringify ${LIBCUDACXX_INCLUDE_DIR}/cuda/std/ctime cuda_std_ctime > ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/libcudacxx/cuda/std/ctime.jit
-                   COMMAND ${CUDF_BINARY_DIR}/stringify ${LIBCUDACXX_INCLUDE_DIR}/cuda/std/limits cuda_std_limits > ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/libcudacxx/cuda/std/limits.jit
-                   COMMAND ${CUDF_BINARY_DIR}/stringify ${LIBCUDACXX_INCLUDE_DIR}/cuda/std/ratio cuda_std_ratio > ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/libcudacxx/cuda/std/ratio.jit
-                   COMMAND ${CUDF_BINARY_DIR}/stringify ${LIBCUDACXX_INCLUDE_DIR}/cuda/std/type_traits cuda_std_type_traits > ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/libcudacxx/cuda/std/type_traits.jit
-                   COMMAND ${CUDF_BINARY_DIR}/stringify ${LIBCUDACXX_INCLUDE_DIR}/cuda/std/version cuda_std_version > ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/libcudacxx/cuda/std/version.jit
-                   COMMAND ${CUDF_BINARY_DIR}/stringify ${LIBCUDACXX_INCLUDE_DIR}/cuda/std/detail/__config cuda_std_detail___config > ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/libcudacxx/cuda/std/detail/__config.jit
-                   COMMAND ${CUDF_BINARY_DIR}/stringify ${LIBCUDACXX_INCLUDE_DIR}/cuda/std/detail/__pragma_pop cuda_std_detail___pragma_pop > ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/libcudacxx/cuda/std/detail/__pragma_pop.jit
-                   COMMAND ${CUDF_BINARY_DIR}/stringify ${LIBCUDACXX_INCLUDE_DIR}/cuda/std/detail/__pragma_push cuda_std_detail___pragma_push > ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/libcudacxx/cuda/std/detail/__pragma_push.jit
-                   COMMAND ${CUDF_BINARY_DIR}/stringify ${LIBCUDACXX_INCLUDE_DIR}/cuda/std/detail/libcxx/include/__config cuda_std_detail_libcxx_include___config > ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/libcudacxx/cuda/std/detail/libcxx/include/__config.jit
-                   COMMAND ${CUDF_BINARY_DIR}/stringify ${LIBCUDACXX_INCLUDE_DIR}/cuda/std/detail/libcxx/include/__pragma_pop cuda_std_detail_libcxx_include___pragma_pop > ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/libcudacxx/cuda/std/detail/libcxx/include/__pragma_pop.jit
-                   COMMAND ${CUDF_BINARY_DIR}/stringify ${LIBCUDACXX_INCLUDE_DIR}/cuda/std/detail/libcxx/include/__pragma_push cuda_std_detail_libcxx_include___pragma_push > ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/libcudacxx/cuda/std/detail/libcxx/include/__pragma_push.jit
-                   COMMAND ${CUDF_BINARY_DIR}/stringify ${LIBCUDACXX_INCLUDE_DIR}/cuda/std/detail/libcxx/include/__undef_macros cuda_std_detail_libcxx_include___undef_macros > ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/libcudacxx/cuda/std/detail/libcxx/include/__undef_macros.jit
-                   COMMAND ${CUDF_BINARY_DIR}/stringify ${LIBCUDACXX_INCLUDE_DIR}/cuda/std/detail/libcxx/include/chrono cuda_std_detail_libcxx_include_chrono > ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/libcudacxx/cuda/std/detail/libcxx/include/chrono.jit
-                   COMMAND ${CUDF_BINARY_DIR}/stringify ${LIBCUDACXX_INCLUDE_DIR}/cuda/std/detail/libcxx/include/climits cuda_std_detail_libcxx_include_climits > ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/libcudacxx/cuda/std/detail/libcxx/include/climits.jit
-                   COMMAND ${CUDF_BINARY_DIR}/stringify ${LIBCUDACXX_INCLUDE_DIR}/cuda/std/detail/libcxx/include/cstddef cuda_std_detail_libcxx_include_cstddef > ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/libcudacxx/cuda/std/detail/libcxx/include/cstddef.jit
-                   COMMAND ${CUDF_BINARY_DIR}/stringify ${LIBCUDACXX_INCLUDE_DIR}/cuda/std/detail/libcxx/include/cstdint cuda_std_detail_libcxx_include_cstdint > ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/libcudacxx/cuda/std/detail/libcxx/include/cstdint.jit
-                   COMMAND ${CUDF_BINARY_DIR}/stringify ${LIBCUDACXX_INCLUDE_DIR}/cuda/std/detail/libcxx/include/ctime cuda_std_detail_libcxx_include_ctime > ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/libcudacxx/cuda/std/detail/libcxx/include/ctime.jit
-                   COMMAND ${CUDF_BINARY_DIR}/stringify ${LIBCUDACXX_INCLUDE_DIR}/cuda/std/detail/libcxx/include/limits cuda_std_detail_libcxx_include_limits > ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/libcudacxx/cuda/std/detail/libcxx/include/limits.jit
-                   COMMAND ${CUDF_BINARY_DIR}/stringify ${LIBCUDACXX_INCLUDE_DIR}/cuda/std/detail/libcxx/include/ratio cuda_std_detail_libcxx_include_ratio > ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/libcudacxx/cuda/std/detail/libcxx/include/ratio.jit
-                   COMMAND ${CUDF_BINARY_DIR}/stringify ${LIBCUDACXX_INCLUDE_DIR}/cuda/std/detail/libcxx/include/type_traits cuda_std_detail_libcxx_include_type_traits > ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/libcudacxx/cuda/std/detail/libcxx/include/type_traits.jit
-                   COMMAND ${CUDF_BINARY_DIR}/stringify ${LIBCUDACXX_INCLUDE_DIR}/cuda/std/detail/libcxx/include/version cuda_std_detail_libcxx_include_version > ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/libcudacxx/cuda/std/detail/libcxx/include/version.jit
-                   )
-
-add_custom_target(stringify_run DEPENDS
-                  ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/types.h.jit
-                  ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/types.hpp.jit
-                  ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/bit.hpp.jit
-                  ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/timestamps.hpp.jit
-                  ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/fixed_point.hpp.jit
-                  ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/durations.hpp.jit
-                  ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/libcudacxx/cuda/std/chrono.jit
-                  ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/libcudacxx/cuda/std/climits.jit
-                  ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/libcudacxx/cuda/std/cstddef.jit
-                  ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/libcudacxx/cuda/std/cstdint.jit
-                  ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/libcudacxx/cuda/std/ctime.jit
-                  ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/libcudacxx/cuda/std/limits.jit
-                  ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/libcudacxx/cuda/std/ratio.jit
-                  ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/libcudacxx/cuda/std/type_traits.jit
-                  ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/libcudacxx/cuda/std/version.jit
-                  ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/libcudacxx/cuda/std/detail/__config.jit
-                  ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/libcudacxx/cuda/std/detail/__pragma_pop.jit
-                  ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/libcudacxx/cuda/std/detail/__pragma_push.jit
-                  ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/libcudacxx/cuda/std/detail/libcxx/include/__config.jit
-                  ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/libcudacxx/cuda/std/detail/libcxx/include/__pragma_pop.jit
-                  ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/libcudacxx/cuda/std/detail/libcxx/include/__pragma_push.jit
-                  ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/libcudacxx/cuda/std/detail/libcxx/include/__undef_macros.jit
-                  ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/libcudacxx/cuda/std/detail/libcxx/include/chrono.jit
-                  ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/libcudacxx/cuda/std/detail/libcxx/include/climits.jit
-                  ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/libcudacxx/cuda/std/detail/libcxx/include/cstddef.jit
-                  ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/libcudacxx/cuda/std/detail/libcxx/include/cstdint.jit
-                  ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/libcudacxx/cuda/std/detail/libcxx/include/ctime.jit
-                  ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/libcudacxx/cuda/std/detail/libcxx/include/limits.jit
-                  ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/libcudacxx/cuda/std/detail/libcxx/include/ratio.jit
-                  ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/libcudacxx/cuda/std/detail/libcxx/include/type_traits.jit
-                  ${CUDF_GENERATED_INCLUDE_DIR}/include/jit/libcudacxx/cuda/std/detail/libcxx/include/version.jit
-                  )
-
-###################################################################################################
-# - copy libcu++ ----------------------------------------------------------------------------------
-
-# `${LIBCUDACXX_INCLUDE_DIR}/` specifies that the contents of this directory will be installed (not the directory itself)
-file(COPY "${LIBCUDACXX_INCLUDE_DIR}/" DESTINATION "${CUDF_GENERATED_INCLUDE_DIR}/include/libcudacxx")
-file(COPY "${LIBCXX_INCLUDE_DIR}"      DESTINATION "${CUDF_GENERATED_INCLUDE_DIR}/include/libcxx")
diff --git a/cpp/cmake/thirdparty/CUDF_GetArrow.cmake b/cpp/cmake/thirdparty/CUDF_GetArrow.cmake
index 002085c2973..c1c29a693d5 100644
--- a/cpp/cmake/thirdparty/CUDF_GetArrow.cmake
+++ b/cpp/cmake/thirdparty/CUDF_GetArrow.cmake
@@ -43,6 +43,7 @@ function(find_and_configure_arrow VERSION BUILD_STATIC)
         GIT_SHALLOW     TRUE
         SOURCE_SUBDIR   cpp
         OPTIONS         "CMAKE_VERBOSE_MAKEFILE ON"
+                        "CUDA_USE_STATIC_CUDA_RUNTIME ${CUDA_STATIC_RUNTIME}"
                         "ARROW_IPC ON"
                         "ARROW_CUDA ON"
                         "ARROW_DATASET ON"
diff --git a/cpp/cmake/thirdparty/CUDF_GetCPM.cmake b/cpp/cmake/thirdparty/CUDF_GetCPM.cmake
index 19c07933d42..d0fe88eb398 100644
--- a/cpp/cmake/thirdparty/CUDF_GetCPM.cmake
+++ b/cpp/cmake/thirdparty/CUDF_GetCPM.cmake
@@ -1,6 +1,8 @@
-set(CPM_DOWNLOAD_VERSION 3b404296b539e596f39421c4e92bc803b299d964) # v0.27.5
+set(CPM_DOWNLOAD_VERSION 4fad2eac0a3741df3d9c44b791f9163b74aa7b07) # 0.32.0
 
 if(CPM_SOURCE_CACHE)
+  # Expand relative path. This is important if the provided path contains a tilde (~)
+  get_filename_component(CPM_SOURCE_CACHE ${CPM_SOURCE_CACHE} ABSOLUTE)
   set(CPM_DOWNLOAD_LOCATION "${CPM_SOURCE_CACHE}/cpm/CPM_${CPM_DOWNLOAD_VERSION}.cmake")
 elseif(DEFINED ENV{CPM_SOURCE_CACHE})
   set(CPM_DOWNLOAD_LOCATION "$ENV{CPM_SOURCE_CACHE}/cpm/CPM_${CPM_DOWNLOAD_VERSION}.cmake")
@@ -12,7 +14,7 @@ if(NOT (EXISTS ${CPM_DOWNLOAD_LOCATION}))
   message(VERBOSE "CUDF: Downloading CPM.cmake to ${CPM_DOWNLOAD_LOCATION}")
   file(
     DOWNLOAD
-    https://raw.githubusercontent.com/TheLartians/CPM.cmake/${CPM_DOWNLOAD_VERSION}/cmake/CPM.cmake
+    https://raw.githubusercontent.com/cpm-cmake/CPM.cmake/${CPM_DOWNLOAD_VERSION}/cmake/CPM.cmake
     ${CPM_DOWNLOAD_LOCATION})
 endif()
 
diff --git a/cpp/cmake/thirdparty/CUDF_GetJitify.cmake b/cpp/cmake/thirdparty/CUDF_GetJitify.cmake
index e041be26d64..6e853816ec5 100644
--- a/cpp/cmake/thirdparty/CUDF_GetJitify.cmake
+++ b/cpp/cmake/thirdparty/CUDF_GetJitify.cmake
@@ -1,5 +1,5 @@
 #=============================================================================
-# Copyright (c) 2020, NVIDIA CORPORATION.
+# Copyright (c) 2020-2021, NVIDIA CORPORATION.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -18,9 +18,9 @@
 
 function(find_and_configure_jitify)
     CPMFindPackage(NAME     jitify
-            VERSION         1.0.0
+            VERSION         2.0.0
             GIT_REPOSITORY  https://github.com/rapidsai/jitify.git
-            GIT_TAG         cudf_0.16
+            GIT_TAG         cudf_0.19
             GIT_SHALLOW     TRUE
             DOWNLOAD_ONLY   TRUE)
     set(JITIFY_INCLUDE_DIR "${jitify_SOURCE_DIR}" PARENT_SCOPE)
diff --git a/cpp/cmake/thirdparty/CUDF_GetRMM.cmake b/cpp/cmake/thirdparty/CUDF_GetRMM.cmake
index 136947674f9..9f6221d5d1f 100644
--- a/cpp/cmake/thirdparty/CUDF_GetRMM.cmake
+++ b/cpp/cmake/thirdparty/CUDF_GetRMM.cmake
@@ -14,19 +14,6 @@
 # limitations under the License.
 #=============================================================================
 
-function(cudf_save_if_enabled var)
-    if(CUDF_${var})
-        unset(${var} PARENT_SCOPE)
-        unset(${var} CACHE)
-    endif()
-endfunction()
-
-function(cudf_restore_if_enabled var)
-    if(CUDF_${var})
-        set(${var} ON CACHE INTERNAL "" FORCE)
-    endif()
-endfunction()
-
 function(find_and_configure_rmm VERSION)
 
     if(TARGET rmm::rmm)
@@ -37,9 +24,6 @@ function(find_and_configure_rmm VERSION)
     # 1. Pass `-D CPM_rmm_SOURCE=/path/to/rmm` to build a local RMM source tree
     # 2. Pass `-D CMAKE_PREFIX_PATH=/path/to/rmm/build` to use an existing local
     #    RMM build directory as the install location for find_package(rmm)
-    cudf_save_if_enabled(BUILD_TESTS)
-    cudf_save_if_enabled(BUILD_BENCHMARKS)
-
     CPMFindPackage(NAME rmm
         VERSION         ${VERSION}
         GIT_REPOSITORY  https://github.com/rapidsai/rmm.git
@@ -50,8 +34,6 @@ function(find_and_configure_rmm VERSION)
                         "CUDA_STATIC_RUNTIME ${CUDA_STATIC_RUNTIME}"
                         "DISABLE_DEPRECATION_WARNING ${DISABLE_DEPRECATION_WARNING}"
     )
-    cudf_restore_if_enabled(BUILD_TESTS)
-    cudf_restore_if_enabled(BUILD_BENCHMARKS)
 
     # Make sure consumers of cudf can also see rmm::rmm
     fix_cmake_global_defaults(rmm::rmm)
diff --git a/cpp/cmake/thirdparty/CUDF_GetThrust.cmake b/cpp/cmake/thirdparty/CUDF_GetThrust.cmake
index 5a304f234d2..daafe4a33a5 100644
--- a/cpp/cmake/thirdparty/CUDF_GetThrust.cmake
+++ b/cpp/cmake/thirdparty/CUDF_GetThrust.cmake
@@ -15,12 +15,23 @@
 #=============================================================================
 
 function(find_and_configure_thrust VERSION)
+    # We only want to set `UPDATE_DISCONNECTED` while
+    # the GIT tag hasn't moved from the last time we cloned
+    set(cpm_thrust_disconnect_update "UPDATE_DISCONNECTED TRUE")
+    set(CPM_THRUST_CURRENT_VERSION ${VERSION} CACHE STRING "version of thrust we checked out")
+    if(NOT VERSION VERSION_EQUAL CPM_THRUST_CURRENT_VERSION)
+        set(CPM_THRUST_CURRENT_VERSION ${VERSION} CACHE STRING "version of thrust we checked out" FORCE)
+        set(cpm_thrust_disconnect_update "")
+    endif()
+
     CPMAddPackage(NAME Thrust
         VERSION         ${VERSION}
         GIT_REPOSITORY  https://github.com/NVIDIA/thrust.git
         GIT_TAG         ${VERSION}
         GIT_SHALLOW     TRUE
-        PATCH_COMMAND   patch -p1 -N < ${CUDF_SOURCE_DIR}/cmake/thrust.patch || true)
+        ${cpm_thrust_disconnect_update}
+        PATCH_COMMAND   patch --reject-file=- -p1 -N < ${CUDF_SOURCE_DIR}/cmake/thrust.patch || true
+        )
 
     thrust_create_target(cudf::Thrust FROM_OPTIONS)
     set(THRUST_LIBRARY "cudf::Thrust" PARENT_SCOPE)
diff --git a/cpp/doxygen/Doxyfile b/cpp/doxygen/Doxyfile
index 8fde8098bd3..eaa632860e5 100644
--- a/cpp/doxygen/Doxyfile
+++ b/cpp/doxygen/Doxyfile
@@ -38,7 +38,7 @@ PROJECT_NAME           = "libcudf"
 # could be handy for archiving the generated documentation or if some version
 # control system is used.
 
-PROJECT_NUMBER         = 0.19.0
+PROJECT_NUMBER         = 0.20.0
 
 # Using the PROJECT_BRIEF tag one can provide an optional one line description
 # for a project that appears at the top of each page and should give viewer a
@@ -2167,7 +2167,7 @@ SKIP_FUNCTION_MACROS   = YES
 # the path). If a tag file is not located in the directory in which doxygen is
 # run, you must also specify the path to the tagfile here.
 
-TAGFILES               = rmm.tag=https://docs.rapids.ai/api/librmm/0.19
+TAGFILES               = rmm.tag=https://docs.rapids.ai/api/librmm/0.20
 
 # When a file name is specified after GENERATE_TAGFILE, doxygen will create a
 # tag file that is based on the input files it reads. See section "Linking to
diff --git a/cpp/include/cudf/detail/gather.cuh b/cpp/include/cudf/detail/gather.cuh
index bf488621d52..7a560e4c048 100644
--- a/cpp/include/cudf/detail/gather.cuh
+++ b/cpp/include/cudf/detail/gather.cuh
@@ -142,7 +142,11 @@ void gather_helper(InputItr source_itr,
 // Error case when no other overload or specialization is available
 template <typename Element, typename Enable = void>
 struct column_gatherer_impl {
-  std::unique_ptr<column> operator()(...) { CUDF_FAIL("Unsupported type in gather."); }
+  template <typename... Args>
+  std::unique_ptr<column> operator()(Args&&...)
+  {
+    CUDF_FAIL("Unsupported type in gather.");
+  }
 };
 
 /**
@@ -466,15 +470,20 @@ struct column_gatherer_impl<struct_view> {
                                                                          mr);
                    });
 
-    gather_bitmask(
-      // Table view of struct column.
-      cudf::table_view{
-        std::vector<cudf::column_view>{structs_column.child_begin(), structs_column.child_end()}},
-      gather_map_begin,
-      output_struct_members,
-      nullify_out_of_bounds ? gather_bitmask_op::NULLIFY : gather_bitmask_op::DONT_CHECK,
-      stream,
-      mr);
+    auto const nullable = std::any_of(structs_column.child_begin(),
+                                      structs_column.child_end(),
+                                      [](auto const& col) { return col.nullable(); });
+    if (nullable) {
+      gather_bitmask(
+        // Table view of struct column.
+        cudf::table_view{
+          std::vector<cudf::column_view>{structs_column.child_begin(), structs_column.child_end()}},
+        gather_map_begin,
+        output_struct_members,
+        nullify_out_of_bounds ? gather_bitmask_op::NULLIFY : gather_bitmask_op::DONT_CHECK,
+        stream,
+        mr);
+    }
 
     return cudf::make_structs_column(
       gather_map_size,
@@ -652,11 +661,15 @@ std::unique_ptr<table> gather(
                                                    mr));
   }
 
-  gather_bitmask_op const op = bounds_policy == out_of_bounds_policy::NULLIFY
-                                 ? gather_bitmask_op::NULLIFY
-                                 : gather_bitmask_op::DONT_CHECK;
-
-  gather_bitmask(source_table, gather_map_begin, destination_columns, op, stream, mr);
+  auto const nullable = bounds_policy == out_of_bounds_policy::NULLIFY ||
+                        std::any_of(source_table.begin(), source_table.end(), [](auto const& col) {
+                          return col.nullable();
+                        });
+  if (nullable) {
+    auto const op = bounds_policy == out_of_bounds_policy::NULLIFY ? gather_bitmask_op::NULLIFY
+                                                                   : gather_bitmask_op::DONT_CHECK;
+    gather_bitmask(source_table, gather_map_begin, destination_columns, op, stream, mr);
+  }
 
   return std::make_unique<table>(std::move(destination_columns));
 }
diff --git a/cpp/include/cudf/detail/scatter.cuh b/cpp/include/cudf/detail/scatter.cuh
index 30764b9b89f..d069ed06cae 100644
--- a/cpp/include/cudf/detail/scatter.cuh
+++ b/cpp/include/cudf/detail/scatter.cuh
@@ -25,6 +25,7 @@
 #include <cudf/dictionary/dictionary_column_view.hpp>
 #include <cudf/dictionary/dictionary_factories.hpp>
 #include <cudf/lists/detail/scatter.cuh>
+#include <cudf/null_mask.hpp>
 #include <cudf/strings/detail/scatter.cuh>
 #include <cudf/strings/string_view.cuh>
 #include <cudf/utilities/traits.hpp>
@@ -32,6 +33,8 @@
 #include <rmm/cuda_stream_view.hpp>
 #include <rmm/exec_policy.hpp>
 
+#include <thrust/uninitialized_fill.h>
+
 namespace cudf {
 namespace detail {
 
@@ -42,10 +45,9 @@ namespace detail {
  * function using the PASSTHROUGH op since the resulting map may contain index
  * values outside the target's range.
  *
- * First, the gather-map is initialized with invalid entries.
- * The gather_rows is used since it should always be outside the target size.
- *
- * Then, the `output[scatter_map[i]] = i`.
+ * First, the gather-map is initialized with an invalid index.
+ * The value `numeric_limits::lowest()` is used since it should always be outside the target size.
+ * Then, `output[scatter_map[i]] = i` for each `i`.
  *
  * @tparam MapIterator Iterator type of the input scatter map.
  * @param scatter_map_begin Beginning of scatter map.
@@ -62,11 +64,16 @@ auto scatter_to_gather(MapIterator scatter_map_begin,
 {
   using MapValueType = typename thrust::iterator_traits<MapIterator>::value_type;
 
-  // The gather_map is initialized with gather_rows value to identify pass-through entries
-  // when calling the gather_bitmask() which applies a pass-through whenever it finds a
+  // The gather_map is initialized with `numeric_limits::lowest()` value to identify pass-through
+  // entries when calling the gather_bitmask() which applies a pass-through whenever it finds a
   // value outside the range of the target column.
-  // We'll use the gather_rows value for this since it should always be outside the valid range.
-  auto gather_map = rmm::device_vector<size_type>(gather_rows, gather_rows);
+  // We'll use the `numeric_limits::lowest()` value for this since it should always be outside the
+  // valid range.
+  auto gather_map = rmm::device_uvector<size_type>(gather_rows, stream);
+  thrust::uninitialized_fill(rmm::exec_policy(stream),
+                             gather_map.begin(),
+                             gather_map.end(),
+                             std::numeric_limits<size_type>::lowest());
 
   // Convert scatter map to a gather map
   thrust::scatter(
@@ -79,9 +86,46 @@ auto scatter_to_gather(MapIterator scatter_map_begin,
   return gather_map;
 }
 
+/**
+ * @brief Create a complement map of `scatter_to_gather` map
+ *
+ * The purpose of this map is to create an identity-mapping for the rows that are not
+ * touched by the `scatter_map`.
+ *
+ * The output result of this mapping is firstly initialized as an identity-mapping
+ * (i.e., `output[i] = i`). Then, for each value `idx` from `scatter_map`, the value `output[idx]`
+ * is set to `numeric_limits::lowest()`, which is an invalid, out-of-bound index to identify the
+ * pass-through entries when calling the `gather_bitmask()` function.
+ *
+ */
+template <typename MapIterator>
+auto scatter_to_gather_complement(MapIterator scatter_map_begin,
+                                  MapIterator scatter_map_end,
+                                  size_type gather_rows,
+                                  rmm::cuda_stream_view stream)
+{
+  auto gather_map = rmm::device_uvector<size_type>(gather_rows, stream);
+  thrust::sequence(rmm::exec_policy(stream), gather_map.begin(), gather_map.end(), 0);
+
+  auto const out_of_bounds_begin =
+    thrust::make_constant_iterator(std::numeric_limits<size_type>::lowest());
+  auto const out_of_bounds_end =
+    out_of_bounds_begin + thrust::distance(scatter_map_begin, scatter_map_end);
+  thrust::scatter(rmm::exec_policy(stream),
+                  out_of_bounds_begin,
+                  out_of_bounds_end,
+                  scatter_map_begin,
+                  gather_map.begin());
+  return gather_map;
+}
+
 template <typename Element, typename Enable = void>
 struct column_scatterer_impl {
-  std::unique_ptr<column> operator()(...) const { CUDF_FAIL("Unsupported type for scatter."); }
+  template <typename... Args>
+  std::unique_ptr<column> operator()(Args&&...) const
+  {
+    CUDF_FAIL("Unsupported type for scatter.");
+  }
 };
 
 template <typename Element>
@@ -214,6 +258,89 @@ struct column_scatterer {
   }
 };
 
+template <>
+struct column_scatterer_impl<struct_view> {
+  template <typename MapItRoot>
+  std::unique_ptr<column> operator()(column_view const& source,
+                                     MapItRoot scatter_map_begin,
+                                     MapItRoot scatter_map_end,
+                                     column_view const& target,
+                                     rmm::cuda_stream_view stream,
+                                     rmm::mr::device_memory_resource* mr) const
+  {
+    CUDF_EXPECTS(source.num_children() == target.num_children(),
+                 "Scatter source and target are not of the same type.");
+
+    auto const scatter_map_size = std::distance(scatter_map_begin, scatter_map_end);
+    if (scatter_map_size == 0) { return std::make_unique<column>(target, stream, mr); }
+
+    structs_column_view const structs_src(source);
+    structs_column_view const structs_target(target);
+    std::vector<std::unique_ptr<column>> output_struct_members(structs_src.num_children());
+
+    std::transform(structs_src.child_begin(),
+                   structs_src.child_end(),
+                   structs_target.child_begin(),
+                   output_struct_members.begin(),
+                   [&scatter_map_begin, &scatter_map_end, stream, mr](auto const& source_col,
+                                                                      auto const& target_col) {
+                     return type_dispatcher<dispatch_storage_type>(source_col.type(),
+                                                                   column_scatterer{},
+                                                                   source_col,
+                                                                   scatter_map_begin,
+                                                                   scatter_map_end,
+                                                                   target_col,
+                                                                   stream,
+                                                                   mr);
+                   });
+
+    // We still need to call `gather_bitmask` even when the source's children are not nullable,
+    // as if the target's children have null_masks, those null_masks need to be updated after
+    // being scattered onto
+    auto const child_nullable = std::any_of(structs_src.child_begin(),
+                                            structs_src.child_end(),
+                                            [](auto const& col) { return col.nullable(); }) or
+                                std::any_of(structs_target.child_begin(),
+                                            structs_target.child_end(),
+                                            [](auto const& col) { return col.nullable(); });
+    if (child_nullable) {
+      auto const gather_map =
+        scatter_to_gather(scatter_map_begin, scatter_map_end, source.size(), stream);
+      gather_bitmask(cudf::table_view{std::vector<cudf::column_view>{structs_src.child_begin(),
+                                                                     structs_src.child_end()}},
+                     gather_map.begin(),
+                     output_struct_members,
+                     gather_bitmask_op::PASSTHROUGH,
+                     stream,
+                     mr);
+    }
+
+    // Need to put the result column in a vector to call `gather_bitmask`
+    std::vector<std::unique_ptr<column>> result;
+    result.emplace_back(cudf::make_structs_column(source.size(),
+                                                  std::move(output_struct_members),
+                                                  0,
+                                                  rmm::device_buffer{0, stream, mr},
+                                                  stream,
+                                                  mr));
+
+    // Only gather bitmask from the target column for the rows that have not been scattered onto
+    // The bitmask from the source column will be gathered at the top level `scatter()` call
+    if (target.nullable()) {
+      auto const gather_map =
+        scatter_to_gather_complement(scatter_map_begin, scatter_map_end, target.size(), stream);
+      gather_bitmask(table_view{std::vector<cudf::column_view>{target}},
+                     gather_map.begin(),
+                     result,
+                     gather_bitmask_op::PASSTHROUGH,
+                     stream,
+                     mr);
+    }
+
+    return std::move(result.front());
+  }
+};
+
 /**
  * @brief Scatters the rows of the source table into a copy of the target table
  * according to a scatter map.
@@ -278,10 +405,8 @@ std::unique_ptr<table> scatter(
   // Transform negative indices to index + target size
   auto updated_scatter_map_begin =
     thrust::make_transform_iterator(scatter_map_begin, index_converter<MapType>{target.num_rows()});
-
   auto updated_scatter_map_end =
     thrust::make_transform_iterator(scatter_map_end, index_converter<MapType>{target.num_rows()});
-
   auto result = std::vector<std::unique_ptr<column>>(target.num_columns());
 
   std::transform(source.begin(),
@@ -299,11 +424,16 @@ std::unique_ptr<table> scatter(
                                                                  mr);
                  });
 
-  auto gather_map = scatter_to_gather(
-    updated_scatter_map_begin, updated_scatter_map_end, target.num_rows(), stream);
-
-  gather_bitmask(source, gather_map.begin(), result, gather_bitmask_op::PASSTHROUGH, stream, mr);
-
+  // We still need to call `gather_bitmask` even when the source columns are not nullable,
+  // as if the target has null_mask, that null_mask needs to be updated after scattering
+  auto const nullable =
+    std::any_of(source.begin(), source.end(), [](auto const& col) { return col.nullable(); }) or
+    std::any_of(target.begin(), target.end(), [](auto const& col) { return col.nullable(); });
+  if (nullable) {
+    auto const gather_map = scatter_to_gather(
+      updated_scatter_map_begin, updated_scatter_map_end, target.num_rows(), stream);
+    gather_bitmask(source, gather_map.begin(), result, gather_bitmask_op::PASSTHROUGH, stream, mr);
+  }
   return std::make_unique<table>(std::move(result));
 }
 }  // namespace detail
diff --git a/cpp/include/cudf/detail/utilities/hash_functions.cuh b/cpp/include/cudf/detail/utilities/hash_functions.cuh
index 31533a69487..7f3c05134e2 100644
--- a/cpp/include/cudf/detail/utilities/hash_functions.cuh
+++ b/cpp/include/cudf/detail/utilities/hash_functions.cuh
@@ -20,6 +20,7 @@
 #include <cudf/detail/utilities/assert.cuh>
 #include <cudf/fixed_point/fixed_point.hpp>
 #include <cudf/strings/string_view.cuh>
+#include <cudf/types.hpp>
 #include <hash/hash_constants.hpp>
 
 using hash_value_type = uint32_t;
@@ -231,6 +232,9 @@ MD5ListHasher::operator()<string_view>(column_device_view data_col,
 }
 
 struct MD5Hash {
+  MD5Hash() = default;
+  constexpr MD5Hash(uint32_t seed) : m_seed(seed) {}
+
   void __device__ finalize(md5_intermediate_data* hash_state, char* result_location) const
   {
     auto const full_length = (static_cast<uint64_t>(hash_state->message_length)) << 3;
@@ -302,6 +306,9 @@ struct MD5Hash {
   {
     md5_process(col.element<T>(row_index), hash_state);
   }
+
+ private:
+  uint32_t m_seed{cudf::DEFAULT_HASH_SEED};
 };
 
 template <>
@@ -372,7 +379,7 @@ struct MurmurHash3_32 {
   using result_type   = hash_value_type;
 
   MurmurHash3_32() = default;
-  CUDA_HOST_DEVICE_CALLABLE MurmurHash3_32(uint32_t seed) : m_seed(seed) {}
+  constexpr MurmurHash3_32(uint32_t seed) : m_seed(seed) {}
 
   CUDA_DEVICE_CALLABLE uint32_t rotl32(uint32_t x, int8_t r) const
   {
@@ -469,7 +476,7 @@ struct MurmurHash3_32 {
   }
 
  private:
-  uint32_t m_seed{0};
+  uint32_t m_seed{cudf::DEFAULT_HASH_SEED};
 };
 
 template <>
@@ -542,13 +549,29 @@ hash_value_type CUDA_DEVICE_CALLABLE MurmurHash3_32<double>::operator()(double c
   return this->compute_floating_point(key);
 }
 
+template <>
+hash_value_type CUDA_DEVICE_CALLABLE
+MurmurHash3_32<cudf::list_view>::operator()(cudf::list_view const& key) const
+{
+  cudf_assert(false && "List column hashing is not supported");
+  return 0;
+}
+
+template <>
+hash_value_type CUDA_DEVICE_CALLABLE
+MurmurHash3_32<cudf::struct_view>::operator()(cudf::struct_view const& key) const
+{
+  cudf_assert(false && "Direct hashing of struct_view is not supported");
+  return 0;
+}
+
 template <typename Key>
 struct SparkMurmurHash3_32 {
   using argument_type = Key;
   using result_type   = hash_value_type;
 
   SparkMurmurHash3_32() = default;
-  CUDA_HOST_DEVICE_CALLABLE SparkMurmurHash3_32(uint32_t seed) : m_seed(seed) {}
+  constexpr SparkMurmurHash3_32(uint32_t seed) : m_seed(seed) {}
 
   CUDA_DEVICE_CALLABLE uint32_t rotl32(uint32_t x, int8_t r) const
   {
@@ -620,7 +643,7 @@ struct SparkMurmurHash3_32 {
   }
 
  private:
-  uint32_t m_seed{0};
+  uint32_t m_seed{cudf::DEFAULT_HASH_SEED};
 };
 
 template <>
@@ -671,6 +694,22 @@ SparkMurmurHash3_32<numeric::decimal64>::operator()(numeric::decimal64 const& ke
   return this->compute<uint64_t>(key.value());
 }
 
+template <>
+hash_value_type CUDA_DEVICE_CALLABLE
+SparkMurmurHash3_32<cudf::list_view>::operator()(cudf::list_view const& key) const
+{
+  cudf_assert(false && "List column hashing is not supported");
+  return 0;
+}
+
+template <>
+hash_value_type CUDA_DEVICE_CALLABLE
+SparkMurmurHash3_32<cudf::struct_view>::operator()(cudf::struct_view const& key) const
+{
+  cudf_assert(false && "Direct hashing of struct_view is not supported");
+  return 0;
+}
+
 /**
  * @brief Specialization of MurmurHash3_32 operator for strings.
  */
@@ -740,6 +779,8 @@ SparkMurmurHash3_32<double>::operator()(double const& key) const
 template <typename Key>
 struct IdentityHash {
   using result_type = hash_value_type;
+  IdentityHash()    = default;
+  constexpr IdentityHash(uint32_t seed) : m_seed(seed) {}
 
   /**
    * @brief  Combines two hash values into a new single hash value. Called
@@ -752,7 +793,7 @@ struct IdentityHash {
    *
    * @returns A hash value that intelligently combines the lhs and rhs hash values
    */
-  CUDA_HOST_DEVICE_CALLABLE result_type hash_combine(result_type lhs, result_type rhs) const
+  constexpr result_type hash_combine(result_type lhs, result_type rhs) const
   {
     result_type combined{lhs};
 
@@ -762,19 +803,22 @@ struct IdentityHash {
   }
 
   template <typename return_type = result_type>
-  CUDA_HOST_DEVICE_CALLABLE std::enable_if_t<!std::is_arithmetic<Key>::value, return_type>
-  operator()(Key const& key) const
+  constexpr std::enable_if_t<!std::is_arithmetic<Key>::value, return_type> operator()(
+    Key const& key) const
   {
     cudf_assert(false && "IdentityHash does not support this data type");
     return 0;
   }
 
   template <typename return_type = result_type>
-  CUDA_HOST_DEVICE_CALLABLE std::enable_if_t<std::is_arithmetic<Key>::value, return_type>
-  operator()(Key const& key) const
+  constexpr std::enable_if_t<std::is_arithmetic<Key>::value, return_type> operator()(
+    Key const& key) const
   {
     return static_cast<result_type>(key);
   }
+
+ private:
+  uint32_t m_seed{cudf::DEFAULT_HASH_SEED};
 };
 
 template <typename Key>
diff --git a/cpp/include/cudf/detail/utilities/vector_factories.hpp b/cpp/include/cudf/detail/utilities/vector_factories.hpp
index 030d2c331c5..90e6a5c9643 100644
--- a/cpp/include/cudf/detail/utilities/vector_factories.hpp
+++ b/cpp/include/cudf/detail/utilities/vector_factories.hpp
@@ -14,6 +14,8 @@
  * limitations under the License.
  */
 
+#pragma once
+
 /**
  * @brief Convenience factories for creating device vectors from host spans
  * @file vector_factories.hpp
@@ -231,6 +233,93 @@ rmm::device_uvector<typename Container::value_type> make_device_uvector_sync(
   return make_device_uvector_sync(device_span<typename Container::value_type const>{c}, stream, mr);
 }
 
+/**
+ * @brief Asynchronously construct a `std::vector` containing a copy of data from a
+ * `device_span`
+ *
+ * @note This function does not synchronize `stream`.
+ *
+ * @tparam T The type of the data to copy
+ * @param source_data The device data to copy
+ * @param stream The stream on which to perform the copy
+ * @return The data copied to the host
+ */
+template <typename T>
+std::vector<T> make_std_vector_async(device_span<T const> v,
+                                     rmm::cuda_stream_view stream = rmm::cuda_stream_default)
+{
+  std::vector<T> result(v.size());
+  CUDA_TRY(cudaMemcpyAsync(
+    result.data(), v.data(), v.size() * sizeof(T), cudaMemcpyDeviceToHost, stream.value()));
+  return result;
+}
+
+/**
+ * @brief Asynchronously construct a `std::vector` containing a copy of data from a device
+ * container
+ *
+ * @note This function synchronizes `stream`.
+ *
+ * @tparam Container The type of the container to copy from
+ * @tparam T The type of the data to copy
+ * @param c The input device container from which to copy
+ * @param stream The stream on which to perform the copy
+ * @return The data copied to the host
+ */
+template <
+  typename Container,
+  std::enable_if_t<
+    std::is_convertible<Container, device_span<typename Container::value_type const>>::value>* =
+    nullptr>
+std::vector<typename Container::value_type> make_std_vector_async(
+  Container const& c, rmm::cuda_stream_view stream = rmm::cuda_stream_default)
+{
+  return make_std_vector_async(device_span<typename Container::value_type const>{c}, stream);
+}
+
+/**
+ * @brief Synchronously construct a `std::vector` containing a copy of data from a
+ * `device_span`
+ *
+ * @note This function does a synchronize on `stream`.
+ *
+ * @tparam T The type of the data to copy
+ * @param source_data The device data to copy
+ * @param stream The stream on which to perform the copy
+ * @return The data copied to the host
+ */
+template <typename T>
+std::vector<T> make_std_vector_sync(device_span<T const> v,
+                                    rmm::cuda_stream_view stream = rmm::cuda_stream_default)
+{
+  auto result = make_std_vector_async(v, stream);
+  stream.synchronize();
+  return result;
+}
+
+/**
+ * @brief Synchronously construct a `std::vector` containing a copy of data from a device
+ * container
+ *
+ * @note This function synchronizes `stream`.
+ *
+ * @tparam Container The type of the container to copy from
+ * @tparam T The type of the data to copy
+ * @param c The input device container from which to copy
+ * @param stream The stream on which to perform the copy
+ * @return The data copied to the host
+ */
+template <
+  typename Container,
+  std::enable_if_t<
+    std::is_convertible<Container, device_span<typename Container::value_type const>>::value>* =
+    nullptr>
+std::vector<typename Container::value_type> make_std_vector_sync(
+  Container const& c, rmm::cuda_stream_view stream = rmm::cuda_stream_default)
+{
+  return make_std_vector_sync(device_span<typename Container::value_type const>{c}, stream);
+}
+
 }  // namespace detail
 
 }  // namespace cudf
diff --git a/cpp/include/cudf/hashing.hpp b/cpp/include/cudf/hashing.hpp
index 3f95b8b417b..0fb5002a953 100644
--- a/cpp/include/cudf/hashing.hpp
+++ b/cpp/include/cudf/hashing.hpp
@@ -39,7 +39,7 @@ std::unique_ptr<column> hash(
   table_view const& input,
   hash_id hash_function                     = hash_id::HASH_MURMUR3,
   std::vector<uint32_t> const& initial_hash = {},
-  uint32_t seed                             = 0,
+  uint32_t seed                             = DEFAULT_HASH_SEED,
   rmm::mr::device_memory_resource* mr       = rmm::mr::get_current_device_resource());
 
 /** @} */  // end of group
diff --git a/cpp/include/cudf/join.hpp b/cpp/include/cudf/join.hpp
index fcc0bcd444e..5a2c913d4c3 100644
--- a/cpp/include/cudf/join.hpp
+++ b/cpp/include/cudf/join.hpp
@@ -41,13 +41,14 @@ namespace cudf {
  * the matched row indices from the right table.
  *
  * @code{.pseudo}
- *     Left: {{0, 1, 2}}
- *     Right: {{1, 2, 3}}
- *     Result: {{1, 2}, {0, 1}}
+ * Left: {{0, 1, 2}}
+ * Right: {{1, 2, 3}}
+ * Result: {{1, 2}, {0, 1}}
  *
- *     Left: {{0, 1, 2}, {3, 4, 5}}
- *     Right: {{1, 2, 3}, {4, 6, 7}}
- *     Result: {{1}, {0}}
+ * Left: {{0, 1, 2}, {3, 4, 5}}
+ * Right: {{1, 2, 3}, {4, 6, 7}}
+ * Result: {{1}, {0}}
+ * @endcode
  *
  * @throw cudf::logic_error if number of elements in `left_keys` or `right_keys`
  * mismatch.
@@ -77,10 +78,10 @@ inner_join(cudf::table_view const& left_keys,
  * in the columns being joined on match.
  *
  * @code{.pseudo}
- *          Left: {{0, 1, 2}}
- *          Right: {{4, 9, 3}, {1, 2, 5}}
- *          left_on: {0}
- *          right_on: {1}
+ * Left: {{0, 1, 2}}
+ * Right: {{4, 9, 3}, {1, 2, 5}}
+ * left_on: {0}
+ * right_on: {1}
  * Result: {{1, 2}, {4, 9}, {1, 2}}
  * @endcode
  *
@@ -125,13 +126,14 @@ std::unique_ptr<cudf::table> inner_join(
  * out-of-bounds value.
  *
  * @code{.pseudo}
- *     Left: {{0, 1, 2}}
- *     Right: {{1, 2, 3}}
- *     Result: {{0, 1, 2}, {None, 0, 1}}
+ * Left: {{0, 1, 2}}
+ * Right: {{1, 2, 3}}
+ * Result: {{0, 1, 2}, {None, 0, 1}}
  *
- *     Left: {{0, 1, 2}, {3, 4, 5}}
- *     Right: {{1, 2, 3}, {4, 6, 7}}
- *     Result: {{0, 1, 2}, {None, 0, None}}
+ * Left: {{0, 1, 2}, {3, 4, 5}}
+ * Right: {{1, 2, 3}, {4, 6, 7}}
+ * Result: {{0, 1, 2}, {None, 0, None}}
+ * @endcode
  *
  * @throw cudf::logic_error if number of elements in `left_keys` or `right_keys`
  * mismatch.
@@ -163,16 +165,16 @@ left_join(cudf::table_view const& left_keys,
  * values in the left columns will be null.
  *
  * @code{.pseudo}
- *          Left: {{0, 1, 2}}
- *          Right: {{1, 2, 3}, {1, 2 ,5}}
- *          left_on: {0}
- *          right_on: {1}
+ * Left: {{0, 1, 2}}
+ * Right: {{1, 2, 3}, {1, 2 ,5}}
+ * left_on: {0}
+ * right_on: {1}
  * Result: { {0, 1, 2}, {NULL, 1, 2}, {NULL, 1, 2} }
  *
- *          Left: {{0, 1, 2}}
- *          Right {{1, 2, 3}, {1, 2, 5}}
- *          left_on: {0}
- *          right_on: {0}
+ * Left: {{0, 1, 2}}
+ * Right {{1, 2, 3}, {1, 2, 5}}
+ * left_on: {0}
+ * right_on: {0}
  * Result: { {0, 1, 2}, {NULL, 1, 2}, {NULL, 1, 2} }
  * @endcode
  *
@@ -216,13 +218,14 @@ std::unique_ptr<cudf::table> left_join(
  * representing a row from one table without a match in the other.
  *
  * @code{.pseudo}
- *     Left: {{0, 1, 2}}
- *     Right: {{1, 2, 3}}
- *     Result: {{0, 1, 2, None}, {None, 0, 1, 2}}
+ * Left: {{0, 1, 2}}
+ * Right: {{1, 2, 3}}
+ * Result: {{0, 1, 2, None}, {None, 0, 1, 2}}
  *
- *     Left: {{0, 1, 2}, {3, 4, 5}}
- *     Right: {{1, 2, 3}, {4, 6, 7}}
- *     Result: {{0, 1, 2, None, None}, {None, 0, None, 1, 2}}
+ * Left: {{0, 1, 2}, {3, 4, 5}}
+ * Right: {{1, 2, 3}, {4, 6, 7}}
+ * Result: {{0, 1, 2, None, None}, {None, 0, None, 1, 2}}
+ * @endcode
  *
  * @throw cudf::logic_error if number of elements in `left_keys` or `right_keys`
  * mismatch.
@@ -254,16 +257,16 @@ full_join(cudf::table_view const& left_keys,
  * values in the left columns will be null.
  *
  * @code{.pseudo}
- *          Left: {{0, 1, 2}}
- *          Right: {{1, 2, 3}, {1, 2, 5}}
- *          left_on: {0}
- *          right_on: {1}
+ * Left: {{0, 1, 2}}
+ * Right: {{1, 2, 3}, {1, 2, 5}}
+ * left_on: {0}
+ * right_on: {1}
  * Result: { {0, 1, 2, NULL}, {NULL, 1, 2, 3}, {NULL, 1, 2, 5} }
  *
- *          Left: {{0, 1, 2}}
- *          Right: {{1, 2, 3}, {1, 2, 5}}
- *          left_on: {0}
- *          right_on: {0}
+ * Left: {{0, 1, 2}}
+ * Right: {{1, 2, 3}, {1, 2, 5}}
+ * left_on: {0}
+ * right_on: {0}
  * Result: { {0, 1, 2, NULL}, {NULL, 1, 2, 3}, {NULL, 1, 2, 5} }
  * @endcode
  *
@@ -305,9 +308,9 @@ std::unique_ptr<cudf::table> full_join(
  * for which there is a matching row in the right table.
  *
  * @code{.pseudo}
- *          TableA: {{0, 1, 2}}
- *          TableB: {{1, 2, 3}}
- *          right_on: {1}
+ * TableA: {{0, 1, 2}}
+ * TableB: {{1, 2, 3}}
+ * right_on: {1}
  * Result: {1, 2}
  * @endcode
  *
@@ -338,16 +341,16 @@ std::unique_ptr<rmm::device_uvector<size_type>> left_semi_join(
  * returns rows that exist in the right table.
  *
  * @code{.pseudo}
- *          TableA: {{0, 1, 2}}
- *          TableB: {{1, 2, 3}, {1, 2, 5}}
- *          left_on: {0}
- *          right_on: {1}
+ * TableA: {{0, 1, 2}}
+ * TableB: {{1, 2, 3}, {1, 2, 5}}
+ * left_on: {0}
+ * right_on: {1}
  * Result: { {1, 2} }
  *
- *          TableA {{0, 1, 2}, {1, 2, 5}}
- *          TableB {{1, 2, 3}}
- *          left_on: {0}
- *          right_on: {0}
+ * TableA {{0, 1, 2}, {1, 2, 5}}
+ * TableB {{1, 2, 3}}
+ * left_on: {0}
+ * right_on: {0}
  * Result: { {1, 2}, {2, 5} }
  * @endcode
  *
@@ -386,8 +389,8 @@ std::unique_ptr<cudf::table> left_semi_join(
  * for which there is no matching row in the right table.
  *
  * @code{.pseudo}
- *          TableA: {{0, 1, 2}}
- *          TableB: {{1, 2, 3}}
+ * TableA: {{0, 1, 2}}
+ * TableB: {{1, 2, 3}}
  * Result: {0}
  * @endcode
  *
@@ -417,16 +420,16 @@ std::unique_ptr<rmm::device_uvector<size_type>> left_anti_join(
  * returns rows that do not exist in the right table.
  *
  * @code{.pseudo}
- *          TableA: {{0, 1, 2}}
- *          TableB: {{1, 2, 3},  {1, 2, 5}}
- *          left_on: {0}
- *          right_on: {1}
+ * TableA: {{0, 1, 2}}
+ * TableB: {{1, 2, 3},  {1, 2, 5}}
+ * left_on: {0}
+ * right_on: {1}
  * Result: {{0}, {1}}
  *
- *          TableA: {{0, 1, 2}, {1, 2, 5}}
- *          TableB: {{1, 2, 3}}
- *          left_on: {0}
- *          right_on: {0}
+ * TableA: {{0, 1, 2}, {1, 2, 5}}
+ * TableB: {{1, 2, 3}}
+ * left_on: {0}
+ * right_on: {0}
  * Result: { {0} {1} }
  * @endcode
  *
@@ -469,8 +472,8 @@ std::unique_ptr<cudf::table> left_anti_join(
  * equal to `left.num_rows() * right.num_rows()`. Use with caution.
  *
  * @code{.pseudo}
- *          Left a: {0, 1, 2}
- *          Right b: {3, 4, 5}
+ * Left a: {0, 1, 2}
+ * Right b: {3, 4, 5}
  * Result: { a: {0, 0, 0, 1, 1, 1, 2, 2, 2}, b: {3, 4, 5, 3, 4, 5, 3, 4, 5} }
  * @endcode
 
diff --git a/cpp/include/cudf/partitioning.hpp b/cpp/include/cudf/partitioning.hpp
index ddde26ec762..6b1ad7db08b 100644
--- a/cpp/include/cudf/partitioning.hpp
+++ b/cpp/include/cudf/partitioning.hpp
@@ -83,6 +83,9 @@ std::pair<std::unique_ptr<table>, std::vector<size_type>> partition(
  * @param input The table to partition
  * @param columns_to_hash Indices of input columns to hash
  * @param num_partitions The number of partitions to use
+ * @param hash_function Optional hash id that chooses the hash function to use
+ * @param seed Optional seed value to the hash function
+ * @param stream CUDA stream used for device memory operations and kernel launches
  * @param mr Device memory resource used to allocate the returned table's device memory.
  *
  * @returns An output table and a vector of row offsets to each partition
@@ -92,6 +95,7 @@ std::pair<std::unique_ptr<table>, std::vector<size_type>> hash_partition(
   std::vector<size_type> const& columns_to_hash,
   int num_partitions,
   hash_id hash_function               = hash_id::HASH_MURMUR3,
+  uint32_t seed                       = DEFAULT_HASH_SEED,
   rmm::cuda_stream_view stream        = rmm::cuda_stream_default,
   rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());
 
diff --git a/cpp/include/cudf/scalar/scalar.hpp b/cpp/include/cudf/scalar/scalar.hpp
index ded833f4ca0..745f88572b4 100644
--- a/cpp/include/cudf/scalar/scalar.hpp
+++ b/cpp/include/cudf/scalar/scalar.hpp
@@ -151,7 +151,7 @@ class fixed_width_scalar : public scalar {
   /**
    * @brief Implicit conversion operator to get the value of the scalar on the host
    */
-  explicit operator value_type() const { return this->value(0); }
+  explicit operator value_type() const { return this->value(rmm::cuda_stream_default); }
 
   /**
    * @brief Get the value of the scalar
@@ -449,7 +449,7 @@ class string_scalar : public scalar {
   /**
    * @brief Implicit conversion operator to get the value of the scalar in a host std::string
    */
-  explicit operator std::string() const { return this->to_string(0); }
+  explicit operator std::string() const { return this->to_string(rmm::cuda_stream_default); }
 
   /**
    * @brief Get the value of the scalar in a host std::string
diff --git a/cpp/include/cudf/strings/detail/json.hpp b/cpp/include/cudf/strings/detail/json.hpp
new file mode 100644
index 00000000000..e6a0b49f102
--- /dev/null
+++ b/cpp/include/cudf/strings/detail/json.hpp
@@ -0,0 +1,40 @@
+/*
+ * Copyright (c) 2021, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#pragma once
+
+#include <cudf/strings/strings_column_view.hpp>
+
+#include <rmm/cuda_stream_view.hpp>
+
+namespace cudf {
+namespace strings {
+namespace detail {
+
+/**
+ * @copydoc cudf::strings::get_json_object
+ *
+ * @param stream CUDA stream used for device memory operations and kernel launches
+ */
+std::unique_ptr<cudf::column> get_json_object(
+  cudf::strings_column_view const& col,
+  cudf::string_scalar const& json_path,
+  rmm::cuda_stream_view stream        = rmm::cuda_stream_default,
+  rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());
+
+}  // namespace detail
+}  // namespace strings
+}  // namespace cudf
diff --git a/cpp/include/cudf/strings/json.hpp b/cpp/include/cudf/strings/json.hpp
new file mode 100644
index 00000000000..b39e4a2027c
--- /dev/null
+++ b/cpp/include/cudf/strings/json.hpp
@@ -0,0 +1,50 @@
+/*
+ * Copyright (c) 2019-2021, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+#pragma once
+
+#include <cudf/strings/strings_column_view.hpp>
+
+namespace cudf {
+namespace strings {
+
+/**
+ * @addtogroup strings_json
+ * @{
+ * @file
+ */
+
+/**
+ * @brief Apply a JSONPath string to all rows in an input strings column.
+ *
+ * Applies a JSONPath string to an incoming strings column where each row in the column
+ * is a valid json string.  The output is returned by row as a strings column.
+ *
+ * https://tools.ietf.org/id/draft-goessner-dispatch-jsonpath-00.html
+ * Implements only the operators: $ . [] *
+ *
+ * @param col The input strings column. Each row must contain a valid json string
+ * @param json_path The JSONPath string to be applied to each row
+ * @param mr Resource for allocating device memory.
+ * @return New strings column containing the retrieved json object strings
+ */
+std::unique_ptr<cudf::column> get_json_object(
+  cudf::strings_column_view const& col,
+  cudf::string_scalar const& json_path,
+  rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());
+
+/** @} */  // end of doxygen group
+}  // namespace strings
+}  // namespace cudf
diff --git a/cpp/include/cudf/strings/strings_column_view.hpp b/cpp/include/cudf/strings/strings_column_view.hpp
index 0c7270b3ba8..4d3c2dcdc56 100644
--- a/cpp/include/cudf/strings/strings_column_view.hpp
+++ b/cpp/include/cudf/strings/strings_column_view.hpp
@@ -19,7 +19,7 @@
 #include <cudf/column/column_view.hpp>
 
 #include <rmm/cuda_stream_view.hpp>
-#include <rmm/device_vector.hpp>
+#include <rmm/device_uvector.hpp>
 
 /**
  * @file
@@ -86,23 +86,6 @@ class strings_column_view : private column_view {
 
 //! Strings column APIs.
 namespace strings {
-/**
- * @brief Prints the strings to stdout.
- *
- * @param strings Strings instance for this operation.
- * @param start Index of first string to print.
- * @param end Index of last string to print. Specify -1 for all strings.
- * @param max_width Maximum number of characters to print per string.
- *        Specify -1 to print all characters.
- * @param delimiter The chars to print between each string.
- *        Default is new-line character.
- */
-void print(strings_column_view const& strings,
-           size_type start       = 0,
-           size_type end         = -1,
-           size_type max_width   = -1,
-           const char* delimiter = "\n");
-
 /**
  * @brief Create output per Arrow strings format.
  *
@@ -110,10 +93,10 @@ void print(strings_column_view const& strings,
  *
  * @param strings Strings instance for this operation.
  * @param stream CUDA stream used for device memory operations and kernel launches.
- * @param mr Device memory resource used to allocate the returned device_vectors.
+ * @param mr Device memory resource used to allocate the returned device vectors.
  * @return Pair containing a vector of chars and a vector of offsets.
  */
-std::pair<rmm::device_vector<char>, rmm::device_vector<size_type>> create_offsets(
+std::pair<rmm::device_uvector<char>, rmm::device_uvector<size_type>> create_offsets(
   strings_column_view const& strings,
   rmm::cuda_stream_view stream        = rmm::cuda_stream_default,
   rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());
diff --git a/cpp/include/cudf/table/row_operators.cuh b/cpp/include/cudf/table/row_operators.cuh
index decd2879f54..61d714c5538 100644
--- a/cpp/include/cudf/table/row_operators.cuh
+++ b/cpp/include/cudf/table/row_operators.cuh
@@ -428,6 +428,7 @@ template <template <typename> class hash_function, bool has_nulls = true>
 class element_hasher_with_seed {
  public:
   element_hasher_with_seed() = default;
+  __device__ element_hasher_with_seed(uint32_t seed) : _seed{seed} {}
   __device__ element_hasher_with_seed(uint32_t seed, hash_value_type null_hash)
     : _seed{seed}, _null_hash(null_hash)
   {
@@ -448,7 +449,7 @@ class element_hasher_with_seed {
   }
 
  private:
-  uint32_t _seed{0};
+  uint32_t _seed{DEFAULT_HASH_SEED};
   hash_value_type _null_hash{std::numeric_limits<hash_value_type>::max()};
 };
 
@@ -463,6 +464,7 @@ class row_hasher {
  public:
   row_hasher() = delete;
   row_hasher(table_device_view t) : _table{t} {}
+  row_hasher(table_device_view t, uint32_t seed) : _table{t}, _seed(seed) {}
 
   __device__ auto operator()(size_type row_index) const
   {
@@ -470,6 +472,14 @@ class row_hasher {
       return hash_function<hash_value_type>{}.hash_combine(lhs, rhs);
     };
 
+    // Hash the first column w/ the seed
+    auto const initial_hash =
+      hash_combiner(hash_value_type{0},
+                    type_dispatcher(_table.column(0).type(),
+                                    element_hasher_with_seed<hash_function, has_nulls>{_seed},
+                                    _table.column(0),
+                                    row_index));
+
     // Hashes an element in a column
     auto hasher = [=](size_type column_index) {
       return cudf::type_dispatcher(_table.column(column_index).type(),
@@ -479,16 +489,19 @@ class row_hasher {
     };
 
     // Hash each element and combine all the hash values together
-    return thrust::transform_reduce(thrust::seq,
-                                    thrust::make_counting_iterator(0),
-                                    thrust::make_counting_iterator(_table.num_columns()),
-                                    hasher,
-                                    hash_value_type{0},
-                                    hash_combiner);
+    return thrust::transform_reduce(
+      thrust::seq,
+      // note that this starts at 1 and not 0 now since we already hashed the first column
+      thrust::make_counting_iterator(1),
+      thrust::make_counting_iterator(_table.num_columns()),
+      hasher,
+      initial_hash,
+      hash_combiner);
   }
 
  private:
   table_device_view _table;
+  uint32_t _seed{DEFAULT_HASH_SEED};
 };
 
 /**
diff --git a/cpp/include/cudf/types.hpp b/cpp/include/cudf/types.hpp
index 789bb3037f4..b08fccc0d66 100644
--- a/cpp/include/cudf/types.hpp
+++ b/cpp/include/cudf/types.hpp
@@ -64,8 +64,8 @@ class list_scalar;
 class string_scalar;
 template <typename T> class numeric_scalar;
 template <typename T> class fixed_point_scalar;
-template <typename T> class timestamp_scalar;
-template <typename T> class duration_scalar;
+template <typename T> struct timestamp_scalar;
+template <typename T> struct duration_scalar;
 
 class string_scalar_device_view;
 template <typename T> class numeric_scalar_device_view;
@@ -339,5 +339,10 @@ enum class hash_id {
   HASH_SPARK_MURMUR3    ///< Spark Murmur3 hash function
 };
 
+/**
+ * @brief The default seed value for hash functions
+ */
+static constexpr uint32_t DEFAULT_HASH_SEED = 0;
+
 /** @} */
 }  // namespace cudf
diff --git a/cpp/include/cudf/utilities/bit.hpp b/cpp/include/cudf/utilities/bit.hpp
index 31c8835f4c6..458587946f2 100644
--- a/cpp/include/cudf/utilities/bit.hpp
+++ b/cpp/include/cudf/utilities/bit.hpp
@@ -17,7 +17,7 @@
 #pragma once
 
 #include <cassert>
-#include <climits>
+#include <cuda/std/climits>
 #include <cudf/types.hpp>
 
 /**
diff --git a/cpp/include/cudf_test/column_utilities.hpp b/cpp/include/cudf_test/column_utilities.hpp
index 66710960296..cd30090fa81 100644
--- a/cpp/include/cudf_test/column_utilities.hpp
+++ b/cpp/include/cudf_test/column_utilities.hpp
@@ -199,8 +199,14 @@ template <>
 inline std::pair<thrust::host_vector<std::string>, std::vector<bitmask_type>> to_host(column_view c)
 {
   auto strings_data = cudf::strings::create_offsets(strings_column_view(c));
-  thrust::host_vector<char> h_chars(strings_data.first);
-  thrust::host_vector<size_type> h_offsets(strings_data.second);
+  thrust::host_vector<char> h_chars(strings_data.first.size());
+  thrust::host_vector<size_type> h_offsets(strings_data.second.size());
+  CUDA_TRY(
+    cudaMemcpy(h_chars.data(), strings_data.first.data(), h_chars.size(), cudaMemcpyDeviceToHost));
+  CUDA_TRY(cudaMemcpy(h_offsets.data(),
+                      strings_data.second.data(),
+                      h_offsets.size() * sizeof(cudf::size_type),
+                      cudaMemcpyDeviceToHost));
 
   // build std::string vector from chars and offsets
   std::vector<std::string> host_data;
diff --git a/cpp/include/doxygen_groups.h b/cpp/include/doxygen_groups.h
index 65dd5c73475..f78ff98d49d 100644
--- a/cpp/include/doxygen_groups.h
+++ b/cpp/include/doxygen_groups.h
@@ -127,6 +127,7 @@
  *   @defgroup strings_modify Modifying
  *   @defgroup strings_replace Replacing
  *   @defgroup strings_split Splitting
+ *   @defgroup strings_json JSON
  * @}
  * @defgroup dictionary_apis Dictionary
  * @{
diff --git a/cpp/libcudf_kafka/CMakeLists.txt b/cpp/libcudf_kafka/CMakeLists.txt
index e178f5a6280..2f7fa5fc0fe 100644
--- a/cpp/libcudf_kafka/CMakeLists.txt
+++ b/cpp/libcudf_kafka/CMakeLists.txt
@@ -15,7 +15,7 @@
 #=============================================================================
 cmake_minimum_required(VERSION 3.18 FATAL_ERROR)
 
-project(CUDA_KAFKA VERSION 0.19.0 LANGUAGES CXX)
+project(CUDA_KAFKA VERSION 0.20.0 LANGUAGES CXX)
 
 ###################################################################################################
 # - Build options
diff --git a/cpp/libcudf_kafka/cmake/thirdparty/CUDF_KAFKA_GetCUDF.cmake b/cpp/libcudf_kafka/cmake/thirdparty/CUDF_KAFKA_GetCUDF.cmake
index 1f7c15d4f75..50c8b696d8c 100644
--- a/cpp/libcudf_kafka/cmake/thirdparty/CUDF_KAFKA_GetCUDF.cmake
+++ b/cpp/libcudf_kafka/cmake/thirdparty/CUDF_KAFKA_GetCUDF.cmake
@@ -14,22 +14,7 @@
 # limitations under the License.
 #=============================================================================
 
-function(cudfkafka_save_if_enabled var)
-    if(CUDF_KAFKA_${var})
-        unset(${var} PARENT_SCOPE)
-        unset(${var} CACHE)
-    endif()
-endfunction()
-
-function(cudfkafka_restore_if_enabled var)
-    if(CUDF_KAFKA_${var})
-        set(${var} ON CACHE INTERNAL "" FORCE)
-    endif()
-endfunction()
-
 function(find_and_configure_cudf VERSION)
-    cudfkafka_save_if_enabled(BUILD_TESTS)
-    cudfkafka_save_if_enabled(BUILD_BENCHMARKS)
     CPMFindPackage(NAME cudf
         VERSION         ${VERSION}
         GIT_REPOSITORY  https://github.com/rapidsai/cudf.git
@@ -38,9 +23,16 @@ function(find_and_configure_cudf VERSION)
         SOURCE_SUBDIR   cpp
         OPTIONS         "BUILD_TESTS OFF"
                         "BUILD_BENCHMARKS OFF")
-    cudfkafka_restore_if_enabled(BUILD_TESTS)
-    cudfkafka_restore_if_enabled(BUILD_BENCHMARKS)
+    if(cudf_ADDED)
+        set(cudf_ADDED TRUE PARENT_SCOPE)
+    endif()
 endfunction()
 
-set(CUDF_KAFKA_MIN_VERSION_cudf 0.19)
-find_and_configure_cudf(${CUDF_KAFKA_MIN_VERSION_cudf})
+set(CUDA_KAFKA_MIN_VERSION_cudf "${CUDA_KAFKA_VERSION_MAJOR}.${CUDA_KAFKA_VERSION_MINOR}")
+find_and_configure_cudf(${CUDA_KAFKA_MIN_VERSION_cudf})
+
+if(cudf_ADDED)
+    # Since we are building cudf as part of ourselves we need
+    # to enable the CUDA language in the top-most scope
+    enable_language(CUDA)
+endif()
diff --git a/cpp/src/binaryop/binaryop.cpp b/cpp/src/binaryop/binaryop.cpp
index 6b5afa69300..11a3383ee87 100644
--- a/cpp/src/binaryop/binaryop.cpp
+++ b/cpp/src/binaryop/binaryop.cpp
@@ -18,19 +18,13 @@
  */
 
 #include "compiled/binary_ops.hpp"
-#include "jit/code/code.h"
 #include "jit/util.hpp"
 
-#include <jit/launcher.h>
-#include <jit/parser.h>
-#include <jit/type.h>
+#include <jit_preprocessed_files/binaryop/jit/kernel.cu.jit.hpp>
 
-#include <jit/bit.hpp.jit>
-#include <jit/common_headers.hpp>
-#include <jit/durations.hpp.jit>
-#include <jit/fixed_point.hpp.jit>
-#include <jit/timestamps.hpp.jit>
-#include <jit/types.hpp.jit>
+#include <jit/cache.hpp>
+#include <jit/parser.hpp>
+#include <jit/type.hpp>
 
 #include <cudf/binaryop.hpp>
 #include <cudf/column/column_factories.hpp>
@@ -56,6 +50,7 @@ namespace cudf {
 
 namespace binops {
 namespace detail {
+
 /**
  * @brief Computes output valid mask for op between a column and a scalar
  */
@@ -78,69 +73,47 @@ rmm::device_buffer scalar_col_valid_mask_and(column_view const& col,
 
 namespace jit {
 
-const std::string hash = "prog_binop";
-
-const std::vector<std::string> header_names{"operation.h",
-                                            "traits.h",
-                                            cudf_types_hpp,
-                                            cudf_utilities_bit_hpp,
-                                            cudf_wrappers_timestamps_hpp,
-                                            cudf_wrappers_durations_hpp,
-                                            cudf_fixed_point_fixed_point_hpp};
-
-std::istream* headers_code(std::string filename, std::iostream& stream)
-{
-  if (filename == "operation.h") {
-    stream << code::operation;
-    return &stream;
-  }
-  if (filename == "traits.h") {
-    stream << code::traits;
-    return &stream;
-  }
-  auto it = cudf::jit::stringified_headers.find(filename);
-  if (it != cudf::jit::stringified_headers.end()) {
-    return cudf::jit::send_stringified_header(stream, it->second);
-  }
-  return nullptr;
-}
-
 void binary_operation(mutable_column_view& out,
-                      scalar const& lhs,
-                      column_view const& rhs,
+                      column_view const& lhs,
+                      scalar const& rhs,
                       binary_operator op,
+                      OperatorType op_type,
                       rmm::cuda_stream_view stream)
 {
   if (is_null_dependent(op)) {
-    cudf::jit::launcher(
-      hash, code::kernel, header_names, cudf::jit::compiler_flags, headers_code, stream)
-      .set_kernel_inst("kernel_v_s_with_validity",             // name of the kernel we are
-                                                               // launching
-                       {cudf::jit::get_type_name(out.type()),  // list of template arguments
-                        cudf::jit::get_type_name(rhs.type()),
-                        cudf::jit::get_type_name(lhs.type()),
-                        get_operator_name(op, OperatorType::Reverse)})
-      .launch(out.size(),
-              cudf::jit::get_data_ptr(out),
-              cudf::jit::get_data_ptr(rhs),
-              cudf::jit::get_data_ptr(lhs),
-              out.null_mask(),
-              rhs.null_mask(),
-              rhs.offset(),
-              lhs.is_valid());
+    std::string kernel_name =
+      jitify2::reflection::Template("cudf::binops::jit::kernel_v_s_with_validity")  //
+        .instantiate(cudf::jit::get_type_name(out.type()),  // list of template arguments
+                     cudf::jit::get_type_name(lhs.type()),
+                     cudf::jit::get_type_name(rhs.type()),
+                     get_operator_name(op, op_type));
+
+    cudf::jit::get_program_cache(*binaryop_jit_kernel_cu_jit)
+      .get_kernel(kernel_name, {}, {}, {"-arch=sm_."})       //
+      ->configure_1d_max_occupancy(0, 0, 0, stream.value())  //
+      ->launch(out.size(),
+               cudf::jit::get_data_ptr(out),
+               cudf::jit::get_data_ptr(lhs),
+               cudf::jit::get_data_ptr(rhs),
+               out.null_mask(),
+               lhs.null_mask(),
+               lhs.offset(),
+               rhs.is_valid());
   } else {
-    cudf::jit::launcher(
-      hash, code::kernel, header_names, cudf::jit::compiler_flags, headers_code, stream)
-      .set_kernel_inst("kernel_v_s",                           // name of the kernel we are
-                                                               // launching
-                       {cudf::jit::get_type_name(out.type()),  // list of template arguments
-                        cudf::jit::get_type_name(rhs.type()),
-                        cudf::jit::get_type_name(lhs.type()),
-                        get_operator_name(op, OperatorType::Reverse)})
-      .launch(out.size(),
-              cudf::jit::get_data_ptr(out),
-              cudf::jit::get_data_ptr(rhs),
-              cudf::jit::get_data_ptr(lhs));
+    std::string kernel_name =
+      jitify2::reflection::Template("cudf::binops::jit::kernel_v_s")  //
+        .instantiate(cudf::jit::get_type_name(out.type()),            // list of template arguments
+                     cudf::jit::get_type_name(lhs.type()),
+                     cudf::jit::get_type_name(rhs.type()),
+                     get_operator_name(op, op_type));
+
+    cudf::jit::get_program_cache(*binaryop_jit_kernel_cu_jit)
+      .get_kernel(kernel_name, {}, {}, {"-arch=sm_."})       //
+      ->configure_1d_max_occupancy(0, 0, 0, stream.value())  //
+      ->launch(out.size(),
+               cudf::jit::get_data_ptr(out),
+               cudf::jit::get_data_ptr(lhs),
+               cudf::jit::get_data_ptr(rhs));
   }
 }
 
@@ -150,37 +123,16 @@ void binary_operation(mutable_column_view& out,
                       binary_operator op,
                       rmm::cuda_stream_view stream)
 {
-  if (is_null_dependent(op)) {
-    cudf::jit::launcher(
-      hash, code::kernel, header_names, cudf::jit::compiler_flags, headers_code, stream)
-      .set_kernel_inst("kernel_v_s_with_validity",             // name of the kernel we are
-                                                               // launching
-                       {cudf::jit::get_type_name(out.type()),  // list of template arguments
-                        cudf::jit::get_type_name(lhs.type()),
-                        cudf::jit::get_type_name(rhs.type()),
-                        get_operator_name(op, OperatorType::Direct)})
-      .launch(out.size(),
-              cudf::jit::get_data_ptr(out),
-              cudf::jit::get_data_ptr(lhs),
-              cudf::jit::get_data_ptr(rhs),
-              out.null_mask(),
-              lhs.null_mask(),
-              lhs.offset(),
-              rhs.is_valid());
-  } else {
-    cudf::jit::launcher(
-      hash, code::kernel, header_names, cudf::jit::compiler_flags, headers_code, stream)
-      .set_kernel_inst("kernel_v_s",                           // name of the kernel we are
-                                                               // launching
-                       {cudf::jit::get_type_name(out.type()),  // list of template arguments
-                        cudf::jit::get_type_name(lhs.type()),
-                        cudf::jit::get_type_name(rhs.type()),
-                        get_operator_name(op, OperatorType::Direct)})
-      .launch(out.size(),
-              cudf::jit::get_data_ptr(out),
-              cudf::jit::get_data_ptr(lhs),
-              cudf::jit::get_data_ptr(rhs));
-  }
+  return binary_operation(out, lhs, rhs, op, OperatorType::Direct, stream);
+}
+
+void binary_operation(mutable_column_view& out,
+                      scalar const& lhs,
+                      column_view const& rhs,
+                      binary_operator op,
+                      rmm::cuda_stream_view stream)
+{
+  return binary_operation(out, rhs, lhs, op, OperatorType::Reverse, stream);
 }
 
 void binary_operation(mutable_column_view& out,
@@ -190,36 +142,40 @@ void binary_operation(mutable_column_view& out,
                       rmm::cuda_stream_view stream)
 {
   if (is_null_dependent(op)) {
-    cudf::jit::launcher(
-      hash, code::kernel, header_names, cudf::jit::compiler_flags, headers_code, stream)
-      .set_kernel_inst("kernel_v_v_with_validity",             // name of the kernel we are
-                                                               // launching
-                       {cudf::jit::get_type_name(out.type()),  // list of template arguments
-                        cudf::jit::get_type_name(lhs.type()),
-                        cudf::jit::get_type_name(rhs.type()),
-                        get_operator_name(op, OperatorType::Direct)})
-      .launch(out.size(),
-              cudf::jit::get_data_ptr(out),
-              cudf::jit::get_data_ptr(lhs),
-              cudf::jit::get_data_ptr(rhs),
-              out.null_mask(),
-              lhs.null_mask(),
-              rhs.offset(),
-              rhs.null_mask(),
-              rhs.offset());
+    std::string kernel_name =
+      jitify2::reflection::Template("cudf::binops::jit::kernel_v_v_with_validity")  //
+        .instantiate(cudf::jit::get_type_name(out.type()),  // list of template arguments
+                     cudf::jit::get_type_name(lhs.type()),
+                     cudf::jit::get_type_name(rhs.type()),
+                     get_operator_name(op, OperatorType::Direct));
+
+    cudf::jit::get_program_cache(*binaryop_jit_kernel_cu_jit)
+      .get_kernel(kernel_name, {}, {}, {"-arch=sm_."})       //
+      ->configure_1d_max_occupancy(0, 0, 0, stream.value())  //
+      ->launch(out.size(),
+               cudf::jit::get_data_ptr(out),
+               cudf::jit::get_data_ptr(lhs),
+               cudf::jit::get_data_ptr(rhs),
+               out.null_mask(),
+               lhs.null_mask(),
+               rhs.offset(),
+               rhs.null_mask(),
+               rhs.offset());
   } else {
-    cudf::jit::launcher(
-      hash, code::kernel, header_names, cudf::jit::compiler_flags, headers_code, stream)
-      .set_kernel_inst("kernel_v_v",                           // name of the kernel we are
-                                                               // launching
-                       {cudf::jit::get_type_name(out.type()),  // list of template arguments
-                        cudf::jit::get_type_name(lhs.type()),
-                        cudf::jit::get_type_name(rhs.type()),
-                        get_operator_name(op, OperatorType::Direct)})
-      .launch(out.size(),
-              cudf::jit::get_data_ptr(out),
-              cudf::jit::get_data_ptr(lhs),
-              cudf::jit::get_data_ptr(rhs));
+    std::string kernel_name =
+      jitify2::reflection::Template("cudf::binops::jit::kernel_v_v")  //
+        .instantiate(cudf::jit::get_type_name(out.type()),            // list of template arguments
+                     cudf::jit::get_type_name(lhs.type()),
+                     cudf::jit::get_type_name(rhs.type()),
+                     get_operator_name(op, OperatorType::Direct));
+
+    cudf::jit::get_program_cache(*binaryop_jit_kernel_cu_jit)
+      .get_kernel(kernel_name, {}, {}, {"-arch=sm_."})       //
+      ->configure_1d_max_occupancy(0, 0, 0, stream.value())  //
+      ->launch(out.size(),
+               cudf::jit::get_data_ptr(out),
+               cudf::jit::get_data_ptr(lhs),
+               cudf::jit::get_data_ptr(rhs));
   }
 }
 
@@ -232,23 +188,25 @@ void binary_operation(mutable_column_view& out,
   std::string const output_type_name = cudf::jit::get_type_name(out.type());
 
   std::string ptx_hash =
-    hash + "." + std::to_string(std::hash<std::string>{}(ptx + output_type_name));
+    "prog_binop." + std::to_string(std::hash<std::string>{}(ptx + output_type_name));
   std::string cuda_source =
-    "\n#include <cudf/types.hpp>\n" +
-    cudf::jit::parse_single_function_ptx(ptx, "GENERIC_BINARY_OP", output_type_name) + code::kernel;
-
-  cudf::jit::launcher(
-    ptx_hash, cuda_source, header_names, cudf::jit::compiler_flags, headers_code, stream)
-    .set_kernel_inst("kernel_v_v",       // name of the kernel
-                                         // we are launching
-                     {output_type_name,  // list of template arguments
-                      cudf::jit::get_type_name(lhs.type()),
-                      cudf::jit::get_type_name(rhs.type()),
-                      get_operator_name(binary_operator::GENERIC_BINARY, OperatorType::Direct)})
-    .launch(out.size(),
-            cudf::jit::get_data_ptr(out),
-            cudf::jit::get_data_ptr(lhs),
-            cudf::jit::get_data_ptr(rhs));
+    cudf::jit::parse_single_function_ptx(ptx, "GENERIC_BINARY_OP", output_type_name);
+
+  std::string kernel_name =
+    jitify2::reflection::Template("cudf::binops::jit::kernel_v_v")  //
+      .instantiate(output_type_name,                                // list of template arguments
+                   cudf::jit::get_type_name(lhs.type()),
+                   cudf::jit::get_type_name(rhs.type()),
+                   get_operator_name(binary_operator::GENERIC_BINARY, OperatorType::Direct));
+
+  cudf::jit::get_program_cache(*binaryop_jit_kernel_cu_jit)
+    .get_kernel(
+      kernel_name, {}, {{"binaryop/jit/operation-udf.hpp", cuda_source}}, {"-arch=sm_."})  //
+    ->configure_1d_max_occupancy(0, 0, 0, stream.value())                                  //
+    ->launch(out.size(),
+             cudf::jit::get_data_ptr(out),
+             cudf::jit::get_data_ptr(lhs),
+             cudf::jit::get_data_ptr(rhs));
 }
 
 }  // namespace jit
diff --git a/cpp/src/binaryop/jit/code/kernel.cpp b/cpp/src/binaryop/jit/code/kernel.cpp
deleted file mode 100644
index cfa1f1f82d2..00000000000
--- a/cpp/src/binaryop/jit/code/kernel.cpp
+++ /dev/null
@@ -1,124 +0,0 @@
-/*
- * Copyright (c) 2019, NVIDIA CORPORATION.
- *
- * Copyright 2018-2019 BlazingDB, Inc.
- *     Copyright 2018 Christian Noboa Mardini <christian@blazingdb.com>
- *     Copyright 2018 Rommel Quintanilla <rommel@blazingdb.com>
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-namespace cudf {
-namespace binops {
-namespace jit {
-namespace code {
-
-// clang-format off
-const char* kernel =
-  R"***(
-    #include "operation.h"
-
-    #include <cudf/types.hpp>
-    #include <cudf/utilities/bit.hpp>
-    #include <cudf/wrappers/timestamps.hpp>
-    #include <cudf/wrappers/durations.hpp>
-    #include <cudf/fixed_point/fixed_point.hpp>
-
-    template <typename TypeOut, typename TypeLhs, typename TypeRhs, typename TypeOpe>
-    __global__
-    void kernel_v_s_with_validity(cudf::size_type size, TypeOut* out_data, TypeLhs* lhs_data,
-                                  TypeRhs* rhs_data, cudf::bitmask_type* output_mask,
-                                  cudf::bitmask_type const* mask,
-                                  cudf::size_type offset, bool scalar_valid) {
-        int tid = threadIdx.x;
-        int blkid = blockIdx.x;
-        int blksz = blockDim.x;
-        int gridsz = gridDim.x;
-
-        int start = tid + blkid * blksz;
-        int step = blksz * gridsz;
-
-        for (cudf::size_type i=start; i<size; i+=step) {
-            bool output_valid = false;
-            out_data[i] = TypeOpe::template operate<TypeOut, TypeLhs, TypeRhs>(
-                lhs_data[i], rhs_data[0],
-                mask ? cudf::bit_is_set(mask, offset + i) : true, scalar_valid, output_valid);
-            if (output_mask && !output_valid) cudf::clear_bit(output_mask, i);
-        }
-    }
-
-    template <typename TypeOut, typename TypeLhs, typename TypeRhs, typename TypeOpe>
-    __global__
-    void kernel_v_s(cudf::size_type size,
-                    TypeOut* out_data, TypeLhs* lhs_data, TypeRhs* rhs_data) {
-        int tid = threadIdx.x;
-        int blkid = blockIdx.x;
-        int blksz = blockDim.x;
-        int gridsz = gridDim.x;
-
-        int start = tid + blkid * blksz;
-        int step = blksz * gridsz;
-
-        for (cudf::size_type i=start; i<size; i+=step) {
-            out_data[i] = TypeOpe::template operate<TypeOut, TypeLhs, TypeRhs>(lhs_data[i], rhs_data[0]);
-        }
-    }
-
-    template <typename TypeOut, typename TypeLhs, typename TypeRhs, typename TypeOpe>
-    __global__
-    void kernel_v_v(cudf::size_type size,
-                    TypeOut* out_data, TypeLhs* lhs_data, TypeRhs* rhs_data) {
-        int tid = threadIdx.x;
-        int blkid = blockIdx.x;
-        int blksz = blockDim.x;
-        int gridsz = gridDim.x;
-
-        int start = tid + blkid * blksz;
-        int step = blksz * gridsz;
-
-        for (cudf::size_type i=start; i<size; i+=step) {
-            out_data[i] = TypeOpe::template operate<TypeOut, TypeLhs, TypeRhs>(lhs_data[i], rhs_data[i]);
-        }
-    }
-
-    template <typename TypeOut, typename TypeLhs, typename TypeRhs, typename TypeOpe>
-    __global__
-    void kernel_v_v_with_validity(cudf::size_type size, TypeOut* out_data, TypeLhs* lhs_data,
-                                  TypeRhs* rhs_data, cudf::bitmask_type* output_mask,
-                                  cudf::bitmask_type const* lhs_mask, cudf::size_type lhs_offset,
-                                  cudf::bitmask_type const* rhs_mask, cudf::size_type rhs_offset) {
-        int tid = threadIdx.x;
-        int blkid = blockIdx.x;
-        int blksz = blockDim.x;
-        int gridsz = gridDim.x;
-
-        int start = tid + blkid * blksz;
-        int step = blksz * gridsz;
-
-        for (cudf::size_type i=start; i<size; i+=step) {
-            bool output_valid = false;
-            out_data[i] = TypeOpe::template operate<TypeOut, TypeLhs, TypeRhs>(
-                lhs_data[i], rhs_data[i],
-                lhs_mask ? cudf::bit_is_set(lhs_mask, lhs_offset + i) : true,
-                rhs_mask ? cudf::bit_is_set(rhs_mask, rhs_offset + i) : true,
-                output_valid);
-            if (output_mask && !output_valid) cudf::clear_bit(output_mask, i);
-        }
-    }
-)***";
-// clang-format on
-
-}  // namespace code
-}  // namespace jit
-}  // namespace binops
-}  // namespace cudf
diff --git a/cpp/src/binaryop/jit/code/operation.cpp b/cpp/src/binaryop/jit/code/operation.cpp
deleted file mode 100644
index 938ab0614d4..00000000000
--- a/cpp/src/binaryop/jit/code/operation.cpp
+++ /dev/null
@@ -1,574 +0,0 @@
-/*
- * Copyright (c) 2019-2020, NVIDIA CORPORATION.
- *
- * Copyright 2018-2019 BlazingDB, Inc.
- *     Copyright 2018 Christian Noboa Mardini <christian@blazingdb.com>
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-namespace cudf {
-namespace binops {
-namespace jit {
-namespace code {
-
-const char* operation =
-  R"***(
-    #pragma once
-
-    #include "traits.h"
-
-    #include <cmath>
-
-    #include <cuda/std/type_traits>
-
-    using namespace cuda::std;
-
-    struct Add {
-        // Allow sum between chronos only when both input and output types
-        // are chronos. Unsupported combinations will fail to compile
-        template <typename TypeOut, typename TypeLhs, typename TypeRhs,
-                  enable_if_t<(is_chrono_v<TypeOut> &&
-                               is_chrono_v<TypeLhs> &&
-                               is_chrono_v<TypeRhs>)>* = nullptr>
-        static TypeOut operate(TypeLhs x, TypeRhs y) {
-            return x + y;
-        }
-
-        template <typename TypeOut, typename TypeLhs, typename TypeRhs,
-                  enable_if_t<(!is_chrono_v<TypeOut> ||
-                               !is_chrono_v<TypeLhs> ||
-                               !is_chrono_v<TypeRhs>)>* = nullptr>
-        static TypeOut operate(TypeLhs x, TypeRhs y) {
-            using TypeCommon = typename common_type<TypeOut, TypeLhs, TypeRhs>::type;
-            return static_cast<TypeOut>(static_cast<TypeCommon>(x) + static_cast<TypeCommon>(y));
-        }
-    };
-
-    using RAdd = Add;
-
-    struct Sub {
-        // Allow difference between chronos only when both input and output types
-        // are chronos. Unsupported combinations will fail to compile
-        template <typename TypeOut, typename TypeLhs, typename TypeRhs,
-                  enable_if_t<(is_chrono_v<TypeOut> &&
-                               is_chrono_v<TypeLhs> &&
-                               is_chrono_v<TypeRhs>)>* = nullptr>
-        static TypeOut operate(TypeLhs x, TypeRhs y) {
-            return x - y;
-        }
-
-        template <typename TypeOut, typename TypeLhs, typename TypeRhs,
-                  enable_if_t<(!is_chrono_v<TypeOut> ||
-                               !is_chrono_v<TypeLhs> ||
-                               !is_chrono_v<TypeRhs>)>* = nullptr>
-        static TypeOut operate(TypeLhs x, TypeRhs y) {
-            using TypeCommon = typename common_type<TypeOut, TypeLhs, TypeRhs>::type;
-            return static_cast<TypeOut>(static_cast<TypeCommon>(x) - static_cast<TypeCommon>(y));
-        }
-    };
-
-    struct RSub {
-        template <typename TypeOut, typename TypeLhs, typename TypeRhs>
-        static TypeOut operate(TypeLhs x, TypeRhs y) {
-            return Sub::operate<TypeOut, TypeRhs, TypeLhs>(y, x);
-        }
-    };
-
-    struct Mul {
-        template <typename TypeOut, typename TypeLhs, typename TypeRhs,
-                  enable_if_t<(!is_duration_v<TypeOut>)>* = nullptr>
-        static TypeOut operate(TypeLhs x, TypeRhs y) {
-            using TypeCommon = typename common_type<TypeOut, TypeLhs, TypeRhs>::type;
-            return static_cast<TypeOut>(static_cast<TypeCommon>(x) * static_cast<TypeCommon>(y));
-        }
-
-        template <typename TypeOut, typename TypeLhs, typename TypeRhs,
-                  enable_if_t<(is_duration_v<TypeOut>)>* = nullptr>
-        static TypeOut operate(TypeLhs x, TypeRhs y) {
-            return DurationProduct<TypeOut>(x, y);
-        }
-
-        template <typename TypeOut, typename TypeLhs, typename TypeRhs,
-                  enable_if_t<(is_duration_v<TypeLhs> && is_integral_v<TypeRhs>) ||
-                              (is_integral_v<TypeLhs> && is_duration_v<TypeRhs>)>* = nullptr>
-        static TypeOut DurationProduct(TypeLhs x, TypeRhs y) {
-            return x * y;
-        }
-    };
-
-    using RMul = Mul;
-
-    struct Div {
-        template <typename TypeOut, typename TypeLhs, typename TypeRhs,
-                  enable_if_t<(!is_duration_v<TypeLhs>)>* = nullptr>
-        static TypeOut operate(TypeLhs x, TypeRhs y) {
-            using TypeCommon = typename common_type<TypeOut, TypeLhs, TypeRhs>::type;
-            return static_cast<TypeOut>(static_cast<TypeCommon>(x) / static_cast<TypeCommon>(y));
-        }
-
-        template <typename TypeOut, typename TypeLhs, typename TypeRhs,
-                  enable_if_t<(is_duration_v<TypeLhs>)>* = nullptr>
-        static TypeOut operate(TypeLhs x, TypeRhs y) {
-            return DurationDivide<TypeOut>(x, y);
-        }
-
-        template <typename TypeOut, typename TypeLhs, typename TypeRhs,
-                  enable_if_t<(is_integral_v<TypeRhs> || is_duration_v<TypeRhs>)>* = nullptr>
-        static TypeOut DurationDivide(TypeLhs x, TypeRhs y) {
-            return x / y;
-        }
-    };
-
-    struct RDiv {
-        template <typename TypeOut, typename TypeLhs, typename TypeRhs>
-        static TypeOut operate(TypeLhs x, TypeRhs y) {
-            return Div::operate<TypeOut, TypeRhs, TypeLhs>(y, x);
-        }
-    };
-
-    struct TrueDiv {
-        template <typename TypeOut, typename TypeLhs, typename TypeRhs>
-        static TypeOut operate(TypeLhs x, TypeRhs y) {
-            return (static_cast<double>(x) / static_cast<double>(y));
-        }
-    };
-
-    struct RTrueDiv {
-        template <typename TypeOut, typename TypeLhs, typename TypeRhs>
-        static TypeOut operate(TypeLhs x, TypeRhs y) {
-            return TrueDiv::operate<TypeOut, TypeRhs, TypeLhs>(y, x);
-        }
-    };
-
-    struct FloorDiv {
-        template <typename TypeOut, typename TypeLhs, typename TypeRhs>
-        static TypeOut operate(TypeLhs x, TypeRhs y) {
-            return floor(static_cast<double>(x) / static_cast<double>(y));
-        }
-    };
-
-    struct RFloorDiv {
-        template <typename TypeOut, typename TypeLhs, typename TypeRhs>
-        static TypeOut operate(TypeLhs x, TypeRhs y) {
-            return FloorDiv::operate<TypeOut, TypeRhs, TypeLhs>(y, x);
-        }
-    };
-
-    struct Mod {
-        template <typename TypeOut,
-                  typename TypeLhs,
-                  typename TypeRhs,
-                  enable_if_t<(is_integral_v<typename common_type<TypeOut, TypeLhs, TypeRhs>::type>)>* = nullptr>
-        static TypeOut operate(TypeLhs x, TypeRhs y) {
-            using TypeCommon = typename common_type<TypeOut, TypeLhs, TypeRhs>::type;
-            return static_cast<TypeOut>(static_cast<TypeCommon>(x) % static_cast<TypeCommon>(y));
-        }
-
-        template <typename TypeOut,
-                  typename TypeLhs,
-                  typename TypeRhs,
-                  enable_if_t<(isFloat<typename common_type<TypeOut, TypeLhs, TypeRhs>::type>)>* = nullptr>
-        static TypeOut operate(TypeLhs x, TypeRhs y) {
-            return static_cast<TypeOut>(fmodf(static_cast<float>(x), static_cast<float>(y)));
-        }
-
-        template <typename TypeOut,
-                  typename TypeLhs,
-                  typename TypeRhs,
-                  enable_if_t<(isDouble<typename common_type<TypeOut, TypeLhs, TypeRhs>::type>)>* = nullptr>
-        static TypeOut operate(TypeLhs x, TypeRhs y) {
-            return static_cast<TypeOut>(fmod(static_cast<double>(x), static_cast<double>(y)));
-        }
-
-        template <typename TypeOut,
-                  typename TypeLhs,
-                  typename TypeRhs,
-                  enable_if_t<(is_duration_v<TypeLhs> && is_duration_v<TypeOut>)>* = nullptr>
-        static TypeOut operate(TypeLhs x, TypeRhs y) {
-            return x % y;
-        }
-    };
-
-    struct RMod {
-        template <typename TypeOut, typename TypeLhs, typename TypeRhs>
-        static TypeOut operate(TypeLhs x, TypeRhs y) {
-            return Mod::operate<TypeOut, TypeRhs, TypeLhs>(y, x);
-        }
-    };
-
-    struct PyMod {
-        template <typename TypeOut,
-                  typename TypeLhs,
-                  typename TypeRhs,
-                  enable_if_t<(is_integral_v<TypeOut>)>* = nullptr>
-        static TypeOut operate(TypeLhs x, TypeRhs y) {
-            return ((x % y) + y) % y;
-        }
-
-        template <typename TypeOut,
-                  typename TypeLhs,
-                  typename TypeRhs,
-                  enable_if_t<(is_floating_point_v<TypeOut>)>* = nullptr>
-        static TypeOut operate(TypeLhs x, TypeRhs y) {
-            double x1 = static_cast<double>(x);
-            double y1 = static_cast<double>(y);
-            return fmod(fmod(x1, y1) + y1, y1);
-        }
-
-        template <typename TypeOut,
-                  typename TypeLhs,
-                  typename TypeRhs,
-                  enable_if_t<(is_duration_v<TypeLhs> && is_duration_v<TypeOut>)>* = nullptr>
-        static TypeOut operate(TypeLhs x, TypeRhs y) {
-            return ((x % y) + y) % y;
-        }
-    };
-
-    struct RPyMod {
-        template <typename TypeOut, typename TypeLhs, typename TypeRhs>
-        static TypeOut operate(TypeLhs x, TypeRhs y) {
-            return PyMod::operate<TypeOut, TypeRhs, TypeLhs>(y, x);
-        }
-    };
-
-    struct Pow {
-        template <typename TypeOut, typename TypeLhs, typename TypeRhs>
-        static TypeOut operate(TypeLhs x, TypeRhs y) {
-            return pow(static_cast<double>(x), static_cast<double>(y));
-        }
-    };
-
-    struct RPow {
-        template <typename TypeOut, typename TypeLhs, typename TypeRhs>
-        static TypeOut operate(TypeLhs x, TypeRhs y) {
-            return Pow::operate<TypeOut, TypeRhs, TypeLhs>(y, x);
-        }
-    };
-
-    struct Equal {
-        template <typename TypeOut, typename TypeLhs, typename TypeRhs>
-        static TypeOut operate(TypeLhs x, TypeRhs y) {
-            return (x == y);
-        }
-    };
-
-    using REqual = Equal;
-
-    struct NotEqual {
-        template <typename TypeOut, typename TypeLhs, typename TypeRhs>
-        static TypeOut operate(TypeLhs x, TypeRhs y) {
-            return (x != y);
-        }
-    };
-
-    using RNotEqual = NotEqual;
-
-    struct Less {
-        template <typename TypeOut, typename TypeLhs, typename TypeRhs>
-        static TypeOut operate(TypeLhs x, TypeRhs y) {
-            return (x < y);
-        }
-    };
-
-    struct RLess {
-        template <typename TypeOut, typename TypeLhs, typename TypeRhs>
-        static TypeOut operate(TypeLhs x, TypeRhs y) {
-            return (y < x);
-        }
-    };
-
-    struct Greater {
-        template <typename TypeOut, typename TypeLhs, typename TypeRhs>
-        static TypeOut operate(TypeLhs x, TypeRhs y) {
-            return (x > y);
-        }
-    };
-
-    struct RGreater {
-        template <typename TypeOut, typename TypeLhs, typename TypeRhs>
-        static TypeOut operate(TypeLhs x, TypeRhs y) {
-            return (y > x);
-        }
-    };
-
-    struct LessEqual {
-        template <typename TypeOut, typename TypeLhs, typename TypeRhs>
-        static TypeOut operate(TypeLhs x, TypeRhs y) {
-            return (x <= y);
-        }
-    };
-
-    struct RLessEqual {
-        template <typename TypeOut, typename TypeLhs, typename TypeRhs>
-        static TypeOut operate(TypeLhs x, TypeRhs y) {
-            return (y <= x);
-        }
-    };
-
-    struct GreaterEqual {
-        template <typename TypeOut, typename TypeLhs, typename TypeRhs>
-        static TypeOut operate(TypeLhs x, TypeRhs y) {
-            return (x >= y);
-        }
-    };
-
-    struct RGreaterEqual {
-        template <typename TypeOut, typename TypeLhs, typename TypeRhs>
-        static TypeOut operate(TypeLhs x, TypeRhs y) {
-            return (y >= x);
-        }
-    };
-    
-    struct BitwiseAnd {
-        template <typename TypeOut, typename TypeLhs, typename TypeRhs>
-        static TypeOut operate(TypeLhs x, TypeRhs y) {
-            return (static_cast<TypeOut>(x) & static_cast<TypeOut>(y));
-        }
-    };
-
-    using RBitwiseAnd = BitwiseAnd;
-
-    struct BitwiseOr {
-        template <typename TypeOut, typename TypeLhs, typename TypeRhs>
-        static TypeOut operate(TypeLhs x, TypeRhs y) {
-            return (static_cast<TypeOut>(x) | static_cast<TypeOut>(y));
-        }
-    };
-
-    using RBitwiseOr = BitwiseOr;
-
-    struct BitwiseXor {
-        template <typename TypeOut, typename TypeLhs, typename TypeRhs>
-        static TypeOut operate(TypeLhs x, TypeRhs y) {
-            return (static_cast<TypeOut>(x) ^ static_cast<TypeOut>(y));
-        }
-    };
-
-    using RBitwiseXor = BitwiseXor;
-
-    struct LogicalAnd {
-        template <typename TypeOut, typename TypeLhs, typename TypeRhs>
-        static TypeOut operate(TypeLhs x, TypeRhs y) {
-            return (x && y);
-        }
-    };
-
-    using RLogicalAnd = LogicalAnd;
-
-    struct LogicalOr {
-        template <typename TypeOut, typename TypeLhs, typename TypeRhs>
-        static TypeOut operate(TypeLhs x, TypeRhs y) {
-            return (x || y);
-        }
-    };
-
-    using RLogicalOr = LogicalOr;
-
-    struct UserDefinedOp {
-        template <typename TypeOut, typename TypeLhs, typename TypeRhs>
-        static TypeOut operate(TypeLhs x, TypeRhs y) {
-            TypeOut output;
-            using TypeCommon = typename common_type<TypeOut, TypeLhs, TypeRhs>::type;
-            GENERIC_BINARY_OP(&output, static_cast<TypeCommon>(x), static_cast<TypeCommon>(y));
-            return output;
-        }
-    };    
-    
-    struct ShiftLeft {
-        template <typename TypeOut, typename TypeLhs, typename TypeRhs>
-        static TypeOut operate(TypeLhs x, TypeRhs y) {
-            return (x << y);
-        }
-    };
-
-    struct RShiftLeft {
-        template <typename TypeOut, typename TypeLhs, typename TypeRhs>
-        static TypeOut operate(TypeLhs x, TypeRhs y) {
-            return (y << x);
-        }
-    };
-
-    struct ShiftRight {
-        template <typename TypeOut, typename TypeLhs, typename TypeRhs>
-        static TypeOut operate(TypeLhs x, TypeRhs y) {
-            return (x >> y);
-        }
-    };    
-
-    struct RShiftRight {
-        template <typename TypeOut, typename TypeLhs, typename TypeRhs>
-        static TypeOut operate(TypeLhs x, TypeRhs y) {
-            return (y >> x);
-        }
-    };
-
-    struct ShiftRightUnsigned {
-        template <typename TypeOut, typename TypeLhs, typename TypeRhs>
-        static TypeOut operate(TypeLhs x, TypeRhs y) {
-            return (static_cast<make_unsigned_t<TypeLhs>>(x) >> y);            
-        }
-    };    
-
-    struct RShiftRightUnsigned {
-        template <typename TypeOut, typename TypeLhs, typename TypeRhs>
-        static TypeOut operate(TypeLhs x, TypeRhs y) {
-            return (static_cast<make_unsigned_t<TypeRhs>>(y) >> x);            
-        }
-    };    
-
-    struct LogBase {
-        template <typename TypeOut, typename TypeLhs, typename TypeRhs>
-        static TypeOut operate(TypeLhs x, TypeRhs y) {
-            return (std::log(static_cast<double>(x)) / std::log(static_cast<double>(y)));
-        }
-    };
-
-    struct RLogBase {
-        template <typename TypeOut, typename TypeLhs, typename TypeRhs>
-        static TypeOut operate(TypeLhs x, TypeRhs y) {
-            return LogBase::operate<TypeOut, TypeRhs, TypeLhs>(y, x);
-        }
-    };
-
-    struct NullEquals {
-        template <typename TypeOut, typename TypeLhs, typename TypeRhs>
-        static TypeOut operate(TypeLhs x, TypeRhs y, bool lhs_valid, bool rhs_valid,
-                               bool& output_valid) {
-            output_valid = true;
-            if (!lhs_valid && !rhs_valid) return true;
-            if (lhs_valid && rhs_valid) return x == y;
-            return false;
-        }
-    };
-
-    struct RNullEquals {
-        template <typename TypeOut, typename TypeLhs, typename TypeRhs>
-        static TypeOut operate(TypeLhs x, TypeRhs y, bool lhs_valid, bool rhs_valid,
-                               bool& output_valid) {
-            output_valid = true;
-            return NullEquals::operate<TypeOut, TypeRhs, TypeLhs>(y, x, rhs_valid, lhs_valid,
-                                                                  output_valid);
-        }
-    };
-
-    struct NullMax {
-        template <typename TypeOut, typename TypeLhs, typename TypeRhs>
-        static TypeOut operate(TypeLhs x, TypeRhs y, bool lhs_valid, bool rhs_valid,
-                               bool& output_valid) {
-            output_valid = true;
-            if (!lhs_valid && !rhs_valid) {
-                output_valid = false;
-                return TypeOut{};
-            } else if (lhs_valid && rhs_valid) {
-                return (TypeOut{x} > TypeOut{y}) ? TypeOut{x} : TypeOut{y};
-            } else if (lhs_valid) return TypeOut{x};
-            else return TypeOut{y};
-        }
-    };
-
-    struct RNullMax {
-        template <typename TypeOut, typename TypeLhs, typename TypeRhs>
-        static TypeOut operate(TypeLhs x, TypeRhs y, bool lhs_valid, bool rhs_valid,
-                               bool& output_valid) {
-            return NullMax::operate<TypeOut, TypeRhs, TypeLhs>(y, x, rhs_valid, lhs_valid,
-                                                               output_valid);
-        }
-    };
-
-    struct NullMin {
-        template <typename TypeOut, typename TypeLhs, typename TypeRhs>
-        static TypeOut operate(TypeLhs x, TypeRhs y, bool lhs_valid, bool rhs_valid,
-                               bool& output_valid) {
-            output_valid = true;
-            if (!lhs_valid && !rhs_valid) {
-                output_valid = false;
-                return TypeOut{};
-            } else if (lhs_valid && rhs_valid) {
-                return (TypeOut{x} < TypeOut{y}) ? TypeOut{x} : TypeOut{y};
-            } else if (lhs_valid) return TypeOut{x};
-            else return TypeOut{y};
-        }
-    };
-
-    struct RNullMin {
-        template <typename TypeOut, typename TypeLhs, typename TypeRhs>
-        static TypeOut operate(TypeLhs x, TypeRhs y, bool lhs_valid, bool rhs_valid,
-                               bool& output_valid) {
-            return NullMin::operate<TypeOut, TypeRhs, TypeLhs>(y, x, rhs_valid, lhs_valid,
-                                                               output_valid);
-        }
-    };
-
-    struct PMod {
-        // Ideally, these two specializations - one for integral types and one for non integral
-        // types shouldn't be required, as std::fmod should promote integral types automatically
-        // to double and call the std::fmod overload for doubles. Sadly, doing this in jitified
-        // code does not work - it is having trouble deciding between float/double overloads
-        template <typename TypeOut,
-                  typename TypeLhs,
-                  typename TypeRhs,
-                  enable_if_t<(is_integral_v<typename cuda::std::common_type<TypeLhs, TypeRhs>::type>)>* = nullptr>
-        static TypeOut operate(TypeLhs x, TypeRhs y) {
-            using common_t = typename cuda::std::common_type<TypeLhs, TypeRhs>::type;
-            common_t xconv{x};
-            common_t yconv{y};
-            auto rem = xconv % yconv;
-            if (rem < 0) rem = (rem + yconv) % yconv;
-            return TypeOut{rem};
-        }
-
-        template <typename TypeOut,
-                  typename TypeLhs,
-                  typename TypeRhs,
-                  enable_if_t<!(is_integral_v<typename cuda::std::common_type<TypeLhs, TypeRhs>::type>)>* = nullptr>
-        static TypeOut operate(TypeLhs x, TypeRhs y) {
-            using common_t = typename cuda::std::common_type<TypeLhs, TypeRhs>::type;
-            common_t xconv{x};
-            common_t yconv{y};
-            auto rem = std::fmod(xconv, yconv);
-            if (rem < 0) rem = std::fmod(rem + yconv, yconv);
-            return TypeOut{rem};
-        }
-    };
-
-    struct RPMod {
-        template <typename TypeOut,
-                  typename TypeLhs,
-                  typename TypeRhs>
-        static TypeOut operate(TypeLhs x, TypeRhs y) {
-            return PMod::operate<TypeOut, TypeRhs, TypeLhs>(y, x);
-        }
-    };
-
-    struct ATan2 {
-        template <typename TypeOut, typename TypeLhs, typename TypeRhs>
-        static TypeOut operate(TypeLhs x, TypeRhs y) {
-            return TypeOut{std::atan2(double{x}, double{y})};
-        }
-    };
-
-    struct RATan2 {
-        template <typename TypeOut,
-                  typename TypeLhs,
-                  typename TypeRhs>
-        static TypeOut operate(TypeLhs x, TypeRhs y) {
-            return TypeOut{ATan2::operate<TypeOut, TypeRhs, TypeLhs>(y, x)};
-        }
-    };
-)***";
-
-}  // namespace code
-}  // namespace jit
-}  // namespace binops
-}  // namespace cudf
diff --git a/cpp/src/binaryop/jit/code/traits.cpp b/cpp/src/binaryop/jit/code/traits.cpp
deleted file mode 100644
index 53b980b1a02..00000000000
--- a/cpp/src/binaryop/jit/code/traits.cpp
+++ /dev/null
@@ -1,76 +0,0 @@
-/*
- * Copyright (c) 2019, NVIDIA CORPORATION.
- *
- * Copyright 2018-2019 BlazingDB, Inc.
- *     Copyright 2018 Christian Noboa Mardini <christian@blazingdb.com>
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-namespace cudf {
-namespace binops {
-namespace jit {
-namespace code {
-const char* traits =
-  R"***(
-    #pragma once
-
-    // Include Jitify's cstddef header first
-    #include <cstddef>
-
-    #include <cuda/std/climits>
-    #include <cuda/std/cstddef>
-    #include <cuda/std/limits>
-    #include <cuda/std/type_traits>
-
-    #include <cudf/wrappers/durations.hpp>
-    #include <cudf/wrappers/timestamps.hpp>
-
-    // -------------------------------------------------------------------------
-    // type_traits cannot tell the difference between float and double
-    template <typename Type>
-    constexpr bool isFloat = false;
-
-    template <typename T>
-    constexpr bool is_timestamp_v =
-        cuda::std::is_same<cudf::timestamp_D, T>::value ||
-        cuda::std::is_same<cudf::timestamp_s, T>::value ||
-        cuda::std::is_same<cudf::timestamp_ms, T>::value ||
-        cuda::std::is_same<cudf::timestamp_us, T>::value ||
-        cuda::std::is_same<cudf::timestamp_ns, T>::value;
-
-    template <typename T>
-    constexpr bool is_duration_v =
-        cuda::std::is_same<cudf::duration_D, T>::value ||
-        cuda::std::is_same<cudf::duration_s, T>::value ||
-        cuda::std::is_same<cudf::duration_ms, T>::value ||
-        cuda::std::is_same<cudf::duration_us, T>::value ||
-        cuda::std::is_same<cudf::duration_ns, T>::value;
-
-    template <typename T>
-    constexpr bool is_chrono_v = is_timestamp_v<T> || is_duration_v<T>;
-
-    template <>
-    constexpr bool isFloat<float> = true;
-
-    template <typename Type>
-    constexpr bool isDouble = false;
-
-    template <>
-    constexpr bool isDouble<double> = true;
-)***";
-
-}  // namespace code
-}  // namespace jit
-}  // namespace binops
-}  // namespace cudf
diff --git a/cpp/src/binaryop/jit/kernel.cu b/cpp/src/binaryop/jit/kernel.cu
new file mode 100644
index 00000000000..fcfe16f979d
--- /dev/null
+++ b/cpp/src/binaryop/jit/kernel.cu
@@ -0,0 +1,134 @@
+/*
+ * Copyright (c) 2019-2021, NVIDIA CORPORATION.
+ *
+ * Copyright 2018-2019 BlazingDB, Inc.
+ *     Copyright 2018 Christian Noboa Mardini <christian@blazingdb.com>
+ *     Copyright 2018 Rommel Quintanilla <rommel@blazingdb.com>
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include <binaryop/jit/operation.hpp>
+
+#include <cudf/fixed_point/fixed_point.hpp>
+#include <cudf/types.hpp>
+#include <cudf/utilities/bit.hpp>
+#include <cudf/wrappers/durations.hpp>
+#include <cudf/wrappers/timestamps.hpp>
+
+namespace cudf {
+namespace binops {
+namespace jit {
+
+template <typename TypeOut, typename TypeLhs, typename TypeRhs, typename TypeOpe>
+__global__ void kernel_v_s_with_validity(cudf::size_type size,
+                                         TypeOut* out_data,
+                                         TypeLhs* lhs_data,
+                                         TypeRhs* rhs_data,
+                                         cudf::bitmask_type* output_mask,
+                                         cudf::bitmask_type const* mask,
+                                         cudf::size_type offset,
+                                         bool scalar_valid)
+{
+  int tid    = threadIdx.x;
+  int blkid  = blockIdx.x;
+  int blksz  = blockDim.x;
+  int gridsz = gridDim.x;
+
+  int start = tid + blkid * blksz;
+  int step  = blksz * gridsz;
+
+  for (cudf::size_type i = start; i < size; i += step) {
+    bool output_valid = false;
+    out_data[i]       = TypeOpe::template operate<TypeOut, TypeLhs, TypeRhs>(
+      lhs_data[i],
+      rhs_data[0],
+      mask ? cudf::bit_is_set(mask, offset + i) : true,
+      scalar_valid,
+      output_valid);
+    if (output_mask && !output_valid) cudf::clear_bit(output_mask, i);
+  }
+}
+
+template <typename TypeOut, typename TypeLhs, typename TypeRhs, typename TypeOpe>
+__global__ void kernel_v_s(cudf::size_type size,
+                           TypeOut* out_data,
+                           TypeLhs* lhs_data,
+                           TypeRhs* rhs_data)
+{
+  int tid    = threadIdx.x;
+  int blkid  = blockIdx.x;
+  int blksz  = blockDim.x;
+  int gridsz = gridDim.x;
+
+  int start = tid + blkid * blksz;
+  int step  = blksz * gridsz;
+
+  for (cudf::size_type i = start; i < size; i += step) {
+    out_data[i] = TypeOpe::template operate<TypeOut, TypeLhs, TypeRhs>(lhs_data[i], rhs_data[0]);
+  }
+}
+
+template <typename TypeOut, typename TypeLhs, typename TypeRhs, typename TypeOpe>
+__global__ void kernel_v_v(cudf::size_type size,
+                           TypeOut* out_data,
+                           TypeLhs* lhs_data,
+                           TypeRhs* rhs_data)
+{
+  int tid    = threadIdx.x;
+  int blkid  = blockIdx.x;
+  int blksz  = blockDim.x;
+  int gridsz = gridDim.x;
+
+  int start = tid + blkid * blksz;
+  int step  = blksz * gridsz;
+
+  for (cudf::size_type i = start; i < size; i += step) {
+    out_data[i] = TypeOpe::template operate<TypeOut, TypeLhs, TypeRhs>(lhs_data[i], rhs_data[i]);
+  }
+}
+
+template <typename TypeOut, typename TypeLhs, typename TypeRhs, typename TypeOpe>
+__global__ void kernel_v_v_with_validity(cudf::size_type size,
+                                         TypeOut* out_data,
+                                         TypeLhs* lhs_data,
+                                         TypeRhs* rhs_data,
+                                         cudf::bitmask_type* output_mask,
+                                         cudf::bitmask_type const* lhs_mask,
+                                         cudf::size_type lhs_offset,
+                                         cudf::bitmask_type const* rhs_mask,
+                                         cudf::size_type rhs_offset)
+{
+  int tid    = threadIdx.x;
+  int blkid  = blockIdx.x;
+  int blksz  = blockDim.x;
+  int gridsz = gridDim.x;
+
+  int start = tid + blkid * blksz;
+  int step  = blksz * gridsz;
+
+  for (cudf::size_type i = start; i < size; i += step) {
+    bool output_valid = false;
+    out_data[i]       = TypeOpe::template operate<TypeOut, TypeLhs, TypeRhs>(
+      lhs_data[i],
+      rhs_data[i],
+      lhs_mask ? cudf::bit_is_set(lhs_mask, lhs_offset + i) : true,
+      rhs_mask ? cudf::bit_is_set(rhs_mask, rhs_offset + i) : true,
+      output_valid);
+    if (output_mask && !output_valid) cudf::clear_bit(output_mask, i);
+  }
+}
+
+}  // namespace jit
+}  // namespace binops
+}  // namespace cudf
diff --git a/cpp/src/binaryop/jit/code/code.h b/cpp/src/binaryop/jit/operation-udf.hpp
similarity index 59%
rename from cpp/src/binaryop/jit/code/code.h
rename to cpp/src/binaryop/jit/operation-udf.hpp
index b8ff9e47c31..eaab2111d98 100644
--- a/cpp/src/binaryop/jit/code/code.h
+++ b/cpp/src/binaryop/jit/operation-udf.hpp
@@ -1,8 +1,5 @@
 /*
- * Copyright (c) 2019, NVIDIA CORPORATION.
- *
- * Copyright 2018-2019 BlazingDB, Inc.
- *     Copyright 2018 Christian Noboa Mardini <christian@blazingdb.com>
+ * Copyright (c) 2021, NVIDIA CORPORATION.
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
  * you may not use this file except in compliance with the License.
@@ -19,15 +16,5 @@
 
 #pragma once
 
-namespace cudf {
-namespace binops {
-namespace jit {
-namespace code {
-extern const char* kernel;
-extern const char* traits;
-extern const char* operation;
-
-}  // namespace code
-}  // namespace jit
-}  // namespace binops
-}  // namespace cudf
+// This file serves as a placeholder for user defined functions, so jitify can choose to override it
+// at runtime.
diff --git a/cpp/src/binaryop/jit/operation.hpp b/cpp/src/binaryop/jit/operation.hpp
new file mode 100644
index 00000000000..d117f2182f9
--- /dev/null
+++ b/cpp/src/binaryop/jit/operation.hpp
@@ -0,0 +1,646 @@
+/*
+ * Copyright (c) 2019-2021, NVIDIA CORPORATION.
+ *
+ * Copyright 2018-2019 BlazingDB, Inc.
+ *     Copyright 2018 Christian Noboa Mardini <christian@blazingdb.com>
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include <cudf/types.hpp>
+
+#include <binaryop/jit/operation-udf.hpp>
+#include <binaryop/jit/traits.hpp>
+
+#include <cmath>
+
+#include <cuda/std/type_traits>
+
+#pragma once
+
+using namespace cuda::std;
+
+namespace cudf {
+namespace binops {
+namespace jit {
+
+struct Add {
+  // Allow sum between chronos only when both input and output types
+  // are chronos. Unsupported combinations will fail to compile
+  template <
+    typename TypeOut,
+    typename TypeLhs,
+    typename TypeRhs,
+    enable_if_t<(is_chrono_v<TypeOut> && is_chrono_v<TypeLhs> && is_chrono_v<TypeRhs>)>* = nullptr>
+  static TypeOut operate(TypeLhs x, TypeRhs y)
+  {
+    return x + y;
+  }
+
+  template <typename TypeOut,
+            typename TypeLhs,
+            typename TypeRhs,
+            enable_if_t<(!is_chrono_v<TypeOut> || !is_chrono_v<TypeLhs> ||
+                         !is_chrono_v<TypeRhs>)>* = nullptr>
+  static TypeOut operate(TypeLhs x, TypeRhs y)
+  {
+    using TypeCommon = typename common_type<TypeOut, TypeLhs, TypeRhs>::type;
+    return static_cast<TypeOut>(static_cast<TypeCommon>(x) + static_cast<TypeCommon>(y));
+  }
+};
+
+using RAdd = Add;
+
+struct Sub {
+  // Allow difference between chronos only when both input and output types
+  // are chronos. Unsupported combinations will fail to compile
+  template <
+    typename TypeOut,
+    typename TypeLhs,
+    typename TypeRhs,
+    enable_if_t<(is_chrono_v<TypeOut> && is_chrono_v<TypeLhs> && is_chrono_v<TypeRhs>)>* = nullptr>
+  static TypeOut operate(TypeLhs x, TypeRhs y)
+  {
+    return x - y;
+  }
+
+  template <typename TypeOut,
+            typename TypeLhs,
+            typename TypeRhs,
+            enable_if_t<(!is_chrono_v<TypeOut> || !is_chrono_v<TypeLhs> ||
+                         !is_chrono_v<TypeRhs>)>* = nullptr>
+  static TypeOut operate(TypeLhs x, TypeRhs y)
+  {
+    using TypeCommon = typename common_type<TypeOut, TypeLhs, TypeRhs>::type;
+    return static_cast<TypeOut>(static_cast<TypeCommon>(x) - static_cast<TypeCommon>(y));
+  }
+};
+
+struct RSub {
+  template <typename TypeOut, typename TypeLhs, typename TypeRhs>
+  static TypeOut operate(TypeLhs x, TypeRhs y)
+  {
+    return Sub::operate<TypeOut, TypeRhs, TypeLhs>(y, x);
+  }
+};
+
+struct Mul {
+  template <typename TypeOut,
+            typename TypeLhs,
+            typename TypeRhs,
+            enable_if_t<(!is_duration_v<TypeOut>)>* = nullptr>
+  static TypeOut operate(TypeLhs x, TypeRhs y)
+  {
+    using TypeCommon = typename common_type<TypeOut, TypeLhs, TypeRhs>::type;
+    return static_cast<TypeOut>(static_cast<TypeCommon>(x) * static_cast<TypeCommon>(y));
+  }
+
+  template <typename TypeOut,
+            typename TypeLhs,
+            typename TypeRhs,
+            enable_if_t<(is_duration_v<TypeOut>)>* = nullptr>
+  static TypeOut operate(TypeLhs x, TypeRhs y)
+  {
+    return DurationProduct<TypeOut>(x, y);
+  }
+
+  template <typename TypeOut,
+            typename TypeLhs,
+            typename TypeRhs,
+            enable_if_t<(is_duration_v<TypeLhs> && is_integral_v<TypeRhs>) ||
+                        (is_integral_v<TypeLhs> && is_duration_v<TypeRhs>)>* = nullptr>
+  static TypeOut DurationProduct(TypeLhs x, TypeRhs y)
+  {
+    return x * y;
+  }
+};
+
+using RMul = Mul;
+
+struct Div {
+  template <typename TypeOut,
+            typename TypeLhs,
+            typename TypeRhs,
+            enable_if_t<(!is_duration_v<TypeLhs>)>* = nullptr>
+  static TypeOut operate(TypeLhs x, TypeRhs y)
+  {
+    using TypeCommon = typename common_type<TypeOut, TypeLhs, TypeRhs>::type;
+    return static_cast<TypeOut>(static_cast<TypeCommon>(x) / static_cast<TypeCommon>(y));
+  }
+
+  template <typename TypeOut,
+            typename TypeLhs,
+            typename TypeRhs,
+            enable_if_t<(is_duration_v<TypeLhs>)>* = nullptr>
+  static TypeOut operate(TypeLhs x, TypeRhs y)
+  {
+    return DurationDivide<TypeOut>(x, y);
+  }
+
+  template <typename TypeOut,
+            typename TypeLhs,
+            typename TypeRhs,
+            enable_if_t<(is_integral_v<TypeRhs> || is_duration_v<TypeRhs>)>* = nullptr>
+  static TypeOut DurationDivide(TypeLhs x, TypeRhs y)
+  {
+    return x / y;
+  }
+};
+
+struct RDiv {
+  template <typename TypeOut, typename TypeLhs, typename TypeRhs>
+  static TypeOut operate(TypeLhs x, TypeRhs y)
+  {
+    return Div::operate<TypeOut, TypeRhs, TypeLhs>(y, x);
+  }
+};
+
+struct TrueDiv {
+  template <typename TypeOut, typename TypeLhs, typename TypeRhs>
+  static TypeOut operate(TypeLhs x, TypeRhs y)
+  {
+    return (static_cast<double>(x) / static_cast<double>(y));
+  }
+};
+
+struct RTrueDiv {
+  template <typename TypeOut, typename TypeLhs, typename TypeRhs>
+  static TypeOut operate(TypeLhs x, TypeRhs y)
+  {
+    return TrueDiv::operate<TypeOut, TypeRhs, TypeLhs>(y, x);
+  }
+};
+
+struct FloorDiv {
+  template <typename TypeOut, typename TypeLhs, typename TypeRhs>
+  static TypeOut operate(TypeLhs x, TypeRhs y)
+  {
+    return floor(static_cast<double>(x) / static_cast<double>(y));
+  }
+};
+
+struct RFloorDiv {
+  template <typename TypeOut, typename TypeLhs, typename TypeRhs>
+  static TypeOut operate(TypeLhs x, TypeRhs y)
+  {
+    return FloorDiv::operate<TypeOut, TypeRhs, TypeLhs>(y, x);
+  }
+};
+
+struct Mod {
+  template <
+    typename TypeOut,
+    typename TypeLhs,
+    typename TypeRhs,
+    enable_if_t<(is_integral_v<typename common_type<TypeOut, TypeLhs, TypeRhs>::type>)>* = nullptr>
+  static TypeOut operate(TypeLhs x, TypeRhs y)
+  {
+    using TypeCommon = typename common_type<TypeOut, TypeLhs, TypeRhs>::type;
+    return static_cast<TypeOut>(static_cast<TypeCommon>(x) % static_cast<TypeCommon>(y));
+  }
+
+  template <
+    typename TypeOut,
+    typename TypeLhs,
+    typename TypeRhs,
+    enable_if_t<(isFloat<typename common_type<TypeOut, TypeLhs, TypeRhs>::type>)>* = nullptr>
+  static TypeOut operate(TypeLhs x, TypeRhs y)
+  {
+    return static_cast<TypeOut>(fmodf(static_cast<float>(x), static_cast<float>(y)));
+  }
+
+  template <
+    typename TypeOut,
+    typename TypeLhs,
+    typename TypeRhs,
+    enable_if_t<(isDouble<typename common_type<TypeOut, TypeLhs, TypeRhs>::type>)>* = nullptr>
+  static TypeOut operate(TypeLhs x, TypeRhs y)
+  {
+    return static_cast<TypeOut>(fmod(static_cast<double>(x), static_cast<double>(y)));
+  }
+
+  template <typename TypeOut,
+            typename TypeLhs,
+            typename TypeRhs,
+            enable_if_t<(is_duration_v<TypeLhs> && is_duration_v<TypeOut>)>* = nullptr>
+  static TypeOut operate(TypeLhs x, TypeRhs y)
+  {
+    return x % y;
+  }
+};
+
+struct RMod {
+  template <typename TypeOut, typename TypeLhs, typename TypeRhs>
+  static TypeOut operate(TypeLhs x, TypeRhs y)
+  {
+    return Mod::operate<TypeOut, TypeRhs, TypeLhs>(y, x);
+  }
+};
+
+struct PyMod {
+  template <typename TypeOut,
+            typename TypeLhs,
+            typename TypeRhs,
+            enable_if_t<(is_integral_v<TypeOut>)>* = nullptr>
+  static TypeOut operate(TypeLhs x, TypeRhs y)
+  {
+    return ((x % y) + y) % y;
+  }
+
+  template <typename TypeOut,
+            typename TypeLhs,
+            typename TypeRhs,
+            enable_if_t<(is_floating_point_v<TypeOut>)>* = nullptr>
+  static TypeOut operate(TypeLhs x, TypeRhs y)
+  {
+    double x1 = static_cast<double>(x);
+    double y1 = static_cast<double>(y);
+    return fmod(fmod(x1, y1) + y1, y1);
+  }
+
+  template <typename TypeOut,
+            typename TypeLhs,
+            typename TypeRhs,
+            enable_if_t<(is_duration_v<TypeLhs> && is_duration_v<TypeOut>)>* = nullptr>
+  static TypeOut operate(TypeLhs x, TypeRhs y)
+  {
+    return ((x % y) + y) % y;
+  }
+};
+
+struct RPyMod {
+  template <typename TypeOut, typename TypeLhs, typename TypeRhs>
+  static TypeOut operate(TypeLhs x, TypeRhs y)
+  {
+    return PyMod::operate<TypeOut, TypeRhs, TypeLhs>(y, x);
+  }
+};
+
+struct Pow {
+  template <typename TypeOut, typename TypeLhs, typename TypeRhs>
+  static TypeOut operate(TypeLhs x, TypeRhs y)
+  {
+    return pow(static_cast<double>(x), static_cast<double>(y));
+  }
+};
+
+struct RPow {
+  template <typename TypeOut, typename TypeLhs, typename TypeRhs>
+  static TypeOut operate(TypeLhs x, TypeRhs y)
+  {
+    return Pow::operate<TypeOut, TypeRhs, TypeLhs>(y, x);
+  }
+};
+
+struct Equal {
+  template <typename TypeOut, typename TypeLhs, typename TypeRhs>
+  static TypeOut operate(TypeLhs x, TypeRhs y)
+  {
+    return (x == y);
+  }
+};
+
+using REqual = Equal;
+
+struct NotEqual {
+  template <typename TypeOut, typename TypeLhs, typename TypeRhs>
+  static TypeOut operate(TypeLhs x, TypeRhs y)
+  {
+    return (x != y);
+  }
+};
+
+using RNotEqual = NotEqual;
+
+struct Less {
+  template <typename TypeOut, typename TypeLhs, typename TypeRhs>
+  static TypeOut operate(TypeLhs x, TypeRhs y)
+  {
+    return (x < y);
+  }
+};
+
+struct RLess {
+  template <typename TypeOut, typename TypeLhs, typename TypeRhs>
+  static TypeOut operate(TypeLhs x, TypeRhs y)
+  {
+    return (y < x);
+  }
+};
+
+struct Greater {
+  template <typename TypeOut, typename TypeLhs, typename TypeRhs>
+  static TypeOut operate(TypeLhs x, TypeRhs y)
+  {
+    return (x > y);
+  }
+};
+
+struct RGreater {
+  template <typename TypeOut, typename TypeLhs, typename TypeRhs>
+  static TypeOut operate(TypeLhs x, TypeRhs y)
+  {
+    return (y > x);
+  }
+};
+
+struct LessEqual {
+  template <typename TypeOut, typename TypeLhs, typename TypeRhs>
+  static TypeOut operate(TypeLhs x, TypeRhs y)
+  {
+    return (x <= y);
+  }
+};
+
+struct RLessEqual {
+  template <typename TypeOut, typename TypeLhs, typename TypeRhs>
+  static TypeOut operate(TypeLhs x, TypeRhs y)
+  {
+    return (y <= x);
+  }
+};
+
+struct GreaterEqual {
+  template <typename TypeOut, typename TypeLhs, typename TypeRhs>
+  static TypeOut operate(TypeLhs x, TypeRhs y)
+  {
+    return (x >= y);
+  }
+};
+
+struct RGreaterEqual {
+  template <typename TypeOut, typename TypeLhs, typename TypeRhs>
+  static TypeOut operate(TypeLhs x, TypeRhs y)
+  {
+    return (y >= x);
+  }
+};
+
+struct BitwiseAnd {
+  template <typename TypeOut, typename TypeLhs, typename TypeRhs>
+  static TypeOut operate(TypeLhs x, TypeRhs y)
+  {
+    return (static_cast<TypeOut>(x) & static_cast<TypeOut>(y));
+  }
+};
+
+using RBitwiseAnd = BitwiseAnd;
+
+struct BitwiseOr {
+  template <typename TypeOut, typename TypeLhs, typename TypeRhs>
+  static TypeOut operate(TypeLhs x, TypeRhs y)
+  {
+    return (static_cast<TypeOut>(x) | static_cast<TypeOut>(y));
+  }
+};
+
+using RBitwiseOr = BitwiseOr;
+
+struct BitwiseXor {
+  template <typename TypeOut, typename TypeLhs, typename TypeRhs>
+  static TypeOut operate(TypeLhs x, TypeRhs y)
+  {
+    return (static_cast<TypeOut>(x) ^ static_cast<TypeOut>(y));
+  }
+};
+
+using RBitwiseXor = BitwiseXor;
+
+struct LogicalAnd {
+  template <typename TypeOut, typename TypeLhs, typename TypeRhs>
+  static TypeOut operate(TypeLhs x, TypeRhs y)
+  {
+    return (x && y);
+  }
+};
+
+using RLogicalAnd = LogicalAnd;
+
+struct LogicalOr {
+  template <typename TypeOut, typename TypeLhs, typename TypeRhs>
+  static TypeOut operate(TypeLhs x, TypeRhs y)
+  {
+    return (x || y);
+  }
+};
+
+using RLogicalOr = LogicalOr;
+
+struct UserDefinedOp {
+  template <typename TypeOut, typename TypeLhs, typename TypeRhs>
+  static TypeOut operate(TypeLhs x, TypeRhs y)
+  {
+    TypeOut output;
+    using TypeCommon = typename common_type<TypeOut, TypeLhs, TypeRhs>::type;
+    GENERIC_BINARY_OP(&output, static_cast<TypeCommon>(x), static_cast<TypeCommon>(y));
+    return output;
+  }
+};
+
+struct ShiftLeft {
+  template <typename TypeOut, typename TypeLhs, typename TypeRhs>
+  static TypeOut operate(TypeLhs x, TypeRhs y)
+  {
+    return (x << y);
+  }
+};
+
+struct RShiftLeft {
+  template <typename TypeOut, typename TypeLhs, typename TypeRhs>
+  static TypeOut operate(TypeLhs x, TypeRhs y)
+  {
+    return (y << x);
+  }
+};
+
+struct ShiftRight {
+  template <typename TypeOut, typename TypeLhs, typename TypeRhs>
+  static TypeOut operate(TypeLhs x, TypeRhs y)
+  {
+    return (x >> y);
+  }
+};
+
+struct RShiftRight {
+  template <typename TypeOut, typename TypeLhs, typename TypeRhs>
+  static TypeOut operate(TypeLhs x, TypeRhs y)
+  {
+    return (y >> x);
+  }
+};
+
+struct ShiftRightUnsigned {
+  template <typename TypeOut, typename TypeLhs, typename TypeRhs>
+  static TypeOut operate(TypeLhs x, TypeRhs y)
+  {
+    return (static_cast<make_unsigned_t<TypeLhs>>(x) >> y);
+  }
+};
+
+struct RShiftRightUnsigned {
+  template <typename TypeOut, typename TypeLhs, typename TypeRhs>
+  static TypeOut operate(TypeLhs x, TypeRhs y)
+  {
+    return (static_cast<make_unsigned_t<TypeRhs>>(y) >> x);
+  }
+};
+
+struct LogBase {
+  template <typename TypeOut, typename TypeLhs, typename TypeRhs>
+  static TypeOut operate(TypeLhs x, TypeRhs y)
+  {
+    return (std::log(static_cast<double>(x)) / std::log(static_cast<double>(y)));
+  }
+};
+
+struct RLogBase {
+  template <typename TypeOut, typename TypeLhs, typename TypeRhs>
+  static TypeOut operate(TypeLhs x, TypeRhs y)
+  {
+    return LogBase::operate<TypeOut, TypeRhs, TypeLhs>(y, x);
+  }
+};
+
+struct NullEquals {
+  template <typename TypeOut, typename TypeLhs, typename TypeRhs>
+  static TypeOut operate(TypeLhs x, TypeRhs y, bool lhs_valid, bool rhs_valid, bool& output_valid)
+  {
+    output_valid = true;
+    if (!lhs_valid && !rhs_valid) return true;
+    if (lhs_valid && rhs_valid) return x == y;
+    return false;
+  }
+};
+
+struct RNullEquals {
+  template <typename TypeOut, typename TypeLhs, typename TypeRhs>
+  static TypeOut operate(TypeLhs x, TypeRhs y, bool lhs_valid, bool rhs_valid, bool& output_valid)
+  {
+    output_valid = true;
+    return NullEquals::operate<TypeOut, TypeRhs, TypeLhs>(y, x, rhs_valid, lhs_valid, output_valid);
+  }
+};
+
+struct NullMax {
+  template <typename TypeOut, typename TypeLhs, typename TypeRhs>
+  static TypeOut operate(TypeLhs x, TypeRhs y, bool lhs_valid, bool rhs_valid, bool& output_valid)
+  {
+    output_valid = true;
+    if (!lhs_valid && !rhs_valid) {
+      output_valid = false;
+      return TypeOut{};
+    } else if (lhs_valid && rhs_valid) {
+      return (TypeOut{x} > TypeOut{y}) ? TypeOut{x} : TypeOut{y};
+    } else if (lhs_valid)
+      return TypeOut{x};
+    else
+      return TypeOut{y};
+  }
+};
+
+struct RNullMax {
+  template <typename TypeOut, typename TypeLhs, typename TypeRhs>
+  static TypeOut operate(TypeLhs x, TypeRhs y, bool lhs_valid, bool rhs_valid, bool& output_valid)
+  {
+    return NullMax::operate<TypeOut, TypeRhs, TypeLhs>(y, x, rhs_valid, lhs_valid, output_valid);
+  }
+};
+
+struct NullMin {
+  template <typename TypeOut, typename TypeLhs, typename TypeRhs>
+  static TypeOut operate(TypeLhs x, TypeRhs y, bool lhs_valid, bool rhs_valid, bool& output_valid)
+  {
+    output_valid = true;
+    if (!lhs_valid && !rhs_valid) {
+      output_valid = false;
+      return TypeOut{};
+    } else if (lhs_valid && rhs_valid) {
+      return (TypeOut{x} < TypeOut{y}) ? TypeOut{x} : TypeOut{y};
+    } else if (lhs_valid)
+      return TypeOut{x};
+    else
+      return TypeOut{y};
+  }
+};
+
+struct RNullMin {
+  template <typename TypeOut, typename TypeLhs, typename TypeRhs>
+  static TypeOut operate(TypeLhs x, TypeRhs y, bool lhs_valid, bool rhs_valid, bool& output_valid)
+  {
+    return NullMin::operate<TypeOut, TypeRhs, TypeLhs>(y, x, rhs_valid, lhs_valid, output_valid);
+  }
+};
+
+struct PMod {
+  // Ideally, these two specializations - one for integral types and one for non integral
+  // types shouldn't be required, as std::fmod should promote integral types automatically
+  // to double and call the std::fmod overload for doubles. Sadly, doing this in jitified
+  // code does not work - it is having trouble deciding between float/double overloads
+  template <typename TypeOut,
+            typename TypeLhs,
+            typename TypeRhs,
+            enable_if_t<(is_integral_v<typename cuda::std::common_type<TypeLhs, TypeRhs>::type>)>* =
+              nullptr>
+  static TypeOut operate(TypeLhs x, TypeRhs y)
+  {
+    using common_t = typename cuda::std::common_type<TypeLhs, TypeRhs>::type;
+    common_t xconv{x};
+    common_t yconv{y};
+    auto rem = xconv % yconv;
+    if (rem < 0) rem = (rem + yconv) % yconv;
+    return TypeOut{rem};
+  }
+
+  template <typename TypeOut,
+            typename TypeLhs,
+            typename TypeRhs,
+            enable_if_t<
+              !(is_integral_v<typename cuda::std::common_type<TypeLhs, TypeRhs>::type>)>* = nullptr>
+  static TypeOut operate(TypeLhs x, TypeRhs y)
+  {
+    using common_t = typename cuda::std::common_type<TypeLhs, TypeRhs>::type;
+    common_t xconv{x};
+    common_t yconv{y};
+    auto rem = std::fmod(xconv, yconv);
+    if (rem < 0) rem = std::fmod(rem + yconv, yconv);
+    return TypeOut{rem};
+  }
+};
+
+struct RPMod {
+  template <typename TypeOut, typename TypeLhs, typename TypeRhs>
+  static TypeOut operate(TypeLhs x, TypeRhs y)
+  {
+    return PMod::operate<TypeOut, TypeRhs, TypeLhs>(y, x);
+  }
+};
+
+struct ATan2 {
+  template <typename TypeOut, typename TypeLhs, typename TypeRhs>
+  static TypeOut operate(TypeLhs x, TypeRhs y)
+  {
+    return TypeOut{std::atan2(double{x}, double{y})};
+  }
+};
+
+struct RATan2 {
+  template <typename TypeOut, typename TypeLhs, typename TypeRhs>
+  static TypeOut operate(TypeLhs x, TypeRhs y)
+  {
+    return TypeOut{ATan2::operate<TypeOut, TypeRhs, TypeLhs>(y, x)};
+  }
+};
+
+}  // namespace jit
+}  // namespace binops
+}  // namespace cudf
diff --git a/cpp/src/binaryop/jit/traits.hpp b/cpp/src/binaryop/jit/traits.hpp
new file mode 100644
index 00000000000..1cca2b6e155
--- /dev/null
+++ b/cpp/src/binaryop/jit/traits.hpp
@@ -0,0 +1,70 @@
+/*
+ * Copyright (c) 2019-2021, NVIDIA CORPORATION.
+ *
+ * Copyright 2018-2019 BlazingDB, Inc.
+ *     Copyright 2018 Christian Noboa Mardini <christian@blazingdb.com>
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#pragma once
+
+// Include Jitify's cstddef header first
+#include <cstddef>
+
+#include <cuda/std/climits>
+#include <cuda/std/cstddef>
+#include <cuda/std/limits>
+#include <cuda/std/type_traits>
+
+#include <cudf/wrappers/durations.hpp>
+#include <cudf/wrappers/timestamps.hpp>
+
+namespace cudf {
+namespace binops {
+namespace jit {
+
+// -------------------------------------------------------------------------
+// type_traits cannot tell the difference between float and double
+template <typename Type>
+constexpr bool isFloat = false;
+
+template <typename T>
+constexpr bool is_timestamp_v = cuda::std::is_same<cudf::timestamp_D, T>::value ||
+                                cuda::std::is_same<cudf::timestamp_s, T>::value ||
+                                cuda::std::is_same<cudf::timestamp_ms, T>::value ||
+                                cuda::std::is_same<cudf::timestamp_us, T>::value ||
+                                cuda::std::is_same<cudf::timestamp_ns, T>::value;
+
+template <typename T>
+constexpr bool is_duration_v = cuda::std::is_same<cudf::duration_D, T>::value ||
+                               cuda::std::is_same<cudf::duration_s, T>::value ||
+                               cuda::std::is_same<cudf::duration_ms, T>::value ||
+                               cuda::std::is_same<cudf::duration_us, T>::value ||
+                               cuda::std::is_same<cudf::duration_ns, T>::value;
+
+template <typename T>
+constexpr bool is_chrono_v = is_timestamp_v<T> || is_duration_v<T>;
+
+template <>
+constexpr bool isFloat<float> = true;
+
+template <typename Type>
+constexpr bool isDouble = false;
+
+template <>
+constexpr bool isDouble<double> = true;
+
+}  // namespace jit
+}  // namespace binops
+}  // namespace cudf
diff --git a/cpp/src/binaryop/jit/util.hpp b/cpp/src/binaryop/jit/util.hpp
index 6b4085bf11b..34c42e28a8b 100644
--- a/cpp/src/binaryop/jit/util.hpp
+++ b/cpp/src/binaryop/jit/util.hpp
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2019, NVIDIA CORPORATION.
+ * Copyright (c) 2019-2021, NVIDIA CORPORATION.
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
  * you may not use this file except in compliance with the License.
@@ -72,11 +72,15 @@ std::string inline get_operator_name(binary_operator op, OperatorType type)
       case binary_operator::NULL_EQUALS:          return "NullEquals";
       case binary_operator::NULL_MAX:             return "NullMax";
       case binary_operator::NULL_MIN:             return "NullMin";
-      default:                              return "None";
+      default:                                    return "";
     }
     // clang-format on
   }();
-  return type == OperatorType::Direct ? operator_name : 'R' + operator_name;
+
+  if (operator_name == "") { return "None"; }
+
+  return "cudf::binops::jit::" +
+         (type == OperatorType::Direct ? operator_name : 'R' + operator_name);
 }
 
 }  // namespace jit
diff --git a/cpp/src/copying/concatenate.cu b/cpp/src/copying/concatenate.cu
index 1b948083982..e87cadbffe8 100644
--- a/cpp/src/copying/concatenate.cu
+++ b/cpp/src/copying/concatenate.cu
@@ -455,7 +455,8 @@ rmm::device_buffer concatenate_masks(host_span<column_view const> views,
     rmm::device_buffer null_mask =
       create_null_mask(total_element_count, mask_state::UNINITIALIZED, mr);
 
-    detail::concatenate_masks(views, static_cast<bitmask_type*>(null_mask.data()), 0);
+    detail::concatenate_masks(
+      views, static_cast<bitmask_type*>(null_mask.data()), rmm::cuda_stream_default);
 
     return null_mask;
   }
diff --git a/cpp/src/copying/copy.cu b/cpp/src/copying/copy.cu
index e6adc027acc..fecf7d18d46 100644
--- a/cpp/src/copying/copy.cu
+++ b/cpp/src/copying/copy.cu
@@ -31,7 +31,11 @@ namespace {
 
 template <typename T, typename Enable = void>
 struct copy_if_else_functor_impl {
-  std::unique_ptr<column> operator()(...) { CUDF_FAIL("Unsupported type for copy_if_else."); }
+  template <typename... Args>
+  std::unique_ptr<column> operator()(Args&&...)
+  {
+    CUDF_FAIL("Unsupported type for copy_if_else.");
+  }
 };
 
 template <typename T>
diff --git a/cpp/src/copying/pack.cpp b/cpp/src/copying/pack.cpp
index 38c95da6dc7..182e3ff0584 100644
--- a/cpp/src/copying/pack.cpp
+++ b/cpp/src/copying/pack.cpp
@@ -17,7 +17,7 @@
 #include <cudf/detail/copy.hpp>
 #include <cudf/detail/nvtx/ranges.hpp>
 
-#include <jit/type.h>
+#include <rmm/cuda_stream_view.hpp>
 
 namespace cudf {
 namespace detail {
@@ -218,7 +218,7 @@ table_view unpack(uint8_t const* metadata, uint8_t const* gpu_data)
 packed_columns pack(cudf::table_view const& input, rmm::mr::device_memory_resource* mr)
 {
   CUDF_FUNC_RANGE();
-  return detail::pack(input, 0, mr);
+  return detail::pack(input, rmm::cuda_stream_default, mr);
 }
 
 /**
diff --git a/cpp/src/hash/hashing.cu b/cpp/src/hash/hashing.cu
index 16efb666b3e..53be019f73b 100644
--- a/cpp/src/hash/hashing.cu
+++ b/cpp/src/hash/hashing.cu
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2019-2020, NVIDIA CORPORATION.
+ * Copyright (c) 2019-2021, NVIDIA CORPORATION.
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
  * you may not use this file except in compliance with the License.
@@ -29,6 +29,8 @@
 
 #include <rmm/cuda_stream_view.hpp>
 
+#include <algorithm>
+
 namespace cudf {
 namespace {
 
@@ -38,6 +40,22 @@ bool md5_type_check(data_type dt)
   return !is_chrono(dt) && (is_fixed_width(dt) || (dt.id() == type_id::STRING));
 }
 
+template <typename IterType>
+std::vector<column_view> to_leaf_columns(IterType iter_begin, IterType iter_end)
+{
+  std::vector<column_view> leaf_columns;
+  std::for_each(iter_begin, iter_end, [&leaf_columns](column_view const& col) {
+    if (is_nested(col.type())) {
+      CUDF_EXPECTS(col.type().id() == type_id::STRUCT, "unsupported nested type");
+      auto child_columns = to_leaf_columns(col.child_begin(), col.child_end());
+      leaf_columns.insert(leaf_columns.end(), child_columns.begin(), child_columns.end());
+    } else {
+      leaf_columns.emplace_back(col);
+    }
+  });
+  return leaf_columns;
+}
+
 }  // namespace
 
 namespace detail {
@@ -133,10 +151,11 @@ std::unique_ptr<column> serial_murmur_hash3_32(table_view const& input,
 
   if (input.num_columns() == 0 || input.num_rows() == 0) { return output; }
 
-  auto const device_input = table_device_view::create(input, stream);
+  table_view const leaf_table(to_leaf_columns(input.begin(), input.end()));
+  auto const device_input = table_device_view::create(leaf_table, stream);
   auto output_view        = output->mutable_view();
 
-  if (has_nulls(input)) {
+  if (has_nulls(leaf_table)) {
     thrust::tabulate(rmm::exec_policy(stream),
                      output_view.begin<int32_t>(),
                      output_view.end<int32_t>(),
diff --git a/cpp/src/io/csv/csv_gpu.cu b/cpp/src/io/csv/csv_gpu.cu
index 86e5f1fdcae..44acc7fc55f 100644
--- a/cpp/src/io/csv/csv_gpu.cu
+++ b/cpp/src/io/csv/csv_gpu.cu
@@ -196,7 +196,7 @@ __global__ void __launch_bounds__(csvparse_block_dim)
       } else if (serialized_trie_contains(opts.trie_true, {field_start, field_len}) ||
                  serialized_trie_contains(opts.trie_false, {field_start, field_len})) {
         atomicAdd(&d_columnData[actual_col].bool_count, 1);
-      } else if (cudf::io::gpu::is_infinity(field_start, next_delimiter)) {
+      } else if (cudf::io::is_infinity(field_start, next_delimiter)) {
         atomicAdd(&d_columnData[actual_col].float_count, 1);
       } else {
         long countNumber   = 0;
@@ -277,7 +277,7 @@ __inline__ __device__ T decode_value(char const *begin,
                                      char const *end,
                                      parse_options_view const &opts)
 {
-  return cudf::io::gpu::parse_numeric<T, base>(begin, end, opts);
+  return cudf::io::parse_numeric<T, base>(begin, end, opts);
 }
 
 template <typename T>
@@ -285,7 +285,7 @@ __inline__ __device__ T decode_value(char const *begin,
                                      char const *end,
                                      parse_options_view const &opts)
 {
-  return cudf::io::gpu::parse_numeric<T>(begin, end, opts);
+  return cudf::io::parse_numeric<T>(begin, end, opts);
 }
 
 template <>
diff --git a/cpp/src/io/csv/writer_impl.cu b/cpp/src/io/csv/writer_impl.cu
index f7e153d71f4..fcab4190071 100644
--- a/cpp/src/io/csv/writer_impl.cu
+++ b/cpp/src/io/csv/writer_impl.cu
@@ -37,6 +37,7 @@
 
 #include <rmm/cuda_stream_view.hpp>
 #include <rmm/exec_policy.hpp>
+#include <rmm/mr/device/per_device_resource.hpp>
 
 #include <thrust/count.h>
 #include <thrust/execution_policy.h>
@@ -211,10 +212,11 @@ struct column_to_strings_fn {
                (cudf::is_timestamp<column_type>()) || (cudf::is_duration<column_type>()));
   }
 
-  explicit column_to_strings_fn(csv_writer_options const& options,
-                                rmm::mr::device_memory_resource* mr = nullptr,
-                                rmm::cuda_stream_view stream        = nullptr)
-    : options_(options), mr_(mr), stream_(stream)
+  explicit column_to_strings_fn(
+    csv_writer_options const& options,
+    rmm::cuda_stream_view stream        = rmm::cuda_stream_default,
+    rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource())
+    : options_(options), stream_(stream), mr_(mr)
   {
   }
 
@@ -345,8 +347,8 @@ struct column_to_strings_fn {
 
  private:
   csv_writer_options const& options_;
-  rmm::mr::device_memory_resource* mr_;
   rmm::cuda_stream_view stream_;
+  rmm::mr::device_memory_resource* mr_;
 };
 }  // unnamed namespace
 
@@ -495,7 +497,7 @@ void writer::impl::write(table_view const& table,
 
     // convert each chunk to CSV:
     //
-    column_to_strings_fn converter{options_, mr_};
+    column_to_strings_fn converter{options_, stream, mr_};
     for (auto&& sub_view : vector_views) {
       // Skip if the table has no rows
       if (sub_view.num_rows() == 0) continue;
diff --git a/cpp/src/io/json/json_gpu.cu b/cpp/src/io/json/json_gpu.cu
index 5efb64fd4d5..75910ae6b5b 100644
--- a/cpp/src/io/json/json_gpu.cu
+++ b/cpp/src/io/json/json_gpu.cu
@@ -114,7 +114,7 @@ __inline__ __device__ T decode_value(const char *begin,
                                      uint64_t end,
                                      parse_options_view const &opts)
 {
-  return cudf::io::gpu::parse_numeric<T, base>(begin, end, opts);
+  return cudf::io::parse_numeric<T, base>(begin, end, opts);
 }
 
 /**
@@ -131,7 +131,7 @@ __inline__ __device__ T decode_value(const char *begin,
                                      const char *end,
                                      parse_options_view const &opts)
 {
-  return cudf::io::gpu::parse_numeric<T>(begin, end, opts);
+  return cudf::io::parse_numeric<T>(begin, end, opts);
 }
 
 /**
diff --git a/cpp/src/io/utilities/column_buffer.hpp b/cpp/src/io/utilities/column_buffer.hpp
index 88444d41206..75e9a4c18df 100644
--- a/cpp/src/io/utilities/column_buffer.hpp
+++ b/cpp/src/io/utilities/column_buffer.hpp
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2019-2020, NVIDIA CORPORATION.
+ * Copyright (c) 2019-2021, NVIDIA CORPORATION.
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
  * you may not use this file except in compliance with the License.
@@ -28,8 +28,6 @@
 #include <cudf/types.hpp>
 #include <cudf/utilities/traits.hpp>
 
-#include <cudf_test/column_utilities.hpp>
-
 #include <rmm/cuda_stream_view.hpp>
 #include <rmm/device_buffer.hpp>
 #include <rmm/device_uvector.hpp>
diff --git a/cpp/src/io/utilities/file_io_utilities.hpp b/cpp/src/io/utilities/file_io_utilities.hpp
index 0119484aee5..8a742076338 100644
--- a/cpp/src/io/utilities/file_io_utilities.hpp
+++ b/cpp/src/io/utilities/file_io_utilities.hpp
@@ -18,13 +18,13 @@
 
 #ifdef CUFILE_FOUND
 #include <cufile.h>
+#include <cudf_test/file_utilities.hpp>
 #endif
 
 #include <rmm/cuda_stream_view.hpp>
 
 #include <cudf/io/datasource.hpp>
 #include <cudf/utilities/error.hpp>
-#include <cudf_test/file_utilities.hpp>
 
 #include <string>
 
diff --git a/cpp/src/io/utilities/parsing_utils.cuh b/cpp/src/io/utilities/parsing_utils.cuh
index 584d2c9a74a..b7719cba580 100644
--- a/cpp/src/io/utilities/parsing_utils.cuh
+++ b/cpp/src/io/utilities/parsing_utils.cuh
@@ -20,6 +20,8 @@
 #include <cudf/io/types.hpp>
 #include <cudf/utilities/span.hpp>
 
+#include <io/utilities/column_type_histogram.hpp>
+
 #include <rmm/device_vector.hpp>
 
 using cudf::device_span;
@@ -82,67 +84,6 @@ struct parse_options {
   }
 };
 
-namespace gpu {
-/**
- * @brief CUDA kernel iterates over the data until the end of the current field
- *
- * Also iterates over (one or more) delimiter characters after the field.
- * Function applies to formats with field delimiters and line terminators.
- *
- * @param begin Pointer to the first element of the string
- * @param end Pointer to the first element after the string
- * @param opts A set of parsing options
- * @param escape_char A boolean value to signify whether to consider `\` as escape character or
- * just a character.
- *
- * @return Pointer to the last character in the field, including the
- *  delimiter(s) following the field data
- */
-__device__ __inline__ char const* seek_field_end(char const* begin,
-                                                 char const* end,
-                                                 parse_options_view const& opts,
-                                                 bool escape_char = false)
-{
-  bool quotation   = false;
-  auto current     = begin;
-  bool escape_next = false;
-  while (true) {
-    // Use simple logic to ignore control chars between any quote seq
-    // Handles nominal cases including doublequotes within quotes, but
-    // may not output exact failures as PANDAS for malformed fields.
-    // Check for instances such as "a2\"bc" and "\\" if `escape_char` is true.
-
-    if (*current == opts.quotechar and not escape_next) {
-      quotation = !quotation;
-    } else if (!quotation) {
-      if (*current == opts.delimiter) {
-        while (opts.multi_delimiter && current < end && *(current + 1) == opts.delimiter) {
-          ++current;
-        }
-        break;
-      } else if (*current == opts.terminator) {
-        break;
-      } else if (*current == '\r' && (current + 1 < end && *(current + 1) == '\n')) {
-        --end;
-        break;
-      }
-    }
-
-    if (escape_char == true) {
-      // If a escape character is encountered, escape next character in next loop.
-      if (escape_next == false and *current == '\\') {
-        escape_next = true;
-      } else {
-        escape_next = false;
-      }
-    }
-
-    if (current >= end) break;
-    current++;
-  }
-  return current;
-}
-
 /**
  * @brief Returns the numeric value of an ASCII/UTF-8 character. Specialization
  * for integral types. Handles hexadecimal digits, both uppercase and lowercase.
@@ -155,7 +96,7 @@ __device__ __inline__ char const* seek_field_end(char const* begin,
  * @return uint8_t Numeric value of the character, or `0`
  */
 template <typename T, typename std::enable_if_t<std::is_integral<T>::value>* = nullptr>
-__device__ __forceinline__ uint8_t decode_digit(char c, bool* valid_flag)
+constexpr uint8_t decode_digit(char c, bool* valid_flag)
 {
   if (c >= '0' && c <= '9') return c - '0';
   if (c >= 'a' && c <= 'f') return c - 'a' + 10;
@@ -176,7 +117,7 @@ __device__ __forceinline__ uint8_t decode_digit(char c, bool* valid_flag)
  * @return uint8_t Numeric value of the character, or `0`
  */
 template <typename T, typename std::enable_if_t<!std::is_integral<T>::value>* = nullptr>
-__device__ __forceinline__ uint8_t decode_digit(char c, bool* valid_flag)
+constexpr uint8_t decode_digit(char c, bool* valid_flag)
 {
   if (c >= '0' && c <= '9') return c - '0';
 
@@ -185,10 +126,7 @@ __device__ __forceinline__ uint8_t decode_digit(char c, bool* valid_flag)
 }
 
 // Converts character to lowercase.
-__inline__ __device__ char to_lower(char const c)
-{
-  return c >= 'A' && c <= 'Z' ? c + ('a' - 'A') : c;
-}
+constexpr char to_lower(char const c) { return c >= 'A' && c <= 'Z' ? c + ('a' - 'A') : c; }
 
 /**
  * @brief Checks if string is infinity, case insensitive with/without sign
@@ -199,7 +137,7 @@ __inline__ __device__ char to_lower(char const c)
  * @param end Pointer to the first element after the string
  * @return true if string is valid infinity, else false.
  */
-__inline__ __device__ bool is_infinity(char const* begin, char const* end)
+constexpr bool is_infinity(char const* begin, char const* end)
 {
   if (*begin == '-' || *begin == '+') begin++;
   char const* cinf = "infinity";
@@ -223,9 +161,10 @@ __inline__ __device__ bool is_infinity(char const* begin, char const* end)
  * @return The parsed and converted value
  */
 template <typename T, int base = 10>
-__inline__ __device__ T parse_numeric(const char* begin,
-                                      const char* end,
-                                      parse_options_view const& opts)
+constexpr T parse_numeric(const char* begin,
+                          const char* end,
+                          parse_options_view const& opts,
+                          T error_result = std::numeric_limits<T>::quiet_NaN())
 {
   T value{};
   bool all_digits_valid = true;
@@ -281,11 +220,72 @@ __inline__ __device__ T parse_numeric(const char* begin,
       if (exponent != 0) { value *= exp10(double(exponent * exponent_sign)); }
     }
   }
-  if (!all_digits_valid) { return std::numeric_limits<T>::quiet_NaN(); }
+  if (!all_digits_valid) { return error_result; }
 
   return value * sign;
 }
 
+namespace gpu {
+/**
+ * @brief CUDA kernel iterates over the data until the end of the current field
+ *
+ * Also iterates over (one or more) delimiter characters after the field.
+ * Function applies to formats with field delimiters and line terminators.
+ *
+ * @param begin Pointer to the first element of the string
+ * @param end Pointer to the first element after the string
+ * @param opts A set of parsing options
+ * @param escape_char A boolean value to signify whether to consider `\` as escape character or
+ * just a character.
+ *
+ * @return Pointer to the last character in the field, including the
+ *  delimiter(s) following the field data
+ */
+__device__ __inline__ char const* seek_field_end(char const* begin,
+                                                 char const* end,
+                                                 parse_options_view const& opts,
+                                                 bool escape_char = false)
+{
+  bool quotation   = false;
+  auto current     = begin;
+  bool escape_next = false;
+  while (true) {
+    // Use simple logic to ignore control chars between any quote seq
+    // Handles nominal cases including doublequotes within quotes, but
+    // may not output exact failures as PANDAS for malformed fields.
+    // Check for instances such as "a2\"bc" and "\\" if `escape_char` is true.
+
+    if (*current == opts.quotechar and not escape_next) {
+      quotation = !quotation;
+    } else if (!quotation) {
+      if (*current == opts.delimiter) {
+        while (opts.multi_delimiter && current < end && *(current + 1) == opts.delimiter) {
+          ++current;
+        }
+        break;
+      } else if (*current == opts.terminator) {
+        break;
+      } else if (*current == '\r' && (current + 1 < end && *(current + 1) == '\n')) {
+        --end;
+        break;
+      }
+    }
+
+    if (escape_char == true) {
+      // If a escape character is encountered, escape next character in next loop.
+      if (escape_next == false and *current == '\\') {
+        escape_next = true;
+      } else {
+        escape_next = false;
+      }
+    }
+
+    if (current >= end) break;
+    current++;
+  }
+  return current;
+}
+
 /**
  * @brief Lexicographically compare digits in input against string
  * representing an integer
diff --git a/cpp/src/jit/cache.cpp b/cpp/src/jit/cache.cpp
index c634aa8d06b..cb401c184ee 100644
--- a/cpp/src/jit/cache.cpp
+++ b/cpp/src/jit/cache.cpp
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2019-2020, NVIDIA CORPORATION.
+ * Copyright (c) 2019-2021, NVIDIA CORPORATION.
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
  * you may not use this file except in compliance with the License.
@@ -14,20 +14,15 @@
  * limitations under the License.
  */
 
-#include <jit/cache.h>
 #include <cudf/utilities/error.hpp>
 
-#include <errno.h>
-#include <fcntl.h>
-#include <pwd.h>
-#include <stdio.h>
-#include <unistd.h>
-#include <boost/filesystem.hpp>
-
 #include <cuda.h>
+#include <boost/filesystem.hpp>
+#include <jitify2.hpp>
 
 namespace cudf {
 namespace jit {
+
 // Get the directory in home to use for storing the cache
 boost::filesystem::path get_user_home_cache_dir()
 {
@@ -62,7 +57,7 @@ boost::filesystem::path get_user_home_cache_dir()
  * are used and if $HOME is not defined, returns an empty path and file
  * caching is not used.
  */
-boost::filesystem::path getCacheDir()
+boost::filesystem::path get_cache_dir()
 {
   // The environment variable always overrides the
   // default/compile-time value of `LIBCUDF_KERNEL_CACHE_PATH`
@@ -98,158 +93,33 @@ boost::filesystem::path getCacheDir()
   return kernel_cache_path;
 }
 
-cudfJitCache::cudfJitCache() {}
-
-cudfJitCache::~cudfJitCache() {}
-
-std::mutex cudfJitCache::_kernel_cache_mutex;
-std::mutex cudfJitCache::_program_cache_mutex;
-
-named_prog<jitify::experimental::Program> cudfJitCache::getProgram(
-  std::string const& prog_name,
-  std::string const& cuda_source,
-  std::vector<std::string> const& given_headers,
-  std::vector<std::string> const& given_options,
-  jitify::experimental::file_callback_type file_callback)
-{
-  // Lock for thread safety
-  std::lock_guard<std::mutex> lock(_program_cache_mutex);
-
-  return getCached(prog_name, program_map, [&]() {
-    CUDF_EXPECTS(not cuda_source.empty(), "Program not found in cache, Needs source string.");
-    return jitify::experimental::Program(cuda_source, given_headers, given_options, file_callback);
-  });
-}
-
-named_prog<jitify::experimental::KernelInstantiation> cudfJitCache::getKernelInstantiation(
-  std::string const& kern_name,
-  named_prog<jitify::experimental::Program> const& named_program,
-  std::vector<std::string> const& arguments)
-{
-  // Lock for thread safety
-  std::lock_guard<std::mutex> lock(_kernel_cache_mutex);
-
-  std::string prog_name                  = std::get<0>(named_program);
-  jitify::experimental::Program& program = *std::get<1>(named_program);
-
-  // Make instance name e.g. "prog_binop.kernel_v_v_int_int_long int_Add"
-  std::string kern_inst_name = prog_name + '.' + kern_name;
-  for (auto&& arg : arguments) kern_inst_name += '_' + arg;
-
-  CUcontext c;
-  cuCtxGetCurrent(&c);
-
-  auto& kernel_inst_map = kernel_inst_context_map[c];
-
-  return getCached(kern_inst_name, kernel_inst_map, [&]() {
-    return program.kernel(kern_name).instantiate(arguments);
-  });
-}
-
-// Another overload for getKernelInstantiation which might be useful to get
-// kernel instantiations in one step
-// ------------------------------------------------------------------------
-/*
-jitify::experimental::KernelInstantiation cudfJitCache::getKernelInstantiation(
-    std::string const& kern_name,
-    std::string const& prog_name,
-    std::string const& cuda_source = "",
-    std::vector<std::string> const& given_headers = {},
-    std::vector<std::string> const& given_options = {},
-    file_callback_type file_callback = nullptr)
-{
-    auto program = getProgram(prog_name,
-                              cuda_source,
-                              given_headers,
-                              given_options,
-                              file_callback);
-    return getKernelInstantiation(kern_name, program);
-}
-*/
-
-cudfJitCache::cacheFile::cacheFile(std::string file_name) : _file_name{file_name} {}
-
-cudfJitCache::cacheFile::~cacheFile() {}
-
-std::string cudfJitCache::cacheFile::read()
+std::string get_program_cache_dir()
 {
-  // Open file (duh)
-  int fd = open(_file_name.c_str(), O_RDWR);
-  if (fd == -1) {
-    successful_read = false;
-    return std::string();
-  }
-
-  // Create args for file locking
-  flock fl{};
-  fl.l_type   = F_RDLCK;  // Shared lock for reading
-  fl.l_whence = SEEK_SET;
-
-  // Lock the file descriptor. Only reading is allowed now
-  if (fcntl(fd, F_SETLKW, &fl) == -1) {
-    successful_read = false;
-    return std::string();
-  }
-
-  // Get file descriptor from file pointer
-  FILE* fp = fdopen(fd, "rb");
-
-  // Get file length
-  fseek(fp, 0L, SEEK_END);
-  size_t file_size = ftell(fp);
-  rewind(fp);
-
-  // Allocate memory of file length size
-  std::string content;
-  content.resize(file_size);
-  char* buffer = &content[0];
-
-  // Copy file into buffer
-  if (fread(buffer, file_size, 1, fp) != 1) {
-    successful_read = false;
-    fclose(fp);
-    free(buffer);
-    return std::string();
-  }
-  fclose(fp);
-  successful_read = true;
-
-  return content;
+#if defined(JITIFY_USE_CACHE)
+  return get_cache_dir().string();
+#else
+  return {};
+#endif
 }
 
-void cudfJitCache::cacheFile::write(std::string content)
+jitify2::ProgramCache<>& get_program_cache(jitify2::PreprocessedProgramData preprog)
 {
-  // Open file and create if it doesn't exist, with access 0600
-  int fd = open(_file_name.c_str(), O_RDWR | O_CREAT, S_IRUSR | S_IWUSR);
-  if (fd == -1) {
-    successful_write = false;
-    return;
-  }
+  static std::mutex caches_mutex{};
+  static std::unordered_map<std::string, std::unique_ptr<jitify2::ProgramCache<>>> caches{};
 
-  // Create args for file locking
-  flock fl{};
-  fl.l_type   = F_WRLCK;  // Exclusive lock for writing
-  fl.l_whence = SEEK_SET;
+  std::lock_guard<std::mutex> caches_lock(caches_mutex);
 
-  // Lock the file descriptor. we the only ones now
-  if (fcntl(fd, F_SETLKW, &fl) == -1) {
-    successful_write = false;
-    return;
-  }
+  auto existing_cache = caches.find(preprog.name());
 
-  // Get file descriptor from file pointer
-  FILE* fp = fdopen(fd, "wb");
+  if (existing_cache == caches.end()) {
+    auto res = caches.insert(
+      {preprog.name(),
+       std::make_unique<jitify2::ProgramCache<>>(100, preprog, nullptr, get_program_cache_dir())});
 
-  // Copy string into file
-  if (fwrite(content.c_str(), content.length(), 1, fp) != 1) {
-    successful_write = false;
-    fclose(fp);
-    return;
+    existing_cache = res.first;
   }
-  fclose(fp);
 
-  successful_write = true;
-  return;
+  return *(existing_cache->second);
 }
 
 }  // namespace jit
diff --git a/cpp/src/jit/cache.h b/cpp/src/jit/cache.h
deleted file mode 100644
index 071a951023b..00000000000
--- a/cpp/src/jit/cache.h
+++ /dev/null
@@ -1,208 +0,0 @@
-/*
- * Copyright (c) 2019-2020, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#ifndef CUDF_JIT_CACHE_H_
-#define CUDF_JIT_CACHE_H_
-
-#include <boost/filesystem.hpp>
-#include <cudf/utilities/error.hpp>
-#include <jitify.hpp>
-#include <memory>
-#include <mutex>
-#include <string>
-#include <unordered_map>
-
-namespace cudf {
-namespace jit {
-template <typename Tv>
-using named_prog = std::pair<std::string, std::shared_ptr<Tv>>;
-
-/**
- * @brief Get the string path to the JITIFY kernel cache directory.
- *
- * This path can be overridden at runtime by defining an environment variable
- * named `LIBCUDF_KERNEL_CACHE_PATH`. The value of this variable must be a path
- * under which the process' user has read/write privileges.
- *
- * This function returns a path to the cache directory, creating it if it
- * doesn't exist.
- *
- * The default cache directory is `$HOME/.cudf/$CUDF_VERSION`. If no overrides
- * are used and if $HOME is not defined, returns an empty path and file
- * caching is not used.
- */
-boost::filesystem::path getCacheDir();
-
-class cudfJitCache {
- public:
-  /**
-   * @brief Get a process wide singleton cache object
-   *
-   */
-  static cudfJitCache& Instance()
-  {
-    // Meyers' singleton is thread safe in C++11
-    // Link: https://stackoverflow.com/a/1661564
-    static cudfJitCache cache;
-    return cache;
-  }
-
-  cudfJitCache();
-  ~cudfJitCache();
-
-  /**
-   * @brief Get the Kernel Instantiation object
-   *
-   * Searches an internal in-memory cache and file based cache for the kernel
-   * and if not found, JIT compiles and returns the kernel
-   *
-   * @param kern_name  name of kernel to return
-   * @param program    Jitify preprocessed program to get the kernel from
-   * @param arguments  template arguments for kernel in vector of strings
-   * @return  Pair of string kernel identifier and compiled kernel object
-   */
-  named_prog<jitify::experimental::KernelInstantiation> getKernelInstantiation(
-    std::string const& kern_name,
-    named_prog<jitify::experimental::Program> const& program,
-    std::vector<std::string> const& arguments);
-
-  /**
-   * @brief Get the Jitify preprocessed Program object
-   *
-   * Searches an internal in-memory cache and file based cache for the Jitify
-   * pre-processed program and if not found, JIT processes and returns it
-   *
-   * @param prog_file_name name of program to return
-   * @param cuda_source    string source code of program to compile
-   * @param given_headers  vector of strings representing source or names of each header included in
-   *                       cuda_source
-   * @param given_options  vector of strings options to pass to NVRTC
-   * @param file_callback  pointer to callback function to call whenever a header needs to be loaded
-   * @return named_prog<jitify::experimental::Program>
-   */
-  named_prog<jitify::experimental::Program> getProgram(
-    std::string const& prog_file_name,
-    std::string const& cuda_source                         = "",
-    std::vector<std::string> const& given_headers          = {},
-    std::vector<std::string> const& given_options          = {},
-    jitify::experimental::file_callback_type file_callback = nullptr);
-
- private:
-  template <typename Tv>
-  using umap_str_shptr = std::unordered_map<std::string, std::shared_ptr<Tv>>;
-
-  std::unordered_map<CUcontext, umap_str_shptr<jitify::experimental::KernelInstantiation>>
-    kernel_inst_context_map;
-  umap_str_shptr<jitify::experimental::Program> program_map;
-
-  /*
-    Even though this class can be used as a non-singleton, the file cache
-    access should remain limited to one thread per process. The lockf locks can
-    prevent multiple processes from accessing the file but are ineffective in
-    preventing multiple threads from doing so as the lock is shared by the
-    entire process.
-    Therefore the mutexes are static.
-    */
-  static std::mutex _kernel_cache_mutex;
-  static std::mutex _program_cache_mutex;
-
- private:
-  /**
-   * @brief Class to allow process wise exclusive access to cache files
-   *
-   */
-  class cacheFile {
-   private:
-    std::string _file_name;
-    bool successful_read  = false;
-    bool successful_write = false;
-
-   public:
-    cacheFile(std::string file_name);
-    ~cacheFile();
-
-    /**
-     * @brief Read this file and return the contents as a std::string
-     *
-     */
-    std::string read();
-
-    /**
-     * @brief Write the passed string to this file
-     *
-     */
-    void write(std::string);
-
-    /**
-     * @brief Check whether the read() operation on the file completed successfully
-     *
-     * @return true Read was successful. String returned by `read()` is valid
-     * @return false Read was unsuccessful. String returned by `read()` is empty
-     */
-    bool is_read_successful() { return successful_read; }
-
-    /**
-     * @brief Check whether the write() operation on the file completed successfully
-     *
-     * @return true Write was successful.
-     * @return false Write was unsuccessful. File state is undefined
-     */
-    bool is_write_successful() { return successful_write; }
-  };
-
- private:
-  template <typename T, typename FallbackFunc>
-  named_prog<T> getCached(std::string const& name, umap_str_shptr<T>& map, FallbackFunc func)
-  {
-    // Find memory cached T object
-    auto it = map.find(name);
-    if (it != map.end()) {
-      return std::make_pair(name, it->second);
-    } else {  // Find file cached T object
-      bool successful_read = false;
-      std::string serialized;
-#if defined(JITIFY_USE_CACHE)
-      boost::filesystem::path cache_dir = getCacheDir();
-      if (not cache_dir.empty()) {
-        boost::filesystem::path file_name = cache_dir / name;
-        cacheFile file{file_name.string()};
-        serialized      = file.read();
-        successful_read = file.is_read_successful();
-      }
-#endif
-      if (not successful_read) {
-        // JIT compile and write to file if possible
-        serialized = func().serialize();
-#if defined(JITIFY_USE_CACHE)
-        if (not cache_dir.empty()) {
-          boost::filesystem::path file_name = cache_dir / name;
-          cacheFile file{file_name.string()};
-          file.write(serialized);
-        }
-#endif
-      }
-      // Add deserialized T to cache and return
-      auto program = std::make_shared<T>(T::deserialize(serialized));
-      map[name]    = program;
-      return std::make_pair(name, program);
-    }
-  }
-};
-
-}  // namespace jit
-}  // namespace cudf
-
-#endif  // CUDF_JIT_CACHE_H_
diff --git a/cpp/src/transform/jit/code/code.h b/cpp/src/jit/cache.hpp
similarity index 71%
rename from cpp/src/transform/jit/code/code.h
rename to cpp/src/jit/cache.hpp
index cc3d6a8fe89..df8d4278f0f 100644
--- a/cpp/src/transform/jit/code/code.h
+++ b/cpp/src/jit/cache.hpp
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2019, NVIDIA CORPORATION.
+ * Copyright (c) 2019-2021, NVIDIA CORPORATION.
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
  * you may not use this file except in compliance with the License.
@@ -16,16 +16,13 @@
 
 #pragma once
 
+#include <jitify2.hpp>
+#include <memory>
+
 namespace cudf {
-namespace transformation {
 namespace jit {
-namespace code {
-extern const char* kernel_header;
-extern const char* kernel;
-extern const char* traits;
-extern const char* operation;
 
-}  // namespace code
+jitify2::ProgramCache<>& get_program_cache(jitify2::PreprocessedProgramData preprog);
+
 }  // namespace jit
-}  // namespace transformation
 }  // namespace cudf
diff --git a/cpp/src/jit/common_headers.hpp b/cpp/src/jit/common_headers.hpp
deleted file mode 100644
index 0f57790afe0..00000000000
--- a/cpp/src/jit/common_headers.hpp
+++ /dev/null
@@ -1,108 +0,0 @@
-/*
- * Copyright (c) 2019-2020, NVIDIA CORPORATION.
- *
- * Copyright 2018-2019 BlazingDB, Inc.
- *     Copyright 2018 Christian Noboa Mardini <christian@blazingdb.com>
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#include <jit/libcudacxx/cuda/std/chrono.jit>
-#include <jit/libcudacxx/cuda/std/climits.jit>
-#include <jit/libcudacxx/cuda/std/cstddef.jit>
-#include <jit/libcudacxx/cuda/std/cstdint.jit>
-#include <jit/libcudacxx/cuda/std/ctime.jit>
-#include <jit/libcudacxx/cuda/std/detail/__config.jit>
-#include <jit/libcudacxx/cuda/std/detail/__pragma_pop.jit>
-#include <jit/libcudacxx/cuda/std/detail/__pragma_push.jit>
-#include <jit/libcudacxx/cuda/std/detail/libcxx/include/__config.jit>
-#include <jit/libcudacxx/cuda/std/detail/libcxx/include/__pragma_pop.jit>
-#include <jit/libcudacxx/cuda/std/detail/libcxx/include/__pragma_push.jit>
-#include <jit/libcudacxx/cuda/std/detail/libcxx/include/__undef_macros.jit>
-#include <jit/libcudacxx/cuda/std/detail/libcxx/include/chrono.jit>
-#include <jit/libcudacxx/cuda/std/detail/libcxx/include/climits.jit>
-#include <jit/libcudacxx/cuda/std/detail/libcxx/include/cstddef.jit>
-#include <jit/libcudacxx/cuda/std/detail/libcxx/include/cstdint.jit>
-#include <jit/libcudacxx/cuda/std/detail/libcxx/include/ctime.jit>
-#include <jit/libcudacxx/cuda/std/detail/libcxx/include/limits.jit>
-#include <jit/libcudacxx/cuda/std/detail/libcxx/include/ratio.jit>
-#include <jit/libcudacxx/cuda/std/detail/libcxx/include/type_traits.jit>
-#include <jit/libcudacxx/cuda/std/detail/libcxx/include/version.jit>
-#include <jit/libcudacxx/cuda/std/limits.jit>
-#include <jit/libcudacxx/cuda/std/ratio.jit>
-#include <jit/libcudacxx/cuda/std/type_traits.jit>
-#include <jit/libcudacxx/cuda/std/version.jit>
-
-#include <cstring>
-#include <iostream>
-#include <string>
-#include <unordered_map>
-#include <vector>
-
-namespace cudf {
-namespace jit {
-
-const std::vector<std::string> compiler_flags
-{
-  "-std=c++14",
-    // Have jitify prune unused global variables
-    "-remove-unused-globals",
-    // suppress all NVRTC warnings
-    "-w",
-    // force libcudacxx to not include system headers
-    "-D__CUDACC_RTC__",
-#if defined(__powerpc64__)
-    "-D__powerpc64__"
-#elif defined(__x86_64__)
-    "-D__x86_64__"
-#endif
-};
-
-const std::unordered_map<std::string, char const*> stringified_headers{
-  {"cuda/std/chrono", cuda_std_chrono},
-  {"cuda/std/climits", cuda_std_climits},
-  {"cuda/std/cstddef", cuda_std_cstddef},
-  {"cuda/std/cstdint", cuda_std_cstdint},
-  {"cuda/std/ctime", cuda_std_ctime},
-  {"cuda/std/limits", cuda_std_limits},
-  {"cuda/std/ratio", cuda_std_ratio},
-  {"cuda/std/type_traits", cuda_std_type_traits},
-  {"cuda/std/type_traits", cuda_std_type_traits},
-  {"cuda/std/version", cuda_std_version},
-  {"cuda/std/detail/__config", cuda_std_detail___config},
-  {"cuda/std/detail/__pragma_pop", cuda_std_detail___pragma_pop},
-  {"cuda/std/detail/__pragma_push", cuda_std_detail___pragma_push},
-  {"cuda/std/detail/libcxx/include/__config", cuda_std_detail_libcxx_include___config},
-  {"cuda/std/detail/libcxx/include/__pragma_pop", cuda_std_detail_libcxx_include___pragma_pop},
-  {"cuda/std/detail/libcxx/include/__pragma_push", cuda_std_detail_libcxx_include___pragma_push},
-  {"cuda/std/detail/libcxx/include/__undef_macros", cuda_std_detail_libcxx_include___undef_macros},
-  {"cuda/std/detail/libcxx/include/chrono", cuda_std_detail_libcxx_include_chrono},
-  {"cuda/std/detail/libcxx/include/climits", cuda_std_detail_libcxx_include_climits},
-  {"cuda/std/detail/libcxx/include/cstddef", cuda_std_detail_libcxx_include_cstddef},
-  {"cuda/std/detail/libcxx/include/cstdint", cuda_std_detail_libcxx_include_cstdint},
-  {"cuda/std/detail/libcxx/include/ctime", cuda_std_detail_libcxx_include_ctime},
-  {"cuda/std/detail/libcxx/include/limits", cuda_std_detail_libcxx_include_limits},
-  {"cuda/std/detail/libcxx/include/ratio", cuda_std_detail_libcxx_include_ratio},
-  {"cuda/std/detail/libcxx/include/type_traits", cuda_std_detail_libcxx_include_type_traits},
-  {"cuda/std/detail/libcxx/include/version", cuda_std_detail_libcxx_include_version},
-};
-
-inline std::istream* send_stringified_header(std::iostream& stream, char const* header)
-{
-  // skip the filename line added by stringify
-  stream << (std::strchr(header, '\n') + 1);
-  return &stream;
-}
-
-}  // namespace jit
-}  // namespace cudf
diff --git a/cpp/src/jit/launcher.cpp b/cpp/src/jit/launcher.cpp
deleted file mode 100644
index 2ddcac7d5ba..00000000000
--- a/cpp/src/jit/launcher.cpp
+++ /dev/null
@@ -1,51 +0,0 @@
-/*
- * Copyright (c) 2019-2020, NVIDIA CORPORATION.
- *
- * Copyright 2018-2019 BlazingDB, Inc.
- *     Copyright 2018 Christian Noboa Mardini <christian@blazingdb.com>
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#include <jit/launcher.h>
-#include <jit/parser.h>
-#include <chrono>
-#include <cstdint>
-
-#include <rmm/cuda_stream_view.hpp>
-
-namespace cudf {
-namespace jit {
-
-launcher::launcher(const std::string& hash,
-                   const std::string& cuda_source,
-                   const std::vector<std::string>& header_names,
-                   const std::vector<std::string>& compiler_flags,
-                   jitify::experimental::file_callback_type file_callback,
-                   rmm::cuda_stream_view stream)
-  : cache_instance{cudf::jit::cudfJitCache::Instance()}, stream(stream)
-{
-  program = cache_instance.getProgram(
-    hash, cuda_source.c_str(), header_names, compiler_flags, file_callback);
-}
-
-launcher::launcher(launcher&& launcher)
-  : cache_instance{cudf::jit::cudfJitCache::Instance()},
-    program{std::move(launcher.program)},
-    kernel_inst{std::move(launcher.kernel_inst)},
-    stream{launcher.stream}
-{
-}
-
-}  // namespace jit
-}  // namespace cudf
diff --git a/cpp/src/jit/launcher.h b/cpp/src/jit/launcher.h
deleted file mode 100644
index 8bcd92149a8..00000000000
--- a/cpp/src/jit/launcher.h
+++ /dev/null
@@ -1,110 +0,0 @@
-/*
- * Copyright (c) 2019-2020, NVIDIA CORPORATION.
- *
- * Copyright 2018-2019 BlazingDB, Inc.
- *     Copyright 2018 Christian Noboa Mardini <christian@blazingdb.com>
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#pragma once
-
-#include <jit/cache.h>
-
-#include <rmm/cuda_stream_view.hpp>
-
-#include <jitify.hpp>
-
-#include <chrono>
-#include <fstream>
-#include <memory>
-#include <string>
-#include <unordered_map>
-
-namespace cudf {
-namespace jit {
-/**
- * @brief Class used to handle compilation and execution of JIT kernels
- */
-class launcher {
- public:
-  launcher() = delete;
-
-  /**
-   * @brief Constructor of the launcher class
-   *
-   * Method to generate vector containing all template types for a JIT kernel.
-   *  This vector is used to get the compiled kernel for one set of types and set
-   *  it as the kernel to launch using this launcher.
-   *
-   * @param hash The hash to be used as the key for caching
-   * @param cuda_code The CUDA code that contains the kernel to be launched
-   * @param header_names Strings of header_names or strings that contain content
-   * of the header files
-   * @param compiler_flags Strings of compiler flags
-   * @param file_callback a function that returns header file contents given header
-   * file names.
-   * @param stream The non-owned stream to use for execution
-   */
-  launcher(const std::string& hash,
-           const std::string& cuda_source,
-           const std::vector<std::string>& header_names,
-           const std::vector<std::string>& compiler_flags,
-           jitify::experimental::file_callback_type file_callback,
-           rmm::cuda_stream_view stream = rmm::cuda_stream_default);
-  launcher(launcher&&);
-  launcher(const launcher&) = delete;
-  launcher& operator=(launcher&&) = delete;
-  launcher& operator=(const launcher&) = delete;
-
-  /**
-   * @brief Sets the kernel to launch using this launcher
-   *
-   * Method to generate vector containing all template types for a JIT kernel.
-   *  This vector is used to get the compiled kernel for one set of types and set
-   *  it as the kernel to launch using this launcher.
-   *
-   * @param kernel_name The kernel to be launched
-   * @param arguments   The template arguments to be used to instantiate the kernel
-   * @return launcher& ref to this launcher object
-   */
-  launcher& set_kernel_inst(const std::string& kernel_name,
-                            const std::vector<std::string>& arguments)
-  {
-    kernel_inst = cache_instance.getKernelInstantiation(kernel_name, program, arguments);
-    return *this;
-  }
-
-  /**
-   * @brief Handle the Jitify API to launch using information
-   *  contained in the members of `this`
-   *
-   * @tparam All parameters to launch the kernel
-   */
-  template <typename... Args>
-  void launch(Args... args)
-  {
-    get_kernel().configure_1d_max_occupancy(0, 0, 0, stream.value()).safe_launch(args...);
-  }
-
- private:
-  cudf::jit::cudfJitCache& cache_instance;
-  cudf::jit::named_prog<jitify::experimental::Program> program;
-  cudf::jit::named_prog<jitify::experimental::KernelInstantiation> kernel_inst;
-  rmm::cuda_stream_view stream;
-
-  jitify::experimental::KernelInstantiation& get_kernel() { return *std::get<1>(kernel_inst); }
-};
-
-}  // namespace jit
-}  // namespace cudf
diff --git a/cpp/src/jit/parser.cpp b/cpp/src/jit/parser.cpp
index 01fd3aea33a..8929d58be08 100644
--- a/cpp/src/jit/parser.cpp
+++ b/cpp/src/jit/parser.cpp
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2019, NVIDIA CORPORATION.
+ * Copyright (c) 2019-2021, NVIDIA CORPORATION.
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
  * you may not use this file except in compliance with the License.
@@ -14,15 +14,17 @@
  * limitations under the License.
  */
 
+#include "parser.hpp"
+
+#include <cudf/utilities/error.hpp>
+
 #include <algorithm>
 #include <cctype>
-#include <cudf/utilities/error.hpp>
 #include <map>
+#include <set>
 #include <string>
 #include <vector>
 
-#include "parser.h"
-
 namespace cudf {
 namespace jit {
 constexpr char percent_escape[] = "_";
diff --git a/cpp/src/jit/parser.h b/cpp/src/jit/parser.hpp
similarity index 100%
rename from cpp/src/jit/parser.h
rename to cpp/src/jit/parser.hpp
diff --git a/cpp/src/jit/type.cpp b/cpp/src/jit/type.cpp
index e833a6fa10f..16894168b31 100644
--- a/cpp/src/jit/type.cpp
+++ b/cpp/src/jit/type.cpp
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2019-2020, NVIDIA CORPORATION.
+ * Copyright (c) 2019-2021, NVIDIA CORPORATION.
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
  * you may not use this file except in compliance with the License.
@@ -18,6 +18,7 @@
 #include <cudf/scalar/scalar.hpp>
 #include <cudf/utilities/traits.hpp>
 #include <cudf/utilities/type_dispatcher.hpp>
+
 #include <string>
 
 namespace cudf {
diff --git a/cpp/src/jit/type.h b/cpp/src/jit/type.hpp
similarity index 100%
rename from cpp/src/jit/type.h
rename to cpp/src/jit/type.hpp
diff --git a/cpp/src/join/hash_join.cu b/cpp/src/join/hash_join.cu
index 5a6ad8892de..15eb122ef27 100644
--- a/cpp/src/join/hash_join.cu
+++ b/cpp/src/join/hash_join.cu
@@ -15,6 +15,7 @@
  */
 #include <thrust/uninitialized_fill.h>
 #include <join/hash_join.cuh>
+#include <structs/utilities.hpp>
 
 #include <cudf/detail/concatenate.cuh>
 #include <cudf/detail/gather.cuh>
@@ -299,13 +300,15 @@ hash_join::hash_join_impl::~hash_join_impl() = default;
 hash_join::hash_join_impl::hash_join_impl(cudf::table_view const &build,
                                           null_equality compare_nulls,
                                           rmm::cuda_stream_view stream)
-  : _build(build), _hash_table(nullptr)
+  : _hash_table(nullptr)
 {
   CUDF_FUNC_RANGE();
-  CUDF_EXPECTS(0 != _build.num_columns(), "Hash join build table is empty");
-  CUDF_EXPECTS(_build.num_rows() < cudf::detail::MAX_JOIN_SIZE,
+  CUDF_EXPECTS(0 != build.num_columns(), "Hash join build table is empty");
+  CUDF_EXPECTS(build.num_rows() < cudf::detail::MAX_JOIN_SIZE,
                "Build column size is too big for hash join");
 
+  _build = std::get<0>(structs::detail::flatten_nested_columns(build, {}, {}));
+
   if (0 == build.num_rows()) { return; }
 
   _hash_table = build_join_hash_table(_build, compare_nulls, stream);
@@ -355,22 +358,25 @@ hash_join::hash_join_impl::compute_hash_join(cudf::table_view const &probe,
   CUDF_EXPECTS(0 != probe.num_columns(), "Hash join probe table is empty");
   CUDF_EXPECTS(probe.num_rows() < cudf::detail::MAX_JOIN_SIZE,
                "Probe column size is too big for hash join");
-  CUDF_EXPECTS(_build.num_columns() == probe.num_columns(),
+
+  auto const _probe = std::get<0>(structs::detail::flatten_nested_columns(probe, {}, {}));
+
+  CUDF_EXPECTS(_build.num_columns() == _probe.num_columns(),
                "Mismatch in number of columns to be joined on");
 
-  if (is_trivial_join(probe, _build, JoinKind)) {
+  if (is_trivial_join(_probe, _build, JoinKind)) {
     return std::make_pair(std::make_unique<rmm::device_uvector<size_type>>(0, stream, mr),
                           std::make_unique<rmm::device_uvector<size_type>>(0, stream, mr));
   }
 
   CUDF_EXPECTS(std::equal(std::cbegin(_build),
                           std::cend(_build),
-                          std::cbegin(probe),
-                          std::cend(probe),
+                          std::cbegin(_probe),
+                          std::cend(_probe),
                           [](const auto &b, const auto &p) { return b.type() == p.type(); }),
                "Mismatch in joining column data types");
 
-  return probe_join_indices<JoinKind>(probe, compare_nulls, stream, mr);
+  return probe_join_indices<JoinKind>(_probe, compare_nulls, stream, mr);
 }
 
 template <cudf::detail::join_kind JoinKind>
diff --git a/cpp/src/partitioning/partitioning.cu b/cpp/src/partitioning/partitioning.cu
index 46f00ecb75d..209e2d16f87 100644
--- a/cpp/src/partitioning/partitioning.cu
+++ b/cpp/src/partitioning/partitioning.cu
@@ -448,6 +448,7 @@ std::pair<std::unique_ptr<table>, std::vector<size_type>> hash_partition_table(
   table_view const& input,
   table_view const& table_to_hash,
   size_type num_partitions,
+  uint32_t seed,
   rmm::cuda_stream_view stream,
   rmm::mr::device_memory_resource* mr)
 {
@@ -481,7 +482,7 @@ std::pair<std::unique_ptr<table>, std::vector<size_type>> hash_partition_table(
   auto row_partition_offset = rmm::device_vector<size_type>(num_rows);
 
   auto const device_input = table_device_view::create(table_to_hash, stream);
-  auto const hasher       = row_hasher<hash_function, hash_has_nulls>(*device_input);
+  auto const hasher       = row_hasher<hash_function, hash_has_nulls>(*device_input, seed);
 
   // If the number of partitions is a power of two, we can compute the partition
   // number of each row more efficiently with bitwise operations
@@ -725,6 +726,7 @@ std::pair<std::unique_ptr<table>, std::vector<size_type>> hash_partition(
   table_view const& input,
   std::vector<size_type> const& columns_to_hash,
   int num_partitions,
+  uint32_t seed,
   rmm::cuda_stream_view stream,
   rmm::mr::device_memory_resource* mr)
 {
@@ -737,10 +739,10 @@ std::pair<std::unique_ptr<table>, std::vector<size_type>> hash_partition(
 
   if (has_nulls(table_to_hash)) {
     return hash_partition_table<hash_function, true>(
-      input, table_to_hash, num_partitions, stream, mr);
+      input, table_to_hash, num_partitions, seed, stream, mr);
   } else {
     return hash_partition_table<hash_function, false>(
-      input, table_to_hash, num_partitions, stream, mr);
+      input, table_to_hash, num_partitions, seed, stream, mr);
   }
 }
 }  // namespace local
@@ -771,6 +773,7 @@ std::pair<std::unique_ptr<table>, std::vector<size_type>> hash_partition(
   std::vector<size_type> const& columns_to_hash,
   int num_partitions,
   hash_id hash_function,
+  uint32_t seed,
   rmm::cuda_stream_view stream,
   rmm::mr::device_memory_resource* mr)
 {
@@ -783,10 +786,10 @@ std::pair<std::unique_ptr<table>, std::vector<size_type>> hash_partition(
           CUDF_FAIL("IdentityHash does not support this data type");
       }
       return detail::local::hash_partition<IdentityHash>(
-        input, columns_to_hash, num_partitions, stream, mr);
+        input, columns_to_hash, num_partitions, seed, stream, mr);
     case (hash_id::HASH_MURMUR3):
       return detail::local::hash_partition<MurmurHash3_32>(
-        input, columns_to_hash, num_partitions, stream, mr);
+        input, columns_to_hash, num_partitions, seed, stream, mr);
     default: CUDF_FAIL("Unsupported hash function in hash_partition");
   }
 }
diff --git a/cpp/src/replace/nans.cu b/cpp/src/replace/nans.cu
index b34b0928847..d6cf7d2c385 100644
--- a/cpp/src/replace/nans.cu
+++ b/cpp/src/replace/nans.cu
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2020, NVIDIA CORPORATION.
+ * Copyright (c) 2020-2021, NVIDIA CORPORATION.
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
  * you may not use this file except in compliance with the License.
@@ -146,7 +146,7 @@ std::unique_ptr<column> replace_nans(column_view const& input,
                                      rmm::mr::device_memory_resource* mr)
 {
   CUDF_FUNC_RANGE();
-  return detail::replace_nans(input, replacement, 0, mr);
+  return detail::replace_nans(input, replacement, rmm::cuda_stream_default, mr);
 }
 
 std::unique_ptr<column> replace_nans(column_view const& input,
diff --git a/cpp/src/replace/nulls.cu b/cpp/src/replace/nulls.cu
index afc2bbb37bd..4cf6899116d 100644
--- a/cpp/src/replace/nulls.cu
+++ b/cpp/src/replace/nulls.cu
@@ -424,9 +424,9 @@ std::unique_ptr<cudf::column> replace_nulls(cudf::column_view const& input,
   CUDF_EXPECTS(replacement.size() == input.size(), "Column size mismatch");
 
   if (input.is_empty()) { return cudf::empty_like(input); }
-  if (!input.has_nulls()) { return std::make_unique<cudf::column>(input); }
+  if (!input.has_nulls()) { return std::make_unique<cudf::column>(input, stream, mr); }
 
-  return cudf::type_dispatcher(
+  return cudf::type_dispatcher<dispatch_storage_type>(
     input.type(), replace_nulls_column_kernel_forwarder{}, input, replacement, stream, mr);
 }
 
diff --git a/cpp/src/rolling/grouped_rolling.cu b/cpp/src/rolling/grouped_rolling.cu
index 34d6d5fa194..ca4913c1843 100644
--- a/cpp/src/rolling/grouped_rolling.cu
+++ b/cpp/src/rolling/grouped_rolling.cu
@@ -14,9 +14,11 @@
  * limitations under the License.
  */
 
+#include "rolling_detail.cuh"
+#include "rolling_jit_detail.hpp"
+
 #include <cudf/detail/iterator.cuh>
 #include <cudf/unary.hpp>
-#include "rolling_detail.cuh"
 
 namespace cudf {
 
diff --git a/cpp/src/rolling/jit/code/operation.cpp b/cpp/src/rolling/jit/code/operation.cpp
deleted file mode 100644
index 1fdc4080634..00000000000
--- a/cpp/src/rolling/jit/code/operation.cpp
+++ /dev/null
@@ -1,52 +0,0 @@
-/*
- * Copyright (c) 2019-2020, NVIDIA CORPORATION.
- *
- * Copyright 2018-2019 BlazingDB, Inc.
- *     Copyright 2018 Christian Noboa Mardini <christian@blazingdb.com>
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-namespace cudf {
-namespace rolling {
-namespace jit {
-namespace code {
-const char* operation_h =
-  R"***(operation.h
-#pragma once
-  struct rolling_udf_ptx {
-    template <typename OutType, typename InType>
-    static OutType operate(const InType* in_col, cudf::size_type start, cudf::size_type count) {
-      OutType ret;
-      rolling_udf(
-        &ret, 0, 0, 0, 0, &in_col[start], count, sizeof(InType));
-      return ret;
-    }
-  };
-
-  struct rolling_udf_cuda {
-    template <typename OutType, typename InType>
-    static OutType operate(const InType* in_col, cudf::size_type start, cudf::size_type count) {
-      OutType ret;
-      rolling_udf(
-        &ret, in_col, start, count);
-      return ret;
-    }
-  };
-
-)***";
-
-}  // namespace code
-}  // namespace jit
-}  // namespace rolling
-}  // namespace cudf
diff --git a/cpp/src/rolling/jit/code/kernel.cpp b/cpp/src/rolling/jit/kernel.cu
similarity index 61%
rename from cpp/src/rolling/jit/code/kernel.cpp
rename to cpp/src/rolling/jit/kernel.cu
index 2c612162f79..52e397b9351 100644
--- a/cpp/src/rolling/jit/code/kernel.cpp
+++ b/cpp/src/rolling/jit/kernel.cu
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2019-2020, NVIDIA CORPORATION.
+ * Copyright (c) 2019-2021, NVIDIA CORPORATION.
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
  * you may not use this file except in compliance with the License.
@@ -14,47 +14,50 @@
  * limitations under the License.
  */
 
-namespace cudf {
-namespace rolling {
-namespace jit {
-namespace code {
-const char* kernel_headers =
-  R"***(
-#include <../src/rolling/rolling_jit_detail.hpp>
+#include <rolling/jit/operation.hpp>
+#include <rolling/rolling_jit_detail.hpp>
+
 #include <cudf/types.hpp>
 #include <cudf/utilities/bit.hpp>
-)***";
 
-const char* kernel =
-  R"***(
-#include "operation.h"
+namespace cudf {
+namespace rolling {
+namespace jit {
 
 template <typename WindowType>
-cudf::size_type __device__ get_window(WindowType window, cudf::size_type index) { return window[index]; }
+cudf::size_type __device__ get_window(WindowType window, cudf::size_type index)
+{
+  return window[index];
+}
 
 template <>
-cudf::size_type __device__ get_window(cudf::size_type window, cudf::size_type index) { return window; }
-
-template <typename InType, typename OutType, class agg_op, typename PrecedingWindowType, typename FollowingWindowType>
-__global__
-void gpu_rolling_new(cudf::size_type nrows,
-                 InType const* const __restrict__ in_col, 
-                 cudf::bitmask_type const* const __restrict__ in_col_valid,
-                 OutType* __restrict__ out_col, 
-                 cudf::bitmask_type* __restrict__ out_col_valid,
-                 cudf::size_type * __restrict__ output_valid_count,
-                 PrecedingWindowType preceding_window_begin,
-                 FollowingWindowType following_window_begin,
-                 cudf::size_type min_periods)
+cudf::size_type __device__ get_window(cudf::size_type window, cudf::size_type index)
 {
-  cudf::size_type i = blockIdx.x * blockDim.x + threadIdx.x;
+  return window;
+}
+
+template <typename InType,
+          typename OutType,
+          class agg_op,
+          typename PrecedingWindowType,
+          typename FollowingWindowType>
+__global__ void gpu_rolling_new(cudf::size_type nrows,
+                                InType const* const __restrict__ in_col,
+                                cudf::bitmask_type const* const __restrict__ in_col_valid,
+                                OutType* __restrict__ out_col,
+                                cudf::bitmask_type* __restrict__ out_col_valid,
+                                cudf::size_type* __restrict__ output_valid_count,
+                                PrecedingWindowType preceding_window_begin,
+                                FollowingWindowType following_window_begin,
+                                cudf::size_type min_periods)
+{
+  cudf::size_type i      = blockIdx.x * blockDim.x + threadIdx.x;
   cudf::size_type stride = blockDim.x * gridDim.x;
 
   cudf::size_type warp_valid_count{0};
 
   auto active_threads = __ballot_sync(0xffffffff, i < nrows);
-  while(i < nrows)
-  {
+  while (i < nrows) {
     // declare this as volatile to avoid some compiler optimizations that lead to incorrect results
     // for CUDA 10.0 and below (fixed in CUDA 10.1)
     volatile cudf::size_type count = 0;
@@ -63,16 +66,16 @@ void gpu_rolling_new(cudf::size_type nrows,
     cudf::size_type following_window = get_window(following_window_begin, i);
 
     // compute bounds
-    cudf::size_type start = min(nrows, max(0, i - preceding_window + 1));
-    cudf::size_type end = min(nrows, max(0, i + following_window + 1));
+    cudf::size_type start       = min(nrows, max(0, i - preceding_window + 1));
+    cudf::size_type end         = min(nrows, max(0, i + following_window + 1));
     cudf::size_type start_index = min(start, end);
-    cudf::size_type end_index = max(start, end);
+    cudf::size_type end_index   = max(start, end);
 
     // aggregate
     // TODO: We should explore using shared memory to avoid redundant loads.
     //       This might require separating the kernel into a special version
     //       for dynamic and static sizes.
-    count = end_index - start_index;
+    count       = end_index - start_index;
     OutType val = agg_op::template operate<OutType, InType>(in_col, start_index, count);
 
     // check if we have enough input samples
@@ -82,9 +85,7 @@ void gpu_rolling_new(cudf::size_type nrows,
     const unsigned int result_mask = __ballot_sync(active_threads, output_is_valid);
 
     // store the output value, one per thread
-    if (output_is_valid) {
-      out_col[i] = val;
-    }
+    if (output_is_valid) { out_col[i] = val; }
 
     // only one thread writes the mask
     if (0 == cudf::intra_word_index(i)) {
@@ -92,20 +93,16 @@ void gpu_rolling_new(cudf::size_type nrows,
       warp_valid_count += __popc(result_mask);
     }
 
-    // process next element 
+    // process next element
     i += stride;
     active_threads = __ballot_sync(active_threads, i < nrows);
   }
 
   // TODO: likely faster to do a single_lane_block_reduce and a single
   // atomic per block but that requires jitifying single_lane_block_reduce...
-  if(0 == cudf::intra_word_index(threadIdx.x)) {
-    atomicAdd(output_valid_count, warp_valid_count);
-  }
+  if (0 == cudf::intra_word_index(threadIdx.x)) { atomicAdd(output_valid_count, warp_valid_count); }
 }
-)***";
 
-}  // namespace code
 }  // namespace jit
 }  // namespace rolling
 }  // namespace cudf
diff --git a/cpp/src/rolling/jit/code/code.h b/cpp/src/rolling/jit/operation-udf.hpp
similarity index 51%
rename from cpp/src/rolling/jit/code/code.h
rename to cpp/src/rolling/jit/operation-udf.hpp
index c5577d326c7..eaab2111d98 100644
--- a/cpp/src/rolling/jit/code/code.h
+++ b/cpp/src/rolling/jit/operation-udf.hpp
@@ -1,8 +1,5 @@
 /*
- * Copyright (c) 2019-2020, NVIDIA CORPORATION.
- *
- * Copyright 2018-2019 BlazingDB, Inc.
- *     Copyright 2018 Christian Noboa Mardini <christian@blazingdb.com>
+ * Copyright (c) 2021, NVIDIA CORPORATION.
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
  * you may not use this file except in compliance with the License.
@@ -19,21 +16,5 @@
 
 #pragma once
 
-namespace cudf {
-namespace rolling {
-namespace jit {
-namespace code {
-extern const char* kernel_headers;
-extern const char* kernel;
-extern const char* operation_h;
-
-extern const char* kernel_headers;
-extern const char* kernel;
-extern const char* operation_h;
-
-extern const char* grouped_window_wrapper;
-
-}  // namespace code
-}  // namespace jit
-}  // namespace rolling
-}  // namespace cudf
+// This file serves as a placeholder for user defined functions, so jitify can choose to override it
+// at runtime.
diff --git a/cpp/src/rolling/jit/operation.hpp b/cpp/src/rolling/jit/operation.hpp
new file mode 100644
index 00000000000..9af8c2ac3fb
--- /dev/null
+++ b/cpp/src/rolling/jit/operation.hpp
@@ -0,0 +1,41 @@
+/*
+ * Copyright (c) 2019-2021, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include <cudf/types.hpp>
+
+#include <rolling/jit/operation-udf.hpp>
+
+#pragma once
+
+struct rolling_udf_ptx {
+  template <typename OutType, typename InType>
+  static OutType operate(const InType* in_col, cudf::size_type start, cudf::size_type count)
+  {
+    OutType ret;
+    rolling_udf(&ret, 0, 0, 0, 0, &in_col[start], count, sizeof(InType));
+    return ret;
+  }
+};
+
+struct rolling_udf_cuda {
+  template <typename OutType, typename InType>
+  static OutType operate(const InType* in_col, cudf::size_type start, cudf::size_type count)
+  {
+    OutType ret;
+    rolling_udf(&ret, in_col, start, count);
+    return ret;
+  }
+};
diff --git a/cpp/src/rolling/rolling_detail.cuh b/cpp/src/rolling/rolling_detail.cuh
index 42562507fa9..c6439486461 100644
--- a/cpp/src/rolling/rolling_detail.cuh
+++ b/cpp/src/rolling/rolling_detail.cuh
@@ -16,9 +16,7 @@
 
 #pragma once
 
-#include <rolling/jit/code/code.h>
 #include <rolling/rolling_detail.hpp>
-#include <rolling/rolling_jit_detail.hpp>
 
 #include <cudf/aggregation.hpp>
 #include <cudf/column/column_device_view.cuh>
@@ -44,12 +42,11 @@
 #include <cudf/utilities/error.hpp>
 #include <cudf/utilities/traits.hpp>
 
-#include <jit/launcher.h>
-#include <jit/parser.h>
-#include <jit/type.h>
-#include <jit/bit.hpp.jit>
-#include <jit/rolling_jit_detail.hpp.jit>
-#include <jit/types.hpp.jit>
+#include <jit/cache.hpp>
+#include <jit/parser.hpp>
+#include <jit/type.hpp>
+
+#include <jit_preprocessed_files/rolling/jit/kernel.cu.jit.hpp>
 
 #include <rmm/thrust_rmm_allocator.h>
 #include <rmm/cuda_stream_view.hpp>
@@ -1270,19 +1267,15 @@ std::unique_ptr<column> rolling_window_udf(column_view const& input,
   std::string cuda_source;
   switch (udf_agg->kind) {
     case aggregation::Kind::PTX:
-      cuda_source = cudf::rolling::jit::code::kernel_headers;
       cuda_source +=
         cudf::jit::parse_single_function_ptx(udf_agg->_source,
                                              udf_agg->_function_name,
                                              cudf::jit::get_type_name(udf_agg->_output_type),
                                              {0, 5});  // args 0 and 5 are pointers.
-      cuda_source += cudf::rolling::jit::code::kernel;
       break;
     case aggregation::Kind::CUDA:
-      cuda_source = cudf::rolling::jit::code::kernel_headers;
       cuda_source +=
         cudf::jit::parse_single_function_cuda(udf_agg->_source, udf_agg->_function_name);
-      cuda_source += cudf::rolling::jit::code::kernel;
       break;
     default: CUDF_FAIL("Unsupported UDF type.");
   }
@@ -1293,37 +1286,27 @@ std::unique_ptr<column> rolling_window_udf(column_view const& input,
   auto output_view = output->mutable_view();
   rmm::device_scalar<size_type> device_valid_count{0, stream};
 
-  const std::vector<std::string> compiler_flags{"-std=c++14",
-                                                // Have jitify prune unused global variables
-                                                "-remove-unused-globals",
-                                                // suppress all NVRTC warnings
-                                                "-w"};
-
-  // Launch the jitify kernel
-  cudf::jit::launcher(hash,
-                      cuda_source,
-                      {cudf_types_hpp,
-                       cudf_utilities_bit_hpp,
-                       cudf::rolling::jit::code::operation_h,
-                       ___src_rolling_rolling_jit_detail_hpp},
-                      compiler_flags,
-                      nullptr,
-                      stream)
-    .set_kernel_inst("gpu_rolling_new",  // name of the kernel we are launching
-                     {cudf::jit::get_type_name(input.type()),  // list of template arguments
-                      cudf::jit::get_type_name(output->type()),
-                      udf_agg->_operator_name,
-                      preceding_window_str.c_str(),
-                      following_window_str.c_str()})
-    .launch(input.size(),
-            cudf::jit::get_data_ptr(input),
-            input.null_mask(),
-            cudf::jit::get_data_ptr(output_view),
-            output_view.null_mask(),
-            device_valid_count.data(),
-            preceding_window,
-            following_window,
-            min_periods);
+  std::string kernel_name =
+    jitify2::reflection::Template("cudf::rolling::jit::gpu_rolling_new")  //
+      .instantiate(cudf::jit::get_type_name(input.type()),  // list of template arguments
+                   cudf::jit::get_type_name(output->type()),
+                   udf_agg->_operator_name,
+                   preceding_window_str.c_str(),
+                   following_window_str.c_str());
+
+  cudf::jit::get_program_cache(*rolling_jit_kernel_cu_jit)
+    .get_kernel(
+      kernel_name, {}, {{"rolling/jit/operation-udf.hpp", cuda_source}}, {"-arch=sm_."})  //
+    ->configure_1d_max_occupancy(0, 0, 0, stream.value())                                 //
+    ->launch(input.size(),
+             cudf::jit::get_data_ptr(input),
+             input.null_mask(),
+             cudf::jit::get_data_ptr(output_view),
+             output_view.null_mask(),
+             device_valid_count.data(),
+             preceding_window,
+             following_window,
+             min_periods);
 
   output->set_null_count(output->size() - device_valid_count.value(stream));
 
diff --git a/cpp/src/strings/json/json_path.cu b/cpp/src/strings/json/json_path.cu
new file mode 100644
index 00000000000..cd8aae12070
--- /dev/null
+++ b/cpp/src/strings/json/json_path.cu
@@ -0,0 +1,952 @@
+/*
+ * Copyright (c) 2021, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include <cudf/column/column_device_view.cuh>
+#include <cudf/column/column_factories.hpp>
+#include <cudf/detail/get_value.cuh>
+#include <cudf/detail/null_mask.hpp>
+#include <cudf/detail/utilities/cuda.cuh>
+#include <cudf/detail/utilities/vector_factories.hpp>
+#include <cudf/scalar/scalar.hpp>
+#include <cudf/strings/string_view.cuh>
+#include <cudf/strings/strings_column_view.hpp>
+#include <cudf/types.hpp>
+#include <cudf/utilities/bit.hpp>
+#include <cudf/utilities/error.hpp>
+
+#include <io/utilities/parsing_utils.cuh>
+
+#include <rmm/device_uvector.hpp>
+#include <rmm/exec_policy.hpp>
+
+#include <thrust/optional.h>
+
+namespace cudf {
+namespace strings {
+namespace detail {
+
+namespace {
+
+// debug accessibility
+
+// change to "\n" and 1 to make output more readable
+#define DEBUG_NEWLINE
+constexpr int DEBUG_NEWLINE_LEN = 0;
+
+/**
+ * @brief Result of calling a parse function.
+ *
+ * The primary use of this is to distinguish between "success" and
+ * "success but no data" return cases.  For example, if you are reading the
+ * values of an array you might call a parse function in a while loop. You
+ * would want to continue doing this until you either encounter an error (parse_result::ERROR)
+ * or you get nothing back (parse_result::EMPTY)
+ */
+enum class parse_result {
+  ERROR,    // failure
+  SUCCESS,  // success
+  EMPTY,    // success, but no data
+};
+
+/**
+ * @brief Base parser class inherited by the (device-side) json_state class and
+ * (host-side) path_state class.
+ *
+ * Contains a number of useful utility functions common to parsing json and
+ * JSONPath strings.
+ */
+class parser {
+ protected:
+  CUDA_HOST_DEVICE_CALLABLE parser() : input(nullptr), input_len(0), pos(nullptr) {}
+  CUDA_HOST_DEVICE_CALLABLE parser(const char* _input, int64_t _input_len)
+    : input(_input), input_len(_input_len), pos(_input)
+  {
+    parse_whitespace();
+  }
+
+  CUDA_HOST_DEVICE_CALLABLE parser(parser const& p)
+    : input(p.input), input_len(p.input_len), pos(p.pos)
+  {
+  }
+
+  CUDA_HOST_DEVICE_CALLABLE bool eof(const char* p) { return p - input >= input_len; }
+  CUDA_HOST_DEVICE_CALLABLE bool eof() { return eof(pos); }
+
+  CUDA_HOST_DEVICE_CALLABLE bool parse_whitespace()
+  {
+    while (!eof()) {
+      if (is_whitespace(*pos)) {
+        pos++;
+      } else {
+        return true;
+      }
+    }
+    return false;
+  }
+
+  CUDA_HOST_DEVICE_CALLABLE parse_result parse_string(string_view& str,
+                                                      bool can_be_empty,
+                                                      char quote)
+  {
+    str = string_view(nullptr, 0);
+
+    if (parse_whitespace() && *pos == quote) {
+      const char* start = ++pos;
+      while (!eof()) {
+        if (*pos == quote) {
+          str = string_view(start, pos - start);
+          pos++;
+          return parse_result::SUCCESS;
+        }
+        pos++;
+      }
+    }
+
+    return can_be_empty ? parse_result::EMPTY : parse_result::ERROR;
+  }
+
+  // a name means:
+  // - a string followed by a :
+  // - no string
+  CUDA_HOST_DEVICE_CALLABLE parse_result parse_name(string_view& name,
+                                                    bool can_be_empty,
+                                                    char quote)
+  {
+    if (parse_string(name, can_be_empty, quote) == parse_result::ERROR) {
+      return parse_result::ERROR;
+    }
+
+    // if we got a real string, the next char must be a :
+    if (name.size_bytes() > 0) {
+      if (!parse_whitespace()) { return parse_result::ERROR; }
+      if (*pos == ':') {
+        pos++;
+        return parse_result::SUCCESS;
+      }
+    }
+    return parse_result::EMPTY;
+  }
+
+  // numbers, true, false, null.
+  // this function is not particularly strong. badly formed values will get
+  // consumed without throwing any errors
+  CUDA_HOST_DEVICE_CALLABLE parse_result parse_non_string_value(string_view& val)
+  {
+    if (!parse_whitespace()) { return parse_result::ERROR; }
+
+    // parse to the end of the value
+    char const* start = pos;
+    char const* end   = start;
+    while (!eof(end)) {
+      char const c = *end;
+      if (c == ',' || c == '}' || c == ']' || is_whitespace(c)) { break; }
+
+      // illegal chars
+      if (c == '[' || c == '{' || c == ':' || c == '\"') { return parse_result::ERROR; }
+      end++;
+    }
+    pos = end;
+
+    val = string_view(start, end - start);
+
+    return parse_result::SUCCESS;
+  }
+
+ protected:
+  char const* input;
+  int64_t input_len;
+  char const* pos;
+
+ private:
+  CUDA_HOST_DEVICE_CALLABLE bool is_whitespace(char c) { return c <= ' '; }
+};
+
+/**
+ * @brief Output buffer object.  Used during the preprocess/size-computation step
+ * and the actual output step.
+ *
+ * There is an important distinction between two cases:
+ *
+ * - producing no output at all. that is, the query matched nothing in the input.
+ * - producing empty output. the query matched something in the input, but the
+ *   value of the result is an empty string.
+ *
+ * The `has_output` field is the flag which indicates whether or not the output
+ * from the query should be considered empty or null.
+ *
+ */
+struct json_output {
+  size_t output_max_len;
+  char* output;
+  thrust::optional<size_t> output_len;
+
+  __device__ void add_output(const char* str, size_t len)
+  {
+    if (output != nullptr) { memcpy(output + output_len.value_or(0), str, len); }
+    output_len = output_len.value_or(0) + len;
+  }
+
+  __device__ void add_output(string_view const& str) { add_output(str.data(), str.size_bytes()); }
+};
+
+enum json_element_type { NONE, OBJECT, ARRAY, VALUE };
+
+/**
+ * @brief Parsing class that holds the current state of the json to be parse and provides
+ * functions for navigating through it.
+ */
+class json_state : private parser {
+ public:
+  __device__ json_state()
+    : parser(),
+      cur_el_start(nullptr),
+      cur_el_type(json_element_type::NONE),
+      parent_el_type(json_element_type::NONE)
+  {
+  }
+  __device__ json_state(const char* _input, int64_t _input_len)
+    : parser(_input, _input_len),
+      cur_el_start(nullptr),
+      cur_el_type(json_element_type::NONE),
+      parent_el_type(json_element_type::NONE)
+  {
+  }
+
+  __device__ json_state(json_state const& j)
+    : parser(j),
+      cur_el_start(j.cur_el_start),
+      cur_el_type(j.cur_el_type),
+      parent_el_type(j.parent_el_type)
+  {
+  }
+
+  // retrieve the entire current element into the output
+  __device__ parse_result extract_element(json_output* output, bool list_element)
+  {
+    char const* start = cur_el_start;
+    char const* end   = start;
+
+    // if we're a value type, do a simple value parse.
+    if (cur_el_type == VALUE) {
+      pos = cur_el_start;
+      if (parse_value() != parse_result::SUCCESS) { return parse_result::ERROR; }
+      end = pos;
+
+      // SPARK-specific behavior.  if this is a non-list-element wrapped in quotes,
+      // strip them. we may need to make this behavior configurable in some way
+      // later on.
+      if (!list_element && *start == '\"' && *(end - 1) == '\"') {
+        start++;
+        end--;
+      }
+    }
+    // otherwise, march through everything inside
+    else {
+      int obj_count = 0;
+      int arr_count = 0;
+
+      while (!eof(end)) {
+        // could do some additional checks here. we know our current
+        // element type, so we could be more strict on what kinds of
+        // characters we expect to see.
+        switch (*end++) {
+          case '{': obj_count++; break;
+          case '}': obj_count--; break;
+          case '[': arr_count++; break;
+          case ']': arr_count--; break;
+          default: break;
+        }
+        if (obj_count == 0 && arr_count == 0) { break; }
+      }
+      if (obj_count > 0 || arr_count > 0) { return parse_result::ERROR; }
+      pos = end;
+    }
+
+    // parse trailing ,
+    if (parse_whitespace()) {
+      if (*pos == ',') { pos++; }
+    }
+
+    if (output != nullptr) { output->add_output({start, static_cast<size_type>(end - start)}); }
+    return parse_result::SUCCESS;
+  }
+
+  // skip the next element
+  __device__ parse_result skip_element() { return extract_element(nullptr, false); }
+
+  // advance to the next element
+  __device__ parse_result next_element() { return next_element_internal(false); }
+
+  // advance inside the current element
+  __device__ parse_result child_element(json_element_type expected_type)
+  {
+    if (expected_type != NONE && cur_el_type != expected_type) { return parse_result::ERROR; }
+
+    // if we succeed, record our parent element type.
+    auto const prev_el_type = cur_el_type;
+    auto const result       = next_element_internal(true);
+    if (result == parse_result::SUCCESS) { parent_el_type = prev_el_type; }
+    return result;
+  }
+
+  // return the next element that matches the specified name.
+  __device__ parse_result next_matching_element(string_view const& name, bool inclusive)
+  {
+    // if we're not including the current element, skip it
+    if (!inclusive) {
+      parse_result result = next_element_internal(false);
+      if (result != parse_result::SUCCESS) { return result; }
+    }
+    // loop until we find a match or there's nothing left
+    do {
+      // wildcard matches anything
+      if (name.size_bytes() == 1 && name.data()[0] == '*') {
+        return parse_result::SUCCESS;
+      } else if (cur_el_name == name) {
+        return parse_result::SUCCESS;
+      }
+
+      // next
+      parse_result result = next_element_internal(false);
+      if (result != parse_result::SUCCESS) { return result; }
+    } while (1);
+
+    return parse_result::ERROR;
+  }
+
+ private:
+  // parse a value - either a string or a number/null/bool
+  __device__ parse_result parse_value()
+  {
+    if (!parse_whitespace()) { return parse_result::ERROR; }
+
+    // string or number?
+    string_view unused;
+    return *pos == '\"' ? parse_string(unused, false, '\"') : parse_non_string_value(unused);
+  }
+
+  __device__ parse_result next_element_internal(bool child)
+  {
+    // if we're not getting a child element, skip the current element.
+    // this will leave pos as the first character -after- the close of
+    // the current element
+    if (!child && cur_el_start != nullptr) {
+      if (skip_element() == parse_result::ERROR) { return parse_result::ERROR; }
+      cur_el_start = nullptr;
+    }
+    // otherwise pos will be at the first character within the current element
+
+    // can only get the child of an object or array.
+    // this could theoretically be handled as an error, but the evaluators I've found
+    // seem to treat this as "it's nothing"
+    if (child && (cur_el_type == VALUE || cur_el_type == NONE)) { return parse_result::EMPTY; }
+
+    // what's next
+    if (!parse_whitespace()) { return parse_result::EMPTY; }
+    // if we're closing off a parent element, we're done
+    char const c = *pos;
+    if (c == ']' || c == '}') { return parse_result::EMPTY; }
+
+    // if we're not accessing elements of an array, check for name.
+    bool const array_access =
+      (cur_el_type == ARRAY && child) || (parent_el_type == ARRAY && !child);
+    if (!array_access && parse_name(cur_el_name, true, '\"') == parse_result::ERROR) {
+      return parse_result::ERROR;
+    }
+
+    // element type
+    if (!parse_whitespace()) { return parse_result::EMPTY; }
+    switch (*pos++) {
+      case '[': cur_el_type = ARRAY; break;
+      case '{': cur_el_type = OBJECT; break;
+
+      case ',':
+      case ':':
+      case '\'': return parse_result::ERROR;
+
+      // value type
+      default: cur_el_type = VALUE; break;
+    }
+
+    // the start of the current element is always at the value, not the name
+    cur_el_start = pos - 1;
+    return parse_result::SUCCESS;
+  }
+
+  const char* cur_el_start;          // pointer to the first character of the -value- of the current
+                                     // element - not the name
+  string_view cur_el_name;           // name of the current element (if applicable)
+  json_element_type cur_el_type;     // type of the current element
+  json_element_type parent_el_type;  // parent element type
+};
+
+enum class path_operator_type { ROOT, CHILD, CHILD_WILDCARD, CHILD_INDEX, ERROR, END };
+
+/**
+ * @brief A "command" operator used to query a json string.  A full query is
+ * an array of these operators applied to the incoming json string,
+ */
+struct path_operator {
+  CUDA_HOST_DEVICE_CALLABLE path_operator()
+    : type(path_operator_type::ERROR), index(-1), expected_type{NONE}
+  {
+  }
+  CUDA_HOST_DEVICE_CALLABLE path_operator(path_operator_type _type,
+                                          json_element_type _expected_type = NONE)
+    : type(_type), index(-1), expected_type{_expected_type}
+  {
+  }
+
+  path_operator_type type;  // operator type
+  // the expected element type we're applying this operation to.
+  // for example:
+  //    - you cannot retrieve a subscripted field (eg [5]) from an object.
+  //    - you cannot retrieve a field by name (eg  .book) from an array.
+  //    - you -can- use .* for both arrays and objects
+  // a value of NONE imples any type accepted
+  json_element_type expected_type;  // the expected type of the element we're working with
+  string_view name;                 // name to match against (if applicable)
+  int index;                        // index for subscript operator
+};
+
+/**
+ * @brief Parsing class that holds the current state of the JSONPath string to be parsed
+ * and provides functions for navigating through it. This is only called on the host
+ * during the preprocess step which builds a command buffer that the gpu uses.
+ */
+class path_state : private parser {
+ public:
+  path_state(const char* _path, size_t _path_len) : parser(_path, _path_len) {}
+
+  // get the next operator in the JSONPath string
+  path_operator get_next_operator()
+  {
+    if (eof()) { return {path_operator_type::END}; }
+
+    switch (*pos++) {
+      case '$': return {path_operator_type::ROOT};
+
+      case '.': {
+        path_operator op;
+        string_view term{".[", 2};
+        if (parse_path_name(op.name, term)) {
+          // this is another potential use case for __SPARK_BEHAVIORS / configurability
+          // Spark currently only handles the wildcard operator inside [*], it does
+          // not handle .*
+          if (op.name.size_bytes() == 1 && op.name.data()[0] == '*') {
+            op.type          = path_operator_type::CHILD_WILDCARD;
+            op.expected_type = NONE;
+          } else {
+            op.type          = path_operator_type::CHILD;
+            op.expected_type = OBJECT;
+          }
+          return op;
+        }
+      } break;
+
+      // 3 ways this can be used
+      // indices:   [0]
+      // name:      ['book']
+      // wildcard:  [*]
+      case '[': {
+        path_operator op;
+        string_view term{"]", 1};
+        bool const is_string = *pos == '\'' ? true : false;
+        if (parse_path_name(op.name, term)) {
+          pos++;
+          if (op.name.size_bytes() == 1 && op.name.data()[0] == '*') {
+            op.type          = path_operator_type::CHILD_WILDCARD;
+            op.expected_type = NONE;
+          } else {
+            if (is_string) {
+              op.type          = path_operator_type::CHILD;
+              op.expected_type = OBJECT;
+            } else {
+              op.type  = path_operator_type::CHILD_INDEX;
+              op.index = cudf::io::parse_numeric<int>(
+                op.name.data(), op.name.data() + op.name.size_bytes(), json_opts, -1);
+              CUDF_EXPECTS(op.index >= 0, "Invalid numeric index specified in JSONPath");
+              op.expected_type = ARRAY;
+            }
+          }
+          return op;
+        }
+      } break;
+
+      // wildcard operator
+      case '*': {
+        pos++;
+        return path_operator{path_operator_type::CHILD_WILDCARD};
+      } break;
+
+      default: CUDF_FAIL("Unrecognized JSONPath operator"); break;
+    }
+    return {path_operator_type::ERROR};
+  }
+
+ private:
+  cudf::io::parse_options_view json_opts{',', '\n', '\"', '.'};
+
+  bool parse_path_name(string_view& name, string_view const& terminators)
+  {
+    switch (*pos) {
+      case '*':
+        name = string_view(pos, 1);
+        pos++;
+        break;
+
+      case '\'':
+        if (parse_string(name, false, '\'') != parse_result::SUCCESS) { return false; }
+        break;
+
+      default: {
+        size_t const chars_left = input_len - (pos - input);
+        char const* end         = std::find_first_of(
+          pos, pos + chars_left, terminators.data(), terminators.data() + terminators.size_bytes());
+        if (end) {
+          name = string_view(pos, end - pos);
+          pos  = end;
+        } else {
+          name = string_view(pos, chars_left);
+          pos  = input + input_len;
+        }
+        break;
+      }
+    }
+
+    // an empty name is not valid
+    CUDF_EXPECTS(name.size_bytes() > 0, "Invalid empty name in JSONPath query string");
+
+    return true;
+  }
+};
+
+/**
+ * @brief Preprocess the incoming JSONPath string on the host to generate a
+ * command buffer for use by the GPU.
+ *
+ * @param json_path The incoming json path
+ * @param stream Cuda stream to perform any gpu actions on
+ * @returns A pair containing the command buffer, and maximum stack depth required.
+ */
+std::pair<thrust::optional<rmm::device_uvector<path_operator>>, int> build_command_buffer(
+  cudf::string_scalar const& json_path, rmm::cuda_stream_view stream)
+{
+  std::string h_json_path = json_path.to_string(stream);
+  path_state p_state(h_json_path.data(), static_cast<size_type>(h_json_path.size()));
+
+  std::vector<path_operator> h_operators;
+
+  path_operator op;
+  int max_stack_depth = 1;
+  do {
+    op = p_state.get_next_operator();
+    if (op.type == path_operator_type::ERROR) {
+      CUDF_FAIL("Encountered invalid JSONPath input string");
+    }
+    if (op.type == path_operator_type::CHILD_WILDCARD) { max_stack_depth++; }
+    // convert pointer to device pointer
+    if (op.name.size_bytes() > 0) {
+      op.name =
+        string_view(json_path.data() + (op.name.data() - h_json_path.data()), op.name.size_bytes());
+    }
+    if (op.type == path_operator_type::ROOT) {
+      CUDF_EXPECTS(h_operators.size() == 0, "Root operator ($) can only exist at the root");
+    }
+    // if we havent' gotten a root operator to start, and we're not empty, quietly push a
+    // root operator now.
+    if (h_operators.size() == 0 && op.type != path_operator_type::ROOT &&
+        op.type != path_operator_type::END) {
+      h_operators.push_back(path_operator{path_operator_type::ROOT});
+    }
+    h_operators.push_back(op);
+  } while (op.type != path_operator_type::END);
+
+  auto const is_empty = h_operators.size() == 1 && h_operators[0].type == path_operator_type::END;
+  return is_empty
+           ? std::make_pair(thrust::nullopt, 0)
+           : std::make_pair(
+               thrust::make_optional(cudf::detail::make_device_uvector_sync(h_operators, stream)),
+               max_stack_depth);
+}
+
+#define PARSE_TRY(_x)                                                       \
+  do {                                                                      \
+    last_result = _x;                                                       \
+    if (last_result == parse_result::ERROR) { return parse_result::ERROR; } \
+  } while (0)
+
+/**
+ * @brief Parse a single json string using the provided command buffer
+ *
+ * @param j_state The incoming json string and associated parser
+ * @param commands The command buffer to be applied to the string. Always ends with a
+ * path_operator_type::END
+ * @param output Buffer user to store the results of the query
+ * @returns A result code indicating success/fail/empty.
+ */
+template <int max_command_stack_depth>
+__device__ parse_result parse_json_path(json_state& j_state,
+                                        path_operator const* commands,
+                                        json_output& output)
+{
+  // manually maintained context stack in lieu of calling parse_json_path recursively.
+  struct context {
+    json_state j_state;
+    path_operator const* commands;
+    bool list_element;
+    bool state_flag;
+  };
+  context stack[max_command_stack_depth];
+  int stack_pos     = 0;
+  auto push_context = [&stack, &stack_pos](json_state const& _j_state,
+                                           path_operator const* _commands,
+                                           bool _list_element = false,
+                                           bool _state_flag   = false) {
+    if (stack_pos == max_command_stack_depth - 1) { return false; }
+    stack[stack_pos++] = context{_j_state, _commands, _list_element, _state_flag};
+    return true;
+  };
+  auto pop_context = [&stack, &stack_pos](context& c) {
+    if (stack_pos > 0) {
+      c = stack[--stack_pos];
+      return true;
+    }
+    return false;
+  };
+  push_context(j_state, commands, false);
+
+  parse_result last_result = parse_result::SUCCESS;
+  context ctx;
+  int element_count = 0;
+  while (pop_context(ctx)) {
+    path_operator op = *ctx.commands;
+
+    switch (op.type) {
+      // whatever the first object is
+      case path_operator_type::ROOT:
+        PARSE_TRY(ctx.j_state.next_element());
+        push_context(ctx.j_state, ctx.commands + 1);
+        break;
+
+      // .name
+      // ['name']
+      // [1]
+      // will return a single thing
+      case path_operator_type::CHILD: {
+        PARSE_TRY(ctx.j_state.child_element(op.expected_type));
+        if (last_result == parse_result::SUCCESS) {
+          PARSE_TRY(ctx.j_state.next_matching_element(op.name, true));
+          if (last_result == parse_result::SUCCESS) {
+            push_context(ctx.j_state, ctx.commands + 1, ctx.list_element);
+          }
+        }
+      } break;
+
+      // .*
+      // [*]
+      // will return an array of things
+      case path_operator_type::CHILD_WILDCARD: {
+        // if we're on the first element of this wildcard
+        if (!ctx.state_flag) {
+          // we will only ever be returning 1 array
+          if (!ctx.list_element) { output.add_output({"[" DEBUG_NEWLINE, 1 + DEBUG_NEWLINE_LEN}); }
+
+          // step into the child element
+          PARSE_TRY(ctx.j_state.child_element(op.expected_type));
+          if (last_result == parse_result::EMPTY) {
+            if (!ctx.list_element) {
+              output.add_output({"]" DEBUG_NEWLINE, 1 + DEBUG_NEWLINE_LEN});
+            }
+            last_result = parse_result::SUCCESS;
+            break;
+          }
+
+          // first element
+          PARSE_TRY(ctx.j_state.next_matching_element({"*", 1}, true));
+          if (last_result == parse_result::EMPTY) {
+            if (!ctx.list_element) {
+              output.add_output({"]" DEBUG_NEWLINE, 1 + DEBUG_NEWLINE_LEN});
+            }
+            last_result = parse_result::SUCCESS;
+            break;
+          }
+
+          // re-push ourselves
+          push_context(ctx.j_state, ctx.commands, ctx.list_element, true);
+          // push the next command
+          push_context(ctx.j_state, ctx.commands + 1, true);
+        } else {
+          // next element
+          PARSE_TRY(ctx.j_state.next_matching_element({"*", 1}, false));
+          if (last_result == parse_result::EMPTY) {
+            if (!ctx.list_element) {
+              output.add_output({"]" DEBUG_NEWLINE, 1 + DEBUG_NEWLINE_LEN});
+            }
+            last_result = parse_result::SUCCESS;
+            break;
+          }
+
+          // re-push ourselves
+          push_context(ctx.j_state, ctx.commands, ctx.list_element, true);
+          // push the next command
+          push_context(ctx.j_state, ctx.commands + 1, true);
+        }
+      } break;
+
+      // [0]
+      // [1]
+      // etc
+      // returns a single thing
+      case path_operator_type::CHILD_INDEX: {
+        PARSE_TRY(ctx.j_state.child_element(op.expected_type));
+        if (last_result == parse_result::SUCCESS) {
+          string_view const any{"*", 1};
+          PARSE_TRY(ctx.j_state.next_matching_element(any, true));
+          if (last_result == parse_result::SUCCESS) {
+            int idx;
+            for (idx = 1; idx <= op.index; idx++) {
+              PARSE_TRY(ctx.j_state.next_matching_element(any, false));
+              if (last_result == parse_result::EMPTY) { break; }
+            }
+            // if we didn't end up at the index we requested, this is an invalid index
+            if (idx - 1 != op.index) { return parse_result::ERROR; }
+            push_context(ctx.j_state, ctx.commands + 1, ctx.list_element);
+          }
+        }
+      } break;
+
+      // some sort of error.
+      case path_operator_type::ERROR: return parse_result::ERROR; break;
+
+      // END case
+      default: {
+        if (ctx.list_element && element_count > 0) {
+          output.add_output({"," DEBUG_NEWLINE, 1 + DEBUG_NEWLINE_LEN});
+        }
+        PARSE_TRY(ctx.j_state.extract_element(&output, ctx.list_element));
+        if (ctx.list_element && last_result != parse_result::EMPTY) { element_count++; }
+      } break;
+    }
+  }
+
+  return parse_result::SUCCESS;
+}
+
+// hardcoding this for now. to reach a stack depth of 8 would require
+// a JSONPath containing 7 nested wildcards so this is probably reasonable.
+constexpr int max_command_stack_depth = 8;
+
+/**
+ * @brief Parse a single json string using the provided command buffer
+ *
+ * This function exists primarily as a shim for debugging purposes.
+ *
+ * @param input The incoming json string
+ * @param input_len Size of the incoming json string
+ * @param commands The command buffer to be applied to the string. Always ends with a
+ * path_operator_type::END
+ * @param out_buf Buffer user to store the results of the query (nullptr in the size computation
+ * step)
+ * @param out_buf_size Size of the output buffer
+ * @returns A pair containing the result code the output buffer.
+ */
+__device__ thrust::pair<parse_result, json_output> get_json_object_single(
+  char const* input,
+  size_t input_len,
+  path_operator const* const commands,
+  char* out_buf,
+  size_t out_buf_size)
+{
+  json_state j_state(input, input_len);
+  json_output output{out_buf_size, out_buf};
+
+  auto const result = parse_json_path<max_command_stack_depth>(j_state, commands, output);
+
+  return {result, output};
+}
+
+/**
+ * @brief Kernel for running the JSONPath query.
+ *
+ * This kernel operates in a 2-pass way.  On the first pass, it computes
+ * output sizes.  On the second pass it fills in the provided output buffers
+ * (chars and validity)
+ *
+ * @param col Device view of the incoming string
+ * @param commands JSONPath command buffer
+ * @param output_offsets Buffer used to store the string offsets for the results of the query
+ * @param out_buf Buffer used to store the results of the query
+ * @param out_validity Output validity buffer
+ * @param out_valid_count Output count of # of valid bits
+ */
+template <int block_size>
+__launch_bounds__(block_size) __global__
+  void get_json_object_kernel(column_device_view col,
+                              path_operator const* const commands,
+                              offset_type* output_offsets,
+                              thrust::optional<char*> out_buf,
+                              thrust::optional<bitmask_type*> out_validity,
+                              thrust::optional<size_type*> out_valid_count)
+{
+  size_type tid    = threadIdx.x + (blockDim.x * blockIdx.x);
+  size_type stride = blockDim.x * gridDim.x;
+
+  if (out_valid_count.has_value()) { *(out_valid_count.value()) = 0; }
+  size_type warp_valid_count{0};
+
+  auto active_threads = __ballot_sync(0xffffffff, tid < col.size());
+  while (tid < col.size()) {
+    bool is_valid         = false;
+    string_view const str = col.element<string_view>(tid);
+    size_type output_size = 0;
+    if (str.size_bytes() > 0) {
+      char* dst = out_buf.has_value() ? out_buf.value() + output_offsets[tid] : nullptr;
+      size_t const dst_size =
+        out_buf.has_value() ? output_offsets[tid + 1] - output_offsets[tid] : 0;
+
+      parse_result result;
+      json_output out;
+      thrust::tie(result, out) =
+        get_json_object_single(str.data(), str.size_bytes(), commands, dst, dst_size);
+      output_size = out.output_len.value_or(0);
+      if (out.output_len.has_value() && result == parse_result::SUCCESS) { is_valid = true; }
+    }
+
+    // filled in only during the precompute step. during the compute step, the offsets
+    // are fed back in so we do -not- want to write them out
+    if (!out_buf.has_value()) { output_offsets[tid] = static_cast<offset_type>(output_size); }
+
+    // validity filled in only during the output step
+    if (out_validity.has_value()) {
+      uint32_t mask = __ballot_sync(active_threads, is_valid);
+      // 0th lane of the warp writes the validity
+      if (!(tid % cudf::detail::warp_size)) {
+        out_validity.value()[cudf::word_index(tid)] = mask;
+        warp_valid_count += __popc(mask);
+      }
+    }
+
+    tid += stride;
+    active_threads = __ballot_sync(active_threads, tid < col.size());
+  }
+
+  // sum the valid counts across the whole block
+  if (out_valid_count) {
+    size_type block_valid_count =
+      cudf::detail::single_lane_block_sum_reduce<block_size, 0>(warp_valid_count);
+    if (threadIdx.x == 0) { atomicAdd(out_valid_count.value(), block_valid_count); }
+  }
+}
+
+/**
+ * @copydoc cudf::strings::detail::get_json_object
+ */
+std::unique_ptr<cudf::column> get_json_object(cudf::strings_column_view const& col,
+                                              cudf::string_scalar const& json_path,
+                                              rmm::cuda_stream_view stream,
+                                              rmm::mr::device_memory_resource* mr)
+{
+  // preprocess the json_path into a command buffer
+  auto preprocess = build_command_buffer(json_path, stream);
+  CUDF_EXPECTS(std::get<1>(preprocess) <= max_command_stack_depth,
+               "Encountered JSONPath string that is too complex");
+
+  // allocate output offsets buffer.
+  auto offsets = cudf::make_fixed_width_column(
+    data_type{type_id::INT32}, col.size() + 1, mask_state::UNALLOCATED, stream, mr);
+  cudf::mutable_column_view offsets_view(*offsets);
+
+  // if the query is empty, return a string column containing all nulls
+  if (!std::get<0>(preprocess).has_value()) {
+    return std::make_unique<column>(
+      data_type{type_id::STRING},
+      col.size(),
+      rmm::device_buffer{0, stream, mr},  // no data
+      cudf::detail::create_null_mask(col.size(), mask_state::ALL_NULL, stream, mr),
+      col.size());  // null count
+  }
+
+  constexpr int block_size = 512;
+  cudf::detail::grid_1d const grid{col.size(), block_size};
+
+  auto cdv = column_device_view::create(col.parent(), stream);
+
+  // preprocess sizes (returned in the offsets buffer)
+  get_json_object_kernel<block_size>
+    <<<grid.num_blocks, grid.num_threads_per_block, 0, stream.value()>>>(
+      *cdv,
+      std::get<0>(preprocess).value().data(),
+      offsets_view.head<offset_type>(),
+      thrust::nullopt,
+      thrust::nullopt,
+      thrust::nullopt);
+
+  // convert sizes to offsets
+  thrust::exclusive_scan(rmm::exec_policy(stream),
+                         offsets_view.head<offset_type>(),
+                         offsets_view.head<offset_type>() + col.size() + 1,
+                         offsets_view.head<offset_type>(),
+                         0);
+  size_type const output_size =
+    cudf::detail::get_value<offset_type>(offsets_view, col.size(), stream);
+
+  // allocate output string column
+  auto chars = cudf::make_fixed_width_column(
+    data_type{type_id::INT8}, output_size, mask_state::UNALLOCATED, stream, mr);
+
+  // potential optimization : if we know that all outputs are valid, we could skip creating
+  // the validity mask altogether
+  rmm::device_buffer validity =
+    cudf::detail::create_null_mask(col.size(), mask_state::UNINITIALIZED, stream, mr);
+
+  // compute results
+  cudf::mutable_column_view chars_view(*chars);
+  rmm::device_scalar<size_type> d_valid_count{0, stream};
+  get_json_object_kernel<block_size>
+    <<<grid.num_blocks, grid.num_threads_per_block, 0, stream.value()>>>(
+      *cdv,
+      std::get<0>(preprocess).value().data(),
+      offsets_view.head<offset_type>(),
+      chars_view.head<char>(),
+      static_cast<bitmask_type*>(validity.data()),
+      d_valid_count.data());
+
+  return make_strings_column(col.size(),
+                             std::move(offsets),
+                             std::move(chars),
+                             col.size() - d_valid_count.value(),
+                             std::move(validity),
+                             stream,
+                             mr);
+}
+
+}  // namespace
+}  // namespace detail
+
+/**
+ * @copydoc cudf::strings::get_json_object
+ */
+std::unique_ptr<cudf::column> get_json_object(cudf::strings_column_view const& col,
+                                              cudf::string_scalar const& json_path,
+                                              rmm::mr::device_memory_resource* mr)
+{
+  CUDF_FUNC_RANGE();
+  return detail::get_json_object(col, json_path, 0, mr);
+}
+
+}  // namespace strings
+}  // namespace cudf
diff --git a/cpp/src/strings/strings_column_factories.cu b/cpp/src/strings/strings_column_factories.cu
index 4d6c9389173..c4c2ff86085 100644
--- a/cpp/src/strings/strings_column_factories.cu
+++ b/cpp/src/strings/strings_column_factories.cu
@@ -23,7 +23,6 @@
 #include <strings/utilities.cuh>
 
 #include <rmm/cuda_stream_view.hpp>
-#include <rmm/device_vector.hpp>
 #include <rmm/exec_policy.hpp>
 
 namespace cudf {
diff --git a/cpp/src/strings/strings_column_view.cu b/cpp/src/strings/strings_column_view.cu
index 3eb1841e467..3c98796bf2d 100644
--- a/cpp/src/strings/strings_column_view.cu
+++ b/cpp/src/strings/strings_column_view.cu
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2019-2020, NVIDIA CORPORATION.
+ * Copyright (c) 2019-2021, NVIDIA CORPORATION.
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
  * you may not use this file except in compliance with the License.
@@ -15,19 +15,17 @@
  */
 
 #include <cudf/column/column_device_view.cuh>
+#include <cudf/detail/get_value.cuh>
 #include <cudf/detail/nvtx/ranges.hpp>
 #include <cudf/strings/string_view.cuh>
 #include <cudf/strings/strings_column_view.hpp>
 #include <cudf/utilities/error.hpp>
 
 #include <rmm/cuda_stream_view.hpp>
+#include <rmm/device_uvector.hpp>
 #include <rmm/exec_policy.hpp>
 
-#include <thrust/for_each.h>
 #include <thrust/transform.h>
-#include <thrust/transform_scan.h>
-
-#include <iostream>
 
 namespace cudf {
 //
@@ -57,113 +55,39 @@ size_type strings_column_view::chars_size() const noexcept
 }
 
 namespace strings {
-// print strings to stdout
-void print(strings_column_view const& strings,
-           size_type first,
-           size_type last,
-           size_type max_width,
-           const char* delimiter)
-{
-  size_type count = strings.size();
-  if (last < 0 || last > count) last = count;
-  if (first < 0) first = 0;
-  CUDF_EXPECTS(((first >= 0) && (first < last)), "invalid start parameter");
-  count = last - first;
-
-  // stick with the default stream for this odd/rare stdout function
-  auto strings_column = column_device_view::create(strings.parent());
-  auto d_column       = *strings_column;
-
-  // create output strings offsets
-  rmm::device_vector<size_type> output_offsets(count + 1);
-  size_type* d_output_offsets = output_offsets.data().get();
-  thrust::transform_inclusive_scan(
-    thrust::device,
-    thrust::make_counting_iterator<size_type>(first),
-    thrust::make_counting_iterator<size_type>(last),
-    d_output_offsets + 1,
-    [d_column, max_width] __device__(size_type idx) {
-      if (d_column.is_null(idx)) return static_cast<size_type>(0);
-      string_view d_str = d_column.element<string_view>(idx);
-      size_type bytes   = d_str.size_bytes();
-      if ((max_width > 0) && (d_str.length() > max_width)) bytes = d_str.byte_offset(max_width);
-      return static_cast<size_type>(bytes + 1);  // allow for null-terminator on non-null strings
-    },
-    thrust::plus<size_type>());
-  CUDA_TRY(cudaMemset(d_output_offsets, 0, sizeof(*d_output_offsets)));
-  // build output buffer
-  size_type buffer_size = output_offsets.back();  // last element has total size
-  if (buffer_size == 0) {
-    std::cout << "all " << count << " strings are null\n";
-    return;
-  }
-  rmm::device_vector<char> buffer(buffer_size, 0);  // allocate and pre-null-terminate
-  char* d_buffer = buffer.data().get();
-  // copy strings into output buffer
-  thrust::for_each_n(
-    thrust::device,
-    thrust::make_counting_iterator<size_type>(0),
-    count,
-    [d_column, max_width, first, d_output_offsets, d_buffer] __device__(size_type idx) {
-      if (d_column.is_null(first + idx)) return;
-      string_view d_str = d_column.element<string_view>(first + idx);
-      size_type bytes   = d_str.size_bytes();
-      if ((max_width > 0) && (d_str.length() > max_width)) bytes = d_str.byte_offset(max_width);
-      memcpy(d_buffer + d_output_offsets[idx], d_str.data(), bytes);
-    });
-
-  // copy output buffer to host
-  std::vector<size_type> h_offsets(count + 1);
-  CUDA_TRY(cudaMemcpy(
-    h_offsets.data(), d_output_offsets, (count + 1) * sizeof(size_type), cudaMemcpyDeviceToHost));
-  std::vector<char> h_buffer(buffer_size);
-  CUDA_TRY(cudaMemcpy(h_buffer.data(), d_buffer, buffer_size, cudaMemcpyDeviceToHost));
-
-  // print out the strings to stdout
-  for (size_type idx = 0; idx < count; ++idx) {
-    size_type offset = h_offsets[idx];
-    size_type length = h_offsets[idx + 1] - offset;
-    std::cout << idx << ":";
-    if (length)
-      std::cout << "[" << std::string(h_buffer.data() + offset) << "]";
-    else
-      std::cout << "<null>";
-    std::cout << delimiter;
-  }
-}
 
-//
-std::pair<rmm::device_vector<char>, rmm::device_vector<size_type>> create_offsets(
+std::pair<rmm::device_uvector<char>, rmm::device_uvector<size_type>> create_offsets(
   strings_column_view const& strings,
   rmm::cuda_stream_view stream,
   rmm::mr::device_memory_resource* mr)
 {
   CUDF_FUNC_RANGE();
-  size_type count          = strings.size();
-  const int32_t* d_offsets = strings.offsets().data<int32_t>();
+  size_type const count = strings.size();
+
+  auto d_offsets = strings.offsets().data<int32_t>();
   d_offsets += strings.offset();  // nvbug-2808421 : do not combine with the previous line
-  int32_t first = 0;
-  CUDA_TRY(
-    cudaMemcpyAsync(&first, d_offsets, sizeof(int32_t), cudaMemcpyDeviceToHost, stream.value()));
-  rmm::device_vector<size_type> offsets(count + 1);
+
+  rmm::device_uvector<size_type> offsets(count + 1, stream);
   // normalize the offset values for the column offset
-  thrust::transform(
-    rmm::exec_policy(stream),
-    d_offsets,
-    d_offsets + count + 1,
-    offsets.begin(),
-    [first] __device__(int32_t offset) { return static_cast<size_type>(offset - first); });
-  // copy the chars column data
-  int32_t bytes = 0;  // last offset entry is the size in bytes
-  CUDA_TRY(cudaMemcpyAsync(
-    &bytes, d_offsets + count, sizeof(int32_t), cudaMemcpyDeviceToHost, stream.value()));
+  thrust::transform(rmm::exec_policy(stream),
+                    d_offsets,
+                    d_offsets + count + 1,
+                    offsets.begin(),
+                    [d_offsets] __device__(int32_t offset) {
+                      return static_cast<size_type>(offset - d_offsets[0]);
+                    });
+
+  // get the input chars column byte offset
+  auto const bytes = offsets.element(count, stream);
+  auto const chars_offset =
+    cudf::detail::get_value<offset_type>(strings.offsets(), strings.offset(), stream);
   stream.synchronize();
 
-  bytes -= first;
-  const char* d_chars = strings.chars().data<char>() + first;
-  rmm::device_vector<char> chars(bytes);
-  CUDA_TRY(
-    cudaMemcpyAsync(chars.data().get(), d_chars, bytes, cudaMemcpyDeviceToHost, stream.value()));
+  // copy the chars column data
+  const char* d_chars = strings.chars().data<char>() + chars_offset;
+  rmm::device_uvector<char> chars(bytes, stream);
+  CUDA_TRY(cudaMemcpyAsync(chars.data(), d_chars, bytes, cudaMemcpyDefault, stream.value()));
+
   // return offsets and chars
   return std::make_pair(std::move(chars), std::move(offsets));
 }
diff --git a/cpp/src/strings/substring.cu b/cpp/src/strings/substring.cu
index f712b0cb6aa..e8da3120c38 100644
--- a/cpp/src/strings/substring.cu
+++ b/cpp/src/strings/substring.cu
@@ -402,8 +402,9 @@ std::unique_ptr<column> slice_strings(strings_column_view const& strings,
                "Strings and delimiters column sizes do not match");
 
   CUDF_FUNC_RANGE();
-  auto delimiters_dev_view_ptr = cudf::column_device_view::create(delimiters.parent(), 0);
-  auto delimiters_dev_view     = *delimiters_dev_view_ptr;
+  auto delimiters_dev_view_ptr =
+    cudf::column_device_view::create(delimiters.parent(), rmm::cuda_stream_default);
+  auto delimiters_dev_view = *delimiters_dev_view_ptr;
   return (delimiters_dev_view.nullable())
            ? detail::slice_strings(
                strings,
diff --git a/cpp/src/text/subword/load_hash_file.cu b/cpp/src/text/subword/load_hash_file.cu
index f3f96933f19..3800339a6a2 100644
--- a/cpp/src/text/subword/load_hash_file.cu
+++ b/cpp/src/text/subword/load_hash_file.cu
@@ -36,17 +36,12 @@
 namespace nvtext {
 namespace detail {
 
-/**
- * @brief Retrieve the code point metadata table.
- *
- * Build the code point metadata table in device memory
- * using the vector pieces from codepoint_metadata.ah
- */
-const codepoint_metadata_type* get_codepoint_metadata(rmm::cuda_stream_view stream)
-{
-  static cudf::strings::detail::thread_safe_per_context_cache<codepoint_metadata_type>
-    g_codepoint_metadata;
-  return g_codepoint_metadata.find_or_initialize([stream](void) {
+namespace {
+struct get_codepoint_metadata_init {
+  rmm::cuda_stream_view stream;
+
+  codepoint_metadata_type* operator()() const
+  {
     codepoint_metadata_type* table =
       static_cast<codepoint_metadata_type*>(rmm::mr::get_current_device_resource()->allocate(
         codepoint_metadata_size * sizeof(codepoint_metadata_type), stream));
@@ -66,20 +61,14 @@ const codepoint_metadata_type* get_codepoint_metadata(rmm::cuda_stream_view stre
       cudaMemcpyHostToDevice,
       stream.value()));
     return table;
-  });
-}
+  };
+};
 
-/**
- * @brief Retrieve the aux code point data table.
- *
- * Build the aux code point data table in device memory
- * using the vector pieces from codepoint_metadata.ah
- */
-const aux_codepoint_data_type* get_aux_codepoint_data(rmm::cuda_stream_view stream)
-{
-  static cudf::strings::detail::thread_safe_per_context_cache<aux_codepoint_data_type>
-    g_aux_codepoint_data;
-  return g_aux_codepoint_data.find_or_initialize([stream](void) {
+struct get_aux_codepoint_data_init {
+  rmm::cuda_stream_view stream;
+
+  aux_codepoint_data_type* operator()() const
+  {
     aux_codepoint_data_type* table =
       static_cast<aux_codepoint_data_type*>(rmm::mr::get_current_device_resource()->allocate(
         aux_codepoint_data_size * sizeof(aux_codepoint_data_type), stream));
@@ -111,7 +100,37 @@ const aux_codepoint_data_type* get_aux_codepoint_data(rmm::cuda_stream_view stre
       cudaMemcpyHostToDevice,
       stream.value()));
     return table;
-  });
+  }
+};
+}  // namespace
+
+/**
+ * @brief Retrieve the code point metadata table.
+ *
+ * Build the code point metadata table in device memory
+ * using the vector pieces from codepoint_metadata.ah
+ */
+const codepoint_metadata_type* get_codepoint_metadata(rmm::cuda_stream_view stream)
+{
+  static cudf::strings::detail::thread_safe_per_context_cache<codepoint_metadata_type>
+    g_codepoint_metadata;
+
+  get_codepoint_metadata_init function = {stream};
+  return g_codepoint_metadata.find_or_initialize(function);
+}
+
+/**
+ * @brief Retrieve the aux code point data table.
+ *
+ * Build the aux code point data table in device memory
+ * using the vector pieces from codepoint_metadata.ah
+ */
+const aux_codepoint_data_type* get_aux_codepoint_data(rmm::cuda_stream_view stream)
+{
+  static cudf::strings::detail::thread_safe_per_context_cache<aux_codepoint_data_type>
+    g_aux_codepoint_data;
+  get_aux_codepoint_data_init function = {stream};
+  return g_aux_codepoint_data.find_or_initialize(function);
 }
 
 namespace {
diff --git a/cpp/src/text/subword/subword_tokenize.cu b/cpp/src/text/subword/subword_tokenize.cu
index 1639af0dbde..8c14f89d4d0 100644
--- a/cpp/src/text/subword/subword_tokenize.cu
+++ b/cpp/src/text/subword/subword_tokenize.cu
@@ -265,7 +265,7 @@ tokenizer_result subword_tokenize(cudf::strings_column_view const& strings,
                                   do_lower_case,
                                   do_truncate,
                                   max_rows_tensor,
-                                  0,
+                                  rmm::cuda_stream_default,
                                   mr);
 }
 
@@ -286,7 +286,7 @@ tokenizer_result subword_tokenize(cudf::strings_column_view const& strings,
                                   do_lower_case,
                                   do_truncate,
                                   max_rows_tensor,
-                                  0,
+                                  rmm::cuda_stream_default,
                                   mr);
 }
 
diff --git a/cpp/src/transform/jit/code/kernel.cpp b/cpp/src/transform/jit/code/kernel.cpp
deleted file mode 100644
index 58fdb945de3..00000000000
--- a/cpp/src/transform/jit/code/kernel.cpp
+++ /dev/null
@@ -1,59 +0,0 @@
-/*
- * Copyright (c) 2019, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-namespace cudf {
-namespace transformation {
-namespace jit {
-namespace code {
-const char* kernel_header =
-  R"***(
-    #pragma once
-
-    // Include Jitify's cstddef header first
-    #include <cstddef>
-
-    #include <cuda/std/climits>
-    #include <cuda/std/cstddef>
-    #include <cuda/std/limits>
-
-    #include <cudf/types.hpp>
-    #include <cudf/wrappers/timestamps.hpp>
-  )***";
-
-const char* kernel =
-  R"***(
-    template <typename TypeOut, typename TypeIn>
-    __global__
-    void kernel(cudf::size_type size,
-                    TypeOut* out_data, TypeIn* in_data) {
-        int tid = threadIdx.x;
-        int blkid = blockIdx.x;
-        int blksz = blockDim.x;
-        int gridsz = gridDim.x;
-
-        int start = tid + blkid * blksz;
-        int step = blksz * gridsz;
-
-        for (cudf::size_type i=start; i<size; i+=step) {
-          GENERIC_UNARY_OP(&out_data[i], in_data[i]);  
-        }
-    }
-  )***";
-
-}  // namespace code
-}  // namespace jit
-}  // namespace transformation
-}  // namespace cudf
diff --git a/cpp/src/transform/jit/kernel.cu b/cpp/src/transform/jit/kernel.cu
new file mode 100644
index 00000000000..3360ac8cf77
--- /dev/null
+++ b/cpp/src/transform/jit/kernel.cu
@@ -0,0 +1,55 @@
+/*
+ * Copyright (c) 2019-2021, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+// Include Jitify's cstddef header first
+#include <cstddef>
+
+#include <cuda/std/climits>
+#include <cuda/std/cstddef>
+#include <cuda/std/limits>
+#include <cuda/std/type_traits>
+
+#include <cudf/wrappers/durations.hpp>
+#include <cudf/wrappers/timestamps.hpp>
+
+#include <transform/jit/operation-udf.hpp>
+
+#include <cudf/types.hpp>
+#include <cudf/wrappers/timestamps.hpp>
+
+namespace cudf {
+namespace transformation {
+namespace jit {
+
+template <typename TypeOut, typename TypeIn>
+__global__ void kernel(cudf::size_type size, TypeOut* out_data, TypeIn* in_data)
+{
+  int tid    = threadIdx.x;
+  int blkid  = blockIdx.x;
+  int blksz  = blockDim.x;
+  int gridsz = gridDim.x;
+
+  int start = tid + blkid * blksz;
+  int step  = blksz * gridsz;
+
+  for (cudf::size_type i = start; i < size; i += step) {
+    GENERIC_UNARY_OP(&out_data[i], in_data[i]);
+  }
+}
+
+}  // namespace jit
+}  // namespace transformation
+}  // namespace cudf
diff --git a/cpp/src/transform/jit/operation-udf.hpp b/cpp/src/transform/jit/operation-udf.hpp
new file mode 100644
index 00000000000..eaab2111d98
--- /dev/null
+++ b/cpp/src/transform/jit/operation-udf.hpp
@@ -0,0 +1,20 @@
+/*
+ * Copyright (c) 2021, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#pragma once
+
+// This file serves as a placeholder for user defined functions, so jitify can choose to override it
+// at runtime.
diff --git a/cpp/src/transform/transform.cpp b/cpp/src/transform/transform.cpp
index 6da0f78687b..40feab00b3c 100644
--- a/cpp/src/transform/transform.cpp
+++ b/cpp/src/transform/transform.cpp
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2019, NVIDIA CORPORATION.
+ * Copyright (c) 2019-2021, NVIDIA CORPORATION.
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
  * you may not use this file except in compliance with the License.
@@ -14,12 +14,11 @@
  * limitations under the License.
  */
 
-#include "jit/code/code.h"
+#include <jit_preprocessed_files/transform/jit/kernel.cu.jit.hpp>
 
-#include <jit/launcher.h>
-#include <jit/parser.h>
-#include <jit/type.h>
-#include <jit/common_headers.hpp>
+#include <jit/cache.hpp>
+#include <jit/parser.hpp>
+#include <jit/type.hpp>
 
 #include <cudf/column/column.hpp>
 #include <cudf/column/column_factories.hpp>
@@ -29,27 +28,12 @@
 #include <cudf/utilities/traits.hpp>
 #include <cudf/utilities/type_dispatcher.hpp>
 
-#include <jit/timestamps.hpp.jit>
-#include <jit/types.hpp.jit>
-
 #include <rmm/cuda_stream_view.hpp>
 
 namespace cudf {
 namespace transformation {
-//! Jit functions
 namespace jit {
 
-const std::vector<std::string> header_names{cudf_types_hpp, cudf_wrappers_timestamps_hpp};
-
-std::istream* headers_code(std::string filename, std::iostream& stream)
-{
-  auto it = cudf::jit::stringified_headers.find(filename);
-  if (it != cudf::jit::stringified_headers.end()) {
-    return cudf::jit::send_stringified_header(stream, it->second);
-  }
-  return nullptr;
-}
-
 void unary_operation(mutable_column_view output,
                      column_view input,
                      const std::string& udf,
@@ -57,28 +41,26 @@ void unary_operation(mutable_column_view output,
                      bool is_ptx,
                      rmm::cuda_stream_view stream)
 {
-  std::string hash = "prog_transform" + std::to_string(std::hash<std::string>{}(udf));
-
-  std::string cuda_source = code::kernel_header;
-  if (is_ptx) {
-    cuda_source += cudf::jit::parse_single_function_ptx(
-                     udf, "GENERIC_UNARY_OP", cudf::jit::get_type_name(output_type), {0}) +
-                   code::kernel;
-  } else {
-    cuda_source += cudf::jit::parse_single_function_cuda(udf, "GENERIC_UNARY_OP") + code::kernel;
-  }
-
-  // Launch the jitify kernel
-  cudf::jit::launcher(hash,
-                      cuda_source,
-                      header_names,
-                      cudf::jit::compiler_flags,
-                      headers_code,
-                      stream)
-    .set_kernel_inst("kernel",  // name of the kernel we are launching
-                     {cudf::jit::get_type_name(output.type()),  // list of template arguments
-                      cudf::jit::get_type_name(input.type())})
-    .launch(output.size(), cudf::jit::get_data_ptr(output), cudf::jit::get_data_ptr(input));
+  std::string kernel_name =
+    jitify2::reflection::Template("cudf::transformation::jit::kernel")  //
+      .instantiate(cudf::jit::get_type_name(output.type()),  // list of template arguments
+                   cudf::jit::get_type_name(input.type()));
+
+  std::string cuda_source =
+    is_ptx ? cudf::jit::parse_single_function_ptx(udf,  //
+                                                  "GENERIC_UNARY_OP",
+                                                  cudf::jit::get_type_name(output_type),
+                                                  {0})
+           : cudf::jit::parse_single_function_cuda(udf,  //
+                                                   "GENERIC_UNARY_OP");
+
+  cudf::jit::get_program_cache(*transform_jit_kernel_cu_jit)
+    .get_kernel(
+      kernel_name, {}, {{"transform/jit/operation-udf.hpp", cuda_source}}, {"-arch=sm_."})  //
+    ->configure_1d_max_occupancy(0, 0, 0, stream.value())                                   //
+    ->launch(output.size(),                                                                 //
+             cudf::jit::get_data_ptr(output),
+             cudf::jit::get_data_ptr(input));
 }
 
 }  // namespace jit
diff --git a/cpp/tests/CMakeLists.txt b/cpp/tests/CMakeLists.txt
index 082f039054e..342ec9145fd 100644
--- a/cpp/tests/CMakeLists.txt
+++ b/cpp/tests/CMakeLists.txt
@@ -154,7 +154,8 @@ ConfigureTest(BINARY_TEST
     binaryop/binop-verify-input-test.cpp
     binaryop/binop-null-test.cpp
     binaryop/binop-integration-test.cpp
-    binaryop/binop-generic-ptx-test.cpp)
+    binaryop/binop-generic-ptx-test.cpp
+    )
 
 ###################################################################################################
 # - unary transform tests -------------------------------------------------------------------------
@@ -172,16 +173,6 @@ ConfigureTest(INTEROP_TEST
     interop/from_arrow_test.cpp
     interop/dlpack_test.cpp)
 
-###################################################################################################
-# - jit cache tests -------------------------------------------------------------------------------
-ConfigureTest(JITCACHE_TEST
-    "${CUDF_SOURCE_DIR}/src/jit/cache.cpp"
-    jit/jit-cache-test.cpp)
-
-ConfigureTest(JITCACHE_MULTIPROC_TEST
-    "${CUDF_SOURCE_DIR}/src/jit/cache.cpp"
-    jit/jit-cache-multiprocess-test.cpp)
-
 ###################################################################################################
 # - io tests --------------------------------------------------------------------------------------
 ConfigureTest(DECOMPRESSION_TEST io/comp/decomp_test.cu)
@@ -201,24 +192,25 @@ ConfigureTest(SORT_TEST
 ###################################################################################################
 # - copying tests ---------------------------------------------------------------------------------
 ConfigureTest(COPYING_TEST
-    copying/utility_tests.cpp
+    copying/concatenate_tests.cu
+    copying/copy_range_tests.cpp
+    copying/copy_tests.cu    
+    copying/detail_gather_tests.cu
+    copying/gather_struct_tests.cu        
     copying/gather_tests.cu
     copying/gather_str_tests.cu
     copying/gather_list_tests.cu
-    copying/segmented_gather_list_tests.cpp
-    copying/gather_struct_tests.cu
-    copying/detail_gather_tests.cu
+    copying/get_value_tests.cpp
+    copying/pack_tests.cu
+    copying/sample_tests.cpp
     copying/scatter_tests.cpp
     copying/scatter_list_tests.cu
-    copying/copy_range_tests.cpp
+    copying/scatter_struct_tests.cu
+    copying/segmented_gather_list_tests.cpp
+    copying/shift_tests.cpp
     copying/slice_tests.cpp
     copying/split_tests.cpp
-    copying/copy_tests.cu
-    copying/shift_tests.cpp
-    copying/get_value_tests.cpp
-    copying/sample_tests.cpp
-    copying/concatenate_tests.cu
-    copying/pack_tests.cu)
+    copying/utility_tests.cpp)
 
 ###################################################################################################
 # - utilities tests -------------------------------------------------------------------------------
@@ -276,7 +268,8 @@ ConfigureTest(ROLLING_TEST
     rolling/rolling_test.cpp 
     rolling/grouped_rolling_test.cpp
     rolling/lead_lag_test.cpp
-    rolling/collect_list_test.cpp)
+    rolling/collect_list_test.cpp
+    )
 
 ###################################################################################################
 # - filling test ----------------------------------------------------------------------------------
@@ -334,6 +327,7 @@ ConfigureTest(STRINGS_TEST
     strings/hash_string.cu
     strings/integers_tests.cu
     strings/ipv4_tests.cpp
+    strings/json_tests.cpp
     strings/pad_tests.cpp
     strings/replace_regex_tests.cpp
     strings/replace_tests.cpp
diff --git a/cpp/tests/copying/concatenate_tests.cu b/cpp/tests/copying/concatenate_tests.cu
index cea53326895..8c4259fb18b 100644
--- a/cpp/tests/copying/concatenate_tests.cu
+++ b/cpp/tests/copying/concatenate_tests.cu
@@ -703,7 +703,7 @@ TEST_F(ListsColumnTest, ConcatenateEmptyLists)
   }
 
   {
-    cudf::test::lists_column_wrapper<int> a{LCW{}};
+    cudf::test::lists_column_wrapper<int> a{{LCW{}}};
     cudf::test::lists_column_wrapper<int> b{4, 5, 6, 7};
     cudf::test::lists_column_wrapper<int> expected{LCW{}, {4, 5, 6, 7}};
 
@@ -713,7 +713,7 @@ TEST_F(ListsColumnTest, ConcatenateEmptyLists)
   }
 
   {
-    cudf::test::lists_column_wrapper<int> a{LCW{}}, b{LCW{}}, c{LCW{}};
+    cudf::test::lists_column_wrapper<int> a{{LCW{}}}, b{{LCW{}}}, c{{LCW{}}};
     cudf::test::lists_column_wrapper<int> d{4, 5, 6, 7};
     cudf::test::lists_column_wrapper<int> expected{LCW{}, LCW{}, LCW{}, {4, 5, 6, 7}};
 
@@ -724,7 +724,7 @@ TEST_F(ListsColumnTest, ConcatenateEmptyLists)
 
   {
     cudf::test::lists_column_wrapper<int> a{1, 2};
-    cudf::test::lists_column_wrapper<int> b{LCW{}}, c{LCW{}};
+    cudf::test::lists_column_wrapper<int> b{{LCW{}}}, c{{LCW{}}};
     cudf::test::lists_column_wrapper<int> d{4, 5, 6, 7};
     cudf::test::lists_column_wrapper<int> expected{{1, 2}, LCW{}, LCW{}, {4, 5, 6, 7}};
 
diff --git a/cpp/tests/copying/gather_struct_tests.cu b/cpp/tests/copying/gather_struct_tests.cu
index a40e10d5e83..bcb4f83e7cb 100644
--- a/cpp/tests/copying/gather_struct_tests.cu
+++ b/cpp/tests/copying/gather_struct_tests.cu
@@ -189,7 +189,8 @@ TYPED_TEST(TypedStructGatherTest, TestGatherStructOfLists)
       cudf::detail::make_counting_transform_iterator(0, [](auto i) { return !(i % 3); })};
   };
 
-  auto lists_column = std::make_unique<cudf::column>(cudf::column(lists_column_exemplar(), 0));
+  auto lists_column =
+    std::make_unique<cudf::column>(cudf::column(lists_column_exemplar(), rmm::cuda_stream_default));
 
   // Assemble struct column.
   std::vector<std::unique_ptr<cudf::column>> vector_of_columns;
@@ -242,7 +243,8 @@ TYPED_TEST(TypedStructGatherTest, TestGatherStructOfListsOfLists)
       cudf::detail::make_counting_transform_iterator(0, [](auto i) { return !(i % 3); })};
   };
 
-  auto lists_column = std::make_unique<cudf::column>(cudf::column(lists_column_exemplar(), 0));
+  auto lists_column =
+    std::make_unique<cudf::column>(cudf::column(lists_column_exemplar(), rmm::cuda_stream_default));
 
   // Assemble struct column.
   std::vector<std::unique_ptr<cudf::column>> vector_of_columns;
diff --git a/cpp/tests/copying/scatter_struct_tests.cu b/cpp/tests/copying/scatter_struct_tests.cu
new file mode 100644
index 00000000000..a9bb1980d53
--- /dev/null
+++ b/cpp/tests/copying/scatter_struct_tests.cu
@@ -0,0 +1,293 @@
+/*
+ * Copyright (c) 2020-2021, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include <cudf_test/base_fixture.hpp>
+#include <cudf_test/column_utilities.hpp>
+#include <cudf_test/column_wrapper.hpp>
+#include <cudf_test/type_lists.hpp>
+
+#include <cudf/copying.hpp>
+#include <cudf/detail/iterator.cuh>
+#include <cudf/lists/lists_column_view.hpp>
+#include <cudf/table/table_view.hpp>
+#include <cudf/utilities/error.hpp>
+
+#include <memory>
+
+using bools_col   = cudf::test::fixed_width_column_wrapper<bool>;
+using int32s_col  = cudf::test::fixed_width_column_wrapper<int32_t>;
+using structs_col = cudf::test::structs_column_wrapper;
+using strings_col = cudf::test::strings_column_wrapper;
+
+constexpr int32_t null{0};  // Mark for null child elements
+constexpr int32_t XXX{0};   // Mark for null struct elements
+
+template <typename T>
+struct TypedStructScatterTest : public cudf::test::BaseFixture {
+};
+
+using TestTypes = cudf::test::Concat<cudf::test::IntegralTypes,
+                                     cudf::test::FloatingPointTypes,
+                                     cudf::test::DurationTypes,
+                                     cudf::test::TimestampTypes>;
+
+TYPED_TEST_CASE(TypedStructScatterTest, TestTypes);
+
+namespace {
+void test_scatter(std::unique_ptr<cudf::column> const& structs_src,
+                  std::unique_ptr<cudf::column> const& structs_tgt,
+                  std::unique_ptr<cudf::column> const& structs_expected,
+                  std::unique_ptr<cudf::column> const& scatter_map)
+{
+  auto const source = cudf::table_view{std::vector<cudf::column_view>{structs_src->view()}};
+  auto const target = cudf::table_view{std::vector<cudf::column_view>{structs_tgt->view()}};
+  auto const result = cudf::scatter(source, scatter_map->view(), target);
+  CUDF_TEST_EXPECT_COLUMNS_EQUAL(structs_expected->view(), result->get_column(0));
+}
+}  // namespace
+
+// Test case when all input columns are empty
+TYPED_TEST(TypedStructScatterTest, EmptyInputTest)
+{
+  using col_wrapper = cudf::test::fixed_width_column_wrapper<TypeParam, int32_t>;
+
+  auto child_col_src     = col_wrapper{};
+  auto const structs_src = structs_col{{child_col_src}, std::vector<bool>{}}.release();
+
+  auto child_col_tgt     = col_wrapper{};
+  auto const structs_tgt = structs_col{{child_col_tgt}, std::vector<bool>{}}.release();
+
+  auto const scatter_map = int32s_col{}.release();
+  test_scatter(structs_src, structs_tgt, structs_src, scatter_map);
+  test_scatter(structs_src, structs_tgt, structs_tgt, scatter_map);
+}
+
+// Test case when only the scatter map is empty
+TYPED_TEST(TypedStructScatterTest, EmptyScatterMapTest)
+{
+  using col_wrapper = cudf::test::fixed_width_column_wrapper<TypeParam, int32_t>;
+
+  auto child_col_src =
+    col_wrapper{{0, 1, 2, 3, null, XXX},
+                cudf::detail::make_counting_transform_iterator(0, [](auto i) { return i != 4; })};
+  auto const structs_src = structs_col{
+    {child_col_src}, cudf::detail::make_counting_transform_iterator(0, [](auto i) {
+      return i != 5;
+    })}.release();
+
+  auto child_col_tgt =
+    col_wrapper{{50, null, 70, XXX, 90, 100},
+                cudf::detail::make_counting_transform_iterator(0, [](auto i) { return i != 1; })};
+  auto const structs_tgt = structs_col{
+    {child_col_tgt}, cudf::detail::make_counting_transform_iterator(0, [](auto i) {
+      return i != 3;
+    })}.release();
+
+  auto const scatter_map = int32s_col{}.release();
+  test_scatter(structs_src, structs_tgt, structs_tgt, scatter_map);
+}
+
+TYPED_TEST(TypedStructScatterTest, ScatterAsCopyTest)
+{
+  using col_wrapper = cudf::test::fixed_width_column_wrapper<TypeParam, int32_t>;
+
+  auto child_col_src =
+    col_wrapper{{0, 1, 2, 3, null, XXX},
+                cudf::detail::make_counting_transform_iterator(0, [](auto i) { return i != 4; })};
+  auto const structs_src = structs_col{
+    {child_col_src}, cudf::detail::make_counting_transform_iterator(0, [](auto i) {
+      return i != 5;
+    })}.release();
+
+  auto child_col_tgt =
+    col_wrapper{{50, null, 70, XXX, 90, 100},
+                cudf::detail::make_counting_transform_iterator(0, [](auto i) { return i != 1; })};
+  auto const structs_tgt = structs_col{
+    {child_col_tgt}, cudf::detail::make_counting_transform_iterator(0, [](auto i) {
+      return i != 3;
+    })}.release();
+
+  // Scatter as copy: the target should be the same as source
+  auto const scatter_map = int32s_col{0, 1, 2, 3, 4, 5}.release();
+  test_scatter(structs_src, structs_tgt, structs_src, scatter_map);
+}
+
+TYPED_TEST(TypedStructScatterTest, ScatterAsLeftShiftTest)
+{
+  using col_wrapper = cudf::test::fixed_width_column_wrapper<TypeParam, int32_t>;
+
+  auto child_col_src =
+    col_wrapper{{0, 1, 2, 3, null, XXX},
+                cudf::detail::make_counting_transform_iterator(0, [](auto i) { return i != 4; })};
+  auto const structs_src = structs_col{
+    {child_col_src}, cudf::detail::make_counting_transform_iterator(0, [](auto i) {
+      return i != 5;
+    })}.release();
+
+  auto child_col_tgt =
+    col_wrapper{{50, null, 70, XXX, 90, 100},
+                cudf::detail::make_counting_transform_iterator(0, [](auto i) { return i != 1; })};
+  auto const structs_tgt = structs_col{
+    {child_col_tgt}, cudf::detail::make_counting_transform_iterator(0, [](auto i) {
+      return i != 3;
+    })}.release();
+
+  auto child_col_expected =
+    col_wrapper{{2, 3, null, XXX, 0, 1},
+                cudf::detail::make_counting_transform_iterator(0, [](auto i) { return i != 2; })};
+  auto structs_expected = structs_col{
+    {child_col_expected}, cudf::detail::make_counting_transform_iterator(0, [](auto i) {
+      return i != 3;
+    })}.release();
+
+  auto const scatter_map = int32s_col{-2, -1, 0, 1, 2, 3}.release();
+  test_scatter(structs_src, structs_tgt, structs_expected, scatter_map);
+}
+
+TYPED_TEST(TypedStructScatterTest, SimpleScatterTests)
+{
+  using col_wrapper = cudf::test::fixed_width_column_wrapper<TypeParam, int32_t>;
+
+  // Source data
+  auto child_col_src =
+    col_wrapper{{0, 1, 2, 3, null, XXX},
+                cudf::detail::make_counting_transform_iterator(0, [](auto i) { return i != 4; })};
+  auto const structs_src = structs_col{
+    {child_col_src}, cudf::detail::make_counting_transform_iterator(0, [](auto i) {
+      return i != 5;
+    })}.release();
+
+  // Target data
+  auto child_col_tgt =
+    col_wrapper{{50, null, 70, XXX, 90, 100},
+                cudf::detail::make_counting_transform_iterator(0, [](auto i) { return i != 1; })};
+  auto const structs_tgt = structs_col{
+    {child_col_tgt}, cudf::detail::make_counting_transform_iterator(0, [](auto i) {
+      return i != 3;
+    })}.release();
+
+  // Expected data
+  auto child_col_expected1 =
+    col_wrapper{{1, null, 70, XXX, 0, 2},
+                cudf::detail::make_counting_transform_iterator(0, [](auto i) { return i != 1; })};
+  auto const structs_expected1 = structs_col{
+    {child_col_expected1}, cudf::detail::make_counting_transform_iterator(0, [](auto i) {
+      return i != 3;
+    })}.release();
+  auto const scatter_map1 = int32s_col{-2, 0, 5}.release();
+  test_scatter(structs_src, structs_tgt, structs_expected1, scatter_map1);
+
+  // Expected data
+  auto child_col_expected2 =
+    col_wrapper{{1, null, 70, 3, 0, 2},
+                cudf::detail::make_counting_transform_iterator(0, [](auto i) { return i != 1; })};
+  auto const structs_expected2 = structs_col{
+    {child_col_expected2}, cudf::detail::make_counting_transform_iterator(0, [](auto i) {
+      return true;
+    })}.release();
+  auto const scatter_map2 = int32s_col{-2, 0, 5, 3}.release();
+  test_scatter(structs_src, structs_tgt, structs_expected2, scatter_map2);
+}
+
+TYPED_TEST(TypedStructScatterTest, ComplexDataScatterTest)
+{
+  // Testing scatter() on struct<string, numeric, bool>.
+  using col_wrapper = cudf::test::fixed_width_column_wrapper<TypeParam, int32_t>;
+
+  // Source data
+  auto names_column_src =
+    strings_col{{"Newton", "Washington", "Cherry", "Kiwi", "Lemon", "Tomato"},
+                cudf::detail::make_counting_transform_iterator(0, [](auto) { return true; })};
+  auto ages_column_src =
+    col_wrapper{{5, 10, 15, 20, 25, 30},
+                cudf::detail::make_counting_transform_iterator(0, [](auto i) { return i != 4; })};
+  auto is_human_col_src =
+    bools_col{{true, true, false, false, false, false},
+              cudf::detail::make_counting_transform_iterator(0, [](auto i) { return i != 3; })};
+
+  // Target data
+  auto names_column_tgt =
+    strings_col{{"String 0", "String 1", "String 2", "String 3", "String 4", "String 5"},
+                cudf::detail::make_counting_transform_iterator(0, [](auto i) { return i != 0; })};
+  auto ages_column_tgt =
+    col_wrapper{{50, 60, 70, 80, 90, 100},
+                cudf::detail::make_counting_transform_iterator(0, [](auto i) { return i != 1; })};
+  auto is_human_col_tgt =
+    bools_col{{true, true, true, true, true, true},
+              cudf::detail::make_counting_transform_iterator(0, [](auto) { return true; })};
+
+  // Expected data
+  auto names_column_expected =
+    strings_col{{"String 0", "Lemon", "Kiwi", "Cherry", "Washington", "Newton"},
+                cudf::detail::make_counting_transform_iterator(0, [](auto i) { return i != 0; })};
+  auto ages_column_expected =
+    col_wrapper{{50, 25, 20, 15, 10, 5},
+                cudf::detail::make_counting_transform_iterator(0, [](auto i) { return i != 1; })};
+  auto is_human_col_expected =
+    bools_col{{true, false, false, false, true, true},
+              cudf::detail::make_counting_transform_iterator(0, [](auto i) { return i != 2; })};
+
+  auto const structs_src = structs_col{
+    {names_column_src, ages_column_src, is_human_col_src},
+    cudf::detail::make_counting_transform_iterator(0, [](auto i) {
+      return i != 5;
+    })}.release();
+  auto const structs_tgt = structs_col{
+    {names_column_tgt, ages_column_tgt, is_human_col_tgt},
+    cudf::detail::make_counting_transform_iterator(0, [](auto i) {
+      return i != 2;
+    })}.release();
+  auto const structs_expected = structs_col{
+    {names_column_expected, ages_column_expected, is_human_col_expected},
+    cudf::detail::make_counting_transform_iterator(0, [](auto i) {
+      return true;
+    })}.release();
+
+  // The first element of the target is not overwritten
+  auto const scatter_map = int32s_col{-1, 4, 3, 2, 1}.release();
+  test_scatter(structs_src, structs_tgt, structs_expected, scatter_map);
+}
+
+TYPED_TEST(TypedStructScatterTest, ScatterStructOfListsTest)
+{
+  // Testing gather() on struct<list<numeric>>
+  using lists_col = cudf::test::lists_column_wrapper<TypeParam, int32_t>;
+
+  // Source data
+  auto lists_col_src =
+    lists_col{{{5}, {10, 15}, {20, 25, 30}, {35, 40, 45, 50}, {55, 60, 65}, {70, 75}, {80}, {}, {}},
+              // Valid for elements 0, 3, 6,...
+              cudf::detail::make_counting_transform_iterator(0, [](auto i) { return !(i % 3); })};
+  auto const structs_src = structs_col{{lists_col_src}}.release();
+
+  // Target data
+  auto lists_col_tgt =
+    lists_col{{{1}, {2, 3}, {4, 5, 6}, {7, 8}, {9}, {10, 11, 12, 13}, {}, {14}, {15, 16}},
+              // Valid for elements 1, 3, 5, 7,...
+              cudf::detail::make_counting_transform_iterator(0, [](auto i) { return i % 2; })};
+  auto const structs_tgt = structs_col{{lists_col_tgt}}.release();
+
+  // Expected data
+  auto const validity_expected = std::vector<bool>{0, 1, 1, 0, 0, 1, 1, 0, 0};
+  auto lists_col_expected      = lists_col{
+    {{1}, {2, 3}, {80}, {70, 75}, {55, 60, 65}, {35, 40, 45, 50}, {5}, {10, 15}, {20, 25, 30}},
+    validity_expected.begin()};
+  auto const structs_expected = structs_col{{lists_col_expected}}.release();
+
+  // The first 2 elements of the target is not overwritten
+  auto const scatter_map = int32s_col{-3, -2, -1, 5, 4, 3, 2}.release();
+  test_scatter(structs_src, structs_tgt, structs_expected, scatter_map);
+}
diff --git a/cpp/tests/hashing/hash_test.cpp b/cpp/tests/hashing/hash_test.cpp
index 5641d445ff3..d928a17b3d1 100644
--- a/cpp/tests/hashing/hash_test.cpp
+++ b/cpp/tests/hashing/hash_test.cpp
@@ -257,20 +257,35 @@ TEST_F(SerialMurmurHash3Test, MultiValueWithSeeds)
   fixed_width_column_wrapper<bool> const bools_col1({0, 1, 1, 1, 0});
   fixed_width_column_wrapper<bool> const bools_col2({0, 1, 2, 255, 0});
 
-  auto const input1 = cudf::table_view({strings_col});
-  auto const input2 = cudf::table_view({ints_col});
-  auto const input3 = cudf::table_view({strings_col, ints_col, bools_col1});
-  auto const input4 = cudf::table_view({strings_col, ints_col, bools_col2});
-
-  auto const hashed_output1 = cudf::hash(input1, cudf::hash_id::HASH_SERIAL_MURMUR3, {}, 314);
-  auto const hashed_output2 = cudf::hash(input2, cudf::hash_id::HASH_SERIAL_MURMUR3, {}, 42);
-  auto const hashed_output3 = cudf::hash(input3, cudf::hash_id::HASH_SERIAL_MURMUR3, {});
-  auto const hashed_output4 = cudf::hash(input4, cudf::hash_id::HASH_SERIAL_MURMUR3, {});
-
-  CUDF_TEST_EXPECT_COLUMNS_EQUAL(hashed_output1->view(), strings_col_result, true);
-  CUDF_TEST_EXPECT_COLUMNS_EQUAL(hashed_output2->view(), ints_col_result, true);
-  EXPECT_EQ(input3.num_rows(), hashed_output3->size());
-  CUDF_TEST_EXPECT_COLUMNS_EQUAL(hashed_output3->view(), hashed_output4->view(), true);
+  std::vector<std::unique_ptr<cudf::column>> struct_field_cols;
+  struct_field_cols.emplace_back(std::make_unique<cudf::column>(strings_col));
+  struct_field_cols.emplace_back(std::make_unique<cudf::column>(ints_col));
+  struct_field_cols.emplace_back(std::make_unique<cudf::column>(bools_col1));
+  structs_column_wrapper structs_col(std::move(struct_field_cols));
+
+  auto const combo1 = cudf::table_view({strings_col, ints_col, bools_col1});
+  auto const combo2 = cudf::table_view({strings_col, ints_col, bools_col2});
+
+  constexpr auto hasher   = cudf::hash_id::HASH_SERIAL_MURMUR3;
+  auto const strings_hash = cudf::hash(cudf::table_view({strings_col}), hasher, {}, 314);
+  auto const ints_hash    = cudf::hash(cudf::table_view({ints_col}), hasher, {}, 42);
+  auto const combo1_hash  = cudf::hash(combo1, hasher, {});
+  auto const combo2_hash  = cudf::hash(combo2, hasher, {});
+  auto const structs_hash = cudf::hash(cudf::table_view({structs_col}), hasher, {});
+
+  CUDF_TEST_EXPECT_COLUMNS_EQUAL(*strings_hash, strings_col_result, true);
+  CUDF_TEST_EXPECT_COLUMNS_EQUAL(*ints_hash, ints_col_result, true);
+  EXPECT_EQ(combo1.num_rows(), combo1_hash->size());
+  CUDF_TEST_EXPECT_COLUMNS_EQUAL(*combo1_hash, *combo2_hash, true);
+  CUDF_TEST_EXPECT_COLUMNS_EQUAL(*structs_hash, *combo1_hash, true);
+}
+
+TEST_F(SerialMurmurHash3Test, ListThrows)
+{
+  lists_column_wrapper<cudf::string_view> strings_list_col({{""}, {"abc"}, {"123"}});
+  EXPECT_THROW(
+    cudf::hash(cudf::table_view({strings_list_col}), cudf::hash_id::HASH_SERIAL_MURMUR3, {}),
+    cudf::logic_error);
 }
 
 class SparkMurmurHash3Test : public cudf::test::BaseFixture {
@@ -280,31 +295,38 @@ TEST_F(SparkMurmurHash3Test, MultiValueWithSeeds)
 {
   // The hash values were determined by running the following Scala code in Apache Spark:
   // import org.apache.spark.sql.catalyst.util.DateTimeUtils
-  // val schema = new StructType().add("strings",StringType).add("doubles",DoubleType)
-  //   .add("timestamps",TimestampType).add("decimal64", DecimalType(18,7)).add("longs",LongType)
-  //   .add("floats",FloatType).add("dates",DateType).add("decimal32", DecimalType(9,3))
-  //   .add("ints",IntegerType).add("shorts",ShortType).add("bytes",ByteType)
-  //   .add("bools",BooleanType)
+  // val schema = new StructType().add("structs", new StructType().add("a",IntegerType)
+  //     .add("b",StringType).add("c",new StructType().add("x",FloatType).add("y",LongType)))
+  //   .add("strings",StringType).add("doubles",DoubleType).add("timestamps",TimestampType)
+  //   .add("decimal64", DecimalType(18,7)).add("longs",LongType).add("floats",FloatType)
+  //   .add("dates",DateType).add("decimal32", DecimalType(9,3)).add("ints",IntegerType)
+  //   .add("shorts",ShortType).add("bytes",ByteType).add("bools",BooleanType)
   // val data = Seq(
-  //  Row("", 0.toDouble, DateTimeUtils.toJavaTimestamp(0), BigDecimal(0), 0.toLong, 0.toFloat,
-  //      DateTimeUtils.toJavaDate(0), BigDecimal(0), 0, 0.toShort, 0.toByte, false),
-  //  Row("The quick brown fox", -(0.toDouble), DateTimeUtils.toJavaTimestamp(100),
-  //      BigDecimal("0.00001"), 100.toLong, -(0.toFloat), DateTimeUtils.toJavaDate(100),
-  //      BigDecimal("0.1"), 100, 100.toShort, 100.toByte, true),
-  //  Row("jumps over the lazy dog.", -Double.NaN, DateTimeUtils.toJavaTimestamp(-100),
-  //      BigDecimal("-0.00001"), -100.toLong, -Float.NaN, DateTimeUtils.toJavaDate(-100),
-  //      BigDecimal("-0.1"), -100, -100.toShort, -100.toByte, true),
-  //  Row("All work and no play makes Jack a dull boy", Double.MinValue,
-  //      DateTimeUtils.toJavaTimestamp(Long.MinValue/1000000), BigDecimal("-99999999999.9999999"),
-  //      Long.MinValue, Float.MinValue, DateTimeUtils.toJavaDate(Int.MinValue/100),
-  //      BigDecimal("-999999.999"), Int.MinValue, Short.MinValue, Byte.MinValue, true),
-  //  Row("!\"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~\ud720\ud721", Double.MaxValue,
-  //      DateTimeUtils.toJavaTimestamp(Long.MaxValue/1000000), BigDecimal("99999999999.9999999"),
-  //      Long.MaxValue, Float.MaxValue, DateTimeUtils.toJavaDate(Int.MaxValue/100),
-  //      BigDecimal("999999.999"), Int.MaxValue, Short.MaxValue, Byte.MaxValue, false))
+  // Row(Row(0, "a", Row(0f, 0L)), "", 0.toDouble, DateTimeUtils.toJavaTimestamp(0), BigDecimal(0),
+  //     0.toLong, 0.toFloat, DateTimeUtils.toJavaDate(0), BigDecimal(0), 0, 0.toShort, 0.toByte,
+  //     false),
+  // Row(Row(100, "bc", Row(100f, 100L)), "The quick brown fox", -(0.toDouble),
+  //     DateTimeUtils.toJavaTimestamp(100), BigDecimal("0.00001"), 100.toLong, -(0.toFloat),
+  //     DateTimeUtils.toJavaDate(100), BigDecimal("0.1"), 100, 100.toShort, 100.toByte, true),
+  // Row(Row(-100, "def", Row(-100f, -100L)), "jumps over the lazy dog.", -Double.NaN,
+  //     DateTimeUtils.toJavaTimestamp(-100), BigDecimal("-0.00001"), -100.toLong, -Float.NaN,
+  //     DateTimeUtils.toJavaDate(-100), BigDecimal("-0.1"), -100, -100.toShort, -100.toByte,
+  //     true),
+  // Row(Row(0x12345678, "ghij", Row(Float.PositiveInfinity, 0x123456789abcdefL)),
+  //     "All work and no play makes Jack a dull boy", Double.MinValue,
+  //     DateTimeUtils.toJavaTimestamp(Long.MinValue/1000000), BigDecimal("-99999999999.9999999"),
+  //     Long.MinValue, Float.MinValue, DateTimeUtils.toJavaDate(Int.MinValue/100),
+  //     BigDecimal("-999999.999"), Int.MinValue, Short.MinValue, Byte.MinValue, true),
+  // Row(Row(-0x76543210, "klmno", Row(Float.NegativeInfinity, -0x123456789abcdefL)),
+  //     "!\"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~\ud720\ud721", Double.MaxValue,
+  //     DateTimeUtils.toJavaTimestamp(Long.MaxValue/1000000), BigDecimal("99999999999.9999999"),
+  //     Long.MaxValue, Float.MaxValue, DateTimeUtils.toJavaDate(Int.MaxValue/100),
+  //     BigDecimal("999999.999"), Int.MaxValue, Short.MaxValue, Byte.MaxValue, false))
   // val df = spark.createDataFrame(sc.parallelize(data), schema)
   // df.columns.foreach(c => println(s"$c => ${df.select(hash(col(c))).collect.mkString(",")}"))
   // df.select(hash(col("*"))).collect
+  fixed_width_column_wrapper<int32_t> const hash_structs_expected(
+    {-105406170, 90479889, -678041645, 1667387937, 301478567});
   fixed_width_column_wrapper<int32_t> const hash_strings_expected(
     {1467149710, 723257560, -1620282500, -2001858707, 1588473657});
   fixed_width_column_wrapper<int32_t> const hash_doubles_expected(
@@ -330,18 +352,26 @@ TEST_F(SparkMurmurHash3Test, MultiValueWithSeeds)
   fixed_width_column_wrapper<int32_t> const hash_bools_expected(
     {933211791, -559580957, -559580957, -559580957, 933211791});
   fixed_width_column_wrapper<int32_t> const hash_combined_expected(
-    {-1947042614, -1731440908, 807283935, 725489209, 822276819});
+    {-1172364561, -442972638, 1213234395, 796626751, 214075225});
+
+  using double_limits = std::numeric_limits<double>;
+  using long_limits   = std::numeric_limits<int64_t>;
+  using float_limits  = std::numeric_limits<float>;
+  using int_limits    = std::numeric_limits<int32_t>;
+  fixed_width_column_wrapper<int32_t> a_col{0, 100, -100, 0x12345678, -0x76543210};
+  strings_column_wrapper b_col{"a", "bc", "def", "ghij", "klmno"};
+  fixed_width_column_wrapper<float> x_col{
+    0.f, 100.f, -100.f, float_limits::infinity(), -float_limits::infinity()};
+  fixed_width_column_wrapper<int64_t> y_col{
+    0L, 100L, -100L, 0x123456789abcdefL, -0x123456789abcdefL};
+  structs_column_wrapper c_col{{x_col, y_col}};
+  structs_column_wrapper const structs_col{{a_col, b_col, c_col}};
 
   strings_column_wrapper const strings_col({"",
                                             "The quick brown fox",
                                             "jumps over the lazy dog.",
                                             "All work and no play makes Jack a dull boy",
                                             "!\"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~\ud720\ud721"});
-
-  using double_limits = std::numeric_limits<double>;
-  using long_limits   = std::numeric_limits<int64_t>;
-  using float_limits  = std::numeric_limits<float>;
-  using int_limits    = std::numeric_limits<int32_t>;
   fixed_width_column_wrapper<double> const doubles_col(
     {0., -0., -double_limits::quiet_NaN(), double_limits::lowest(), double_limits::max()});
   fixed_width_column_wrapper<cudf::timestamp_ms, cudf::timestamp_ms::rep> const timestamps_col(
@@ -364,6 +394,7 @@ TEST_F(SparkMurmurHash3Test, MultiValueWithSeeds)
   fixed_width_column_wrapper<bool> const bools_col2({0, 1, 2, 255, 0});
 
   constexpr auto hasher      = cudf::hash_id::HASH_SPARK_MURMUR3;
+  auto const hash_structs    = cudf::hash(cudf::table_view({structs_col}), hasher, {}, 42);
   auto const hash_strings    = cudf::hash(cudf::table_view({strings_col}), hasher, {}, 314);
   auto const hash_doubles    = cudf::hash(cudf::table_view({doubles_col}), hasher, {}, 42);
   auto const hash_timestamps = cudf::hash(cudf::table_view({timestamps_col}), hasher, {}, 42);
@@ -378,6 +409,7 @@ TEST_F(SparkMurmurHash3Test, MultiValueWithSeeds)
   auto const hash_bools1     = cudf::hash(cudf::table_view({bools_col1}), hasher, {}, 42);
   auto const hash_bools2     = cudf::hash(cudf::table_view({bools_col2}), hasher, {}, 42);
 
+  CUDF_TEST_EXPECT_COLUMNS_EQUAL(*hash_structs, hash_structs_expected, true);
   CUDF_TEST_EXPECT_COLUMNS_EQUAL(*hash_strings, hash_strings_expected, true);
   CUDF_TEST_EXPECT_COLUMNS_EQUAL(*hash_doubles, hash_doubles_expected, true);
   CUDF_TEST_EXPECT_COLUMNS_EQUAL(*hash_timestamps, hash_timestamps_expected, true);
@@ -392,7 +424,8 @@ TEST_F(SparkMurmurHash3Test, MultiValueWithSeeds)
   CUDF_TEST_EXPECT_COLUMNS_EQUAL(*hash_bools1, hash_bools_expected, true);
   CUDF_TEST_EXPECT_COLUMNS_EQUAL(*hash_bools2, hash_bools_expected, true);
 
-  auto const combined_table = cudf::table_view({strings_col,
+  auto const combined_table = cudf::table_view({structs_col,
+                                                strings_col,
                                                 doubles_col,
                                                 timestamps_col,
                                                 decimal64_col,
@@ -408,6 +441,14 @@ TEST_F(SparkMurmurHash3Test, MultiValueWithSeeds)
   CUDF_TEST_EXPECT_COLUMNS_EQUAL(*hash_combined, hash_combined_expected, true);
 }
 
+TEST_F(SparkMurmurHash3Test, ListThrows)
+{
+  lists_column_wrapper<cudf::string_view> strings_list_col({{""}, {"abc"}, {"123"}});
+  EXPECT_THROW(
+    cudf::hash(cudf::table_view({strings_list_col}), cudf::hash_id::HASH_SPARK_MURMUR3, {}),
+    cudf::logic_error);
+}
+
 class MD5HashTest : public cudf::test::BaseFixture {
 };
 
diff --git a/cpp/tests/jit/jit-cache-multiprocess-test.cpp b/cpp/tests/jit/jit-cache-multiprocess-test.cpp
deleted file mode 100644
index 2f0b353673e..00000000000
--- a/cpp/tests/jit/jit-cache-multiprocess-test.cpp
+++ /dev/null
@@ -1,128 +0,0 @@
-/*
- * Copyright (c) 2019-2020, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#include <sys/types.h>
-#include <unistd.h>
-#include "jit-cache-test.hpp"
-#include "rmm/mr/device/per_device_resource.hpp"
-
-#if defined(JITIFY_USE_CACHE)
-
-/**
- * @brief This test runs two processes that try to access the same kernel
- *
- * This is a stress test.
- *
- * A single test process is forked before invocation of CUDA and then both the
- * parent and child processes try to get and run a kernel. The child process
- * clears the cache before each iteration of the test so that the cache has to
- * be re-written by it. The parent process runs on a changing time offset so
- * that it sometimes gets the kernel from cache and sometimes it doesn't.
- *
- * The aim of this test is to check that the file cache doesn't get corrupted
- * when multiple processes are reading/writing to it at the same time. Since
- * the public API of JitCache doesn't return the serialized string of the
- * cached kernel, the way to test its validity is to run it on test data.
- */
-TEST_F(JitCacheMultiProcessTest, MultiProcessTest)
-{
-  int num_tests = 20;
-  // Cannot initialize scalars before forking
-  rmm::device_scalar<int> *input;
-  rmm::device_scalar<int> *output;
-  int expect = 64;
-
-  auto tester = [&](int pid, int test_no) {
-    // Brand new cache object that has nothing in in-memory cache
-    cudf::jit::cudfJitCache cache;
-
-    auto const in{4};
-    auto const out{1};
-    input->set_value(in);
-    output->set_value(out);
-
-    // make program
-    auto program = cache.getProgram("FileCacheTestProg3", program3_source);
-    // make kernel
-    auto kernel = cache.getKernelInstantiation("my_kernel", program, {"3", "int"});
-    (*std::get<1>(kernel)).configure(grid, block).launch(input->data(), output->data());
-    CUDA_TRY(cudaDeviceSynchronize());
-
-    ASSERT_TRUE(expect == output->value()) << "Expected val: " << expect << '\n'
-                                           << "  Actual val: " << output->value();
-  };
-
-  // This pipe is how the child process will send output to parent
-  int pipefd[2];
-  ASSERT_NE(pipe(pipefd), -1) << "Unable to create pipe";
-
-  pid_t cpid = fork();
-  ASSERT_TRUE(cpid >= 0) << "Fork failed";
-
-  if (cpid > 0) {      // Parent
-    close(pipefd[1]);  // Close write end of pipe. Parent doesn't write.
-    usleep(100000);
-  } else {                           // Child
-    close(pipefd[0]);                // Close read end of pipe. Child doesn't read.
-    dup2(pipefd[1], STDOUT_FILENO);  // redirect stdout to pipe
-  }
-
-  input  = new rmm::device_scalar<int>();
-  output = new rmm::device_scalar<int>();
-
-  for (int i = 0; i < num_tests; i++) {
-    if (cpid > 0)
-      usleep(10000);
-    else
-      purgeFileCache();
-
-    tester(cpid, i);
-  }
-
-  // Child ends here --------------------------------------------------------
-
-  if (cpid > 0) {
-    int status;
-    wait(&status);
-
-    std::cout << "Child output begin:" << std::endl;
-    char buf;
-    while (read(pipefd[0], &buf, 1) > 0) ASSERT_EQ(write(STDOUT_FILENO, &buf, 1), 1);
-    ASSERT_EQ(write(STDOUT_FILENO, "\n", 1), 1);
-    std::cout << "Child output end" << std::endl;
-
-    ASSERT_TRUE(WIFEXITED(status)) << "Child did not exit normally.";
-    ASSERT_EQ(WEXITSTATUS(status), 0) << "Error in child.";
-  }
-}
-#endif
-
-int main(int argc, char **argv)
-{
-  ::testing::InitGoogleTest(&argc, argv);
-
-  // This test relies on the fact that the cuda context will be created in
-  // each process separately after the fork. With the default CUDF_TEST_MAIN,
-  // using rmm_mode=pool will cause the cuda context to be created at startup,
-  // before the fork. So we hardcode the rmm_mode to "cuda" for this test
-  // and explicitly set the device 0 resource to it. Note that using
-  // `set_current_device_resource` would result in a call to `cudaGetDevice()`
-  // which would also initialize the CUDA context before the fork.
-  auto const rmm_mode = "cuda";
-  auto resource       = cudf::test::create_memory_resource(rmm_mode);
-  rmm::mr::set_per_device_resource(rmm::cuda_device_id{0}, resource.get());
-  return RUN_ALL_TESTS();
-}
diff --git a/cpp/tests/jit/jit-cache-test.cpp b/cpp/tests/jit/jit-cache-test.cpp
deleted file mode 100644
index 43cd5911ae7..00000000000
--- a/cpp/tests/jit/jit-cache-test.cpp
+++ /dev/null
@@ -1,125 +0,0 @@
-/*
- * Copyright (c) 2019, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#include "jit-cache-test.hpp"
-
-namespace cudf {
-namespace test {
-TEST_F(JitCacheTest, CacheExceptionTest)
-{
-  EXPECT_NO_THROW(auto program = getProgram("MemoryCacheTestProg"));
-  EXPECT_ANY_THROW(auto program1 = getProgram("MemoryCacheTestProg1"));
-}
-
-// Test the in memory caching ability
-TEST_F(JitCacheTest, MemoryCacheKernelTest)
-{
-  // Check the kernel caching
-
-  // Single value column
-  // TODO (dm): should be a scalar tho
-  auto column = cudf::test::fixed_width_column_wrapper<int>{{5, 0}};
-  auto expect = cudf::test::fixed_width_column_wrapper<int>{{125, 0}};
-
-  // make new program and rename it to match old program
-  auto program = getProgram("MemoryCacheTestProg1", program2_source);
-  // TODO: when I convert this pair to a class, make an inherited test class that can edit names
-  std::get<0>(program) = "MemoryCacheTestProg";
-
-  // remove any file cache so below kernel should not be obtained from file
-  purgeFileCache();
-
-  // make kernel that if the cache tried to compile, will use a different
-  // program than intended and give wrong result.
-  auto kernel = getKernelInstantiation("my_kernel", program, {"3", "int"});
-
-  (*std::get<1>(kernel))
-    .configure(grid, block)
-    .launch(column.operator cudf::mutable_column_view().data<int>());
-
-  CUDF_TEST_EXPECT_COLUMNS_EQUAL(expect, column);
-}
-
-TEST_F(JitCacheTest, MemoryCacheProgramTest)
-{
-  // Check program source caching
-
-  // Single value column
-  // TODO (dm): should be a scalar tho
-  auto column = cudf::test::fixed_width_column_wrapper<int>{{5, 0}};
-  auto expect = cudf::test::fixed_width_column_wrapper<int>{{625, 0}};
-
-  // remove any file cache so below program should not be obtained from file
-  purgeFileCache();
-
-  auto program = getProgram("MemoryCacheTestProg");
-  // make kernel that HAS to be compiled
-  auto kernel = getKernelInstantiation("my_kernel", program, {"4", "int"});
-
-  (*std::get<1>(kernel))
-    .configure(grid, block)
-    .launch(column.operator cudf::mutable_column_view().data<int>());
-
-  CUDF_TEST_EXPECT_COLUMNS_EQUAL(expect, column);
-}
-
-// Test the file caching ability
-#if defined(JITIFY_USE_CACHE)
-TEST_F(JitCacheTest, FileCacheProgramTest)
-{
-  // Brand new cache object that has nothing in in-memory cache
-  cudf::jit::cudfJitCache cache;
-
-  // Single value column
-  auto column = cudf::test::fixed_width_column_wrapper<int>{{5, 0}};
-  auto expect = cudf::test::fixed_width_column_wrapper<int>{{625, 0}};
-
-  // make program
-  auto program = cache.getProgram("FileCacheTestProg", program_source);
-  // make kernel that HAS to be compiled
-  auto kernel = cache.getKernelInstantiation("my_kernel", program, {"4", "int"});
-  (*std::get<1>(kernel))
-    .configure(grid, block)
-    .launch(column.operator cudf::mutable_column_view().data<int>());
-
-  CUDF_TEST_EXPECT_COLUMNS_EQUAL(expect, column);
-}
-
-TEST_F(JitCacheTest, FileCacheKernelTest)
-{
-  // Brand new cache object that has nothing in in-memory cache
-  cudf::jit::cudfJitCache cache;
-
-  // Single value column
-  auto column = cudf::test::fixed_width_column_wrapper<int>{{5, 0}};
-  auto expect = cudf::test::fixed_width_column_wrapper<int>{{125, 0}};
-
-  // make program
-  auto program = cache.getProgram("FileCacheTestProg", program_source);
-  // make kernel that should NOT need to be compiled
-  auto kernel = cache.getKernelInstantiation("my_kernel", program, {"3", "int"});
-  (*std::get<1>(kernel))
-    .configure(grid, block)
-    .launch(column.operator cudf::mutable_column_view().data<int>());
-
-  CUDF_TEST_EXPECT_COLUMNS_EQUAL(expect, column);
-}
-#endif
-
-}  // namespace test
-}  // namespace cudf
-
-CUDF_TEST_PROGRAM_MAIN()
diff --git a/cpp/tests/jit/jit-cache-test.hpp b/cpp/tests/jit/jit-cache-test.hpp
deleted file mode 100644
index 261cc0fd3b4..00000000000
--- a/cpp/tests/jit/jit-cache-test.hpp
+++ /dev/null
@@ -1,132 +0,0 @@
-/*
- * Copyright (c) 2019-2020, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#pragma once
-
-#include <boost/filesystem.hpp>
-
-#include <cudf_test/base_fixture.hpp>
-#include <cudf_test/column_utilities.hpp>
-#include <cudf_test/column_wrapper.hpp>
-
-#include <jit/cache.h>
-
-// Note that this test does not inherit from cudf::test::BaseFixture because
-// doing so would cause the CUDA context to be created before the fork in
-// the JitCacheMultiProcessTest, where we need it to be created after the fork
-// to ensure the forked child has a context. These tests do not need the
-// memory_resource member of BaseFixture.
-struct JitCacheTest : public ::testing::Test, public cudf::jit::cudfJitCache {
-  JitCacheTest() : grid(1), block(1) {}
-
-  virtual ~JitCacheTest() {}
-
-  virtual void SetUp()
-  {
-    purgeFileCache();
-    warmUp();
-  }
-
-  virtual void TearDown() { purgeFileCache(); }
-
-  void purgeFileCache()
-  {
-#if defined(JITIFY_USE_CACHE)
-    // In the multi-process test there are two processes repeatedly creating and deleting the cache.
-    // While deleting the cache, we cannot use `filesystem::remove_all(cudf::jit::getCacheDir())`
-    // because it would recursively remove all files within the cache directory and then finally
-    // remove the directory itself. A non-empty directory cannot be removed and throws an exception.
-    // On slower disks, there would be times when one process would be deleting the cache and the
-    // other would be creating it. So while the process that’s trying to delete is done deleting the
-    // contents of the directory, and is about to delete the directory itself, the other process
-    // would go ahead and create a cache file in that directory. Thus causing an exception to be
-    // thrown on the process trying to delete the now non-empty directory.
-
-    // By recursing the cache directory and only deleting cache files, we leave the directory alone.
-    // That way the aforementioned scenario doesn’t occur
-    std::vector<boost::filesystem::path> file_paths;
-    for (auto& path : boost::filesystem::recursive_directory_iterator(cudf::jit::getCacheDir())) {
-      if (boost::filesystem::is_regular_file(path)) { file_paths.push_back(path); }
-    }
-    for (auto& file_path : file_paths) { boost::filesystem::remove(file_path); }
-#endif
-  }
-
-  void warmUp()
-  {
-    // Prime up the cache so that the in-memory and file cache is populated
-
-    // Single value column
-    auto column = cudf::test::fixed_width_column_wrapper<int>({4, 0});
-    auto expect = cudf::test::fixed_width_column_wrapper<int>({64, 0});
-
-    // make program
-    auto program = getProgram("MemoryCacheTestProg", program_source);
-    // make kernel
-    auto kernel = getKernelInstantiation("my_kernel", program, {"3", "int"});
-    (*std::get<1>(kernel))
-      .configure(grid, block)
-      .launch(column.operator cudf::mutable_column_view().data<int>());
-
-    CUDF_TEST_EXPECT_COLUMNS_EQUAL(expect, column);
-  }
-
-  const char* program_source =
-    "my_program\n"
-    "template<int N, typename T>\n"
-    "__global__\n"
-    "void my_kernel(T* data) {\n"
-    "    T data0 = data[0];\n"
-    "    for( int i=0; i<N-1; ++i ) {\n"
-    "        data[0] *= data0;\n"
-    "    }\n"
-    "}\n";
-
-  const char* program2_source =
-    "my_program\n"
-    "template<int N, typename T>\n"
-    "__global__\n"
-    "void my_kernel(T* data) {\n"
-    "    T data0 = data[0];\n"
-    "    for( int i=0; i<N-1; ++i ) {\n"
-    "        data[0] += data0;\n"
-    "    }\n"
-    "}\n";
-
-  const char* program3_source =
-    "my_program\n"
-    "template<int N, typename T>\n"
-    "__global__\n"
-    "void my_kernel(T* data, T* out) {\n"
-    "    T data0 = data[0];\n"
-    "    for( int i=0; i<N; ++i ) {\n"
-    "        out[0] *= data0;\n"
-    "    }\n"
-    "}\n";
-
-  dim3 grid;
-  dim3 block;
-};
-
-/**
- * @brief Similar to JitCacheTest but it doesn't run warmUp() test in SetUp and
- * purgeFileCache() in SetUp and TearDown
- */
-struct JitCacheMultiProcessTest : public JitCacheTest {
-  virtual void SetUp() {}
-
-  virtual void TearDown() {}
-};
diff --git a/cpp/tests/join/join_tests.cpp b/cpp/tests/join/join_tests.cpp
index 32192234c56..365653d701f 100644
--- a/cpp/tests/join/join_tests.cpp
+++ b/cpp/tests/join/join_tests.cpp
@@ -410,6 +410,97 @@ TEST_F(JoinTest, LeftJoinWithNulls)
   CUDF_TEST_EXPECT_TABLES_EQUIVALENT(*sorted_gold, *sorted_result);
 }
 
+TEST_F(JoinTest, LeftJoinWithStructsAndNulls)
+{
+  column_wrapper<int32_t> col0_0{{3, 1, 2, 0, 2}};
+  strcol_wrapper col0_1({"s1", "s1", "", "s4", "s0"}, {1, 1, 0, 1, 1});
+  column_wrapper<int32_t> col0_2{{0, 1, 2, 4, 1}};
+  auto col0_names_col = strcol_wrapper{
+    "Samuel Vimes", "Carrot Ironfoundersson", "Detritus", "Samuel Vimes", "Angua von Überwald"};
+  auto col0_ages_col = column_wrapper<int32_t>{{48, 27, 351, 31, 25}};
+
+  auto col0_is_human_col = column_wrapper<bool>{{true, true, false, false, false}, {1, 1, 0, 1, 0}};
+
+  auto col0_3 =
+    cudf::test::structs_column_wrapper{{col0_names_col, col0_ages_col, col0_is_human_col}};
+
+  column_wrapper<int32_t> col1_0{{2, 2, 0, 4, 3}};
+  strcol_wrapper col1_1({"s1", "s0", "s1", "s2", "s1"});
+  column_wrapper<int32_t> col1_2{{1, 0, 1, 2, 1}, {1, 0, 1, 1, 1}};
+  auto col1_names_col = strcol_wrapper{
+    "Samuel Vimes", "Detritus", "Detritus", "Carrot Ironfoundersson", "Angua von Überwald"};
+  auto col1_ages_col = column_wrapper<int32_t>{{48, 35, 351, 22, 25}};
+
+  auto col1_is_human_col = column_wrapper<bool>{{true, true, false, false, true}, {1, 1, 0, 1, 1}};
+
+  auto col1_3 =
+    cudf::test::structs_column_wrapper{{col1_names_col, col1_ages_col, col1_is_human_col}};
+
+  CVector cols0, cols1;
+  cols0.push_back(col0_0.release());
+  cols0.push_back(col0_1.release());
+  cols0.push_back(col0_2.release());
+  cols0.push_back(col0_3.release());
+  cols1.push_back(col1_0.release());
+  cols1.push_back(col1_1.release());
+  cols1.push_back(col1_2.release());
+  cols1.push_back(col1_3.release());
+
+  Table t0(std::move(cols0));
+  Table t1(std::move(cols1));
+
+  auto result            = cudf::left_join(t0, t1, {3}, {3});
+  auto result_sort_order = cudf::sorted_order(result->view());
+  auto sorted_result     = cudf::gather(result->view(), *result_sort_order);
+
+  column_wrapper<int32_t> col_gold_0{{3, 2, 1, 0, 2}, {1, 1, 1, 1, 1}};
+  strcol_wrapper col_gold_1({"s1", "", "s1", "s4", "s0"}, {1, 0, 1, 1, 1});
+  column_wrapper<int32_t> col_gold_2{{0, 2, 1, 4, 1}, {1, 1, 1, 1, 1}};
+  auto col0_gold_names_col = strcol_wrapper{
+    "Samuel Vimes", "Detritus", "Carrot Ironfoundersson", "Samuel Vimes", "Angua von Überwald"};
+  auto col0_gold_ages_col = column_wrapper<int32_t>{{48, 351, 27, 31, 25}};
+
+  auto col0_gold_is_human_col =
+    column_wrapper<bool>{{true, false, true, false, false}, {1, 0, 1, 1, 0}};
+
+  auto col_gold_3 = cudf::test::structs_column_wrapper{
+    {col0_gold_names_col, col0_gold_ages_col, col0_gold_is_human_col}};
+
+  column_wrapper<int32_t> col_gold_4{{2, 0, -1, -1, -1}, {1, 1, 0, 0, 0}};
+  strcol_wrapper col_gold_5{{"s1", "s1", "", "", ""}, {1, 1, 0, 0, 0}};
+  column_wrapper<int32_t> col_gold_6{{1, 1, -1, -1, -1}, {1, 1, 0, 0, 0}};
+  auto col1_gold_names_col = strcol_wrapper{{
+                                              "Samuel Vimes",
+                                              "Detritus",
+                                              "",
+                                              "",
+                                              "",
+                                            },
+                                            {1, 1, 0, 0, 0}};
+  auto col1_gold_ages_col  = column_wrapper<int32_t>{{48, 351, -1, -1, -1}, {1, 1, 0, 0, 0}};
+
+  auto col1_gold_is_human_col =
+    column_wrapper<bool>{{true, false, false, false, false}, {1, 0, 0, 0, 0}};
+
+  auto col_gold_7 = cudf::test::structs_column_wrapper{
+    {col1_gold_names_col, col1_gold_ages_col, col1_gold_is_human_col}, {1, 1, 0, 0, 0}};
+
+  CVector cols_gold;
+  cols_gold.push_back(col_gold_0.release());
+  cols_gold.push_back(col_gold_1.release());
+  cols_gold.push_back(col_gold_2.release());
+  cols_gold.push_back(col_gold_3.release());
+  cols_gold.push_back(col_gold_4.release());
+  cols_gold.push_back(col_gold_5.release());
+  cols_gold.push_back(col_gold_6.release());
+  cols_gold.push_back(col_gold_7.release());
+  Table gold(std::move(cols_gold));
+
+  auto gold_sort_order = cudf::sorted_order(gold.view());
+  auto sorted_gold     = cudf::gather(gold.view(), *gold_sort_order);
+  CUDF_TEST_EXPECT_TABLES_EQUIVALENT(*sorted_gold, *sorted_result);
+}
+
 TEST_F(JoinTest, LeftJoinOnNulls)
 {
   // clang-format off
@@ -629,6 +720,91 @@ TEST_F(JoinTest, InnerJoinWithNulls)
   CUDF_TEST_EXPECT_TABLES_EQUIVALENT(*sorted_gold, *sorted_result);
 }
 
+TEST_F(JoinTest, InnerJoinWithStructsAndNulls)
+{
+  column_wrapper<int32_t> col0_0{{3, 1, 2, 0, 2}};
+  strcol_wrapper col0_1({"s1", "s1", "s0", "s4", "s0"}, {1, 1, 0, 1, 1});
+  column_wrapper<int32_t> col0_2{{0, 1, 2, 4, 1}};
+  std::initializer_list<std::string> col0_names = {
+    "Samuel Vimes", "Carrot Ironfoundersson", "Detritus", "Samuel Vimes", "Angua von Überwald"};
+  auto col0_names_col = strcol_wrapper{col0_names.begin(), col0_names.end()};
+  auto col0_ages_col  = column_wrapper<int32_t>{{48, 27, 351, 31, 25}};
+
+  auto col0_is_human_col = column_wrapper<bool>{{true, true, false, false, false}, {1, 1, 0, 1, 0}};
+
+  auto col0_3 =
+    cudf::test::structs_column_wrapper{{col0_names_col, col0_ages_col, col0_is_human_col}};
+
+  column_wrapper<int32_t> col1_0{{2, 2, 0, 4, 3}};
+  strcol_wrapper col1_1({"s1", "s0", "s1", "s2", "s1"});
+  column_wrapper<int32_t> col1_2{{1, 0, 1, 2, 1}, {1, 0, 1, 1, 1}};
+  std::initializer_list<std::string> col1_names = {"Carrot Ironfoundersson",
+                                                   "Angua von Überwald",
+                                                   "Detritus",
+                                                   "Carrot Ironfoundersson",
+                                                   "Samuel Vimes"};
+  auto col1_names_col = strcol_wrapper{col1_names.begin(), col1_names.end()};
+  auto col1_ages_col  = column_wrapper<int32_t>{{351, 25, 27, 31, 48}};
+
+  auto col1_is_human_col = column_wrapper<bool>{{true, false, false, false, true}, {1, 0, 0, 1, 1}};
+
+  auto col1_3 =
+    cudf::test::structs_column_wrapper{{col1_names_col, col1_ages_col, col1_is_human_col}};
+
+  CVector cols0, cols1;
+  cols0.push_back(col0_0.release());
+  cols0.push_back(col0_1.release());
+  cols0.push_back(col0_2.release());
+  cols0.push_back(col0_3.release());
+  cols1.push_back(col1_0.release());
+  cols1.push_back(col1_1.release());
+  cols1.push_back(col1_2.release());
+  cols1.push_back(col1_3.release());
+
+  Table t0(std::move(cols0));
+  Table t1(std::move(cols1));
+
+  auto result            = cudf::inner_join(t0, t1, {0, 1, 3}, {0, 1, 3});
+  auto result_sort_order = cudf::sorted_order(result->view());
+  auto sorted_result     = cudf::gather(result->view(), *result_sort_order);
+
+  column_wrapper<int32_t> col_gold_0{{3, 2}};
+  strcol_wrapper col_gold_1({"s1", "s0"}, {1, 1});
+  column_wrapper<int32_t> col_gold_2{{0, 1}};
+  auto col_gold_3_names_col = strcol_wrapper{"Samuel Vimes", "Angua von Überwald"};
+  auto col_gold_3_ages_col  = column_wrapper<int32_t>{{48, 25}};
+
+  auto col_gold_3_is_human_col = column_wrapper<bool>{{true, false}, {1, 0}};
+
+  auto col_gold_3 = cudf::test::structs_column_wrapper{
+    {col_gold_3_names_col, col_gold_3_ages_col, col_gold_3_is_human_col}};
+
+  column_wrapper<int32_t> col_gold_4{{3, 2}};
+  strcol_wrapper col_gold_5({"s1", "s0"}, {1, 1});
+  column_wrapper<int32_t> col_gold_6{{1, -1}, {1, 0}};
+  auto col_gold_7_names_col = strcol_wrapper{"Samuel Vimes", "Angua von Überwald"};
+  auto col_gold_7_ages_col  = column_wrapper<int32_t>{{48, 25}};
+
+  auto col_gold_7_is_human_col = column_wrapper<bool>{{true, false}, {1, 0}};
+
+  auto col_gold_7 = cudf::test::structs_column_wrapper{
+    {col_gold_7_names_col, col_gold_7_ages_col, col_gold_7_is_human_col}};
+  CVector cols_gold;
+  cols_gold.push_back(col_gold_0.release());
+  cols_gold.push_back(col_gold_1.release());
+  cols_gold.push_back(col_gold_2.release());
+  cols_gold.push_back(col_gold_3.release());
+  cols_gold.push_back(col_gold_4.release());
+  cols_gold.push_back(col_gold_5.release());
+  cols_gold.push_back(col_gold_6.release());
+  cols_gold.push_back(col_gold_7.release());
+  Table gold(std::move(cols_gold));
+
+  auto gold_sort_order = cudf::sorted_order(gold.view());
+  auto sorted_gold     = cudf::gather(gold.view(), *gold_sort_order);
+  CUDF_TEST_EXPECT_TABLES_EQUIVALENT(*sorted_gold, *sorted_result);
+}
+
 // // Test to check join behaviour when join keys are null.
 TEST_F(JoinTest, InnerJoinOnNulls)
 {
@@ -1359,4 +1535,128 @@ TEST_F(JoinDictionaryTest, FullJoinWithNulls)
   CUDF_TEST_EXPECT_TABLES_EQUIVALENT(*gold, cudf::table_view(result_decoded));
 }
 
+TEST_F(JoinTest, FullJoinWithStructsAndNulls)
+{
+  column_wrapper<int32_t> col0_0{{3, 1, 2, 0, 3}};
+  strcol_wrapper col0_1({"s0", "s1", "s2", "s4", "s1"});
+  column_wrapper<int32_t> col0_2{{0, 1, 2, 4, 1}};
+
+  std::initializer_list<std::string> col0_names = {"Samuel Vimes",
+                                                   "Carrot Ironfoundersson",
+                                                   "Angua von Überwald",
+                                                   "Detritus",
+                                                   "Carrot Ironfoundersson"};
+  auto col0_names_col = strcol_wrapper{col0_names.begin(), col0_names.end()};
+  auto col0_ages_col  = column_wrapper<int32_t>{{48, 27, 25, 31, 351}};
+
+  auto col0_is_human_col = column_wrapper<bool>{{true, true, false, false, false}, {1, 1, 0, 1, 1}};
+
+  auto col0_3 =
+    cudf::test::structs_column_wrapper{{col0_names_col, col0_ages_col, col0_is_human_col}};
+
+  column_wrapper<int32_t> col1_0{{2, 2, 0, 4, 3}, {1, 1, 1, 0, 1}};
+  strcol_wrapper col1_1{{"s1", "s0", "s1", "s2", "s1"}};
+  column_wrapper<int32_t> col1_2{{1, 0, 1, 2, 1}};
+
+  std::initializer_list<std::string> col1_names = {"Carrot Ironfoundersson",
+                                                   "Samuel Vimes",
+                                                   "Carrot Ironfoundersson",
+                                                   "Angua von Überwald",
+                                                   "Carrot Ironfoundersson"};
+  auto col1_names_col = strcol_wrapper{col1_names.begin(), col1_names.end()};
+  auto col1_ages_col  = column_wrapper<int32_t>{{27, 48, 27, 25, 27}};
+
+  auto col1_is_human_col = column_wrapper<bool>{{true, true, true, false, true}, {1, 1, 1, 0, 1}};
+
+  auto col1_3 =
+    cudf::test::structs_column_wrapper{{col1_names_col, col1_ages_col, col1_is_human_col}};
+
+  CVector cols0, cols1;
+  cols0.push_back(col0_0.release());
+  cols0.push_back(col0_1.release());
+  cols0.push_back(col0_2.release());
+  cols0.push_back(col0_3.release());
+  cols1.push_back(col1_0.release());
+  cols1.push_back(col1_1.release());
+  cols1.push_back(col1_2.release());
+  cols1.push_back(col1_3.release());
+
+  Table t0(std::move(cols0));
+  Table t1(std::move(cols1));
+
+  auto result            = cudf::full_join(t0, t1, {0, 1, 3}, {0, 1, 3});
+  auto result_sort_order = cudf::sorted_order(result->view());
+  auto sorted_result     = cudf::gather(result->view(), *result_sort_order);
+
+  column_wrapper<int32_t> col_gold_0{{3, 1, 2, 0, 3, -1, -1, -1, -1, -1},
+                                     {1, 1, 1, 1, 1, 0, 0, 0, 0, 0}};
+  strcol_wrapper col_gold_1({"s0", "s1", "s2", "s4", "s1", "", "", "", "", ""},
+                            {1, 1, 1, 1, 1, 0, 0, 0, 0, 0});
+  column_wrapper<int32_t> col_gold_2{{0, 1, 2, 4, 1, -1, -1, -1, -1, -1},
+                                     {1, 1, 1, 1, 1, 0, 0, 0, 0, 0}};
+  auto gold_names0_col = strcol_wrapper{{"Samuel Vimes",
+                                         "Carrot Ironfoundersson",
+                                         "Angua von Überwald",
+                                         "Detritus",
+                                         "Carrot Ironfoundersson",
+                                         "",
+                                         "",
+                                         "",
+                                         "",
+                                         ""},
+                                        {1, 1, 1, 1, 1, 0, 0, 0, 0, 0}};
+  auto gold_ages0_col  = column_wrapper<int32_t>{{48, 27, 25, 31, 351, -1, -1, -1, -1, -1},
+                                                {1, 1, 1, 1, 1, 0, 0, 0, 0, 0}};
+
+  auto gold_is_human0_col =
+    column_wrapper<bool>{{true, true, false, false, false, false, false, false, false, false},
+                         {1, 1, 0, 1, 1, 0, 0, 0, 0, 0}};
+
+  auto col_gold_3 = cudf::test::structs_column_wrapper{
+    {gold_names0_col, gold_ages0_col, gold_is_human0_col}, {1, 1, 1, 1, 1, 0, 0, 0, 0, 0}};
+
+  column_wrapper<int32_t> col_gold_4{{-1, -1, -1, -1, -1, 3, 2, 2, 0, 4},
+                                     {0, 0, 0, 0, 0, 1, 1, 1, 1, 0}};
+  strcol_wrapper col_gold_5({"", "", "", "", "", "s1", "s1", "s0", "s1", "s2"},
+                            {0, 0, 0, 0, 0, 1, 1, 1, 1, 1});
+  column_wrapper<int32_t> col_gold_6{{-1, -1, -1, -1, -1, 1, 1, 0, 1, 2},
+                                     {0, 0, 0, 0, 0, 1, 1, 1, 1, 1}};
+  auto gold_names1_col = strcol_wrapper{{"",
+                                         "",
+                                         "",
+                                         "",
+                                         "",
+                                         "Carrot Ironfoundersson",
+                                         "Carrot Ironfoundersson",
+                                         "Samuel Vimes",
+                                         "Carrot Ironfoundersson",
+                                         "Angua von Überwald"},
+                                        {0, 0, 0, 0, 0, 1, 1, 1, 1, 1}};
+  auto gold_ages1_col  = column_wrapper<int32_t>{{-1, -1, -1, -1, -1, 27, 27, 48, 27, 25},
+                                                {0, 0, 0, 0, 0, 1, 1, 1, 1, 1}};
+
+  auto gold_is_human1_col =
+    column_wrapper<bool>{{false, false, false, false, false, true, true, true, true, false},
+                         {0, 0, 0, 0, 0, 1, 1, 1, 1, 0}};
+
+  auto col_gold_7 = cudf::test::structs_column_wrapper{
+    {gold_names1_col, gold_ages1_col, gold_is_human1_col}, {0, 0, 0, 0, 0, 1, 1, 1, 1, 1}};
+
+  CVector cols_gold;
+  cols_gold.push_back(col_gold_0.release());
+  cols_gold.push_back(col_gold_1.release());
+  cols_gold.push_back(col_gold_2.release());
+  cols_gold.push_back(col_gold_3.release());
+  cols_gold.push_back(col_gold_4.release());
+  cols_gold.push_back(col_gold_5.release());
+  cols_gold.push_back(col_gold_6.release());
+  cols_gold.push_back(col_gold_7.release());
+
+  Table gold(std::move(cols_gold));
+
+  auto gold_sort_order = cudf::sorted_order(gold.view());
+  auto sorted_gold     = cudf::gather(gold.view(), *gold_sort_order);
+  CUDF_TEST_EXPECT_TABLES_EQUIVALENT(*sorted_gold, *sorted_result);
+}
+
 CUDF_TEST_PROGRAM_MAIN()
diff --git a/cpp/tests/partitioning/hash_partition_test.cpp b/cpp/tests/partitioning/hash_partition_test.cpp
index bbe6fbc432a..97c61c10718 100644
--- a/cpp/tests/partitioning/hash_partition_test.cpp
+++ b/cpp/tests/partitioning/hash_partition_test.cpp
@@ -214,6 +214,34 @@ TEST_F(HashPartition, UnsupportedHashFunction)
     cudf::logic_error);
 }
 
+TEST_F(HashPartition, CustomSeedValue)
+{
+  fixed_width_column_wrapper<float> floats({1.f, 2.f, 3.f, 4.f, 5.f, 6.f, 7.f, 8.f});
+  fixed_width_column_wrapper<int16_t> integers({1, 2, 3, 4, 5, 6, 7, 8});
+  strings_column_wrapper strings({"a", "bb", "ccc", "d", "ee", "fff", "gg", "h"});
+  auto input = cudf::table_view({floats, integers, strings});
+
+  auto columns_to_hash = std::vector<cudf::size_type>({0, 2});
+
+  cudf::size_type const num_partitions = 3;
+  std::unique_ptr<cudf::table> output1, output2;
+  std::vector<cudf::size_type> offsets1, offsets2;
+  std::tie(output1, offsets1) = cudf::hash_partition(
+    input, columns_to_hash, num_partitions, cudf::hash_id::HASH_MURMUR3, 12345);
+  std::tie(output2, offsets2) = cudf::hash_partition(
+    input, columns_to_hash, num_partitions, cudf::hash_id::HASH_MURMUR3, 12345);
+
+  // Expect output to have size num_partitions
+  EXPECT_EQ(static_cast<size_t>(num_partitions), offsets1.size());
+  EXPECT_EQ(offsets1.size(), offsets2.size());
+
+  // Expect output to have same shape as input
+  CUDF_TEST_EXPECT_TABLE_PROPERTIES_EQUAL(input, output1->view());
+
+  // Expect deterministic result from hashing the same input
+  CUDF_TEST_EXPECT_TABLES_EQUAL(output1->view(), output2->view());
+}
+
 template <typename T>
 class HashPartitionFixedWidth : public cudf::test::BaseFixture {
 };
diff --git a/cpp/tests/partitioning/partition_test.cpp b/cpp/tests/partitioning/partition_test.cpp
index a6838112a54..ed994da20f8 100644
--- a/cpp/tests/partitioning/partition_test.cpp
+++ b/cpp/tests/partitioning/partition_test.cpp
@@ -141,6 +141,35 @@ TYPED_TEST(PartitionTest, Identity)
   run_partition_test(table_to_partition, map, 6, table_to_partition, expected_offsets);
 }
 
+TYPED_TEST(PartitionTest, Struct)
+{
+  using value_type = cudf::test::GetType<TypeParam, 0>;
+  using map_type   = cudf::test::GetType<TypeParam, 1>;
+
+  fixed_width_column_wrapper<value_type, int32_t> A({1, 2}, {0, 1});
+  auto struct_col         = cudf::test::structs_column_wrapper({A}, {0, 1}).release();
+  auto table_to_partition = cudf::table_view{{*struct_col}};
+
+  fixed_width_column_wrapper<map_type> map{9, 2};
+
+  fixed_width_column_wrapper<value_type, int32_t> A_expected({2, 1}, {1, 0});
+  auto struct_expected = cudf::test::structs_column_wrapper({A_expected}, {1, 0}).release();
+  auto expected        = cudf::table_view{{*struct_expected}};
+
+  std::vector<cudf::size_type> expected_offsets{0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2};
+
+  // This does not work because we cannot sort a struct right now...
+  // run_partition_test(table_to_partition, map, 12, expected, expected_offsets);
+  // But there is no ambiguity in the ordering so I'll just copy it all here for now.
+  auto num_partitions                  = 12;
+  auto result                          = cudf::partition(table_to_partition, map, num_partitions);
+  auto const& actual_partitioned_table = result.first;
+  auto const& actual_offsets           = result.second;
+  EXPECT_EQ(actual_offsets, expected_offsets);
+
+  CUDF_TEST_EXPECT_TABLES_EQUAL(expected, *actual_partitioned_table);
+}
+
 TYPED_TEST(PartitionTest, Reverse)
 {
   using value_type = cudf::test::GetType<TypeParam, 0>;
diff --git a/cpp/tests/replace/replace_nulls_tests.cpp b/cpp/tests/replace/replace_nulls_tests.cpp
index bd3bf7ddd03..f6937c29d04 100644
--- a/cpp/tests/replace/replace_nulls_tests.cpp
+++ b/cpp/tests/replace/replace_nulls_tests.cpp
@@ -1,5 +1,5 @@
 /*
- * Copyright 2019, NVIDIA CORPORATION.
+ * Copyright 2019-2021, NVIDIA CORPORATION.
  *
  * Copyright 2018 BlazingDB, Inc.
  *     Copyright 2018 Alexander Ocsa <cristhian@blazingdb.com>
@@ -23,6 +23,7 @@
 
 #include <cudf/dictionary/detail/replace.hpp>
 #include <cudf/dictionary/encode.hpp>
+#include <cudf/fixed_point/fixed_point.hpp>
 #include <cudf/scalar/scalar.hpp>
 #include <cudf/scalar/scalar_factories.hpp>
 #include <cudf/utilities/error.hpp>
@@ -167,8 +168,9 @@ TEST_F(ReplaceNullsStringsTest, SimpleReplaceScalar)
 {
   std::vector<std::string> input{"", "", "", "", "", "", "", ""};
   std::vector<cudf::valid_type> input_v{0, 0, 0, 0, 0, 0, 0, 0};
-  std::unique_ptr<cudf::scalar> repl = cudf::make_string_scalar("rep", 0, mr());
-  repl->set_valid(true, 0);
+  std::unique_ptr<cudf::scalar> repl =
+    cudf::make_string_scalar("rep", rmm::cuda_stream_default, mr());
+  repl->set_valid(true, rmm::cuda_stream_default);
   std::vector<std::string> expected{"rep", "rep", "rep", "rep", "rep", "rep", "rep", "rep"};
 
   cudf::test::strings_column_wrapper input_w{input.begin(), input.end(), input_v.begin()};
@@ -437,6 +439,151 @@ TYPED_TEST(ReplaceNullsPolicyTest, FollowingFillTrailingNulls)
     cudf::replace_policy::FOLLOWING);
 }
 
+template <typename T>
+struct ReplaceNullsFixedPointTest : public cudf::test::BaseFixture {
+};
+
+TYPED_TEST_CASE(ReplaceNullsFixedPointTest, cudf::test::FixedPointTypes);
+
+TYPED_TEST(ReplaceNullsFixedPointTest, ReplaceColumn)
+{
+  auto const scale = numeric::scale_type{0};
+  auto const sz    = std::size_t{1000};
+  auto data_begin  = cudf::detail::make_counting_transform_iterator(0, [&](auto i) {
+    return TypeParam{i, scale};
+  });
+  auto valid_begin =
+    cudf::detail::make_counting_transform_iterator(0, [&](auto i) { return i % 3 ? 1 : 0; });
+  auto replace_begin  = cudf::detail::make_counting_transform_iterator(0, [&](auto i) {
+    return TypeParam{-2, scale};
+  });
+  auto expected_begin = cudf::detail::make_counting_transform_iterator(0, [&](auto i) {
+    int val = i % 3 ? static_cast<int>(i) : -2;
+    return TypeParam{val, scale};
+  });
+
+  ReplaceNullsColumn<TypeParam>(
+    cudf::test::fixed_width_column_wrapper<TypeParam>(data_begin, data_begin + sz, valid_begin),
+    cudf::test::fixed_width_column_wrapper<TypeParam>(replace_begin, replace_begin + sz),
+    cudf::test::fixed_width_column_wrapper<TypeParam>(expected_begin, expected_begin + sz));
+}
+
+TYPED_TEST(ReplaceNullsFixedPointTest, ReplaceColumn_Empty)
+{
+  ReplaceNullsColumn<TypeParam>(cudf::test::fixed_width_column_wrapper<TypeParam>{},
+                                cudf::test::fixed_width_column_wrapper<TypeParam>{},
+                                cudf::test::fixed_width_column_wrapper<TypeParam>{});
+}
+
+TYPED_TEST(ReplaceNullsFixedPointTest, ReplaceScalar)
+{
+  auto const scale = numeric::scale_type{0};
+  auto const sz    = std::size_t{1000};
+  auto data_begin  = cudf::detail::make_counting_transform_iterator(0, [&](auto i) {
+    return TypeParam{i, scale};
+  });
+  auto valid_begin =
+    cudf::detail::make_counting_transform_iterator(0, [&](auto i) { return i % 3 ? 1 : 0; });
+  auto expected_begin = cudf::detail::make_counting_transform_iterator(0, [&](auto i) {
+    int val = i % 3 ? static_cast<int>(i) : -2;
+    return TypeParam{val, scale};
+  });
+
+  cudf::fixed_point_scalar<TypeParam> replacement{-2, scale};
+
+  ReplaceNullsScalar<TypeParam>(
+    cudf::test::fixed_width_column_wrapper<TypeParam>(data_begin, data_begin + sz, valid_begin),
+    replacement,
+    cudf::test::fixed_width_column_wrapper<TypeParam>(expected_begin, expected_begin + sz));
+}
+
+TYPED_TEST(ReplaceNullsFixedPointTest, ReplacementHasNulls)
+{
+  auto const scale = numeric::scale_type{0};
+  auto const sz    = std::size_t{1000};
+  auto data_begin  = cudf::detail::make_counting_transform_iterator(0, [&](auto i) {
+    return TypeParam{i, scale};
+  });
+  auto data_valid_begin =
+    cudf::detail::make_counting_transform_iterator(0, [&](auto i) { return i % 3 ? 1 : 0; });
+  auto replace_begin = cudf::detail::make_counting_transform_iterator(0, [&](auto i) {
+    return TypeParam{-2, scale};
+  });
+  auto replace_valid_begin =
+    cudf::detail::make_counting_transform_iterator(0, [&](auto i) { return i % 2 ? 1 : 0; });
+  auto expected_begin = cudf::detail::make_counting_transform_iterator(0, [&](auto i) {
+    int val = i % 3 ? static_cast<int>(i) : -2;
+    return TypeParam{val, scale};
+  });
+  auto expected_valid_begin =
+    cudf::detail::make_counting_transform_iterator(0, [&](auto i) { return i % 6 ? 1 : 0; });
+
+  ReplaceNullsColumn<TypeParam>(cudf::test::fixed_width_column_wrapper<TypeParam>(
+                                  data_begin, data_begin + sz, data_valid_begin),
+                                cudf::test::fixed_width_column_wrapper<TypeParam>(
+                                  replace_begin, replace_begin + sz, replace_valid_begin),
+                                cudf::test::fixed_width_column_wrapper<TypeParam>(
+                                  expected_begin, expected_begin + sz, expected_valid_begin));
+}
+
+template <typename T>
+struct ReplaceNullsPolicyFixedPointTest : public cudf::test::BaseFixture {
+};
+
+TYPED_TEST_CASE(ReplaceNullsPolicyFixedPointTest, cudf::test::FixedPointTypes);
+
+TYPED_TEST(ReplaceNullsPolicyFixedPointTest, PrecedingFill)
+{
+  using fp     = TypeParam;
+  auto const s = numeric::scale_type{0};
+  auto col     = cudf::test::fixed_width_column_wrapper<TypeParam>(
+    {fp{42, s}, fp{2, s}, fp{1, s}, fp{-10, s}, fp{20, s}, fp{-30, s}}, {1, 0, 0, 1, 0, 1});
+  auto expect_col = cudf::test::fixed_width_column_wrapper<TypeParam>(
+    {fp{42, s}, fp{42, s}, fp{42, s}, fp{-10, s}, fp{-10, s}, fp{-30, s}}, {1, 1, 1, 1, 1, 1});
+
+  TestReplaceNullsWithPolicy(
+    std::move(col), std::move(expect_col), cudf::replace_policy::PRECEDING);
+}
+
+TYPED_TEST(ReplaceNullsPolicyFixedPointTest, FollowingFill)
+{
+  using fp     = TypeParam;
+  auto const s = numeric::scale_type{0};
+  auto col     = cudf::test::fixed_width_column_wrapper<TypeParam>(
+    {fp{42, s}, fp{2, s}, fp{1, s}, fp{-10, s}, fp{20, s}, fp{-30, s}}, {1, 0, 0, 1, 0, 1});
+  auto expect_col = cudf::test::fixed_width_column_wrapper<TypeParam>(
+    {fp{42, s}, fp{-10, s}, fp{-10, s}, fp{-10, s}, fp{-30, s}, fp{-30, s}}, {1, 1, 1, 1, 1, 1});
+
+  TestReplaceNullsWithPolicy(
+    std::move(col), std::move(expect_col), cudf::replace_policy::FOLLOWING);
+}
+
+TYPED_TEST(ReplaceNullsPolicyFixedPointTest, PrecedingFillLeadingNulls)
+{
+  using fp     = TypeParam;
+  auto const s = numeric::scale_type{0};
+  auto col     = cudf::test::fixed_width_column_wrapper<TypeParam>(
+    {fp{1, s}, fp{2, s}, fp{3, s}, fp{4, s}, fp{5, s}}, {0, 0, 1, 0, 1});
+  auto expect_col = cudf::test::fixed_width_column_wrapper<TypeParam>(
+    {fp{1, s}, fp{2, s}, fp{3, s}, fp{3, s}, fp{5, s}}, {0, 0, 1, 1, 1});
+
+  TestReplaceNullsWithPolicy(
+    std::move(col), std::move(expect_col), cudf::replace_policy::PRECEDING);
+}
+
+TYPED_TEST(ReplaceNullsPolicyFixedPointTest, FollowingFillTrailingNulls)
+{
+  using fp     = TypeParam;
+  auto const s = numeric::scale_type{0};
+  auto col     = cudf::test::fixed_width_column_wrapper<TypeParam>(
+    {fp{1, s}, fp{2, s}, fp{3, s}, fp{4, s}, fp{5, s}}, {1, 0, 1, 0, 0});
+  auto expect_col = cudf::test::fixed_width_column_wrapper<TypeParam>(
+    {fp{1, s}, fp{3, s}, fp{3, s}, fp{4, s}, fp{5, s}}, {1, 1, 1, 0, 0});
+
+  TestReplaceNullsWithPolicy(
+    std::move(col), std::move(expect_col), cudf::replace_policy::FOLLOWING);
+}
+
 struct ReplaceDictionaryTest : public cudf::test::BaseFixture {
 };
 
diff --git a/cpp/tests/rolling/rolling_test.cpp b/cpp/tests/rolling/rolling_test.cpp
index e7eaeb7f415..b6e2b35e760 100644
--- a/cpp/tests/rolling/rolling_test.cpp
+++ b/cpp/tests/rolling/rolling_test.cpp
@@ -879,15 +879,14 @@ TEST_F(RollingTestUdf, StaticWindow)
 
   std::unique_ptr<cudf::column> output;
 
-  auto start = cudf::detail::make_counting_transform_iterator(0, [size] __device__(size_type row) {
+  auto start = cudf::detail::make_counting_transform_iterator(0, [size](size_type row) {
     return std::accumulate(thrust::make_counting_iterator(std::max(0, row - 2 + 1)),
                            thrust::make_counting_iterator(std::min(size, row + 2 + 1)),
                            0);
   });
 
-  auto valid = cudf::detail::make_counting_transform_iterator(0, [size] __device__(size_type row) {
-    return (row != 0 && row != size - 2 && row != size - 1);
-  });
+  auto valid = cudf::detail::make_counting_transform_iterator(
+    0, [size](size_type row) { return (row != 0 && row != size - 2 && row != size - 1); });
 
   fixed_width_column_wrapper<int64_t> expected{start, start + size, valid};
 
@@ -895,7 +894,7 @@ TEST_F(RollingTestUdf, StaticWindow)
   auto cuda_udf_agg = cudf::make_udf_aggregation(
     cudf::udf_type::CUDA, this->cuda_func, cudf::data_type{cudf::type_id::INT64});
 
-  EXPECT_NO_THROW(output = cudf::rolling_window(input, 2, 2, 4, cuda_udf_agg));
+  output = cudf::rolling_window(input, 2, 2, 4, cuda_udf_agg);
 
   CUDF_TEST_EXPECT_COLUMNS_EQUAL(*output, expected);
 
@@ -903,7 +902,7 @@ TEST_F(RollingTestUdf, StaticWindow)
   auto ptx_udf_agg = cudf::make_udf_aggregation(
     cudf::udf_type::PTX, this->ptx_func, cudf::data_type{cudf::type_id::INT64});
 
-  EXPECT_NO_THROW(output = cudf::rolling_window(input, 2, 2, 4, ptx_udf_agg));
+  output = cudf::rolling_window(input, 2, 2, 4, ptx_udf_agg);
 
   CUDF_TEST_EXPECT_COLUMNS_EQUAL(*output, expected);
 }
@@ -941,7 +940,7 @@ TEST_F(RollingTestUdf, DynamicWindow)
   auto cuda_udf_agg = cudf::make_udf_aggregation(
     cudf::udf_type::CUDA, this->cuda_func, cudf::data_type{cudf::type_id::INT64});
 
-  EXPECT_NO_THROW(output = cudf::rolling_window(input, preceding, following, 2, cuda_udf_agg));
+  output = cudf::rolling_window(input, preceding, following, 2, cuda_udf_agg);
 
   CUDF_TEST_EXPECT_COLUMNS_EQUAL(*output, expected);
 
@@ -949,7 +948,7 @@ TEST_F(RollingTestUdf, DynamicWindow)
   auto ptx_udf_agg = cudf::make_udf_aggregation(
     cudf::udf_type::PTX, this->ptx_func, cudf::data_type{cudf::type_id::INT64});
 
-  EXPECT_NO_THROW(output = cudf::rolling_window(input, preceding, following, 2, ptx_udf_agg));
+  output = cudf::rolling_window(input, preceding, following, 2, ptx_udf_agg);
 
   CUDF_TEST_EXPECT_COLUMNS_EQUAL(*output, expected);
 }
diff --git a/cpp/tests/strings/factories_test.cu b/cpp/tests/strings/factories_test.cu
index bd463a7ab0d..be592478b13 100644
--- a/cpp/tests/strings/factories_test.cu
+++ b/cpp/tests/strings/factories_test.cu
@@ -17,11 +17,13 @@
 #include <tests/strings/utilities.h>
 #include <cudf/column/column_factories.hpp>
 #include <cudf/copying.hpp>
+#include <cudf/detail/utilities/vector_factories.hpp>
 #include <cudf/scalar/scalar.hpp>
 #include <cudf/scalar/scalar_factories.hpp>
 #include <cudf/strings/string_view.cuh>
 #include <cudf/strings/strings_column_view.hpp>
 #include <cudf/types.hpp>
+#include <cudf/utilities/span.hpp>
 #include <cudf_test/base_fixture.hpp>
 #include <cudf_test/column_utilities.hpp>
 #include <cudf_test/column_wrapper.hpp>
@@ -90,9 +92,9 @@ TEST_F(StringsFactoriesTest, CreateColumnFromPair)
   EXPECT_EQ(strings_view.chars().size(), memsize);
 
   // check string data
-  auto strings_data = cudf::strings::create_offsets(strings_view);
-  thrust::host_vector<char> h_chars_data(strings_data.first);
-  thrust::host_vector<cudf::size_type> h_offsets_data(strings_data.second);
+  auto strings_data   = cudf::strings::create_offsets(strings_view);
+  auto h_chars_data   = cudf::detail::make_std_vector_sync(strings_data.first);
+  auto h_offsets_data = cudf::detail::make_std_vector_sync(strings_data.second);
   EXPECT_EQ(memcmp(h_buffer.data(), h_chars_data.data(), h_buffer.size()), 0);
   EXPECT_EQ(
     memcmp(h_offsets.data(), h_offsets_data.data(), h_offsets.size() * sizeof(cudf::size_type)), 0);
@@ -146,9 +148,9 @@ TEST_F(StringsFactoriesTest, CreateColumnFromOffsets)
   EXPECT_EQ(strings_view.chars().size(), memsize);
 
   // check string data
-  auto strings_data = cudf::strings::create_offsets(strings_view);
-  thrust::host_vector<char> h_chars_data(strings_data.first);
-  thrust::host_vector<cudf::size_type> h_offsets_data(strings_data.second);
+  auto strings_data   = cudf::strings::create_offsets(strings_view);
+  auto h_chars_data   = cudf::detail::make_std_vector_sync(strings_data.first);
+  auto h_offsets_data = cudf::detail::make_std_vector_sync(strings_data.second);
   EXPECT_EQ(memcmp(h_buffer.data(), h_chars_data.data(), h_buffer.size()), 0);
   EXPECT_EQ(
     memcmp(h_offsets.data(), h_offsets_data.data(), h_offsets.size() * sizeof(cudf::size_type)), 0);
@@ -192,9 +194,9 @@ TEST_F(StringsFactoriesTest, CreateOffsets)
     std::vector<std::string>{"column", "of", "strings"}  // [3,6)
   };
   for (size_t idx = 0; idx < result.size(); idx++) {
-    auto strings_data = cudf::strings::create_offsets(cudf::strings_column_view(result[idx]));
-    thrust::host_vector<char> h_chars(strings_data.first);
-    thrust::host_vector<cudf::size_type> h_offsets(strings_data.second);
+    auto strings_data     = cudf::strings::create_offsets(cudf::strings_column_view(result[idx]));
+    auto h_chars          = cudf::detail::make_std_vector_sync(strings_data.first);
+    auto h_offsets        = cudf::detail::make_std_vector_sync(strings_data.second);
     auto expected_strings = expecteds[idx];
     for (size_t jdx = 0; jdx < h_offsets.size() - 1; ++jdx) {
       auto offset = h_offsets[jdx];
diff --git a/cpp/tests/strings/json_tests.cpp b/cpp/tests/strings/json_tests.cpp
new file mode 100644
index 00000000000..44eb35d4163
--- /dev/null
+++ b/cpp/tests/strings/json_tests.cpp
@@ -0,0 +1,761 @@
+/*
+ * Copyright (c) 2021, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include <cudf/scalar/scalar_factories.hpp>
+#include <cudf/strings/json.hpp>
+#include <cudf/strings/replace.hpp>
+#include <cudf/strings/strings_column_view.hpp>
+
+#include <cudf_test/base_fixture.hpp>
+#include <cudf_test/column_wrapper.hpp>
+
+// reference:  https://jsonpath.herokuapp.com/
+
+// clang-format off
+std::string json_string{
+  "{" 
+    "\"store\": {""\"book\": ["
+        "{"
+          "\"category\": \"reference\","
+          "\"author\": \"Nigel Rees\","
+          "\"title\": \"Sayings of the Century\","
+          "\"price\": 8.95"
+        "},"
+        "{"
+          "\"category\": \"fiction\","
+          "\"author\": \"Evelyn Waugh\","
+          "\"title\": \"Sword of Honour\","
+          "\"price\": 12.99"
+        "},"
+        "{"
+          "\"category\": \"fiction\","
+          "\"author\": \"Herman Melville\","
+          "\"title\": \"Moby Dick\","
+          "\"isbn\": \"0-553-21311-3\","
+          "\"price\": 8.99"
+        "},"
+        "{"
+          "\"category\": \"fiction\","
+          "\"author\": \"J. R. R. Tolkien\","
+          "\"title\": \"The Lord of the Rings\","
+          "\"isbn\": \"0-395-19395-8\","
+          "\"price\": 22.99"
+        "}"
+      "],"
+      "\"bicycle\": {"
+        "\"color\": \"red\","
+        "\"price\": 19.95"
+      "}"
+    "},"
+    "\"expensive\": 10"
+  "}"
+};
+// clang-format on
+
+std::unique_ptr<cudf::column> drop_whitespace(cudf::column_view const& col)
+{
+  cudf::test::strings_column_wrapper whitespace{"\n", "\r", "\t"};
+  cudf::test::strings_column_wrapper repl{"", "", ""};
+
+  cudf::strings_column_view strings(col);
+  cudf::strings_column_view targets(whitespace);
+  cudf::strings_column_view replacements(repl);
+  return cudf::strings::replace(strings, targets, replacements);
+}
+
+struct JsonTests : public cudf::test::BaseFixture {
+};
+
+TEST_F(JsonTests, GetJsonObjectRootOp)
+{
+  // root
+  cudf::test::strings_column_wrapper input{json_string};
+  std::string json_path("$");
+  auto result_raw = cudf::strings::get_json_object(cudf::strings_column_view(input), json_path);
+  auto result     = drop_whitespace(*result_raw);
+
+  auto expected = drop_whitespace(input);
+
+  CUDF_TEST_EXPECT_COLUMNS_EQUIVALENT(*result, *expected);
+}
+
+TEST_F(JsonTests, GetJsonObjectChildOp)
+{
+  {
+    cudf::test::strings_column_wrapper input{json_string};
+    std::string json_path("$.store");
+    auto result_raw = cudf::strings::get_json_object(cudf::strings_column_view(input), json_path);
+    auto result     = drop_whitespace(*result_raw);
+
+    // clang-format off
+    cudf::test::strings_column_wrapper expected_raw{     
+      "{"
+        "\"book\": ["
+          "{"
+            "\"category\": \"reference\","
+            "\"author\": \"Nigel Rees\","
+            "\"title\": \"Sayings of the Century\","
+            "\"price\": 8.95"
+          "},"
+          "{"
+            "\"category\": \"fiction\","
+            "\"author\": \"Evelyn Waugh\","
+            "\"title\": \"Sword of Honour\","
+            "\"price\": 12.99"
+          "},"
+          "{"
+            "\"category\": \"fiction\","
+            "\"author\": \"Herman Melville\","
+            "\"title\": \"Moby Dick\","
+            "\"isbn\": \"0-553-21311-3\","
+            "\"price\": 8.99"
+          "},"
+          "{"
+            "\"category\": \"fiction\","
+            "\"author\": \"J. R. R. Tolkien\","
+            "\"title\": \"The Lord of the Rings\","
+            "\"isbn\": \"0-395-19395-8\","
+            "\"price\": 22.99"
+          "}"
+        "],"
+        "\"bicycle\": {"
+          "\"color\": \"red\","
+          "\"price\": 19.95"
+        "}"
+      "}"
+    };
+    // clang-format on
+    auto expected = drop_whitespace(expected_raw);
+
+    CUDF_TEST_EXPECT_COLUMNS_EQUIVALENT(*result, *expected);
+  }
+
+  {
+    cudf::test::strings_column_wrapper input{json_string};
+    std::string json_path("$.store.book");
+    auto result_raw = cudf::strings::get_json_object(cudf::strings_column_view(input), json_path);
+    auto result     = drop_whitespace(*result_raw);
+
+    // clang-format off
+    cudf::test::strings_column_wrapper expected_raw{
+      "["
+        "{"
+          "\"category\": \"reference\","
+          "\"author\": \"Nigel Rees\","
+          "\"title\": \"Sayings of the Century\","
+          "\"price\": 8.95"
+        "},"
+        "{"
+          "\"category\": \"fiction\","
+          "\"author\": \"Evelyn Waugh\","
+          "\"title\": \"Sword of Honour\","
+          "\"price\": 12.99"
+        "},"
+        "{"
+          "\"category\": \"fiction\","
+          "\"author\": \"Herman Melville\","
+          "\"title\": \"Moby Dick\","
+          "\"isbn\": \"0-553-21311-3\","
+          "\"price\": 8.99"
+        "},"
+        "{"
+          "\"category\": \"fiction\","
+          "\"author\": \"J. R. R. Tolkien\","
+          "\"title\": \"The Lord of the Rings\","
+          "\"isbn\": \"0-395-19395-8\","
+          "\"price\": 22.99"
+        "}"
+      "]"
+    };
+    // clang-format on
+    auto expected = drop_whitespace(expected_raw);
+
+    CUDF_TEST_EXPECT_COLUMNS_EQUIVALENT(*result, *expected);
+  }
+}
+
+TEST_F(JsonTests, GetJsonObjectWildcardOp)
+{
+  {
+    cudf::test::strings_column_wrapper input{json_string};
+    std::string json_path("$.store.*");
+    auto result_raw = cudf::strings::get_json_object(cudf::strings_column_view(input), json_path);
+    auto result     = drop_whitespace(*result_raw);
+
+    // clang-format off
+    cudf::test::strings_column_wrapper expected_raw{
+      "["
+        "["
+          "{"
+            "\"category\": \"reference\","
+            "\"author\": \"Nigel Rees\","
+            "\"title\": \"Sayings of the Century\","
+            "\"price\": 8.95"
+          "},"
+          "{"
+            "\"category\": \"fiction\","
+            "\"author\": \"Evelyn Waugh\","
+            "\"title\": \"Sword of Honour\","
+            "\"price\": 12.99"
+          "},"
+          "{"
+            "\"category\": \"fiction\","
+            "\"author\": \"Herman Melville\","
+            "\"title\": \"Moby Dick\","
+            "\"isbn\": \"0-553-21311-3\","
+            "\"price\": 8.99"
+          "},"
+          "{"
+            "\"category\": \"fiction\","
+            "\"author\": \"J. R. R. Tolkien\","
+            "\"title\": \"The Lord of the Rings\","
+            "\"isbn\": \"0-395-19395-8\","
+            "\"price\": 22.99"
+          "}"
+        "],"
+        "{"
+          "\"color\": \"red\","
+          "\"price\": 19.95"
+        "}"
+      "]"
+    };
+    // clang-format on
+    auto expected = drop_whitespace(expected_raw);
+
+    CUDF_TEST_EXPECT_COLUMNS_EQUIVALENT(*result, *expected);
+  }
+
+  {
+    cudf::test::strings_column_wrapper input{json_string};
+    std::string json_path("*");
+    auto result_raw = cudf::strings::get_json_object(cudf::strings_column_view(input), json_path);
+    auto result     = drop_whitespace(*result_raw);
+
+    // clang-format off
+    cudf::test::strings_column_wrapper expected_raw{
+      "[" 
+        "{"
+          "\"book\": ["
+            "{"
+              "\"category\": \"reference\","
+              "\"author\": \"Nigel Rees\","
+              "\"title\": \"Sayings of the Century\","
+              "\"price\": 8.95"
+            "},"
+            "{"
+              "\"category\": \"fiction\","
+              "\"author\": \"Evelyn Waugh\","
+              "\"title\": \"Sword of Honour\","
+              "\"price\": 12.99"
+            "},"
+            "{"
+              "\"category\": \"fiction\","
+              "\"author\": \"Herman Melville\","
+              "\"title\": \"Moby Dick\","
+              "\"isbn\": \"0-553-21311-3\","
+              "\"price\": 8.99"
+            "},"
+            "{"
+              "\"category\": \"fiction\","
+              "\"author\": \"J. R. R. Tolkien\","
+              "\"title\": \"The Lord of the Rings\","
+              "\"isbn\": \"0-395-19395-8\","
+              "\"price\": 22.99"
+            "}"
+          "],"
+          "\"bicycle\": {"
+            "\"color\": \"red\","
+            "\"price\": 19.95"
+          "}"
+        "},"
+        "10"
+      "]"
+    };
+    // clang-format on
+    auto expected = drop_whitespace(expected_raw);
+
+    CUDF_TEST_EXPECT_COLUMNS_EQUIVALENT(*result, *expected);
+  }
+}
+
+TEST_F(JsonTests, GetJsonObjectSubscriptOp)
+{
+  {
+    cudf::test::strings_column_wrapper input{json_string};
+    std::string json_path("$.store.book[2]");
+    auto result_raw = cudf::strings::get_json_object(cudf::strings_column_view(input), json_path);
+    auto result     = drop_whitespace(*result_raw);
+
+    // clang-format off
+    cudf::test::strings_column_wrapper expected_raw{
+      "{"
+        "\"category\": \"fiction\","
+        "\"author\": \"Herman Melville\","
+        "\"title\": \"Moby Dick\","
+        "\"isbn\": \"0-553-21311-3\","
+        "\"price\": 8.99"
+      "}"
+    };
+    // clang-format on
+    auto expected = drop_whitespace(expected_raw);
+
+    CUDF_TEST_EXPECT_COLUMNS_EQUIVALENT(*result, *expected);
+  }
+
+  {
+    cudf::test::strings_column_wrapper input{json_string};
+    std::string json_path("$.store['bicycle']");
+    auto result_raw = cudf::strings::get_json_object(cudf::strings_column_view(input), json_path);
+    auto result     = drop_whitespace(*result_raw);
+
+    // clang-format off
+    cudf::test::strings_column_wrapper expected_raw{
+      "{"
+        "\"color\": \"red\","
+        "\"price\": 19.95"
+      "}"
+    };
+    // clang-format on
+    auto expected = drop_whitespace(expected_raw);
+
+    CUDF_TEST_EXPECT_COLUMNS_EQUIVALENT(*result, *expected);
+  }
+
+  {
+    cudf::test::strings_column_wrapper input{json_string};
+    std::string json_path("$.store.book[*]");
+    auto result_raw = cudf::strings::get_json_object(cudf::strings_column_view(input), json_path);
+    auto result     = drop_whitespace(*result_raw);
+
+    // clang-format off
+    cudf::test::strings_column_wrapper expected_raw{
+      "["
+        "{"
+          "\"category\": \"reference\","
+          "\"author\": \"Nigel Rees\","
+          "\"title\": \"Sayings of the Century\","
+          "\"price\": 8.95"
+        "},"
+        "{"
+          "\"category\": \"fiction\","
+          "\"author\": \"Evelyn Waugh\","
+          "\"title\": \"Sword of Honour\","
+          "\"price\": 12.99"
+        "},"
+        "{"
+          "\"category\": \"fiction\","
+          "\"author\": \"Herman Melville\","
+          "\"title\": \"Moby Dick\","
+          "\"isbn\": \"0-553-21311-3\","
+          "\"price\": 8.99"
+        "},"
+        "{"
+          "\"category\": \"fiction\","
+          "\"author\": \"J. R. R. Tolkien\","
+          "\"title\": \"The Lord of the Rings\","
+          "\"isbn\": \"0-395-19395-8\","
+          "\"price\": 22.99"
+        "}"
+      "]"
+    };
+    // clang-format on
+    auto expected = drop_whitespace(expected_raw);
+
+    CUDF_TEST_EXPECT_COLUMNS_EQUIVALENT(*result, *expected);
+  }
+}
+
+TEST_F(JsonTests, GetJsonObjectFilter)
+{
+  // queries that result in filtering/collating results (mostly meaning - generates new
+  // json instead of just returning parts of the existing string
+
+  {
+    cudf::test::strings_column_wrapper input{json_string};
+    std::string json_path("$.store.book[*]['isbn']");
+    auto result_raw = cudf::strings::get_json_object(cudf::strings_column_view(input), json_path);
+    auto result     = drop_whitespace(*result_raw);
+
+    cudf::test::strings_column_wrapper expected_raw{"[\"0-553-21311-3\",\"0-395-19395-8\"]"};
+    auto expected = drop_whitespace(expected_raw);
+
+    CUDF_TEST_EXPECT_COLUMNS_EQUIVALENT(*result, *expected);
+  }
+
+  {
+    cudf::test::strings_column_wrapper input{json_string};
+    std::string json_path("$.store.book[*].category");
+    auto result_raw = cudf::strings::get_json_object(cudf::strings_column_view(input), json_path);
+    auto result     = drop_whitespace(*result_raw);
+
+    cudf::test::strings_column_wrapper expected_raw{
+      "[\"reference\",\"fiction\",\"fiction\",\"fiction\"]"};
+    auto expected = drop_whitespace(expected_raw);
+
+    CUDF_TEST_EXPECT_COLUMNS_EQUIVALENT(*result, *expected);
+  }
+
+  {
+    cudf::test::strings_column_wrapper input{json_string};
+    std::string json_path("$.store.book[*].title");
+    auto result_raw = cudf::strings::get_json_object(cudf::strings_column_view(input), json_path);
+    auto result     = drop_whitespace(*result_raw);
+
+    cudf::test::strings_column_wrapper expected_raw{
+      "[\"Sayings of the Century\",\"Sword of Honour\",\"Moby Dick\",\"The Lord of the Rings\"]"};
+    auto expected = drop_whitespace(expected_raw);
+
+    CUDF_TEST_EXPECT_COLUMNS_EQUIVALENT(*result, *expected);
+  }
+
+  {
+    cudf::test::strings_column_wrapper input{json_string};
+    std::string json_path("$.store.book.*.price");
+    auto result_raw = cudf::strings::get_json_object(cudf::strings_column_view(input), json_path);
+    auto result     = drop_whitespace(*result_raw);
+
+    cudf::test::strings_column_wrapper expected_raw{"[8.95,12.99,8.99,22.99]"};
+    auto expected = drop_whitespace(expected_raw);
+
+    CUDF_TEST_EXPECT_COLUMNS_EQUIVALENT(*result, *expected);
+  }
+
+  {
+    // spark behavioral difference.
+    //  standard:     "fiction"
+    //  spark:        fiction
+    cudf::test::strings_column_wrapper input{json_string};
+    std::string json_path("$.store.book[2].category");
+    auto result_raw = cudf::strings::get_json_object(cudf::strings_column_view(input), json_path);
+    auto result     = drop_whitespace(*result_raw);
+
+    cudf::test::strings_column_wrapper expected_raw{"fiction"};
+    auto expected = drop_whitespace(expected_raw);
+
+    CUDF_TEST_EXPECT_COLUMNS_EQUIVALENT(*result, *expected);
+  }
+}
+
+TEST_F(JsonTests, GetJsonObjectNullInputs)
+{
+  {
+    std::string str("{\"a\" : \"b\"}");
+    cudf::test::strings_column_wrapper input({str, str, str, str}, {1, 0, 1, 0});
+
+    std::string json_path("$.a");
+    auto result_raw = cudf::strings::get_json_object(cudf::strings_column_view(input), json_path);
+    auto result     = drop_whitespace(*result_raw);
+
+    cudf::test::strings_column_wrapper expected_raw({"b", "", "b", ""}, {1, 0, 1, 0});
+    auto expected = drop_whitespace(expected_raw);
+
+    CUDF_TEST_EXPECT_COLUMNS_EQUIVALENT(*result, *expected);
+  }
+}
+
+TEST_F(JsonTests, GetJsonObjectEmptyQuery)
+{
+  // empty query -> null
+  {
+    cudf::test::strings_column_wrapper input{"{\"a\" : \"b\"}"};
+    std::string json_path("");
+    auto result = cudf::strings::get_json_object(cudf::strings_column_view(input), json_path);
+
+    cudf::test::strings_column_wrapper expected({""}, {0});
+
+    CUDF_TEST_EXPECT_COLUMNS_EQUIVALENT(*result, expected);
+  }
+}
+
+TEST_F(JsonTests, GetJsonObjectEmptyInputsAndOutputs)
+{
+  // empty input -> null
+  {
+    cudf::test::strings_column_wrapper input{""};
+    std::string json_path("$");
+    auto result = cudf::strings::get_json_object(cudf::strings_column_view(input), json_path);
+
+    cudf::test::strings_column_wrapper expected({""}, {0});
+
+    CUDF_TEST_EXPECT_COLUMNS_EQUIVALENT(*result, expected);
+  }
+
+  // slightly different from "empty output". in this case, we're
+  // returning something, but it happens to be empty. so we expect
+  // a valid, but empty row
+  {
+    cudf::test::strings_column_wrapper input{"{\"store\": { \"bicycle\" : \"\" } }"};
+    std::string json_path("$.store.bicycle");
+    auto result = cudf::strings::get_json_object(cudf::strings_column_view(input), json_path);
+
+    cudf::test::strings_column_wrapper expected({""}, {1});
+
+    CUDF_TEST_EXPECT_COLUMNS_EQUIVALENT(*result, expected);
+  }
+}
+
+// badly formed JSONpath strings
+TEST_F(JsonTests, GetJsonObjectIllegalQuery)
+{
+  // can't have more than one root operator, or a root operator anywhere other
+  // than the beginning
+  {
+    cudf::test::strings_column_wrapper input{"{\"a\": \"b\"}"};
+    std::string json_path("$$");
+    auto query = [&]() {
+      auto result = cudf::strings::get_json_object(cudf::strings_column_view(input), json_path);
+    };
+    EXPECT_THROW(query(), cudf::logic_error);
+  }
+
+  // invalid index
+  {
+    cudf::test::strings_column_wrapper input{"{\"a\": \"b\"}"};
+    std::string json_path("$[auh46h-]");
+    auto query = [&]() {
+      auto result = cudf::strings::get_json_object(cudf::strings_column_view(input), json_path);
+    };
+    EXPECT_THROW(query(), cudf::logic_error);
+  }
+
+  // invalid index
+  {
+    cudf::test::strings_column_wrapper input{"{\"a\": \"b\"}"};
+    std::string json_path("$[[]]");
+    auto query = [&]() {
+      auto result = cudf::strings::get_json_object(cudf::strings_column_view(input), json_path);
+    };
+    EXPECT_THROW(query(), cudf::logic_error);
+  }
+
+  // negative index
+  {
+    cudf::test::strings_column_wrapper input{"{\"a\": \"b\"}"};
+    std::string json_path("$[-1]");
+    auto query = [&]() {
+      auto result = cudf::strings::get_json_object(cudf::strings_column_view(input), json_path);
+    };
+    EXPECT_THROW(query(), cudf::logic_error);
+  }
+
+  // child operator with no name specified
+  {
+    cudf::test::strings_column_wrapper input{"{\"a\": \"b\"}"};
+    std::string json_path(".");
+    auto query = [&]() {
+      auto result = cudf::strings::get_json_object(cudf::strings_column_view(input), json_path);
+    };
+    EXPECT_THROW(query(), cudf::logic_error);
+  }
+
+  {
+    cudf::test::strings_column_wrapper input{"{\"a\": \"b\"}"};
+    std::string json_path("][");
+    auto query = [&]() {
+      auto result = cudf::strings::get_json_object(cudf::strings_column_view(input), json_path);
+    };
+    EXPECT_THROW(query(), cudf::logic_error);
+  }
+
+  {
+    cudf::test::strings_column_wrapper input{"{\"a\": \"b\"}"};
+    std::string json_path("6hw6,56i3");
+    auto query = [&]() {
+      auto result = cudf::strings::get_json_object(cudf::strings_column_view(input), json_path);
+    };
+    EXPECT_THROW(query(), cudf::logic_error);
+  }
+}
+
+// queries that are legal, but reference invalid parts of the input
+TEST_F(JsonTests, GetJsonObjectInvalidQuery)
+{
+  // non-existent field
+  {
+    cudf::test::strings_column_wrapper input{"{\"a\": \"b\"}"};
+    std::string json_path("$[*].c");
+    auto result = cudf::strings::get_json_object(cudf::strings_column_view(input), json_path);
+
+    cudf::test::strings_column_wrapper expected({""}, {0});
+
+    CUDF_TEST_EXPECT_COLUMNS_EQUIVALENT(*result, expected);
+  }
+
+  // non-existent field
+  {
+    cudf::test::strings_column_wrapper input{"{\"a\": \"b\"}"};
+    std::string json_path("$[*].c[2]");
+    auto result = cudf::strings::get_json_object(cudf::strings_column_view(input), json_path);
+
+    cudf::test::strings_column_wrapper expected({""}, {0});
+
+    CUDF_TEST_EXPECT_COLUMNS_EQUIVALENT(*result, expected);
+  }
+
+  // non-existent field
+  {
+    cudf::test::strings_column_wrapper input{json_string};
+    std::string json_path("$.store.book.price");
+    auto result = cudf::strings::get_json_object(cudf::strings_column_view(input), json_path);
+
+    cudf::test::strings_column_wrapper expected({""}, {0});
+
+    CUDF_TEST_EXPECT_COLUMNS_EQUIVALENT(*result, expected);
+  }
+
+  // out of bounds index
+  {
+    cudf::test::strings_column_wrapper input{json_string};
+    std::string json_path("$.store.book[4]");
+    auto result = cudf::strings::get_json_object(cudf::strings_column_view(input), json_path);
+
+    cudf::test::strings_column_wrapper expected({""}, {0});
+
+    CUDF_TEST_EXPECT_COLUMNS_EQUIVALENT(*result, expected);
+  }
+}
+
+TEST_F(JsonTests, MixedOutput)
+{
+  // various queries on:
+  // clang-format off
+  std::vector<std::string> input_strings {
+    "{\"a\": {\"b\" : \"c\"}}",
+
+    "{"
+      "\"a\": {\"b\" : \"c\"},"
+      "\"d\": [{\"e\":123}, {\"f\":-10}]"
+    "}",
+
+    "{"
+      "\"b\": 123"
+    "}",
+
+    "{"
+      "\"a\": [\"y\",500]"
+    "}",
+
+    "{"
+      "\"a\": \"\""
+    "}",
+
+    "{"
+      "\"a\": {"
+                "\"z\": {\"i\": 10, \"j\": 100},"
+                "\"b\": [\"c\",null,true,-1]"
+              "}"
+    "}"
+  };
+  // clang-format on
+  cudf::test::strings_column_wrapper input(input_strings.begin(), input_strings.end());
+  {
+    std::string json_path("$.a");
+    auto result = cudf::strings::get_json_object(cudf::strings_column_view(input), json_path);
+
+    // clang-format off
+    cudf::test::strings_column_wrapper expected({
+      "{\"b\" : \"c\"}",
+      "{\"b\" : \"c\"}",
+      "",
+      "[\"y\",500]",
+      "",
+      "{"
+         "\"z\": {\"i\": 10, \"j\": 100},"
+         "\"b\": [\"c\",null,true,-1]"
+      "}"
+      }, 
+      {1, 1, 0, 1, 1, 1});
+    // clang-format on
+
+    CUDF_TEST_EXPECT_COLUMNS_EQUIVALENT(*result, expected);
+  }
+
+  {
+    std::string json_path("$.a[1]");
+    auto result = cudf::strings::get_json_object(cudf::strings_column_view(input), json_path);
+
+    // clang-format off
+    cudf::test::strings_column_wrapper expected({
+        "",
+        "",
+        "",
+        "500",
+        "",
+        "",
+      },
+      {0, 0, 0, 1, 0, 0});
+    // clang-format on
+
+    CUDF_TEST_EXPECT_COLUMNS_EQUIVALENT(*result, expected);
+  }
+
+  {
+    std::string json_path("$.a.b");
+    auto result = cudf::strings::get_json_object(cudf::strings_column_view(input), json_path);
+
+    // clang-format off
+    cudf::test::strings_column_wrapper expected({
+      "c", 
+      "c", 
+      "", 
+      "", 
+      "", 
+      "[\"c\",null,true,-1]"},
+      {1, 1, 0, 0, 0, 1});
+    // clang-format on
+
+    CUDF_TEST_EXPECT_COLUMNS_EQUIVALENT(*result, expected);
+  }
+
+  {
+    std::string json_path("$.a[*]");
+    auto result = cudf::strings::get_json_object(cudf::strings_column_view(input), json_path);
+
+    // clang-format off
+    cudf::test::strings_column_wrapper expected({
+      "[\"c\"]", 
+      "[\"c\"]", 
+      "", 
+      "[\"y\",500]", 
+      "[]", 
+      "["
+        "{\"i\": 10, \"j\": 100},"
+        "[\"c\",null,true,-1]"
+      "]" },
+      {1, 1, 0, 1, 1, 1});
+    // clang-format on
+
+    CUDF_TEST_EXPECT_COLUMNS_EQUIVALENT(*result, expected);
+  }
+
+  {
+    std::string json_path("$.a.b[*]");
+    auto result = cudf::strings::get_json_object(cudf::strings_column_view(input), json_path);
+
+    // clang-format off
+    cudf::test::strings_column_wrapper expected({
+      "[]", 
+      "[]", 
+      "", 
+      "",
+      "",      
+      "[\"c\",null,true,-1]"},
+      {1, 1, 0, 0, 0, 1});
+    // clang-format on
+
+    CUDF_TEST_EXPECT_COLUMNS_EQUIVALENT(*result, expected);
+  }
+}
diff --git a/cpp/tests/unary/math_ops_test.cpp b/cpp/tests/unary/math_ops_test.cpp
index 2799c2f13df..08a40edb776 100644
--- a/cpp/tests/unary/math_ops_test.cpp
+++ b/cpp/tests/unary/math_ops_test.cpp
@@ -25,7 +25,7 @@
 #include <cudf_test/column_wrapper.hpp>
 #include <cudf_test/type_lists.hpp>
 
-#include <climits>
+#include <cuda/std/climits>
 #include <vector>
 
 template <typename T>
diff --git a/cpp/tests/utilities/column_utilities.cu b/cpp/tests/utilities/column_utilities.cu
index 78a67464654..f17446ca1dc 100644
--- a/cpp/tests/utilities/column_utilities.cu
+++ b/cpp/tests/utilities/column_utilities.cu
@@ -32,7 +32,7 @@
 #include <cudf_test/cudf_gtest.hpp>
 #include <cudf_test/detail/column_utilities.hpp>
 
-#include <jit/type.h>
+#include <jit/type.hpp>
 
 #include <rmm/exec_policy.hpp>
 
@@ -71,7 +71,7 @@ struct column_property_comparator {
 
     // equivalent, but not exactly equal columns can have a different number of children if their
     // sizes are both 0. Specifically, empty string columns may or may not have children.
-    if (check_exact_equality || lhs.size() > 0) {
+    if (check_exact_equality || (lhs.size() > 0 && lhs.null_count() < lhs.size())) {
       EXPECT_EQ(lhs.num_children(), rhs.num_children());
     }
   }
@@ -93,8 +93,8 @@ struct column_property_comparator {
     // recurse
     cudf::type_dispatcher(lhs_l.child().type(),
                           column_property_comparator<check_exact_equality>{},
-                          lhs_l.get_sliced_child(0),
-                          rhs_l.get_sliced_child(0));
+                          lhs_l.get_sliced_child(rmm::cuda_stream_default),
+                          rhs_l.get_sliced_child(rmm::cuda_stream_default));
   }
 };
 
@@ -283,8 +283,9 @@ struct column_comparator_impl<list_view, check_exact_equality> {
     // compare offsets, taking slicing into account
 
     // left side
-    size_type lhs_shift = cudf::detail::get_value<size_type>(lhs_l.offsets(), lhs_l.offset(), 0);
-    auto lhs_offsets    = thrust::make_transform_iterator(
+    size_type lhs_shift =
+      cudf::detail::get_value<size_type>(lhs_l.offsets(), lhs_l.offset(), rmm::cuda_stream_default);
+    auto lhs_offsets = thrust::make_transform_iterator(
       lhs_l.offsets().begin<size_type>() + lhs_l.offset(),
       [lhs_shift] __device__(size_type offset) { return offset - lhs_shift; });
     auto lhs_valids = thrust::make_transform_iterator(
@@ -294,8 +295,9 @@ struct column_comparator_impl<list_view, check_exact_equality> {
       });
 
     // right side
-    size_type rhs_shift = cudf::detail::get_value<size_type>(rhs_l.offsets(), rhs_l.offset(), 0);
-    auto rhs_offsets    = thrust::make_transform_iterator(
+    size_type rhs_shift =
+      cudf::detail::get_value<size_type>(rhs_l.offsets(), rhs_l.offset(), rmm::cuda_stream_default);
+    auto rhs_offsets = thrust::make_transform_iterator(
       rhs_l.offsets().begin<size_type>() + rhs_l.offset(),
       [rhs_shift] __device__(size_type offset) { return offset - rhs_shift; });
     auto rhs_valids = thrust::make_transform_iterator(
@@ -328,8 +330,8 @@ struct column_comparator_impl<list_view, check_exact_equality> {
         differences, lhs, rhs, print_all_differences, depth);
 
     // recurse
-    auto lhs_child = lhs_l.get_sliced_child(0);
-    auto rhs_child = rhs_l.get_sliced_child(0);
+    auto lhs_child = lhs_l.get_sliced_child(rmm::cuda_stream_default);
+    auto rhs_child = rhs_l.get_sliced_child(rmm::cuda_stream_default);
     cudf::type_dispatcher(lhs_child.type(),
                           column_comparator<check_exact_equality>{},
                           lhs_child,
@@ -518,7 +520,8 @@ std::string nested_offsets_to_string(NestedColumnView const& c, std::string cons
   size_type output_size = c.size() + 1;
 
   // the first offset value to normalize everything against
-  size_type first = cudf::detail::get_value<size_type>(offsets, c.offset(), 0);
+  size_type first =
+    cudf::detail::get_value<size_type>(offsets, c.offset(), rmm::cuda_stream_default);
   rmm::device_vector<size_type> shifted_offsets(output_size);
 
   // normalize the offset values for the column offset
@@ -687,7 +690,7 @@ struct column_view_printer {
     lists_column_view lcv(col);
 
     // propage slicing to the child if necessary
-    column_view child    = lcv.get_sliced_child(0);
+    column_view child    = lcv.get_sliced_child(rmm::cuda_stream_default);
     bool const is_sliced = lcv.offset() > 0 || child.offset() > 0;
 
     std::string tmp =
diff --git a/cpp/tests/utilities_tests/span_tests.cu b/cpp/tests/utilities_tests/span_tests.cu
index 24884c15f64..22e15809a2d 100644
--- a/cpp/tests/utilities_tests/span_tests.cu
+++ b/cpp/tests/utilities_tests/span_tests.cu
@@ -209,7 +209,7 @@ TEST(SpanTest, CanConstructFromDeviceContainers)
 {
   auto d_thrust_vector = thrust::device_vector<int>(1);
   auto d_vector        = rmm::device_vector<int>(1);
-  auto d_uvector       = rmm::device_uvector<int>(1, 0);
+  auto d_uvector       = rmm::device_uvector<int>(1, rmm::cuda_stream_default);
 
   (void)device_span<int>(d_thrust_vector);
   (void)device_span<int>(d_vector);
diff --git a/docker_build/Dockerfile b/docker_build/Dockerfile
new file mode 100644
index 00000000000..0c04cab152a
--- /dev/null
+++ b/docker_build/Dockerfile
@@ -0,0 +1,81 @@
+# Copyright (c) 2021, NVIDIA CORPORATION.
+
+ARG CUDA_VERSION=11.2.2
+FROM nvidia/cuda:${CUDA_VERSION}-devel
+ENV CUDA_SHORT_VERSION=11.2
+
+SHELL ["/bin/bash", "-c"]
+ENV LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/lib:/repos/dist/lib
+
+ENV DEBIAN_FRONTEND=noninteractive
+
+ENV CUDA_HOME=/usr/local/cuda
+ENV CUDA_PATH=$CUDA_HOME
+ENV PATH=${CUDA_HOME}/lib64/:${PATH}:${CUDA_HOME}/bin
+
+# Build env variables for arrow
+ENV CMAKE_BUILD_TYPE=release
+ENV PYARROW_WITH_PARQUET=1
+ENV PYARROW_WITH_CUDA=1
+ENV PYARROW_WITH_ORC=1
+ENV PYARROW_WITH_DATASET=1
+
+ENV ARROW_HOME=/repos/dist
+
+# Build env variables for rmm
+ENV INSTALL_PREFIX=/usr
+
+
+RUN apt update -y --fix-missing && \
+    apt upgrade -y && \
+      apt install -y --no-install-recommends software-properties-common && \
+      add-apt-repository ppa:deadsnakes/ppa && \
+      apt update -y --fix-missing
+
+RUN apt install -y --no-install-recommends \
+      git \
+      libboost-all-dev \
+      python3.8-dev \
+      build-essential \
+      autoconf \
+      bison \
+      flex \
+      libboost-filesystem-dev \
+      libboost-system-dev \
+      libboost-regex-dev \
+      libjemalloc-dev \
+      wget \
+      libssl-dev \
+      protobuf-compiler && \
+    apt-get autoremove -y && \
+    apt-get clean && \
+    rm -rf /var/lib/apt/lists/* && \
+    update-alternatives --install /usr/bin/python python /usr/bin/python3.8 1 && \
+    wget https://bootstrap.pypa.io/get-pip.py && \
+    python get-pip.py
+
+# Install cmake
+RUN version=3.18 && build=5 && mkdir ~/temp && cd ~/temp && wget https://cmake.org/files/v$version/cmake-$version.$build.tar.gz && \
+    tar -xzvf cmake-$version.$build.tar.gz && cd cmake-$version.$build/ && ./bootstrap && make -j$(nproc) && make install
+
+# Install arrow from source
+RUN git clone https://github.com/apache/arrow.git /repos/arrow && mkdir /repos/dist/ && cd /repos/arrow && git checkout apache-arrow-1.0.1 && git submodule init && \
+    git submodule update && export PARQUET_TEST_DATA="${PWD}/cpp/submodules/parquet-testing/data" && export ARROW_TEST_DATA="${PWD}/testing/data" && \
+    cd /repos/arrow/cpp && mkdir release && cd /repos/arrow/cpp/release && pip install -r /repos/arrow/python/requirements-build.txt && \
+    cmake -DCMAKE_INSTALL_PREFIX=$ARROW_HOME -DCMAKE_INSTALL_LIBDIR=lib -DARROW_FLIGHT=ON -DARROW_GANDIVA=OFF -DARROW_ORC=ON -DARROW_WITH_BZ2=ON -DARROW_WITH_ZLIB=ON -DARROW_WITH_ZSTD=ON -DARROW_WITH_LZ4=ON -DARROW_WITH_SNAPPY=ON -DARROW_WITH_BROTLI=ON -DARROW_PARQUET=ON -DARROW_PYTHON=ON -DARROW_PLASMA=ON -DARROW_BUILD_TESTS=ON -DARROW_CUDA=ON -DARROW_DATASET=ON .. && \
+    make -j$(nproc) && make install && cd /repos/arrow/python/ && python setup.py build_ext --build-type=release bdist_wheel && pip install /repos/arrow/python/dist/*.whl
+
+
+# Install rmm from source
+RUN cd /repos/ && git clone https://github.com/rapidsai/rmm.git && cd /repos/rmm/ && ./build.sh librmm && pip install /repos/rmm/python/.
+
+ADD . /repos/cudf/
+
+# Build env for CUDF build
+ENV CUDF_HOME=/repos/cudf/
+ENV CUDF_ROOT=/repos/cudf/cpp/build/
+
+# Install cudf from source
+RUN cd /repos/cudf/ && git submodule update --init --recursive && ./build.sh libcudf && \
+    pip install /repos/cudf/python/cudf/.
+
diff --git a/docs/cudf/source/basics.rst b/docs/cudf/source/basics.rst
index e270708df90..15b4b43662b 100644
--- a/docs/cudf/source/basics.rst
+++ b/docs/cudf/source/basics.rst
@@ -34,6 +34,8 @@ The following table lists all of cudf types. For methods requiring dtype argumen
 +------------------------+------------------+-------------------------------------------------------------------------------------+---------------------------------------------+
 | Boolean                |                  | np.bool_                                                                            | ``'bool'``                                  |
 +------------------------+------------------+-------------------------------------------------------------------------------------+---------------------------------------------+
+| Decimal                | Decimal64Dtype   | (none)                                                                              | (none)                                      |
++------------------------+------------------+-------------------------------------------------------------------------------------+---------------------------------------------+
 
 **Note: All dtypes above are Nullable**
 
diff --git a/docs/cudf/source/conf.py b/docs/cudf/source/conf.py
index b68d7b5849f..18ffbacca1f 100644
--- a/docs/cudf/source/conf.py
+++ b/docs/cudf/source/conf.py
@@ -77,9 +77,9 @@
 # built documents.
 #
 # The short X.Y version.
-version = "0.19"
+version = '0.20'
 # The full version, including alpha/beta/rc tags.
-release = "0.19.0"
+release = '0.20.0'
 
 # The language for content autogenerated by Sphinx. Refer to documentation
 # for a list of supported languages.
diff --git a/docs/cudf/source/groupby.md b/docs/cudf/source/groupby.md
index 5376df261e7..8a0e5dddba0 100644
--- a/docs/cudf/source/groupby.md
+++ b/docs/cudf/source/groupby.md
@@ -120,24 +120,24 @@ a
 
 The following table summarizes the available aggregations and the types that support them:
 
-| Aggregations\dtypes | Numeric  | Datetime | String   | Categorical | List | Struct |
-| ------------------- | -------- | -------  | -------- | ----------- | ---- | ------ |
-| count               | ✅       | ✅       | ✅       | ✅          |      |        |
-| size                | ✅       | ✅       | ✅       | ✅          |      |        |
-| sum                 | ✅       | ✅       |          |             |      |        |
-| idxmin              | ✅       | ✅       |          |             |      |        |
-| idxmax              | ✅       | ✅       |          |             |      |        |
-| min                 | ✅       | ✅       | ✅       |             |      |        |
-| max                 | ✅       | ✅       | ✅       |             |      |        |
-| mean                | ✅       | ✅       |          |             |      |        |
-| var                 | ✅       | ✅       |          |             |      |        |
-| std                 | ✅       | ✅       |          |             |      |        |
-| quantile            | ✅       | ✅       |          |             |      |        |
-| median              | ✅       | ✅       |          |             |      |        |
-| nunique             | ✅       | ✅       | ✅       | ✅          |      |        |
-| nth                 | ✅       | ✅       | ✅       |             |      |        |
-| collect             | ✅       | ✅       | ✅       |             | ✅   |        |
-| unique              | ✅       | ✅       | ✅       | ✅          |      |        |
+| Aggregations\dtypes | Numeric  | Datetime | String   | Categorical | List | Struct | Interval | Decimal |
+| ------------------- | -------- | -------  | -------- | ----------- | ---- | ------ | -------- | ------- |
+| count               | ✅       | ✅       | ✅       | ✅          |      |        |          | ✅      |
+| size                | ✅       | ✅       | ✅       | ✅          |      |        |          | ✅      |
+| sum                 | ✅       | ✅       |          |             |      |        |          | ✅      |
+| idxmin              | ✅       | ✅       |          |             |      |        |          | ✅      |
+| idxmax              | ✅       | ✅       |          |             |      |        |          | ✅      |
+| min                 | ✅       | ✅       | ✅       |             |      |        |          | ✅      |
+| max                 | ✅       | ✅       | ✅       |             |      |        |          | ✅      |
+| mean                | ✅       | ✅       |          |             |      |        |          |         |
+| var                 | ✅       | ✅       |          |             |      |        |          |         |
+| std                 | ✅       | ✅       |          |             |      |        |          |         |
+| quantile            | ✅       | ✅       |          |             |      |        |          |         |
+| median              | ✅       | ✅       |          |             |      |        |          |         |
+| nunique             | ✅       | ✅       | ✅       | ✅          |      |        |          | ✅      |
+| nth                 | ✅       | ✅       | ✅       |             |      |        |          | ✅      |
+| collect             | ✅       | ✅       | ✅       |             | ✅   |        |          | ✅      |
+| unique              | ✅       | ✅       | ✅       | ✅          |      |        |          |         |
 
 ## GroupBy apply
 
diff --git a/java/README.md b/java/README.md
index 6ca58496605..366d014db95 100644
--- a/java/README.md
+++ b/java/README.md
@@ -38,12 +38,12 @@ In some cases there may be a classifier to indicate the version of cuda required
 Build From Source section below for more information about when this can happen. No official
 release of the jar will have a classifier on it.
 
-CUDA 10.0:
+CUDA 11.0:
 ```xml
 <dependency>
     <groupId>ai.rapids</groupId>
     <artifactId>cudf</artifactId>
-    <classifier>cuda10</classifier>
+    <classifier>cuda11</classifier>
     <version>${cudf.version}</version>
 </dependency>
 ```
diff --git a/java/ci/README.md b/java/ci/README.md
index 3ffed71b27c..8f45c0f89af 100644
--- a/java/ci/README.md
+++ b/java/ci/README.md
@@ -11,16 +11,14 @@
 
 In the root path of cuDF repo, run below command to build the docker image.
 ```bash
-docker build -f java/ci/Dockerfile.centos7 --build-arg CUDA_VERSION=10.1 -t cudf-build:10.1-devel-centos7 .
+docker build -f java/ci/Dockerfile.centos7 --build-arg CUDA_VERSION=11.0 -t cudf-build:11.0-devel-centos7 .
 ```
 
-We support different CUDA versions as below:
-* CUDA 10.1
-* CUDA 10.2
+The following CUDA versions are supported:
 * CUDA 11.0
 
 Change the --build-arg CUDA_VERSION to what you need.
-You can replace the tag "cudf-build:10.1-devel-centos7" with another name you like.
+You can replace the tag "cudf-build:11.0-devel-centos7" with another name you like.
 
 ## Start the docker then build
 
@@ -28,7 +26,7 @@ You can replace the tag "cudf-build:10.1-devel-centos7" with another name you li
 
 Run below command to start a docker container with GPU.
 ```bash
-nvidia-docker run -it cudf-build:10.1-devel-centos7 bash
+nvidia-docker run -it cudf-build:11.0-devel-centos7 bash
 ```
 
 ### Download the cuDF source code
@@ -36,7 +34,7 @@ nvidia-docker run -it cudf-build:10.1-devel-centos7 bash
 You can download the cuDF repo in the docker container or you can mount it into the container.
 Here I choose to download again in the container.
 ```bash
-git clone --recursive https://github.com/rapidsai/cudf.git -b branch-0.19
+git clone --recursive https://github.com/rapidsai/cudf.git -b branch-0.20
 ```
 
 ### Build cuDF jar with devtoolset
@@ -49,5 +47,5 @@ scl enable devtoolset-8 "java/ci/build-in-docker.sh"
 
 ### The output
 
-You can find the cuDF jar in java/target/ like cudf-0.19-SNAPSHOT-cuda10-1.jar.
+You can find the cuDF jar in java/target/ like cudf-0.20-SNAPSHOT-cuda11.jar.
 
diff --git a/java/ci/build-in-docker.sh b/java/ci/build-in-docker.sh
index eee943cde38..10be5b9c639 100755
--- a/java/ci/build-in-docker.sh
+++ b/java/ci/build-in-docker.sh
@@ -24,6 +24,8 @@ SKIP_JAVA_TESTS=${SKIP_JAVA_TESTS:-true}
 BUILD_CPP_TESTS=${BUILD_CPP_TESTS:-OFF}
 ENABLE_PTDS=${ENABLE_PTDS:-ON}
 RMM_LOGGING_LEVEL=${RMM_LOGGING_LEVEL:-OFF}
+ENABLE_NVTX=${ENABLE_NVTX:-ON}
+ENABLE_GDS=${ENABLE_GDS:-OFF}
 OUT=${OUT:-out}
 
 SIGN_FILE=$1
@@ -35,6 +37,8 @@ echo "SIGN_FILE: $SIGN_FILE,\
  SKIP_JAVA_TESTS: $SKIP_JAVA_TESTS,\
  BUILD_CPP_TESTS: $BUILD_CPP_TESTS,\
  ENABLED_PTDS: $ENABLE_PTDS,\
+ ENABLE_NVTX: $ENABLE_NVTX,\
+ ENABLE_GDS: $ENABLE_GDS,\
  RMM_LOGGING_LEVEL: $RMM_LOGGING_LEVEL,\
  OUT_PATH: $OUT_PATH"
 
@@ -51,12 +55,12 @@ export PATH=/usr/local/cmake-3.19.0-Linux-x86_64/bin:$PATH
 rm -rf $WORKSPACE/cpp/build
 mkdir -p $WORKSPACE/cpp/build
 cd $WORKSPACE/cpp/build
-cmake .. -DUSE_NVTX=OFF -DCUDF_USE_ARROW_STATIC=ON -DBoost_USE_STATIC_LIBS=ON -DBUILD_TESTS=$SKIP_CPP_TESTS -DPER_THREAD_DEFAULT_STREAM=$ENABLE_PTDS -DRMM_LOGGING_LEVEL=$RMM_LOGGING_LEVEL
+cmake .. -DUSE_NVTX=$ENABLE_NVTX -DCUDF_USE_ARROW_STATIC=ON -DBoost_USE_STATIC_LIBS=ON -DBUILD_TESTS=$SKIP_CPP_TESTS -DPER_THREAD_DEFAULT_STREAM=$ENABLE_PTDS -DRMM_LOGGING_LEVEL=$RMM_LOGGING_LEVEL
 make -j$PARALLEL_LEVEL
 make install DESTDIR=$INSTALL_PREFIX
 
 ###### Build cudf jar ######
-BUILD_ARG="-Dmaven.repo.local=$WORKSPACE/.m2 -DskipTests=$SKIP_JAVA_TESTS -DPER_THREAD_DEFAULT_STREAM=$ENABLE_PTDS -DRMM_LOGGING_LEVEL=$RMM_LOGGING_LEVEL"
+BUILD_ARG="-Dmaven.repo.local=$WORKSPACE/.m2 -DskipTests=$SKIP_JAVA_TESTS -DPER_THREAD_DEFAULT_STREAM=$ENABLE_PTDS -DRMM_LOGGING_LEVEL=$RMM_LOGGING_LEVEL -DUSE_GDS=$ENABLE_GDS"
 if [ "$SIGN_FILE" == true ]; then
     # Build javadoc and sources only when SIGN_FILE is true
     BUILD_ARG="$BUILD_ARG -Prelease"
@@ -70,6 +74,9 @@ fi
 cd $WORKSPACE/java
 mvn -B clean package $BUILD_ARG
 
+###### Sanity test: fail if static cudart found ######
+find . -name '*.so' | xargs -I{} readelf -Ws {} | grep cuInit && echo "Found statically linked CUDA runtime, this is currently not tested" && exit 1
+
 ###### Stash Jar files ######
 rm -rf $OUT_PATH
 mkdir -p $OUT_PATH
diff --git a/java/pom.xml b/java/pom.xml
index a3fd464b320..d94d51944a0 100755
--- a/java/pom.xml
+++ b/java/pom.xml
@@ -21,7 +21,7 @@
 
     <groupId>ai.rapids</groupId>
     <artifactId>cudf</artifactId>
-    <version>0.19-SNAPSHOT</version>
+    <version>0.20-SNAPSHOT</version>
 
     <name>cudfjni</name>
     <description>
diff --git a/java/src/main/java/ai/rapids/cudf/ColumnVector.java b/java/src/main/java/ai/rapids/cudf/ColumnVector.java
index e6675591164..fcdb5d44ad3 100644
--- a/java/src/main/java/ai/rapids/cudf/ColumnVector.java
+++ b/java/src/main/java/ai/rapids/cudf/ColumnVector.java
@@ -570,8 +570,7 @@ public static ColumnVector serial32BitMurmurHash3(int seed, ColumnView columns[]
       assert columns[i] != null : "Column vectors passed may not be null";
       assert columns[i].getRowCount() == size : "Row count mismatch, all columns must be the same size";
       assert !columns[i].getType().isDurationType() : "Unsupported column type Duration";
-      assert !columns[i].getType().isTimestampType() : "Unsupported column type Timestamp";
-      assert !columns[i].getType().isNestedType() : "Unsupported column of nested type";
+      assert !columns[i].getType().equals(DType.LIST) : "List columns are not supported";
       columnViews[i] = columns[i].getNativeView();
     }
     return new ColumnVector(hash(columnViews, HashType.HASH_SERIAL_MURMUR3.getNativeId(), new int[0], seed));
@@ -606,7 +605,7 @@ public static ColumnVector spark32BitMurmurHash3(int seed, ColumnView columns[])
       assert columns[i] != null : "Column vectors passed may not be null";
       assert columns[i].getRowCount() == size : "Row count mismatch, all columns must be the same size";
       assert !columns[i].getType().isDurationType() : "Unsupported column type Duration";
-      assert !columns[i].getType().isNestedType() : "Unsupported column of nested type";
+      assert !columns[i].getType().equals(DType.LIST) : "List columns are not supported";
       columnViews[i] = columns[i].getNativeView();
     }
     return new ColumnVector(hash(columnViews, HashType.HASH_SPARK_MURMUR3.getNativeId(), new int[0], seed));
diff --git a/java/src/main/java/ai/rapids/cudf/ColumnView.java b/java/src/main/java/ai/rapids/cudf/ColumnView.java
index 5d869ab75fb..402c64dd83d 100644
--- a/java/src/main/java/ai/rapids/cudf/ColumnView.java
+++ b/java/src/main/java/ai/rapids/cudf/ColumnView.java
@@ -2083,6 +2083,23 @@ public final ColumnVector substring(ColumnView start, ColumnView end) {
     return new ColumnVector(substringColumn(getNativeView(), start.getNativeView(), end.getNativeView()));
   }
 
+   /**
+   * Apply a JSONPath string to all rows in an input strings column.
+   *
+   * Applies a JSONPath string to an incoming strings column where each row in the column
+   * is a valid json string.  The output is returned by row as a strings column.
+   *
+   * For reference, https://tools.ietf.org/id/draft-goessner-dispatch-jsonpath-00.html
+   * Note: Only implements the operators: $ . [] *
+   *
+   * @param path The JSONPath string to be applied to each row
+   * @return new strings ColumnVector containing the retrieved json object strings
+   */
+  public final ColumnVector getJSONObject(Scalar path) {
+    assert(type.equals(DType.STRING)) : "column type must be a String";
+    return new ColumnVector(getJSONObject(getNativeView(), path.getScalarHandle()));
+  }
+
   /**
    * Returns a new strings column where target string within each string is replaced with the specified
    * replacement string.
@@ -2649,6 +2666,8 @@ static DeviceMemoryBufferView getOffsetsBuffer(long viewHandle) {
    */
   private static native long stringTimestampToTimestamp(long viewHandle, int unit, String format);
 
+  private static native long getJSONObject(long viewHandle, long scalarHandle) throws CudfException;
+
   /**
    * Native method to parse and convert a timestamp column vector to string column vector. A unix
    * timestamp is a long value representing how many units since 1970-01-01 00:00:00:000 in either
diff --git a/java/src/main/java/ai/rapids/cudf/DeviceMemoryBuffer.java b/java/src/main/java/ai/rapids/cudf/DeviceMemoryBuffer.java
index 5753ecea74d..fa888625d47 100644
--- a/java/src/main/java/ai/rapids/cudf/DeviceMemoryBuffer.java
+++ b/java/src/main/java/ai/rapids/cudf/DeviceMemoryBuffer.java
@@ -122,7 +122,17 @@ private DeviceMemoryBuffer(long address, long lengthInBytes, DeviceMemoryBuffer
    * @return the buffer
    */
   public static DeviceMemoryBuffer allocate(long bytes) {
-    return Rmm.alloc(bytes);
+    return allocate(bytes, Cuda.DEFAULT_STREAM);
+  }
+
+  /**
+   * Allocate memory for use on the GPU. You must close it when done.
+   * @param bytes size in bytes to allocate
+   * @param stream The stream in which to synchronize this command
+   * @return the buffer
+   */
+  public static DeviceMemoryBuffer allocate(long bytes, Cuda.Stream stream) {
+    return Rmm.alloc(bytes, stream);
   }
 
   /**
diff --git a/java/src/main/java/ai/rapids/cudf/MemoryBuffer.java b/java/src/main/java/ai/rapids/cudf/MemoryBuffer.java
index a1be9b561a0..9f0d9a451c0 100644
--- a/java/src/main/java/ai/rapids/cudf/MemoryBuffer.java
+++ b/java/src/main/java/ai/rapids/cudf/MemoryBuffer.java
@@ -146,6 +146,39 @@ public final long getAddress() {
     return address;
   }
 
+  /**
+   * Copy a subset of src to this buffer starting at destOffset using the specified CUDA stream.
+   * The copy has completed when this returns, but the memory copy could overlap with
+   * operations occurring on other streams.
+   * @param destOffset the offset in this to start copying from.
+   * @param src what to copy from
+   * @param srcOffset offset into src to start out
+   * @param length how many bytes to copy
+   * @param stream CUDA stream to use
+   */
+  public final void copyFromMemoryBuffer(
+          long destOffset, MemoryBuffer src, long srcOffset, long length, Cuda.Stream stream) {
+    addressOutOfBoundsCheck(address + destOffset, length, "copy range dest");
+    src.addressOutOfBoundsCheck(src.address + srcOffset, length, "copy range src");
+    Cuda.memcpy(address + destOffset, src.address + srcOffset, length, CudaMemcpyKind.DEFAULT, stream);
+  }
+
+  /**
+   * Copy a subset of src to this buffer starting at destOffset using the specified CUDA stream.
+   * The copy is async and may not have completed when this returns.
+   * @param destOffset the offset in this to start copying from.
+   * @param src what to copy from
+   * @param srcOffset offset into src to start out
+   * @param length how many bytes to copy
+   * @param stream CUDA stream to use
+   */
+  public final void copyFromMemoryBufferAsync(
+          long destOffset, MemoryBuffer src, long srcOffset, long length, Cuda.Stream stream) {
+    addressOutOfBoundsCheck(address + destOffset, length, "copy range dest");
+    src.addressOutOfBoundsCheck(src.address + srcOffset, length, "copy range src");
+    Cuda.asyncMemcpy(address + destOffset, src.address + srcOffset, length, CudaMemcpyKind.DEFAULT, stream);
+  }
+
   /**
    * Slice off a part of the buffer. Note that this is a zero copy operation and all
    * slices must be closed along with the original buffer before the memory is released.
diff --git a/java/src/main/native/CMakeLists.txt b/java/src/main/native/CMakeLists.txt
index 46b3f0c5a53..bd38c7ca0b6 100755
--- a/java/src/main/native/CMakeLists.txt
+++ b/java/src/main/native/CMakeLists.txt
@@ -32,7 +32,7 @@ elseif(CMAKE_CUDA_ARCHITECTURES STREQUAL "")
   set(CUDF_JNI_BUILD_FOR_DETECTED_ARCHS TRUE)
 endif()
 
-project(CUDF_JNI VERSION 0.19 LANGUAGES C CXX)
+project(CUDF_JNI VERSION 0.20.0 LANGUAGES C CXX)
 
 ###################################################################################################
 # - build options ---------------------------------------------------------------------------------
diff --git a/java/src/main/native/cmake/Modules/ConfigureNvcomp.cmake b/java/src/main/native/cmake/Modules/ConfigureNvcomp.cmake
index 0bef79116af..bff0f4ac606 100644
--- a/java/src/main/native/cmake/Modules/ConfigureNvcomp.cmake
+++ b/java/src/main/native/cmake/Modules/ConfigureNvcomp.cmake
@@ -16,7 +16,13 @@
 
 set(NVCOMP_ROOT "${CMAKE_BINARY_DIR}/nvcomp")
 
-set(NVCOMP_CMAKE_ARGS "-DUSE_RMM=ON -DCUB_DIR=${CUB_INCLUDE}")
+if(CUDA_STATIC_RUNTIME)
+  set(NVCOMP_CUDA_RUNTIME_LIBRARY Static)
+else()
+  set(NVCOMP_CUDA_RUNTIME_LIBRARY Shared)
+endif()
+
+set(NVCOMP_CMAKE_ARGS "-DCMAKE_CUDA_RUNTIME_LIBRARY=${NVCOMP_CUDA_RUNTIME_LIBRARY} -DUSE_RMM=ON -DCUB_DIR=${CUB_INCLUDE}")
 
 configure_file("${CMAKE_SOURCE_DIR}/cmake/Templates/Nvcomp.CMakeLists.txt.cmake"
                "${NVCOMP_ROOT}/CMakeLists.txt")
diff --git a/java/src/main/native/src/ColumnViewJni.cpp b/java/src/main/native/src/ColumnViewJni.cpp
index dc1acc50b5f..cec3a1a92a6 100644
--- a/java/src/main/native/src/ColumnViewJni.cpp
+++ b/java/src/main/native/src/ColumnViewJni.cpp
@@ -54,6 +54,7 @@
 #include <cudf/strings/split/split.hpp>
 #include <cudf/strings/strip.hpp>
 #include <cudf/strings/substring.hpp>
+#include <cudf/strings/json.hpp>
 #include <cudf/transform.hpp>
 #include <cudf/unary.hpp>
 #include <cudf/utilities/bit.hpp>
@@ -65,6 +66,8 @@
 
 #include "cudf_jni_apis.hpp"
 #include "dtype_utils.hpp"
+#include "jni.h"
+#include "jni_utils.hpp"
 
 namespace {
 
@@ -1835,4 +1838,24 @@ JNIEXPORT jlong JNICALL Java_ai_rapids_cudf_ColumnView_copyColumnViewToCV(JNIEnv
   }
   CATCH_STD(env, 0)
 }
+
+JNIEXPORT jlong JNICALL Java_ai_rapids_cudf_ColumnView_getJSONObject(JNIEnv *env, jclass, 
+                                                                     jlong j_view_handle, jlong j_scalar_handle) {
+
+   JNI_NULL_CHECK(env, j_view_handle, "view cannot be null", 0);
+   JNI_NULL_CHECK(env, j_scalar_handle, "path cannot be null", 0);
+
+  try {
+    cudf::jni::auto_set_device(env);
+    cudf::column_view* n_column_view = reinterpret_cast<cudf::column_view*>(j_view_handle);
+    cudf::strings_column_view n_strings_col_view(*n_column_view);
+    cudf::string_scalar *n_scalar_path = reinterpret_cast<cudf::string_scalar *>(j_scalar_handle);
+
+    auto result = cudf::strings::get_json_object(n_strings_col_view, *n_scalar_path);
+
+    return reinterpret_cast<jlong>(result.release());
+  }
+  CATCH_STD(env, 0)
+
+}
 } // extern "C"
diff --git a/java/src/test/java/ai/rapids/cudf/ColumnVectorTest.java b/java/src/test/java/ai/rapids/cudf/ColumnVectorTest.java
index fe1cba5ceb1..36123704ae6 100644
--- a/java/src/test/java/ai/rapids/cudf/ColumnVectorTest.java
+++ b/java/src/test/java/ai/rapids/cudf/ColumnVectorTest.java
@@ -490,6 +490,25 @@ void testSerial32BitMurmur3HashMixed() {
     }
   }
 
+  @Test
+  void testSerial32BitMurmur3HashStruct() {
+    try (ColumnVector strings = ColumnVector.fromStrings(
+        "a", "B\n", "dE\"\u0100\t\u0101 \ud720\ud721",
+        "A very long (greater than 128 bytes/char string) to test a multi hash-step data point " +
+            "in the MD5 hash function. This string needed to be longer.",
+        null, null);
+         ColumnVector integers = ColumnVector.fromBoxedInts(0, 100, -100, Integer.MIN_VALUE, Integer.MAX_VALUE, null);
+         ColumnVector doubles = ColumnVector.fromBoxedDoubles(
+             0.0, 100.0, -100.0, POSITIVE_DOUBLE_NAN_LOWER_RANGE, POSITIVE_DOUBLE_NAN_UPPER_RANGE, null);
+         ColumnVector floats = ColumnVector.fromBoxedFloats(
+             0f, 100f, -100f, NEGATIVE_FLOAT_NAN_LOWER_RANGE, NEGATIVE_FLOAT_NAN_UPPER_RANGE, null);
+         ColumnVector bools = ColumnVector.fromBoxedBooleans(true, false, null, false, true, null);
+         ColumnVector result = ColumnVector.serial32BitMurmurHash3(1868, new ColumnVector[]{strings, integers, doubles, floats, bools});
+         ColumnVector expected = ColumnVector.fromBoxedInts(387200465, 1988790727, 774895031, 814731646, -1073686048, 1868)) {
+      assertColumnsAreEqual(expected, result);
+    }
+  }
+
   @Test
   void testSpark32BitMurmur3HashStrings() {
     try (ColumnVector v0 = ColumnVector.fromStrings(
@@ -529,6 +548,8 @@ void testSpark32BitMurmur3HashDoubles() {
 
   @Test
   void testSpark32BitMurmur3HashTimestamps() {
+    // The hash values were derived from Apache Spark in a manner similar to the one documented at
+    // https://github.com/rapidsai/cudf/blob/aa7ca46dcd9e/cpp/tests/hashing/hash_test.cpp#L281-L307
     try (ColumnVector v = ColumnVector.timestampMicroSecondsFromBoxedLongs(
         0L, null, 100L, -100L, 0x123456789abcdefL, null, -0x123456789abcdefL);
          ColumnVector result = ColumnVector.spark32BitMurmurHash3(42, new ColumnVector[]{v});
@@ -539,6 +560,8 @@ void testSpark32BitMurmur3HashTimestamps() {
 
   @Test
   void testSpark32BitMurmur3HashDecimal64() {
+    // The hash values were derived from Apache Spark in a manner similar to the one documented at
+    // https://github.com/rapidsai/cudf/blob/aa7ca46dcd9e/cpp/tests/hashing/hash_test.cpp#L281-L307
     try (ColumnVector v = ColumnVector.decimalFromLongs(-7,
         0L, 100L, -100L, 0x123456789abcdefL, -0x123456789abcdefL);
          ColumnVector result = ColumnVector.spark32BitMurmurHash3(42, new ColumnVector[]{v});
@@ -549,6 +572,8 @@ void testSpark32BitMurmur3HashDecimal64() {
 
   @Test
   void testSpark32BitMurmur3HashDecimal32() {
+    // The hash values were derived from Apache Spark in a manner similar to the one documented at
+    // https://github.com/rapidsai/cudf/blob/aa7ca46dcd9e/cpp/tests/hashing/hash_test.cpp#L281-L307
     try (ColumnVector v = ColumnVector.decimalFromInts(-3,
         0, 100, -100, 0x12345678, -0x12345678);
          ColumnVector result = ColumnVector.spark32BitMurmurHash3(42, new ColumnVector[]{v});
@@ -559,6 +584,8 @@ void testSpark32BitMurmur3HashDecimal32() {
 
   @Test
   void testSpark32BitMurmur3HashDates() {
+    // The hash values were derived from Apache Spark in a manner similar to the one documented at
+    // https://github.com/rapidsai/cudf/blob/aa7ca46dcd9e/cpp/tests/hashing/hash_test.cpp#L281-L307
     try (ColumnVector v = ColumnVector.timestampDaysFromBoxedInts(
         0, null, 100, -100, 0x12345678, null, -0x12345678);
          ColumnVector result = ColumnVector.spark32BitMurmurHash3(42, new ColumnVector[]{v});
@@ -587,7 +614,6 @@ void testSpark32BitMurmur3HashBools() {
          ColumnVector result = ColumnVector.spark32BitMurmurHash3(0, new ColumnVector[]{v0, v1});
          ColumnVector expected = ColumnVector.fromBoxedInts(0, -1589400010, -239939054, -68075478, 593689054, -1194558265)) {
       assertColumnsAreEqual(expected, result);
-
     }
   }
 
@@ -610,6 +636,26 @@ void testSpark32BitMurmur3HashMixed() {
     }
   }
 
+  @Test
+  void testSpark32BitMurmur3HashStruct() {
+    try (ColumnVector strings = ColumnVector.fromStrings(
+        "a", "B\n", "dE\"\u0100\t\u0101 \ud720\ud721",
+        "A very long (greater than 128 bytes/char string) to test a multi hash-step data point " +
+            "in the MD5 hash function. This string needed to be longer.",
+        null, null);
+         ColumnVector integers = ColumnVector.fromBoxedInts(0, 100, -100, Integer.MIN_VALUE, Integer.MAX_VALUE, null);
+         ColumnVector doubles = ColumnVector.fromBoxedDoubles(
+             0.0, 100.0, -100.0, POSITIVE_DOUBLE_NAN_LOWER_RANGE, POSITIVE_DOUBLE_NAN_UPPER_RANGE, null);
+         ColumnVector floats = ColumnVector.fromBoxedFloats(
+             0f, 100f, -100f, NEGATIVE_FLOAT_NAN_LOWER_RANGE, NEGATIVE_FLOAT_NAN_UPPER_RANGE, null);
+         ColumnVector bools = ColumnVector.fromBoxedBooleans(true, false, null, false, true, null);
+         ColumnView structs = ColumnView.makeStructView(strings, integers, doubles, floats, bools);
+         ColumnVector result = ColumnVector.spark32BitMurmurHash3(1868, new ColumnView[]{structs});
+         ColumnVector expected = ColumnVector.spark32BitMurmurHash3(1868, new ColumnVector[]{strings, integers, doubles, floats, bools})) {
+      assertColumnsAreEqual(expected, result);
+    }
+  }
+
   @Test
   void testAndNullReconfigureNulls() {
     try (ColumnVector v0 = ColumnVector.fromBoxedInts(0, 100, null, null, Integer.MIN_VALUE, null);
@@ -4132,6 +4178,50 @@ void testCopyToColumnVector() {
     }
   }
 
+  @Test
+  void testGetJSONObject() {
+    String jsonString = "{ \"store\": {\n" +
+        "    \"book\": [\n" +
+        "      { \"category\": \"reference\",\n" +
+        "        \"author\": \"Nigel Rees\",\n" +
+        "        \"title\": \"Sayings of the Century\",\n" +
+        "        \"price\": 8.95\n" +
+        "      },\n" +
+        "      { \"category\": \"fiction\",\n" +
+        "        \"author\": \"Evelyn Waugh\",\n" +
+        "        \"title\": \"Sword of Honour\",\n" +
+        "        \"price\": 12.99\n" +
+        "      },\n" +
+        "      { \"category\": \"fiction\",\n" +
+        "        \"author\": \"Herman Melville\",\n" +
+        "        \"title\": \"Moby Dick\",\n" +
+        "        \"isbn\": \"0-553-21311-3\",\n" +
+        "        \"price\": 8.99\n" +
+        "      },\n" +
+        "      { \"category\": \"fiction\",\n" +
+        "        \"author\": \"J. R. R. Tolkien\",\n" +
+        "        \"title\": \"The Lord of the Rings\",\n" +
+        "        \"isbn\": \"0-395-19395-8\",\n" +
+        "        \"price\": 22.99\n" +
+        "      }\n" +
+        "    ],\n" +
+        "    \"bicycle\": {\n" +
+        "      \"color\": \"red\",\n" +
+        "      \"price\": 19.95\n" +
+        "    }\n" +
+        "  }\n" +
+        "}";
+
+    try (ColumnVector json = ColumnVector.fromStrings(jsonString, jsonString);
+         ColumnVector expectedAuthors = ColumnVector.fromStrings("[\"Nigel Rees\",\"Evelyn " +
+             "Waugh\",\"Herman Melville\",\"J. R. R. Tolkien\"]", "[\"Nigel Rees\",\"Evelyn " +
+             "Waugh\",\"Herman Melville\",\"J. R. R. Tolkien\"]");
+         Scalar path = Scalar.fromString("$.store.book[*].author");
+         ColumnVector gotAuthors = json.getJSONObject(path)) {
+      assertColumnsAreEqual(expectedAuthors, gotAuthors);
+    }
+  }
+
   @Test
   void testMakeStructEmpty() {
     final int numRows = 10;
diff --git a/java/src/test/java/ai/rapids/cudf/MemoryBufferTest.java b/java/src/test/java/ai/rapids/cudf/MemoryBufferTest.java
new file mode 100644
index 00000000000..df710c71f63
--- /dev/null
+++ b/java/src/test/java/ai/rapids/cudf/MemoryBufferTest.java
@@ -0,0 +1,171 @@
+/*
+ *
+ *  Copyright (c) 2021, NVIDIA CORPORATION.
+ *
+ *  Licensed under the Apache License, Version 2.0 (the "License");
+ *  you may not use this file except in compliance with the License.
+ *  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ *  Unless required by applicable law or agreed to in writing, software
+ *  distributed under the License is distributed on an "AS IS" BASIS,
+ *  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ *  See the License for the specific language governing permissions and
+ *  limitations under the License.
+ *
+ */
+
+package ai.rapids.cudf;
+
+import org.junit.jupiter.api.Test;
+
+import static org.junit.jupiter.api.Assertions.*;
+
+public class MemoryBufferTest extends CudfTestBase {
+  private static final byte[] BYTES = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15};
+  private static final byte[] EXPECTED = {0, 2, 3, 4, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15};
+
+  @Test
+  public void testAddressOutOfBoundsExceptionWhenCopying() {
+    try (HostMemoryBuffer from = HostMemoryBuffer.allocate(16);
+         HostMemoryBuffer to = HostMemoryBuffer.allocate(16)) {
+      assertThrows(AssertionError.class, () -> to.copyFromMemoryBuffer(-1, from, 0, 16, Cuda.DEFAULT_STREAM));
+      assertThrows(AssertionError.class, () -> to.copyFromMemoryBuffer(16, from, 0, 16, Cuda.DEFAULT_STREAM));
+      assertThrows(AssertionError.class, () -> to.copyFromMemoryBuffer(0, from, -1, 16, Cuda.DEFAULT_STREAM));
+      assertThrows(AssertionError.class, () -> to.copyFromMemoryBuffer(0, from, 16, 16, Cuda.DEFAULT_STREAM));
+      assertThrows(AssertionError.class, () -> to.copyFromMemoryBuffer(0, from, 0, -1, Cuda.DEFAULT_STREAM));
+      assertThrows(AssertionError.class, () -> to.copyFromMemoryBuffer(0, from, 0, 17, Cuda.DEFAULT_STREAM));
+      assertThrows(AssertionError.class, () -> to.copyFromMemoryBuffer(1, from, 0, 16, Cuda.DEFAULT_STREAM));
+      assertThrows(AssertionError.class, () -> to.copyFromMemoryBuffer(0, from, 1, 16, Cuda.DEFAULT_STREAM));
+    }
+  }
+
+  @Test
+  public void testAddressOutOfBoundsExceptionWhenCopyingAsync() {
+    try (HostMemoryBuffer from = HostMemoryBuffer.allocate(16);
+         HostMemoryBuffer to = HostMemoryBuffer.allocate(16)) {
+      assertThrows(AssertionError.class, () -> to.copyFromMemoryBufferAsync(-1, from, 0, 16, Cuda.DEFAULT_STREAM));
+      assertThrows(AssertionError.class, () -> to.copyFromMemoryBufferAsync(16, from, 0, 16, Cuda.DEFAULT_STREAM));
+      assertThrows(AssertionError.class, () -> to.copyFromMemoryBufferAsync(0, from, -1, 16, Cuda.DEFAULT_STREAM));
+      assertThrows(AssertionError.class, () -> to.copyFromMemoryBufferAsync(0, from, 16, 16, Cuda.DEFAULT_STREAM));
+      assertThrows(AssertionError.class, () -> to.copyFromMemoryBufferAsync(0, from, 0, -1, Cuda.DEFAULT_STREAM));
+      assertThrows(AssertionError.class, () -> to.copyFromMemoryBufferAsync(0, from, 0, 17, Cuda.DEFAULT_STREAM));
+      assertThrows(AssertionError.class, () -> to.copyFromMemoryBufferAsync(1, from, 0, 16, Cuda.DEFAULT_STREAM));
+      assertThrows(AssertionError.class, () -> to.copyFromMemoryBufferAsync(0, from, 1, 16, Cuda.DEFAULT_STREAM));
+    }
+  }
+
+  @Test
+  public void testCopyingFromDeviceToDevice() {
+    try (HostMemoryBuffer in = HostMemoryBuffer.allocate(16);
+         DeviceMemoryBuffer from = DeviceMemoryBuffer.allocate(16);
+         DeviceMemoryBuffer to = DeviceMemoryBuffer.allocate(16);
+         HostMemoryBuffer out = HostMemoryBuffer.allocate(16)) {
+      in.setBytes(0, BYTES, 0, 16);
+      from.copyFromHostBuffer(in);
+      to.copyFromMemoryBuffer(0, from, 0, 16, Cuda.DEFAULT_STREAM);
+      to.copyFromMemoryBuffer(1, from, 2, 3, Cuda.DEFAULT_STREAM);
+      out.copyFromDeviceBuffer(to);
+      verifyOutput(out);
+    }
+  }
+
+  @Test
+  public void testCopyingFromDeviceToDeviceAsync() {
+    try (HostMemoryBuffer in = HostMemoryBuffer.allocate(16);
+         DeviceMemoryBuffer from = DeviceMemoryBuffer.allocate(16);
+         DeviceMemoryBuffer to = DeviceMemoryBuffer.allocate(16);
+         HostMemoryBuffer out = HostMemoryBuffer.allocate(16)) {
+      in.setBytes(0, BYTES, 0, 16);
+      from.copyFromHostBuffer(in);
+      to.copyFromMemoryBufferAsync(0, from, 0, 16, Cuda.DEFAULT_STREAM);
+      to.copyFromMemoryBufferAsync(1, from, 2, 3, Cuda.DEFAULT_STREAM);
+      out.copyFromDeviceBufferAsync(to, Cuda.DEFAULT_STREAM);
+      Cuda.DEFAULT_STREAM.sync();
+      verifyOutput(out);
+    }
+  }
+
+  @Test
+  public void testCopyingFromHostToHost() {
+    try (HostMemoryBuffer from = HostMemoryBuffer.allocate(16);
+         HostMemoryBuffer to = HostMemoryBuffer.allocate(16)) {
+      from.setBytes(0, BYTES, 0, 16);
+      to.setBytes(0, BYTES, 0, 16);
+      to.copyFromMemoryBuffer(1, from, 2, 3, Cuda.DEFAULT_STREAM);
+      verifyOutput(to);
+    }
+  }
+
+  @Test
+  public void testCopyingFromHostToHostAsync() {
+    try (HostMemoryBuffer from = HostMemoryBuffer.allocate(16);
+         HostMemoryBuffer to = HostMemoryBuffer.allocate(16)) {
+      from.setBytes(0, BYTES, 0, 16);
+      to.setBytes(0, BYTES, 0, 16);
+      to.copyFromMemoryBufferAsync(1, from, 2, 3, Cuda.DEFAULT_STREAM);
+      verifyOutput(to);
+    }
+  }
+
+  @Test
+  public void testCopyingFromHostToDevice() {
+    try (HostMemoryBuffer from = HostMemoryBuffer.allocate(16);
+         DeviceMemoryBuffer to = DeviceMemoryBuffer.allocate(16);
+         HostMemoryBuffer out = HostMemoryBuffer.allocate(16)) {
+      from.setBytes(0, BYTES, 0, 16);
+      to.copyFromMemoryBuffer(0, from, 0, 16, Cuda.DEFAULT_STREAM);
+      to.copyFromMemoryBufferAsync(1, from, 2, 3, Cuda.DEFAULT_STREAM);
+      out.copyFromDeviceBuffer(to);
+      verifyOutput(out);
+    }
+  }
+
+  @Test
+  public void testCopyingFromHostToDeviceAsync() {
+    try (HostMemoryBuffer from = HostMemoryBuffer.allocate(16);
+         DeviceMemoryBuffer to = DeviceMemoryBuffer.allocate(16);
+         HostMemoryBuffer out = HostMemoryBuffer.allocate(16)) {
+      from.setBytes(0, BYTES, 0, 16);
+      to.copyFromMemoryBufferAsync(0, from, 0, 16, Cuda.DEFAULT_STREAM);
+      to.copyFromMemoryBufferAsync(1, from, 2, 3, Cuda.DEFAULT_STREAM);
+      out.copyFromDeviceBufferAsync(to, Cuda.DEFAULT_STREAM);
+      Cuda.DEFAULT_STREAM.sync();
+      verifyOutput(out);
+    }
+  }
+
+  @Test
+  public void testCopyingFromDeviceToHost() {
+    try (HostMemoryBuffer in = HostMemoryBuffer.allocate(16);
+         DeviceMemoryBuffer from = DeviceMemoryBuffer.allocate(16);
+         HostMemoryBuffer to = HostMemoryBuffer.allocate(16)) {
+      in.setBytes(0, BYTES, 0, 16);
+      from.copyFromHostBuffer(in);
+      to.setBytes(0, BYTES, 0, 16);
+      to.copyFromMemoryBuffer(1, from, 2, 3, Cuda.DEFAULT_STREAM);
+      verifyOutput(to);
+    }
+  }
+
+  @Test
+  public void testCopyingFromDeviceToHostAsync() {
+    try (HostMemoryBuffer in = HostMemoryBuffer.allocate(16);
+         DeviceMemoryBuffer from = DeviceMemoryBuffer.allocate(16);
+         HostMemoryBuffer to = HostMemoryBuffer.allocate(16)) {
+      in.setBytes(0, BYTES, 0, 16);
+      from.copyFromHostBuffer(in);
+      to.setBytes(0, BYTES, 0, 16);
+      to.copyFromMemoryBufferAsync(1, from, 2, 3, Cuda.DEFAULT_STREAM);
+      Cuda.DEFAULT_STREAM.sync();
+      verifyOutput(to);
+    }
+  }
+
+  private void verifyOutput(HostMemoryBuffer out) {
+    byte[] bytes = new byte[16];
+    out.getBytes(bytes, 0, 0, 16);
+    assertArrayEquals(EXPECTED, bytes);
+  }
+}
diff --git a/java/src/test/java/ai/rapids/cudf/TableTest.java b/java/src/test/java/ai/rapids/cudf/TableTest.java
index 9c67966c16c..8b7ece5d60b 100644
--- a/java/src/test/java/ai/rapids/cudf/TableTest.java
+++ b/java/src/test/java/ai/rapids/cudf/TableTest.java
@@ -5067,7 +5067,7 @@ private Table[] buildExplodeTestTableWithPrimitiveTypes(boolean pos, boolean out
         .build()) {
       Table.TestBuilder expectedBuilder = new Table.TestBuilder();
       if (pos) {
-        Integer[] posData = outer ? new Integer[]{0, 1, 2, 0, 1, 0, 0, 0} : new Integer[]{0, 1, 2, 0, 1, 0};
+        Integer[] posData = outer ? new Integer[]{0, 1, 2, 0, 1, 0, null, null} : new Integer[]{0, 1, 2, 0, 1, 0};
         expectedBuilder.column(posData);
       }
       List<Object[]> expectedData = new ArrayList<Object[]>(){{
@@ -5109,10 +5109,11 @@ private Table[] buildExplodeTestTableWithNestedTypes(boolean pos, boolean outer)
         .build()) {
       Table.TestBuilder expectedBuilder = new Table.TestBuilder();
       if (pos) {
-        if (!outer)
+        if (outer) {
+          expectedBuilder.column(0, 1, 2, 0, 1, 0, null, null);
+        } else {
           expectedBuilder.column(0, 1, 2, 0, 1, 0, 0);
-        else
-          expectedBuilder.column(0, 1, 2, 0, 1, 0, 0, 0);
+        }
       }
       List<Object[]> expectedData = new ArrayList<Object[]>(){{
         if (!outer) {
diff --git a/python/cudf/cudf/__init__.py b/python/cudf/cudf/__init__.py
index 2d9438b515f..c8a4894f4be 100644
--- a/python/cudf/cudf/__init__.py
+++ b/python/cudf/cudf/__init__.py
@@ -18,6 +18,8 @@
 from cudf.core import (
     NA,
     CategoricalIndex,
+    interval_range,
+    IntervalIndex,
     DataFrame,
     DatetimeIndex,
     Float32Index,
@@ -40,7 +42,12 @@
     merge,
 )
 from cudf.core.algorithms import factorize
-from cudf.core.dtypes import CategoricalDtype, Decimal64Dtype
+from cudf.core.dtypes import (
+    CategoricalDtype,
+    Decimal64Dtype,
+    ListDtype,
+    StructDtype,
+)
 from cudf.core.groupby import Grouper
 from cudf.core.ops import (
     add,
diff --git a/python/cudf/cudf/_lib/copying.pyx b/python/cudf/cudf/_lib/copying.pyx
index 4c72ba2e055..548e16155dd 100644
--- a/python/cudf/cudf/_lib/copying.pyx
+++ b/python/cudf/cudf/_lib/copying.pyx
@@ -1,4 +1,4 @@
-# Copyright (c) 2020, NVIDIA CORPORATION.
+# Copyright (c) 2020-2021, NVIDIA CORPORATION.
 
 import pandas as pd
 
@@ -564,11 +564,11 @@ def copy_if_else(object lhs, object rhs, Column boolean_mask):
             return _copy_if_else_column_column(lhs, rhs, boolean_mask)
         else:
             return _copy_if_else_column_scalar(
-                lhs, as_device_scalar(rhs, lhs.dtype), boolean_mask)
+                lhs, as_device_scalar(rhs), boolean_mask)
     else:
         if isinstance(rhs, Column):
             return _copy_if_else_scalar_column(
-                as_device_scalar(lhs, rhs.dtype), rhs, boolean_mask)
+                as_device_scalar(lhs), rhs, boolean_mask)
         else:
             if lhs is None and rhs is None:
                 return lhs
@@ -685,7 +685,9 @@ def get_element(Column input_column, size_type index):
             cpp_copying.get_element(col_view, index)
         )
 
-    return DeviceScalar.from_unique_ptr(move(c_output))
+    return DeviceScalar.from_unique_ptr(
+        move(c_output), dtype=input_column.dtype
+    )
 
 
 def sample(Table input, size_type n,
diff --git a/python/cudf/cudf/_lib/cpp/lists/drop_list_duplicates.pxd b/python/cudf/cudf/_lib/cpp/lists/drop_list_duplicates.pxd
new file mode 100644
index 00000000000..40b1836f932
--- /dev/null
+++ b/python/cudf/cudf/_lib/cpp/lists/drop_list_duplicates.pxd
@@ -0,0 +1,15 @@
+# Copyright (c) 2021, NVIDIA CORPORATION.
+
+from libcpp.memory cimport unique_ptr
+
+from cudf._lib.cpp.lists.lists_column_view cimport lists_column_view
+from cudf._lib.cpp.column.column cimport column
+from cudf._lib.cpp.types cimport null_equality, nan_equality
+
+cdef extern from "cudf/lists/drop_list_duplicates.hpp" \
+        namespace "cudf::lists" nogil:
+    cdef unique_ptr[column] drop_list_duplicates(
+        const lists_column_view lists_column,
+        null_equality nulls_equal,
+        nan_equality nans_equal
+    ) except +
diff --git a/python/cudf/cudf/_lib/cpp/scalar/scalar.pxd b/python/cudf/cudf/_lib/cpp/scalar/scalar.pxd
index 3eb11c2bfd0..fec1c6382e6 100644
--- a/python/cudf/cudf/_lib/cpp/scalar/scalar.pxd
+++ b/python/cudf/cudf/_lib/cpp/scalar/scalar.pxd
@@ -7,6 +7,7 @@ from libcpp cimport bool
 from libcpp.string cimport string
 
 from cudf._lib.cpp.types cimport data_type
+from cudf._lib.cpp.wrappers.decimals cimport scale_type
 
 cdef extern from "cudf/scalar/scalar.hpp" namespace "cudf" nogil:
     cdef cppclass scalar:
@@ -51,3 +52,11 @@ cdef extern from "cudf/scalar/scalar.hpp" namespace "cudf" nogil:
         string_scalar(string st, bool is_valid) except +
         string_scalar(string_scalar other) except +
         string to_string() except +
+
+    cdef cppclass fixed_point_scalar[T](scalar):
+        fixed_point_scalar() except +
+        fixed_point_scalar(int64_t value,
+                           scale_type scale,
+                           bool is_valid) except +
+        int64_t value() except +
+        # TODO: Figure out how to add an int32 overload of value()
diff --git a/python/cudf/cudf/_lib/cpp/types.pxd b/python/cudf/cudf/_lib/cpp/types.pxd
index bd1108b2cdf..1f2094b3958 100644
--- a/python/cudf/cudf/_lib/cpp/types.pxd
+++ b/python/cudf/cudf/_lib/cpp/types.pxd
@@ -46,6 +46,10 @@ cdef extern from "cudf/types.hpp" namespace "cudf" nogil:
         EQUAL "cudf::null_equality::EQUAL"
         UNEQUAL "cudf::null_equality::UNEQUAL"
 
+    ctypedef enum nan_equality "cudf::nan_equality":
+        ALL_EQUAL "cudf::nan_equality::ALL_EQUAL"
+        UNEQUAL "cudf::nan_equality::UNEQUAL"
+
     ctypedef enum type_id "cudf::type_id":
         EMPTY                  "cudf::type_id::EMPTY"
         INT8                   "cudf::type_id::INT8"
diff --git a/python/cudf/cudf/_lib/cpp/wrappers/decimals.pxd b/python/cudf/cudf/_lib/cpp/wrappers/decimals.pxd
new file mode 100644
index 00000000000..9de23fb2595
--- /dev/null
+++ b/python/cudf/cudf/_lib/cpp/wrappers/decimals.pxd
@@ -0,0 +1,9 @@
+# Copyright (c) 2021, NVIDIA CORPORATION.
+from libc.stdint cimport int64_t, int32_t
+
+cdef extern from "cudf/fixed_point/fixed_point.hpp" namespace "numeric" nogil:
+    # cython type stub to help resolve to numeric::decimal64
+    ctypedef int64_t decimal64
+
+    cdef cppclass scale_type:
+        scale_type(int32_t)
diff --git a/python/cudf/cudf/_lib/groupby.pyx b/python/cudf/cudf/_lib/groupby.pyx
index 713a2274a77..4584841dd33 100644
--- a/python/cudf/cudf/_lib/groupby.pyx
+++ b/python/cudf/cudf/_lib/groupby.pyx
@@ -3,6 +3,7 @@
 from collections import defaultdict
 
 import numpy as np
+import rmm
 
 from libcpp.pair cimport pair
 from libcpp.memory cimport unique_ptr
@@ -20,25 +21,9 @@ cimport cudf._lib.cpp.groupby as libcudf_groupby
 cimport cudf._lib.cpp.aggregation as libcudf_aggregation
 
 
-_GROUPBY_AGGS = {
-    "count",
-    "size",
-    "sum",
-    "idxmin",
-    "idxmax",
-    "min",
-    "max",
-    "mean",
-    "var",
-    "std",
-    "quantile",
-    "median",
-    "nunique",
-    "nth",
-    "collect",
-    "unique",
-}
-
+# The sets below define the possible aggregations that can be performed on
+# different dtypes. The uppercased versions of these strings correspond to
+# elements of the AggregationKind enum.
 _CATEGORICAL_AGGS = {
     "count",
     "size",
@@ -61,6 +46,24 @@ _LIST_AGGS = {
     "collect",
 }
 
+_STRUCT_AGGS = {
+}
+
+_INTERVAL_AGGS = {
+}
+
+_DECIMAL_AGGS = {
+    "count",
+    "sum",
+    "argmin",
+    "argmax",
+    "min",
+    "max",
+    "nunique",
+    "nth",
+    "collect"
+}
+
 
 cdef class GroupBy:
     cdef unique_ptr[libcudf_groupby.groupby] c_obj
@@ -197,7 +200,10 @@ def _drop_unsupported_aggs(Table values, aggs):
     from cudf.utils.dtypes import (
         is_categorical_dtype,
         is_string_dtype,
-        is_list_dtype
+        is_list_dtype,
+        is_interval_dtype,
+        is_struct_dtype,
+        is_decimal_dtype,
     )
     result = aggs.copy()
 
@@ -220,6 +226,29 @@ def _drop_unsupported_aggs(Table values, aggs):
             for i, agg_name in enumerate(aggs[col_name]):
                 if Aggregation(agg_name).kind not in _CATEGORICAL_AGGS:
                     del result[col_name][i]
+        elif (
+                is_struct_dtype(values._data[col_name].dtype)
+        ):
+            for i, agg_name in enumerate(aggs[col_name]):
+                if Aggregation(agg_name).kind not in _STRUCT_AGGS:
+                    del result[col_name][i]
+        elif (
+                is_interval_dtype(values._data[col_name].dtype)
+        ):
+            for i, agg_name in enumerate(aggs[col_name]):
+                if Aggregation(agg_name).kind not in _INTERVAL_AGGS:
+                    del result[col_name][i]
+        elif (
+                is_decimal_dtype(values._data[col_name].dtype)
+        ):
+            if rmm._cuda.gpu.runtimeGetVersion() < 11000:
+                raise RuntimeError(
+                    "Decimal aggregations are only supported on CUDA >= 11 "
+                    "due to an nvcc compiler bug."
+                )
+            for i, agg_name in enumerate(aggs[col_name]):
+                if Aggregation(agg_name).kind not in _DECIMAL_AGGS:
+                    del result[col_name][i]
 
     if all(len(v) == 0 for v in result.values()):
         raise DataError("No numeric types to aggregate")
diff --git a/python/cudf/cudf/_lib/join.pyx b/python/cudf/cudf/_lib/join.pyx
index 69b8004cede..193c2ca9d67 100644
--- a/python/cudf/cudf/_lib/join.pyx
+++ b/python/cudf/cudf/_lib/join.pyx
@@ -2,7 +2,6 @@
 
 import cudf
 
-from collections import OrderedDict
 from itertools import chain
 
 from libcpp.memory cimport unique_ptr, make_unique
diff --git a/python/cudf/cudf/_lib/lists.pyx b/python/cudf/cudf/_lib/lists.pyx
index 7f745e58c67..9bc0550bdf0 100644
--- a/python/cudf/cudf/_lib/lists.pyx
+++ b/python/cudf/cudf/_lib/lists.pyx
@@ -10,6 +10,9 @@ from cudf._lib.cpp.lists.count_elements cimport (
 from cudf._lib.cpp.lists.explode cimport (
     explode_outer as cpp_explode_outer
 )
+from cudf._lib.cpp.lists.drop_list_duplicates cimport (
+    drop_list_duplicates as cpp_drop_list_duplicates
+)
 from cudf._lib.cpp.lists.sorting cimport (
     sort_lists as cpp_sort_lists
 )
@@ -22,7 +25,13 @@ from cudf._lib.cpp.scalar.scalar cimport scalar
 
 from cudf._lib.cpp.table.table cimport table
 from cudf._lib.cpp.table.table_view cimport table_view
-from cudf._lib.cpp.types cimport size_type, order, null_order
+from cudf._lib.cpp.types cimport (
+    size_type,
+    null_equality,
+    order,
+    null_order,
+    nan_equality
+)
 
 from cudf._lib.column cimport Column
 from cudf._lib.table cimport Table
@@ -71,6 +80,34 @@ def explode_outer(Table tbl, int explode_column_idx, bool ignore_index=False):
     )
 
 
+def drop_list_duplicates(Column col, bool nulls_equal, bool nans_all_equal):
+    """
+    nans_all_equal == True indicates that libcudf should treat any two elements
+    from {+nan, -nan} as equal, and as unequal otherwise.
+    nulls_equal == True indicates that libcudf should treat any two nulls as
+    equal, and as unequal otherwise.
+    """
+    cdef shared_ptr[lists_column_view] list_view = (
+        make_shared[lists_column_view](col.view())
+    )
+    cdef null_equality c_nulls_equal = (
+        null_equality.EQUAL if nulls_equal else null_equality.UNEQUAL
+    )
+    cdef nan_equality c_nans_equal = (
+        nan_equality.ALL_EQUAL if nans_all_equal else nan_equality.UNEQUAL
+    )
+
+    cdef unique_ptr[column] c_result
+
+    with nogil:
+        c_result = move(
+            cpp_drop_list_duplicates(list_view.get()[0],
+                                     c_nulls_equal,
+                                     c_nans_equal)
+        )
+    return Column.from_unique_ptr(move(c_result))
+
+
 def sort_lists(Column col, bool ascending, str na_position):
     cdef shared_ptr[lists_column_view] list_view = (
         make_shared[lists_column_view](col.view())
@@ -108,7 +145,10 @@ def extract_element(Column col, size_type index):
     return result
 
 
-def contains_scalar(Column col, DeviceScalar search_key):
+def contains_scalar(Column col, object py_search_key):
+
+    cdef DeviceScalar search_key = py_search_key.device_value
+
     cdef shared_ptr[lists_column_view] list_view = (
         make_shared[lists_column_view](col.view())
     )
@@ -121,6 +161,5 @@ def contains_scalar(Column col, DeviceScalar search_key):
             list_view.get()[0],
             search_key_value[0],
         ))
-
     result = Column.from_unique_ptr(move(c_result))
     return result
diff --git a/python/cudf/cudf/_lib/parquet.pyx b/python/cudf/cudf/_lib/parquet.pyx
index d8b4fbbbe4b..4ea2adec23a 100644
--- a/python/cudf/cudf/_lib/parquet.pyx
+++ b/python/cudf/cudf/_lib/parquet.pyx
@@ -312,6 +312,9 @@ cpdef write_parquet(
         num_index_cols_meta = 0
 
     for i, name in enumerate(table._column_names, num_index_cols_meta):
+        if not isinstance(name, str):
+            raise ValueError("parquet must have string column names")
+
         tbl_meta.get().column_metadata[i].set_name(name.encode())
         _set_col_metadata(
             table[name]._column, tbl_meta.get().column_metadata[i]
diff --git a/python/cudf/cudf/_lib/reduce.pyx b/python/cudf/cudf/_lib/reduce.pyx
index 2185cb089a7..62013ea88ae 100644
--- a/python/cudf/cudf/_lib/reduce.pyx
+++ b/python/cudf/cudf/_lib/reduce.pyx
@@ -1,6 +1,8 @@
-# Copyright (c) 2020, NVIDIA CORPORATION.
+# Copyright (c) 2020-2021, NVIDIA CORPORATION.
 
 import cudf
+from cudf.utils.dtypes import is_decimal_dtype
+from cudf.core.dtypes import Decimal64Dtype
 from cudf._lib.cpp.reduce cimport cpp_reduce, cpp_scan, scan_type, cpp_minmax
 from cudf._lib.cpp.scalar.scalar cimport scalar
 from cudf._lib.cpp.types cimport data_type, type_id
@@ -9,12 +11,14 @@ from cudf._lib.cpp.column.column cimport column
 from cudf._lib.scalar cimport DeviceScalar
 from cudf._lib.column cimport Column
 from cudf._lib.types import np_to_cudf_types
-from cudf._lib.types cimport underlying_type_t_type_id
+from cudf._lib.types cimport underlying_type_t_type_id, dtype_to_data_type
 from cudf._lib.aggregation cimport make_aggregation, aggregation
 from libcpp.memory cimport unique_ptr
 from libcpp.utility cimport move, pair
 import numpy as np
 
+cimport cudf._lib.cpp.types as libcudf_types
+
 
 def reduce(reduction_op, Column incol, dtype=None, **kwargs):
     """
@@ -32,7 +36,10 @@ def reduce(reduction_op, Column incol, dtype=None, **kwargs):
     """
 
     col_dtype = incol.dtype
-    if reduction_op in ['sum', 'sum_of_squares', 'product']:
+    if (
+        reduction_op in ['sum', 'sum_of_squares', 'product']
+        and not is_decimal_dtype(col_dtype)
+    ):
         col_dtype = np.find_common_type([col_dtype], [np.uint64])
     col_dtype = col_dtype if dtype is None else dtype
 
@@ -41,15 +48,8 @@ def reduce(reduction_op, Column incol, dtype=None, **kwargs):
     cdef unique_ptr[aggregation] c_agg = move(make_aggregation(
         reduction_op, kwargs
     ))
-    cdef type_id tid = (
-        <type_id> (
-            <underlying_type_t_type_id> (
-                np_to_cudf_types[np.dtype(col_dtype)]
-            )
-        )
-    )
 
-    cdef data_type c_out_dtype = data_type(tid)
+    cdef data_type c_out_dtype = dtype_to_data_type(col_dtype)
 
     # check empty case
     if len(incol) <= incol.null_count:
@@ -69,7 +69,14 @@ def reduce(reduction_op, Column incol, dtype=None, **kwargs):
             c_out_dtype
         ))
 
-    py_result = DeviceScalar.from_unique_ptr(move(c_result))
+    if c_result.get()[0].type().id() == libcudf_types.type_id.DECIMAL64:
+        scale = -c_result.get()[0].type().scale()
+        precision = _reduce_precision(col_dtype, reduction_op, len(incol))
+        py_result = DeviceScalar.from_unique_ptr(
+            move(c_result), dtype=Decimal64Dtype(precision, scale)
+        )
+    else:
+        py_result = DeviceScalar.from_unique_ptr(move(c_result))
     return py_result.value
 
 
@@ -132,3 +139,24 @@ def minmax(Column incol):
     py_result_max = DeviceScalar.from_unique_ptr(move(c_result.second))
 
     return cudf.Scalar(py_result_min), cudf.Scalar(py_result_max)
+
+
+def _reduce_precision(dtype, op, nrows):
+    """
+    Returns the result precision when performing the reduce
+    operation `op` for the given dtype and column size.
+
+    See: https://docs.microsoft.com/en-us/sql/t-sql/data-types/precision-scale-and-length-transact-sql
+    """  # noqa: E501
+    p = dtype.precision
+    if op in ("min", "max"):
+        new_p = p
+    elif op == "sum":
+        new_p = p + nrows - 1
+    elif op == "product":
+        new_p = p * nrows + nrows - 1
+    elif op == "sum_of_squares":
+        new_p = 2 * p + nrows
+    else:
+        raise NotImplementedError()
+    return max(min(new_p, Decimal64Dtype.MAX_PRECISION), 0)
diff --git a/python/cudf/cudf/_lib/replace.pyx b/python/cudf/cudf/_lib/replace.pyx
index 2732142dd15..cdedd3ac022 100644
--- a/python/cudf/cudf/_lib/replace.pyx
+++ b/python/cudf/cudf/_lib/replace.pyx
@@ -1,4 +1,4 @@
-# Copyright (c) 2020, NVIDIA CORPORATION.
+# Copyright (c) 2020-2021, NVIDIA CORPORATION.
 
 from libcpp.memory cimport unique_ptr
 from libcpp.utility cimport move
diff --git a/python/cudf/cudf/_lib/scalar.pxd b/python/cudf/cudf/_lib/scalar.pxd
index d44bac0e435..2fafe0f2c67 100644
--- a/python/cudf/cudf/_lib/scalar.pxd
+++ b/python/cudf/cudf/_lib/scalar.pxd
@@ -8,10 +8,11 @@ from cudf._lib.cpp.scalar.scalar cimport scalar
 
 cdef class DeviceScalar:
     cdef unique_ptr[scalar] c_value
+    cdef object _dtype
 
     cdef const scalar* get_raw_ptr(self) except *
 
     @staticmethod
-    cdef DeviceScalar from_unique_ptr(unique_ptr[scalar] ptr)
+    cdef DeviceScalar from_unique_ptr(unique_ptr[scalar] ptr, dtype=*)
 
     cpdef bool is_valid(DeviceScalar s)
diff --git a/python/cudf/cudf/_lib/scalar.pyx b/python/cudf/cudf/_lib/scalar.pyx
index a5945bc72f0..b31f0675422 100644
--- a/python/cudf/cudf/_lib/scalar.pyx
+++ b/python/cudf/cudf/_lib/scalar.pyx
@@ -1,5 +1,5 @@
 # Copyright (c) 2020, NVIDIA CORPORATION.
-
+import decimal
 import numpy as np
 import pandas as pd
 
@@ -34,15 +34,19 @@ from cudf._lib.cpp.wrappers.durations cimport(
     duration_us,
     duration_ns
 )
+from cudf._lib.cpp.wrappers.decimals cimport decimal64, scale_type
 from cudf._lib.cpp.scalar.scalar cimport (
     scalar,
     numeric_scalar,
     timestamp_scalar,
     duration_scalar,
-    string_scalar
+    string_scalar,
+    fixed_point_scalar
 )
+from cudf.utils.dtypes import _decimal_to_int64
 cimport cudf._lib.cpp.types as libcudf_types
 
+
 cdef class DeviceScalar:
 
     def __init__(self, value, dtype):
@@ -59,14 +63,17 @@ cdef class DeviceScalar:
         dtype : dtype
             A NumPy dtype.
         """
-
-        self._set_value(value, dtype)
+        self._dtype = dtype if dtype.kind != 'U' else np.dtype('object')
+        self._set_value(value, self._dtype)
 
     def _set_value(self, value, dtype):
         # IMPORTANT: this should only ever be called from __init__
         valid = not _is_null_host_scalar(value)
 
-        if pd.api.types.is_string_dtype(dtype):
+        if isinstance(dtype, cudf.Decimal64Dtype):
+            _set_decimal64_from_scalar(
+                self.c_value, value, dtype, valid)
+        elif pd.api.types.is_string_dtype(dtype):
             _set_string_from_np_string(self.c_value, value, valid)
         elif pd.api.types.is_numeric_dtype(dtype):
             _set_numeric_from_np_scalar(self.c_value,
@@ -88,7 +95,9 @@ cdef class DeviceScalar:
             )
 
     def _to_host_scalar(self):
-        if pd.api.types.is_string_dtype(self.dtype):
+        if isinstance(self.dtype, cudf.Decimal64Dtype):
+            result = _get_py_decimal_from_fixed_point(self.c_value)
+        elif pd.api.types.is_string_dtype(self.dtype):
             result = _get_py_string_from_string(self.c_value)
         elif pd.api.types.is_numeric_dtype(self.dtype):
             result = _get_np_scalar_from_numeric(self.c_value)
@@ -108,8 +117,7 @@ cdef class DeviceScalar:
         The NumPy dtype corresponding to the data type of the underlying
         device scalar.
         """
-        cdef libcudf_types.data_type cdtype = self.get_raw_ptr()[0].type()
-        return cudf_to_np_types[<underlying_type_t_type_id>(cdtype.id())]
+        return self._dtype
 
     @property
     def value(self):
@@ -137,13 +145,27 @@ cdef class DeviceScalar:
             return f"{self.__class__.__name__}({self.value.__repr__()})"
 
     @staticmethod
-    cdef DeviceScalar from_unique_ptr(unique_ptr[scalar] ptr):
+    cdef DeviceScalar from_unique_ptr(unique_ptr[scalar] ptr, dtype=None):
         """
         Construct a Scalar object from a unique_ptr<cudf::scalar>.
         """
         cdef DeviceScalar s = DeviceScalar.__new__(DeviceScalar)
+        cdef libcudf_types.data_type cdtype
+
         s.c_value = move(ptr)
+        cdtype = s.get_raw_ptr()[0].type()
 
+        if cdtype.id() == libcudf_types.DECIMAL64 and dtype is None:
+            raise TypeError(
+                "Must pass a dtype when constructing from a fixed-point scalar"
+            )
+        else:
+            if dtype is not None:
+                s._dtype = dtype
+            else:
+                s._dtype = cudf_to_np_types[
+                    <underlying_type_t_type_id>(cdtype.id())
+                ]
         return s
 
 
@@ -235,6 +257,17 @@ cdef _set_timedelta64_from_np_scalar(unique_ptr[scalar]& s,
     else:
         raise ValueError(f"dtype not supported: {dtype}")
 
+cdef _set_decimal64_from_scalar(unique_ptr[scalar]& s,
+                                object value,
+                                object dtype,
+                                bool valid=True):
+    value = _decimal_to_int64(value) if valid else 0
+    s.reset(
+        new fixed_point_scalar[decimal64](
+            <int64_t>np.int64(value), scale_type(-dtype.scale), valid
+        )
+    )
+
 cdef _get_py_string_from_string(unique_ptr[scalar]& s):
     if not s.get()[0].is_valid():
         return cudf.NA
@@ -274,6 +307,20 @@ cdef _get_np_scalar_from_numeric(unique_ptr[scalar]& s):
         raise ValueError("Could not convert cudf::scalar to numpy scalar")
 
 
+cdef _get_py_decimal_from_fixed_point(unique_ptr[scalar]& s):
+    cdef scalar* s_ptr = s.get()
+    if not s_ptr[0].is_valid():
+        return cudf.NA
+
+    cdef libcudf_types.data_type cdtype = s_ptr[0].type()
+
+    if cdtype.id() == libcudf_types.DECIMAL64:
+        rep_val = int((<fixed_point_scalar[decimal64]*>s_ptr)[0].value())
+        scale = int((<fixed_point_scalar[decimal64]*>s_ptr)[0].type().scale())
+        return decimal.Decimal(rep_val).scaleb(scale)
+    else:
+        raise ValueError("Could not convert cudf::scalar to numpy scalar")
+
 cdef _get_np_scalar_from_timestamp64(unique_ptr[scalar]& s):
 
     cdef scalar* s_ptr = s.get()
diff --git a/python/cudf/cudf/core/__init__.py b/python/cudf/cudf/core/__init__.py
index 91a369c31f8..10abdaf0061 100644
--- a/python/cudf/cudf/core/__init__.py
+++ b/python/cudf/cudf/core/__init__.py
@@ -1,10 +1,12 @@
-# Copyright (c) 2018-2020, NVIDIA CORPORATION.
+# Copyright (c) 2018-2021, NVIDIA CORPORATION.
 
-from cudf.core import buffer, column, column_accessor, common
+from cudf.core import _internals, buffer, column, column_accessor, common
 from cudf.core.buffer import Buffer
 from cudf.core.dataframe import DataFrame, from_pandas, merge
 from cudf.core.index import (
     CategoricalIndex,
+    interval_range,
+    IntervalIndex,
     DatetimeIndex,
     Float32Index,
     Float64Index,
diff --git a/python/cudf/cudf/core/_internals/__init__.py b/python/cudf/cudf/core/_internals/__init__.py
new file mode 100644
index 00000000000..53d186def85
--- /dev/null
+++ b/python/cudf/cudf/core/_internals/__init__.py
@@ -0,0 +1,3 @@
+# Copyright (c) 2021, NVIDIA CORPORATION.
+
+from cudf.core._internals.where import where
diff --git a/python/cudf/cudf/core/_internals/where.py b/python/cudf/cudf/core/_internals/where.py
new file mode 100644
index 00000000000..1fdc907875e
--- /dev/null
+++ b/python/cudf/cudf/core/_internals/where.py
@@ -0,0 +1,383 @@
+# Copyright (c) 2021, NVIDIA CORPORATION.
+
+import warnings
+from typing import Any, Optional, Tuple, Union, cast
+
+import numpy as np
+import pandas as pd
+
+import cudf
+from cudf._typing import ColumnLike, ScalarLike
+from cudf.core.column import ColumnBase
+from cudf.core.dataframe import DataFrame
+from cudf.core.frame import Frame
+from cudf.core.index import Index
+from cudf.core.series import Series
+
+
+def _normalize_scalars(col: ColumnBase, other: ScalarLike) -> ScalarLike:
+    """
+    Try to normalize scalar values as per col dtype
+    """
+    if (isinstance(other, float) and not np.isnan(other)) and (
+        col.dtype.type(other) != other
+    ):
+        raise TypeError(
+            f"Cannot safely cast non-equivalent "
+            f"{type(other).__name__} to {col.dtype.name}"
+        )
+
+    return cudf.Scalar(other, dtype=col.dtype if other is None else None)
+
+
+def _check_and_cast_columns_with_other(
+    source_col: ColumnBase,
+    other: Union[ScalarLike, ColumnBase],
+    inplace: bool,
+) -> Tuple[ColumnBase, Union[ScalarLike, ColumnBase]]:
+    """
+    Returns type-casted column `source_col` & scalar `other_scalar`
+    based on `inplace` parameter.
+    """
+    if cudf.utils.dtypes.is_categorical_dtype(source_col.dtype):
+        return source_col, other
+
+    if cudf.utils.dtypes.is_scalar(other):
+        device_obj = _normalize_scalars(source_col, other)
+    else:
+        device_obj = other
+
+    if other is None:
+        return source_col, device_obj
+    elif cudf.utils.dtypes.is_mixed_with_object_dtype(device_obj, source_col):
+        raise TypeError(
+            "cudf does not support mixed types, please type-cast "
+            "the column of dataframe/series and other "
+            "to same dtypes."
+        )
+    if inplace:
+        if not cudf.utils.dtypes._can_cast(device_obj.dtype, source_col.dtype):
+            warnings.warn(
+                f"Type-casting from {device_obj.dtype} "
+                f"to {source_col.dtype}, there could be potential data loss"
+            )
+        return source_col, device_obj.astype(source_col.dtype)
+    else:
+        if (
+            cudf.utils.dtypes.is_scalar(other)
+            and cudf.utils.dtypes.is_numerical_dtype(source_col.dtype)
+            and cudf.utils.dtypes._can_cast(other, source_col.dtype)
+        ):
+            common_dtype = source_col.dtype
+            return (
+                source_col.astype(common_dtype),
+                cudf.Scalar(other, dtype=common_dtype),
+            )
+        else:
+            common_dtype = cudf.utils.dtypes.find_common_type(
+                [
+                    source_col.dtype,
+                    np.min_scalar_type(other)
+                    if cudf.utils.dtypes.is_scalar(other)
+                    else other.dtype,
+                ]
+            )
+            if cudf.utils.dtypes.is_scalar(device_obj):
+                device_obj = cudf.Scalar(other, dtype=common_dtype)
+            else:
+                device_obj = device_obj.astype(common_dtype)
+            return source_col.astype(common_dtype), device_obj
+
+
+def _normalize_columns_and_scalars_type(
+    frame: Union[Series, Index, DataFrame], other: Any, inplace: bool = False,
+) -> Tuple[
+    Union[Series, Index, DataFrame, ColumnLike], Any,
+]:
+    """
+    Try to normalize the other's dtypes as per frame.
+
+    Parameters
+    ----------
+
+    frame : Can be a DataFrame or Series or Index
+    other : Can be a DataFrame, Series, Index, Array
+        like object or a scalar value
+
+        if frame is DataFrame, other can be only a
+        scalar or array like with size of number of columns
+        in DataFrame or a DataFrame with same dimension
+
+        if frame is Series, other can be only a scalar or
+        a series like with same length as frame
+
+    Returns:
+    --------
+    A dataframe/series/list/scalar form of normalized other
+    """
+    if isinstance(frame, DataFrame) and isinstance(other, DataFrame):
+        source_df = frame.copy(deep=False)
+        other_df = other.copy(deep=False)
+        for self_col in source_df._column_names:
+            source_col, other_col = _check_and_cast_columns_with_other(
+                source_col=source_df._data[self_col],
+                other=other_df._data[self_col],
+                inplace=inplace,
+            )
+            source_df._data[self_col] = source_col
+            other_df._data[self_col] = other_col
+        return source_df, other_df
+
+    elif isinstance(
+        frame, (Series, Index)
+    ) and not cudf.utils.dtypes.is_scalar(other):
+        other = cudf.core.column.as_column(other)
+        input_col = frame._data[frame.name]
+        return _check_and_cast_columns_with_other(
+            source_col=input_col, other=other, inplace=inplace
+        )
+    else:
+        # Handles scalar or list/array like scalars
+        if isinstance(frame, (Series, Index)) and cudf.utils.dtypes.is_scalar(
+            other
+        ):
+            input_col = frame._data[frame.name]
+            return _check_and_cast_columns_with_other(
+                source_col=frame._data[frame.name],
+                other=other,
+                inplace=inplace,
+            )
+
+        elif isinstance(frame, DataFrame):
+            if cudf.utils.dtypes.is_scalar(other):
+                other = [other for i in range(len(frame._column_names))]
+
+            source_df = frame.copy(deep=False)
+            others = []
+            for col_name, other_sclr in zip(frame._column_names, other):
+
+                (
+                    source_col,
+                    other_scalar,
+                ) = _check_and_cast_columns_with_other(
+                    source_col=source_df._data[col_name],
+                    other=other_sclr,
+                    inplace=inplace,
+                )
+                source_df._data[col_name] = source_col
+                others.append(other_scalar)
+            return source_df, others
+        else:
+            raise ValueError(
+                f"Inappropriate input {type(frame)} "
+                f"and other {type(other)} combination"
+            )
+
+
+def where(
+    frame: Union[Series, Index, DataFrame],
+    cond: Any,
+    other: Any = None,
+    inplace: bool = False,
+) -> Optional[Union[Frame]]:
+    """
+    Replace values where the condition is False.
+
+    Parameters
+    ----------
+    cond : bool Series/DataFrame, array-like
+        Where cond is True, keep the original value.
+        Where False, replace with corresponding value from other.
+        Callables are not supported.
+    other: scalar, list of scalars, Series/DataFrame
+        Entries where cond is False are replaced with
+        corresponding value from other. Callables are not
+        supported. Default is None.
+
+        DataFrame expects only Scalar or array like with scalars or
+        dataframe with same dimension as frame.
+
+        Series expects only scalar or series like with same length
+    inplace : bool, default False
+        Whether to perform the operation in place on the data.
+
+    Returns
+    -------
+    Same type as caller
+
+    Examples
+    --------
+    >>> import cudf
+    >>> df = DataFrame({"A":[1, 4, 5], "B":[3, 5, 8]})
+    >>> df.where(df % 2 == 0, [-1, -1])
+       A  B
+    0 -1 -1
+    1  4 -1
+    2 -1  8
+
+    >>> ser = Series([4, 3, 2, 1, 0])
+    >>> ser.where(ser > 2, 10)
+    0     4
+    1     3
+    2    10
+    3    10
+    4    10
+    dtype: int64
+    >>> ser.where(ser > 2)
+    0       4
+    1       3
+    2    <NA>
+    3    <NA>
+    4    <NA>
+    dtype: int64
+    """
+
+    if isinstance(frame, DataFrame):
+        if hasattr(cond, "__cuda_array_interface__"):
+            cond = DataFrame(
+                cond, columns=frame._column_names, index=frame.index
+            )
+        elif (
+            hasattr(cond, "__array_interface__")
+            and cond.__array_interface__["shape"] != frame.shape
+        ):
+            raise ValueError("conditional must be same shape as self")
+        elif not isinstance(cond, DataFrame):
+            cond = frame.from_pandas(pd.DataFrame(cond))
+
+        common_cols = set(frame._column_names).intersection(
+            set(cond._column_names)
+        )
+        if len(common_cols) > 0:
+            # If `frame` and `cond` are having unequal index,
+            # then re-index `cond`.
+            if not frame.index.equals(cond.index):
+                cond = cond.reindex(frame.index)
+        else:
+            if cond.shape != frame.shape:
+                raise ValueError(
+                    """Array conditional must be same shape as self"""
+                )
+            # Setting `frame` column names to `cond`
+            # as `cond` has no column names.
+            cond.columns = frame.columns
+
+        (source_df, others,) = _normalize_columns_and_scalars_type(
+            frame, other
+        )
+        if isinstance(other, Frame):
+            others = others._data.columns
+
+        out_df = DataFrame(index=frame.index)
+        if len(frame._columns) != len(others):
+            raise ValueError(
+                """Replacement list length or number of dataframe columns
+                should be equal to Number of columns of dataframe"""
+            )
+        for i, column_name in enumerate(frame._column_names):
+            input_col = source_df._data[column_name]
+            other_column = others[i]
+            if column_name in cond._data:
+                if isinstance(input_col, cudf.core.column.CategoricalColumn):
+                    if cudf.utils.dtypes.is_scalar(other_column):
+                        try:
+                            other_column = input_col._encode(other_column)
+                        except ValueError:
+                            # When other is not present in categories,
+                            # fill with Null.
+                            other_column = None
+                        other_column = cudf.Scalar(
+                            other_column, dtype=input_col.codes.dtype
+                        )
+                    elif isinstance(
+                        other_column, cudf.core.column.CategoricalColumn
+                    ):
+                        other_column = other_column.codes
+                    input_col = input_col.codes
+
+                result = cudf._lib.copying.copy_if_else(
+                    input_col, other_column, cond._data[column_name]
+                )
+
+                if isinstance(
+                    frame._data[column_name],
+                    cudf.core.column.CategoricalColumn,
+                ):
+                    result = cudf.core.column.build_categorical_column(
+                        categories=frame._data[column_name].categories,
+                        codes=cudf.core.column.as_column(
+                            result.base_data, dtype=result.dtype
+                        ),
+                        mask=result.base_mask,
+                        size=result.size,
+                        offset=result.offset,
+                        ordered=frame._data[column_name].ordered,
+                    )
+            else:
+                out_mask = cudf._lib.null_mask.create_null_mask(
+                    len(input_col),
+                    state=cudf._lib.null_mask.MaskState.ALL_NULL,
+                )
+                result = input_col.set_mask(out_mask)
+            out_df[column_name] = frame[column_name].__class__(result)
+
+        return frame._mimic_inplace(out_df, inplace=inplace)
+
+    else:
+        if isinstance(other, DataFrame):
+            raise NotImplementedError(
+                "cannot align with a higher dimensional Frame"
+            )
+        input_col = frame._data[frame.name]
+        cond = cudf.core.column.as_column(cond)
+        if len(cond) != len(frame):
+            raise ValueError(
+                """Array conditional must be same shape as self"""
+            )
+
+        (input_col, other,) = _normalize_columns_and_scalars_type(
+            frame, other, inplace
+        )
+
+        if isinstance(input_col, cudf.core.column.CategoricalColumn):
+            if cudf.utils.dtypes.is_scalar(other):
+                try:
+                    other = input_col._encode(other)
+                except ValueError:
+                    # When other is not present in categories,
+                    # fill with Null.
+                    other = None
+                other = cudf.Scalar(other, dtype=input_col.codes.dtype)
+            elif isinstance(other, cudf.core.column.CategoricalColumn):
+                other = other.codes
+
+            input_col = input_col.codes
+
+        result = cudf._lib.copying.copy_if_else(input_col, other, cond)
+
+        if isinstance(
+            frame._data[frame.name], cudf.core.column.CategoricalColumn
+        ):
+            result = cudf.core.column.build_categorical_column(
+                categories=cast(
+                    cudf.core.column.CategoricalColumn,
+                    frame._data[frame.name],
+                ).categories,
+                codes=cudf.core.column.as_column(
+                    result.base_data, dtype=result.dtype
+                ),
+                mask=result.base_mask,
+                size=result.size,
+                offset=result.offset,
+                ordered=cast(
+                    cudf.core.column.CategoricalColumn,
+                    frame._data[frame.name],
+                ).ordered,
+            )
+
+        if isinstance(frame, Index):
+            result = Index(result, name=frame.name)
+        else:
+            result = frame._copy_construct(data=result)
+
+        return frame._mimic_inplace(result, inplace=inplace)
diff --git a/python/cudf/cudf/core/column/column.py b/python/cudf/cudf/core/column/column.py
index e59b395ec0f..6a1600d6461 100644
--- a/python/cudf/cudf/core/column/column.py
+++ b/python/cudf/cudf/core/column/column.py
@@ -42,6 +42,7 @@
 from cudf.core.abc import Serializable
 from cudf.core.buffer import Buffer
 from cudf.core.dtypes import CategoricalDtype
+from cudf.core.dtypes import IntervalDtype
 from cudf.utils import ioutils, utils
 from cudf.utils.dtypes import (
     NUMERIC_TYPES,
@@ -426,6 +427,8 @@ def from_arrow(cls, array: pa.Array) -> ColumnBase:
             array.type, pd.core.arrays._arrow_utils.ArrowIntervalType
         ):
             return cudf.core.column.IntervalColumn.from_arrow(array)
+        elif isinstance(array.type, pa.Decimal128Type):
+            return cudf.core.column.DecimalColumn.from_arrow(array)
 
         return libcudf.interop.from_arrow(data, data.column_names)._data[
             "None"
@@ -1044,11 +1047,7 @@ def astype(self, dtype: Dtype, **kwargs) -> ColumnBase:
                 )
             return self
         elif is_interval_dtype(self.dtype):
-            if not self.dtype == dtype:
-                raise NotImplementedError(
-                    "Casting interval columns not currently supported"
-                )
-            return self
+            return self.as_interval_column(dtype, **kwargs)
         elif is_decimal_dtype(dtype):
             return self.as_decimal_column(dtype, **kwargs)
         elif np.issubdtype(dtype, np.datetime64):
@@ -1111,6 +1110,11 @@ def as_datetime_column(
     ) -> "cudf.core.column.DatetimeColumn":
         raise NotImplementedError
 
+    def as_interval_column(
+        self, dtype: Dtype, **kwargs
+    ) -> "cudf.core.column.IntervalColumn":
+        raise NotImplementedError
+
     def as_timedelta_column(
         self, dtype: Dtype, **kwargs
     ) -> "cudf.core.column.TimeDeltaColumn":
@@ -1412,6 +1416,8 @@ def _copy_type_metadata(self: T, other: ColumnBase) -> ColumnBase:
           of `other`  and the categories of `self`.
         * when both `self` and `other` are StructColumns, rename the fields
           of `other` to the field names of `self`.
+        * when both `self` and `other` are DecimalColumns, copy the precision
+          from self.dtype to other.dtype
         * when `self` and `other` are nested columns of the same type,
           recursively apply this function on the children of `self` to the
           and the children of `other`.
@@ -1435,6 +1441,11 @@ def _copy_type_metadata(self: T, other: ColumnBase) -> ColumnBase:
         ):
             other = other._rename_fields(self.dtype.fields.keys())
 
+        if isinstance(other, cudf.core.column.DecimalColumn) and isinstance(
+            self, cudf.core.column.DecimalColumn
+        ):
+            other.dtype.precision = self.dtype.precision
+
         if type(self) is type(other):
             if self.base_children and other.base_children:
                 base_children = tuple(
@@ -1510,7 +1521,7 @@ def column_empty(
                 dtype="int32",
             ),
         )
-    elif dtype.kind in "OU":
+    elif dtype.kind in "OU" and not is_decimal_dtype(dtype):
         data = None
         children = (
             full(row_count + 1, 0, dtype="int32"),
@@ -1624,6 +1635,15 @@ def build_column(
             null_count=null_count,
             children=children,
         )
+    elif is_interval_dtype(dtype):
+        return cudf.core.column.IntervalColumn(
+            dtype=dtype,
+            mask=mask,
+            size=size,
+            offset=offset,
+            children=children,
+            null_count=null_count,
+        )
     elif is_struct_dtype(dtype):
         if size is None:
             raise TypeError("Must specify size")
@@ -1641,6 +1661,7 @@ def build_column(
         return cudf.core.column.DecimalColumn(
             data=data,
             size=size,
+            offset=offset,
             dtype=dtype,
             mask=mask,
             null_count=null_count,
@@ -1704,6 +1725,52 @@ def build_categorical_column(
     return cast("cudf.core.column.CategoricalColumn", result)
 
 
+def build_interval_column(
+    left_col,
+    right_col,
+    mask=None,
+    size=None,
+    offset=0,
+    null_count=None,
+    closed="right",
+):
+    """
+    Build an IntervalColumn
+
+    Parameters
+    ----------
+    left_col : Column
+        Column of values representing the left of the interval
+    right_col : Column
+        Column of representing the right of the interval
+    mask : Buffer
+        Null mask
+    size : int, optional
+    offset : int, optional
+    closed : {"left", "right", "both", "neither"}, default "right"
+            Whether the intervals are closed on the left-side, right-side,
+            both or neither.
+    """
+    left = as_column(left_col)
+    right = as_column(right_col)
+    if closed not in {"left", "right", "both", "neither"}:
+        closed = "right"
+    if type(left_col) is not list:
+        dtype = IntervalDtype(left_col.dtype, closed)
+    else:
+        dtype = IntervalDtype("int64", closed)
+    size = len(left)
+    return build_column(
+        data=None,
+        dtype=dtype,
+        mask=mask,
+        size=size,
+        offset=offset,
+        null_count=null_count,
+        children=(left, right),
+    )
+
+
 def as_column(
     arbitrary: Any,
     nan_as_null: bool = None,
@@ -1846,10 +1913,14 @@ def as_column(
                 cupy.asarray(arbitrary), nan_as_null=nan_as_null, dtype=dtype
             )
         else:
-            data = as_column(
-                pa.array(arbitrary, from_pandas=nan_as_null),
-                dtype=arbitrary.dtype,
-            )
+            pyarrow_array = pa.array(arbitrary, from_pandas=nan_as_null)
+            if isinstance(pyarrow_array.type, pa.Decimal128Type):
+                pyarrow_type = cudf.Decimal64Dtype.from_arrow(
+                    pyarrow_array.type
+                )
+            else:
+                pyarrow_type = arbitrary.dtype
+            data = as_column(pyarrow_array, dtype=pyarrow_type)
         if dtype is not None:
             data = data.astype(dtype)
 
@@ -2088,7 +2159,7 @@ def as_column(
                     data = as_column(sr, nan_as_null=nan_as_null)
                 elif is_interval_dtype(dtype):
                     sr = pd.Series(arbitrary, dtype="interval")
-                    data = as_column(sr, nan_as_null=nan_as_null)
+                    data = as_column(sr, nan_as_null=nan_as_null, dtype=dtype)
                 else:
                     data = as_column(
                         _construct_array(arbitrary, dtype),
diff --git a/python/cudf/cudf/core/column/decimal.py b/python/cudf/cudf/core/column/decimal.py
index 96e09a5abb5..d9e4610832d 100644
--- a/python/cudf/cudf/core/column/decimal.py
+++ b/python/cudf/cudf/core/column/decimal.py
@@ -1,25 +1,29 @@
 # Copyright (c) 2021, NVIDIA CORPORATION.
 
-import cudf
+from decimal import Decimal
+from typing import cast, Any
+
 import cupy as cp
 import numpy as np
 import pyarrow as pa
-from typing import cast
+from pandas.api.types import is_integer_dtype
 
+import cudf
 from cudf import _lib as libcudf
-from cudf.core.buffer import Buffer
-from cudf.core.column import ColumnBase
-from cudf.core.dtypes import Decimal64Dtype
-from cudf.utils.utils import pa_mask_buffer_to_mask
-
-from cudf._typing import Dtype
 from cudf._lib.strings.convert.convert_fixed_point import (
     from_decimal as cpp_from_decimal,
 )
-from cudf.core.column import as_column
+from cudf._typing import Dtype
+from cudf.core.buffer import Buffer
+from cudf.core.column import ColumnBase, as_column
+from cudf.core.dtypes import Decimal64Dtype
+from cudf.utils.dtypes import is_scalar
+from cudf.utils.utils import pa_mask_buffer_to_mask
 
 
 class DecimalColumn(ColumnBase):
+    dtype: Decimal64Dtype
+
     @classmethod
     def from_arrow(cls, data: pa.Array):
         dtype = Decimal64Dtype.from_arrow(data.type)
@@ -35,6 +39,7 @@ def from_arrow(cls, data: pa.Array):
             data=Buffer(data_64.view("uint8")),
             size=len(data),
             dtype=dtype,
+            offset=data.offset,
             mask=mask,
         )
 
@@ -56,6 +61,7 @@ def to_arrow(self):
         )
         return pa.Array.from_buffers(
             type=self.dtype.to_arrow(),
+            offset=self._offset,
             length=self.size,
             buffers=[mask_buf, data_buf],
         )
@@ -63,16 +69,53 @@ def to_arrow(self):
     def binary_operator(self, op, other, reflect=False):
         if reflect:
             self, other = other, self
-        scale = _binop_scale(self.dtype, other.dtype, op)
-        output_type = Decimal64Dtype(
-            scale=scale, precision=Decimal64Dtype.MAX_PRECISION
-        )  # precision will be ignored, libcudf has no notion of precision
-        result = libcudf.binaryop.binaryop(self, other, op, output_type)
-        result.dtype.precision = _binop_precision(self.dtype, other.dtype, op)
+
+        # Binary Arithmatics between decimal columns. `Scale` and `precision`
+        # are computed outside of libcudf
+        if op in ("add", "sub", "mul"):
+            scale = _binop_scale(self.dtype, other.dtype, op)
+            output_type = Decimal64Dtype(
+                scale=scale, precision=Decimal64Dtype.MAX_PRECISION
+            )  # precision will be ignored, libcudf has no notion of precision
+            result = libcudf.binaryop.binaryop(self, other, op, output_type)
+            result.dtype.precision = _binop_precision(
+                self.dtype, other.dtype, op
+            )
+        elif op in ("eq", "lt", "gt", "le", "ge"):
+            if not isinstance(
+                other,
+                (DecimalColumn, cudf.core.column.NumericalColumn, cudf.Scalar),
+            ):
+                raise TypeError(
+                    f"Operator {op} not supported between"
+                    f"{str(type(self))} and {str(type(other))}"
+                )
+            if isinstance(
+                other, cudf.core.column.NumericalColumn
+            ) and not is_integer_dtype(other.dtype):
+                raise TypeError(
+                    f"Only decimal and integer column is supported for {op}."
+                )
+            if isinstance(other, cudf.core.column.NumericalColumn):
+                other = other.as_decimal_column(
+                    Decimal64Dtype(Decimal64Dtype.MAX_PRECISION, 0)
+                )
+            result = libcudf.binaryop.binaryop(self, other, op, bool)
         return result
 
+    def normalize_binop_value(self, other):
+        if is_scalar(other) and isinstance(other, (int, np.int, Decimal)):
+            return cudf.Scalar(Decimal(other))
+        elif isinstance(other, cudf.Scalar) and isinstance(
+            other.dtype, cudf.Decimal64Dtype
+        ):
+            return other
+        else:
+            raise TypeError(f"cannot normalize {type(other)}")
+
     def _apply_scan_op(self, op: str) -> ColumnBase:
-        return libcudf.reduce.scan(op, self, True)
+        result = libcudf.reduce.scan(op, self, True)
+        return self._copy_type_metadata(result)
 
     def as_decimal_column(
         self, dtype: Dtype, **kwargs
@@ -96,6 +139,49 @@ def as_string_column(
                 "cudf.core.column.StringColumn", as_column([], dtype="object")
             )
 
+    def reduce(self, op: str, skipna: bool = None, **kwargs) -> Decimal:
+        min_count = kwargs.pop("min_count", 0)
+        preprocessed = self._process_for_reduction(
+            skipna=skipna, min_count=min_count
+        )
+        if isinstance(preprocessed, ColumnBase):
+            return libcudf.reduce.reduce(op, preprocessed, **kwargs)
+        else:
+            return preprocessed
+
+    def sum(
+        self, skipna: bool = None, dtype: Dtype = None, min_count: int = 0
+    ) -> Decimal:
+        return self.reduce(
+            "sum", skipna=skipna, dtype=dtype, min_count=min_count
+        )
+
+    def product(
+        self, skipna: bool = None, dtype: Dtype = None, min_count: int = 0
+    ) -> Decimal:
+        return self.reduce(
+            "product", skipna=skipna, dtype=dtype, min_count=min_count
+        )
+
+    def sum_of_squares(
+        self, skipna: bool = None, dtype: Dtype = None, min_count: int = 0
+    ) -> Decimal:
+        return self.reduce(
+            "sum_of_squares", skipna=skipna, dtype=dtype, min_count=min_count
+        )
+
+    def fillna(
+        self, value: Any = None, method: str = None, dtype: Dtype = None
+    ):
+        """Fill null values with ``value``.
+
+        Returns a copy with null filled.
+        """
+        result = libcudf.replace.replace_nulls(
+            input_col=self, replacement=value, method=method, dtype=dtype
+        )
+        return self._copy_type_metadata(result)
+
 
 def _binop_scale(l_dtype, r_dtype, op):
     # This should at some point be hooked up to libcudf's
diff --git a/python/cudf/cudf/core/column/interval.py b/python/cudf/cudf/core/column/interval.py
index e9991bef071..d8bea6b1658 100644
--- a/python/cudf/cudf/core/column/interval.py
+++ b/python/cudf/cudf/core/column/interval.py
@@ -2,6 +2,8 @@
 import pyarrow as pa
 import cudf
 from cudf.core.column import StructColumn
+from cudf.core.dtypes import IntervalDtype
+from cudf.utils.dtypes import is_interval_dtype
 
 
 class IntervalColumn(StructColumn):
@@ -38,7 +40,7 @@ def closed(self):
     def from_arrow(self, data):
         new_col = super().from_arrow(data.storage)
         size = len(data)
-        dtype = cudf.core.dtypes.IntervalDtype.from_arrow(data.type)
+        dtype = IntervalDtype.from_arrow(data.type)
         mask = data.buffers()[0]
         if mask is not None:
             mask = cudf.utils.utils.pa_mask_buffer_to_mask(mask, len(data))
@@ -60,14 +62,17 @@ def from_arrow(self, data):
 
     def to_arrow(self):
         typ = self.dtype.to_arrow()
-        return pa.ExtensionArray.from_storage(typ, super().to_arrow())
+        struct_arrow = super().to_arrow()
+        if len(struct_arrow) == 0:
+            # struct arrow is pa.struct array with null children types
+            # we need to make sure its children have non-null type
+            struct_arrow = pa.array([], typ.storage_type)
+        return pa.ExtensionArray.from_storage(typ, struct_arrow)
 
     def from_struct_column(self, closed="right"):
         return IntervalColumn(
             size=self.size,
-            dtype=cudf.core.dtypes.IntervalDtype(
-                self.dtype.fields["left"], closed
-            ),
+            dtype=IntervalDtype(self.dtype.fields["left"], closed),
             mask=self.base_mask,
             offset=self.offset,
             null_count=self.null_count,
@@ -80,12 +85,28 @@ def copy(self, deep=True):
         struct_copy = super().copy(deep=deep)
         return IntervalColumn(
             size=struct_copy.size,
-            dtype=cudf.core.dtypes.IntervalDtype(
-                struct_copy.dtype.fields["left"], closed
-            ),
+            dtype=IntervalDtype(struct_copy.dtype.fields["left"], closed),
             mask=struct_copy.base_mask,
             offset=struct_copy.offset,
             null_count=struct_copy.null_count,
             children=struct_copy.base_children,
             closed=closed,
         )
+
+    def as_interval_column(self, dtype, **kwargs):
+        if is_interval_dtype(dtype):
+            # a user can directly input the string `interval` as the dtype
+            # when creating an interval series or interval dataframe
+            if dtype == "interval":
+                dtype = IntervalDtype(self.dtype.fields["left"], self.closed)
+            return IntervalColumn(
+                size=self.size,
+                dtype=dtype,
+                mask=self.mask,
+                offset=self.offset,
+                null_count=self.null_count,
+                children=self.children,
+                closed=dtype.closed,
+            )
+        else:
+            raise ValueError("dtype must be IntervalDtype")
diff --git a/python/cudf/cudf/core/column/lists.py b/python/cudf/cudf/core/column/lists.py
index b7f34e8c007..da953df5478 100644
--- a/python/cudf/cudf/core/column/lists.py
+++ b/python/cudf/cudf/core/column/lists.py
@@ -10,6 +10,7 @@
 from cudf._lib.lists import (
     contains_scalar,
     count_elements,
+    drop_list_duplicates,
     extract_element,
     sort_lists,
 )
@@ -236,9 +237,10 @@ def contains(self, search_key):
         Series([False, True, True])
         dtype: bool
         """
+        search_key = cudf.Scalar(search_key)
         try:
             res = self._return_or_inplace(
-                contains_scalar(self._column, search_key.device_value)
+                contains_scalar(self._column, search_key)
             )
         except RuntimeError as e:
             if (
@@ -361,6 +363,41 @@ def take(self, lists_indices):
         else:
             return res
 
+    def unique(self):
+        """
+        Returns unique element for each list in the column, order for each
+        unique element is not guaranteed.
+
+        Returns
+        -------
+        ListColumn
+
+        Examples
+        --------
+        >>> s = cudf.Series([[1, 1, 2, None, None], None, [4, 4], []])
+        >>> s
+        0    [1.0, 1.0, 2.0, nan, nan]
+        1                         None
+        2                   [4.0, 4.0]
+        3                           []
+        dtype: list
+        >>> s.list.unique() # Order of list element is not guaranteed
+        0              [1.0, 2.0, nan]
+        1                         None
+        2                        [4.0]
+        3                           []
+        dtype: list
+        """
+
+        if is_list_dtype(self._column.children[1].dtype):
+            raise NotImplementedError("Nested lists unique is not supported.")
+
+        return self._return_or_inplace(
+            drop_list_duplicates(
+                self._column, nulls_equal=True, nans_all_equal=True
+            )
+        )
+
     def sort_values(
         self,
         ascending=True,
diff --git a/python/cudf/cudf/core/column/numerical.py b/python/cudf/cudf/core/column/numerical.py
index f58a47a918c..10a9ffbfbae 100644
--- a/python/cudf/cudf/core/column/numerical.py
+++ b/python/cudf/cudf/core/column/numerical.py
@@ -22,6 +22,7 @@
     column,
     string,
 )
+from cudf.core.dtypes import Decimal64Dtype
 from cudf.utils import cudautils, utils
 from cudf.utils.dtypes import (
     min_column_type,
@@ -103,11 +104,23 @@ def binary_operator(
             out_dtype = self.dtype
         else:
             if not (
-                isinstance(rhs, (NumericalColumn, cudf.Scalar,),)
+                isinstance(
+                    rhs,
+                    (
+                        NumericalColumn,
+                        cudf.Scalar,
+                        cudf.core.column.DecimalColumn,
+                    ),
+                )
                 or np.isscalar(rhs)
             ):
                 msg = "{!r} operator not supported between {} and {}"
                 raise TypeError(msg.format(binop, type(self), type(rhs)))
+            if isinstance(rhs, cudf.core.column.DecimalColumn):
+                lhs = self.as_decimal_column(
+                    Decimal64Dtype(Decimal64Dtype.MAX_PRECISION, 0)
+                )
+                return lhs.binary_operator(binop, rhs)
             out_dtype = np.result_type(self.dtype, rhs.dtype)
             if binop in ["mod", "floordiv"]:
                 tmp = self if reflect else rhs
diff --git a/python/cudf/cudf/core/column/struct.py b/python/cudf/cudf/core/column/struct.py
index adaf62ffc25..266e448cdf3 100644
--- a/python/cudf/cudf/core/column/struct.py
+++ b/python/cudf/cudf/core/column/struct.py
@@ -5,9 +5,19 @@
 
 import cudf
 from cudf.core.column import ColumnBase
+from cudf.core.column.methods import ColumnMethodsMixin
+from cudf.utils.dtypes import is_struct_dtype
 
 
 class StructColumn(ColumnBase):
+    """
+    Column that stores fields of values.
+
+    Every column has n children, where n is
+    the number of fields in the Struct Dtype.
+
+    """
+
     dtype: cudf.core.dtypes.StructDtype
 
     @property
@@ -74,6 +84,9 @@ def copy(self, deep=True):
             result = result._rename_fields(self.dtype.fields.keys())
         return result
 
+    def struct(self, parent=None):
+        return StructMethods(self, parent=parent)
+
     def _rename_fields(self, names):
         """
         Return a StructColumn with the same field values as this StructColumn,
@@ -91,3 +104,50 @@ def _rename_fields(self, names):
             null_count=self.null_count,
             children=self.base_children,
         )
+
+
+class StructMethods(ColumnMethodsMixin):
+    """
+    Struct methods for Series
+    """
+
+    def __init__(self, column, parent=None):
+        if not is_struct_dtype(column.dtype):
+            raise AttributeError(
+                "Can only use .struct accessor with a 'struct' dtype"
+            )
+        super().__init__(column=column, parent=parent)
+
+    def field(self, key):
+        """
+        Extract children of the specified struct column
+        in the Series
+
+        Parameters
+        ----------
+        key: int or str
+            index/position or field name of the respective
+            struct column
+
+        Returns
+        -------
+        Series
+
+        Examples
+        --------
+        >>> s = cudf.Series([{'a': 1, 'b': 2}, {'a': 3, 'b': 4}])
+        >>> s.struct.field(0)
+        0    1
+        1    3
+        dtype: int64
+        >>> s.struct.field('a')
+        0    1
+        1    3
+        dtype: int64
+        """
+        fields = list(self._column.dtype.fields.keys())
+        if key in fields:
+            pos = fields.index(key)
+            return self._return_or_inplace(self._column.children[pos])
+        else:
+            return self._return_or_inplace(self._column.children[key])
diff --git a/python/cudf/cudf/core/column_accessor.py b/python/cudf/cudf/core/column_accessor.py
index 33bae5c1328..f0681b330da 100644
--- a/python/cudf/cudf/core/column_accessor.py
+++ b/python/cudf/cudf/core/column_accessor.py
@@ -4,6 +4,7 @@
 
 import itertools
 from collections.abc import MutableMapping
+from functools import reduce
 from typing import (
     TYPE_CHECKING,
     Any,
@@ -19,12 +20,65 @@
 
 import cudf
 from cudf.core import column
-from cudf.utils.utils import cached_property, to_flat_dict, to_nested_dict
+from cudf.utils.utils import cached_property
 
 if TYPE_CHECKING:
     from cudf.core.column import ColumnBase
 
 
+class _NestedGetItemDict(dict):
+    """A dictionary whose __getitem__ method accesses nested dicts.
+
+    This class directly subclasses dict for performance, so there are a number
+    of gotchas: 1) the only safe accessor for nested elements is
+    `__getitem__` (all other accessors will fail to perform nested lookups), 2)
+    nested mappings will not exhibit the same behavior (they will be raw
+    dictionaries unless explicitly created to be of this class), and 3) to
+    construct this class you _must_ use `from_zip` to get appropriate treatment
+    of tuple keys.
+    """
+
+    @classmethod
+    def from_zip(cls, data):
+        """Create from zip, specialized factory for nesting."""
+        obj = cls()
+        for key, value in data:
+            d = obj
+            for k in key[:-1]:
+                d = d.setdefault(k, {})
+            d[key[-1]] = value
+        return obj
+
+    def __getitem__(self, key):
+        """Recursively apply dict.__getitem__ for nested elements."""
+        # As described in the pandas docs
+        # https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html#advanced-indexing-with-hierarchical-index  # noqa: E501
+        # accessing nested elements of a multiindex must be done using a tuple.
+        # Lists and other sequences are treated as accessing multiple elements
+        # at the top level of the index.
+        if isinstance(key, tuple):
+            return reduce(dict.__getitem__, key, self)
+        return super().__getitem__(key)
+
+
+def _to_flat_dict_inner(d, parents=()):
+    for k, v in d.items():
+        if not isinstance(v, d.__class__):
+            if parents:
+                k = parents + (k,)
+            yield (k, v)
+        else:
+            yield from _to_flat_dict_inner(d=v, parents=parents + (k,))
+
+
+def _to_flat_dict(d):
+    """
+    Convert the given nested dictionary to a flat dictionary
+    with tuple keys.
+    """
+    return {k: v for k, v in _to_flat_dict_inner(d)}
+
+
 class ColumnAccessor(MutableMapping):
 
     _data: "Dict[Any, ColumnBase]"
@@ -166,7 +220,7 @@ def _grouped_data(self) -> MutableMapping:
         return the underlying mapping as a nested mapping.
         """
         if self.multiindex:
-            return to_nested_dict(dict(zip(self.names, self.columns)))
+            return _NestedGetItemDict.from_zip(zip(self.names, self.columns))
         else:
             return self._data
 
@@ -343,10 +397,11 @@ def set_by_label(self, key: Any, value: Any, validate: bool = True):
         self._clear_cache()
 
     def _select_by_label_list_like(self, key: Any) -> ColumnAccessor:
+        data = {k: self._grouped_data[k] for k in key}
+        if self.multiindex:
+            data = _to_flat_dict(data)
         return self.__class__(
-            to_flat_dict({k: self._grouped_data[k] for k in key}),
-            multiindex=self.multiindex,
-            level_names=self.level_names,
+            data, multiindex=self.multiindex, level_names=self.level_names,
         )
 
     def _select_by_label_grouped(self, key: Any) -> ColumnAccessor:
@@ -354,7 +409,8 @@ def _select_by_label_grouped(self, key: Any) -> ColumnAccessor:
         if isinstance(result, cudf.core.column.ColumnBase):
             return self.__class__({key: result})
         else:
-            result = to_flat_dict(result)
+            if self.multiindex:
+                result = _to_flat_dict(result)
             if not isinstance(key, tuple):
                 key = (key,)
             return self.__class__(
diff --git a/python/cudf/cudf/core/dataframe.py b/python/cudf/cudf/core/dataframe.py
index 01b96151485..d5393a724ec 100644
--- a/python/cudf/cudf/core/dataframe.py
+++ b/python/cudf/cudf/core/dataframe.py
@@ -8,7 +8,7 @@
 import pickle
 import sys
 import warnings
-from collections import OrderedDict, defaultdict
+from collections import defaultdict
 from collections.abc import Iterable, Sequence
 from typing import Any, Optional, Set, TypeVar
 
@@ -240,7 +240,7 @@ def __init__(self, data=None, index=None, columns=None, dtype=None):
                 self._index = as_index(index)
             if columns is not None:
                 self._data = ColumnAccessor(
-                    OrderedDict.fromkeys(
+                    dict.fromkeys(
                         columns,
                         column.column_empty(
                             len(self), dtype="object", masked=True
@@ -1658,8 +1658,9 @@ def update(
         if not self.index.equals(other.index):
             other = other.reindex(self.index, axis=0)
 
-        for col in self.columns:
-            this = self[col]
+        source_df = self.copy(deep=False)
+        for col in source_df._column_names:
+            this = source_df[col]
             that = other[col]
 
             if errors == "raise":
@@ -1676,8 +1677,9 @@ def update(
             # don't overwrite columns unnecessarily
             if mask.all():
                 continue
+            source_df[col] = source_df[col].where(mask, that)
 
-            self[col] = this.where(mask, that)
+        self._mimic_inplace(source_df, inplace=True)
 
     def __add__(self, other):
         return self._apply_op("__add__", other)
@@ -2804,7 +2806,7 @@ def reindex(
 
         df = self
         cols = columns
-        dtypes = OrderedDict(df.dtypes)
+        dtypes = dict(df.dtypes)
         idx = labels if index is None and axis in (0, "index") else index
         cols = labels if cols is None and axis in (1, "columns") else cols
         df = df if cols is None else df[list(set(df.columns) & set(cols))]
@@ -5375,7 +5377,7 @@ def describe(
             ldesc_indexes = sorted(
                 (x.index for x in describe_series_list), key=len
             )
-            names = OrderedDict.fromkeys(
+            names = dict.fromkeys(
                 [
                     name
                     for idxnames in ldesc_indexes
@@ -7192,10 +7194,9 @@ def _columns_view(self, columns):
         """
         Return a subset of the DataFrame's columns as a view.
         """
-        result_columns = OrderedDict({})
-        for col in columns:
-            result_columns[col] = self._data[col]
-        return DataFrame(result_columns, index=self.index)
+        return DataFrame(
+            {col: self._data[col] for col in columns}, index=self.index
+        )
 
     def select_dtypes(self, include=None, exclude=None):
         """Return a subset of the DataFrame’s columns based on the column dtypes.
diff --git a/python/cudf/cudf/core/dtypes.py b/python/cudf/cudf/core/dtypes.py
index a18aad3872b..0c436cf36e7 100644
--- a/python/cudf/cudf/core/dtypes.py
+++ b/python/cudf/cudf/core/dtypes.py
@@ -14,7 +14,12 @@
 from cudf._typing import Dtype
 
 
-class CategoricalDtype(ExtensionDtype):
+class _BaseDtype(ExtensionDtype):
+    # Base type for all cudf-specific dtypes
+    pass
+
+
+class CategoricalDtype(_BaseDtype):
 
     ordered: Optional[bool]
 
@@ -121,7 +126,7 @@ def deserialize(cls, header, frames):
         return cls(categories=categories, ordered=ordered)
 
 
-class ListDtype(ExtensionDtype):
+class ListDtype(_BaseDtype):
     _typ: pa.ListType
     name: str = "list"
 
@@ -180,7 +185,7 @@ def __hash__(self):
         return hash(self._typ)
 
 
-class StructDtype(ExtensionDtype):
+class StructDtype(_BaseDtype):
 
     name = "struct"
 
@@ -231,7 +236,7 @@ def __hash__(self):
         return hash(self._typ)
 
 
-class Decimal64Dtype(ExtensionDtype):
+class Decimal64Dtype(_BaseDtype):
 
     name = "decimal"
     _metadata = ("precision", "scale")
@@ -311,6 +316,15 @@ def _validate(cls, precision, scale=0):
         if abs(scale) > precision:
             raise ValueError(f"scale={scale} exceeds precision={precision}")
 
+    @classmethod
+    def _from_decimal(cls, decimal):
+        """
+        Create a cudf.Decimal64Dtype from a decimal.Decimal object
+        """
+        metadata = decimal.as_tuple()
+        precision = max(len(metadata.digits), -metadata.exponent)
+        return cls(precision, -metadata.exponent)
+
 
 class IntervalDtype(StructDtype):
     name = "interval"
diff --git a/python/cudf/cudf/core/frame.py b/python/cudf/cudf/core/frame.py
index fb746d6c794..dcf5044ed2f 100644
--- a/python/cudf/cudf/core/frame.py
+++ b/python/cudf/cudf/core/frame.py
@@ -5,8 +5,8 @@
 import copy
 import functools
 import warnings
-from collections import OrderedDict, abc as abc
-from typing import TYPE_CHECKING, Any, Dict, Tuple, TypeVar, Union, overload
+from collections import abc
+from typing import TYPE_CHECKING, Any, Dict, Optional, Tuple, TypeVar, Union
 
 import cupy
 import numpy as np
@@ -14,7 +14,6 @@
 import pyarrow as pa
 from nvtx import annotate
 from pandas.api.types import is_dict_like, is_dtype_equal
-from typing_extensions import Literal
 
 import cudf
 from cudf import _lib as libcudf
@@ -53,19 +52,9 @@ class Frame(libcudf.table.Table):
     def _from_table(cls, table: Frame):
         return cls(table._data, index=table._index)
 
-    @overload
-    def _mimic_inplace(self, result: Frame) -> Frame:
-        ...
-
-    @overload
-    def _mimic_inplace(self, result: Frame, inplace: Literal[True]):
-        ...
-
-    @overload
-    def _mimic_inplace(self, result: Frame, inplace: Literal[False]) -> Frame:
-        ...
-
-    def _mimic_inplace(self, result, inplace=False):
+    def _mimic_inplace(
+        self: T, result: Frame, inplace: bool = False
+    ) -> Optional[Frame]:
         if inplace:
             for col in self._data:
                 if col in result._data:
@@ -74,6 +63,7 @@ def _mimic_inplace(self, result, inplace=False):
                     )
             self._data = result._data
             self._index = result._index
+            return None
         else:
             return result
 
@@ -348,7 +338,7 @@ def _concat(
             non_intersecting_columns = union_of_columns.symmetric_difference(
                 intersecting_columns
             )
-            names = OrderedDict.fromkeys(intersecting_columns).keys()
+            names = dict.fromkeys(intersecting_columns).keys()
 
             if axis == 0:
                 if ignore_index and (
@@ -374,7 +364,7 @@ def _concat(
         elif join == "outer":
             # Get a list of the unique table column names
             names = [name for f in objs for name in f._column_names]
-            names = OrderedDict.fromkeys(names).keys()
+            names = dict.fromkeys(names).keys()
 
         else:
             raise ValueError(
@@ -796,87 +786,6 @@ def clip(self, lower=None, upper=None, inplace=False, axis=1):
 
         return self._mimic_inplace(output, inplace=inplace)
 
-    def _normalize_scalars(self, other):
-        """
-        Try to normalizes scalar values as per self dtype
-        """
-        if (
-            other is not None
-            and (isinstance(other, float) and not np.isnan(other))
-        ) and (self.dtype.type(other) != other):
-            raise TypeError(
-                f"Cannot safely cast non-equivalent "
-                f"{type(other).__name__} to {self.dtype.name}"
-            )
-
-        return (
-            self.dtype.type(other)
-            if (
-                other is not None
-                and (isinstance(other, float) and not np.isnan(other))
-            )
-            else other
-        )
-
-    def _normalize_columns_and_scalars_type(self, other):
-        """
-        Try to normalize the other's dtypes as per self.
-
-        Parameters
-        ----------
-
-        self : Can be a DataFrame or Series or Index
-        other : Can be a DataFrame, Series, Index, Array
-            like object or a scalar value
-
-            if self is DataFrame, other can be only a
-            scalar or array like with size of number of columns
-            in DataFrame or a DataFrame with same dimension
-
-            if self is Series, other can be only a scalar or
-            a series like with same length as self
-
-        Returns:
-        --------
-        A dataframe/series/list/scalar form of normalized other
-        """
-        if isinstance(self, cudf.DataFrame) and isinstance(
-            other, cudf.DataFrame
-        ):
-            return [
-                other[self_col].astype(self._data[self_col].dtype)._column
-                for self_col in self._data.names
-            ]
-
-        elif isinstance(self, (cudf.Series, cudf.Index)) and not is_scalar(
-            other
-        ):
-            other = as_column(other)
-            return other.astype(self.dtype)
-
-        else:
-            # Handles scalar or list/array like scalars
-            if isinstance(self, (cudf.Series, cudf.Index)) and is_scalar(
-                other
-            ):
-                return self._normalize_scalars(other)
-
-            elif isinstance(self, cudf.DataFrame):
-                out = []
-                if is_scalar(other):
-                    other = [other for i in range(len(self._data.names))]
-                out = [
-                    self[in_col_name]._normalize_scalars(sclr)
-                    for in_col_name, sclr in zip(self._data.names, other)
-                ]
-
-                return out
-            else:
-                raise ValueError(
-                    f"Inappropriate input {type(self)} "
-                    f"and other {type(other)} combination"
-                )
-
     def where(self, cond, other=None, inplace=False):
         """
         Replace values where the condition is False.
@@ -930,133 +839,9 @@ def where(self, cond, other=None, inplace=False):
         dtype: int64
         """
 
-        if isinstance(self, cudf.DataFrame):
-            if hasattr(cond, "__cuda_array_interface__"):
-                cond = cudf.DataFrame(
-                    cond, columns=self._data.names, index=self.index
-                )
-            elif not isinstance(cond, cudf.DataFrame):
-                cond = self.from_pandas(pd.DataFrame(cond))
-
-            common_cols = set(self._data.names).intersection(
-                set(cond._data.names)
-            )
-            if len(common_cols) > 0:
-                # If `self` and `cond` are having unequal index,
-                # then re-index `cond`.
-                if not self.index.equals(cond.index):
-                    cond = cond.reindex(self.index)
-            else:
-                if cond.shape != self.shape:
-                    raise ValueError(
-                        """Array conditional must be same shape as self"""
-                    )
-                # Setting `self` column names to `cond`
-                # as `cond` has no column names.
-                cond.columns = self.columns
-
-            other = self._normalize_columns_and_scalars_type(other)
-            out_df = cudf.DataFrame(index=self.index)
-            if len(self._columns) != len(other):
-                raise ValueError(
-                    """Replacement list length or number of dataframe columns
-                    should be equal to Number of columns of dataframe"""
-                )
-
-            for column_name, other_column in zip(self._data.names, other):
-                input_col = self._data[column_name]
-                if column_name in cond._data:
-                    if isinstance(
-                        input_col, cudf.core.column.CategoricalColumn
-                    ):
-                        if np.isscalar(other_column):
-                            try:
-                                other_column = input_col._encode(other_column)
-                            except ValueError:
-                                # When other is not present in categories,
-                                # fill with Null.
-                                other_column = None
-                        elif hasattr(other_column, "codes"):
-                            other_column = other_column.codes
-                        input_col = input_col.codes
-
-                    result = libcudf.copying.copy_if_else(
-                        input_col, other_column, cond._data[column_name]
-                    )
-
-                    if isinstance(
-                        self._data[column_name],
-                        cudf.core.column.CategoricalColumn,
-                    ):
-                        result = build_categorical_column(
-                            categories=self._data[column_name].categories,
-                            codes=as_column(
-                                result.base_data, dtype=result.dtype
-                            ),
-                            mask=result.base_mask,
-                            size=result.size,
-                            offset=result.offset,
-                            ordered=self._data[column_name].ordered,
-                        )
-                else:
-                    from cudf._lib.null_mask import MaskState, create_null_mask
-
-                    out_mask = create_null_mask(
-                        len(input_col), state=MaskState.ALL_NULL
-                    )
-                    result = input_col.set_mask(out_mask)
-                out_df[column_name] = self[column_name].__class__(result)
-
-            return self._mimic_inplace(out_df, inplace=inplace)
-
-        else:
-
-            if isinstance(other, cudf.DataFrame):
-                raise NotImplementedError(
-                    "cannot align with a higher dimensional Frame"
-                )
-
-            other = self._normalize_columns_and_scalars_type(other)
-
-            cond = as_column(cond)
-            if len(cond) != len(self):
-                raise ValueError(
-                    """Array conditional must be same shape as self"""
-                )
-            input_col = self._data[self.name]
-            if isinstance(input_col, cudf.core.column.CategoricalColumn):
-                if np.isscalar(other):
-                    try:
-                        other = input_col._encode(other)
-                    except ValueError:
-                        # When other is not present in categories,
-                        # fill with Null.
-                        other = None
-                elif hasattr(other, "codes"):
-                    other = other.codes
-
-                input_col = input_col.codes
-
-            result = libcudf.copying.copy_if_else(input_col, other, cond)
-
-            if is_categorical_dtype(self.dtype):
-                result = build_categorical_column(
-                    categories=self._data[self.name].categories,
-                    codes=as_column(result.base_data, dtype=result.dtype),
-                    mask=result.base_mask,
-                    size=result.size,
-                    offset=result.offset,
-                    ordered=self._data[self.name].ordered,
-                )
-
-            if isinstance(self, cudf.Index):
-                from cudf.core.index import as_index
-
-                result = as_index(result, name=self.name)
-            else:
-                result = self._copy_construct(data=result)
-
-            return self._mimic_inplace(result, inplace=inplace)
+        return cudf.core._internals.where(
+            frame=self, cond=cond, other=other, inplace=inplace
+        )
 
     def mask(self, cond, other=None, inplace=False):
         """
@@ -2387,12 +2172,14 @@ def replace(self, to_replace: Any, replacement: Any) -> Frame:
                         replacements_per_column[name],
                         all_na_per_column[name],
                     )
-                except KeyError:
-                    # We need to create a deep copy if `find_and_replace`
-                    # was not successful or any of
-                    # `to_replace_per_column`, `replacements_per_column`,
-                    # `all_na_per_column` don't contain the `name`
-                    # that exists in `copy_data`
+                except (KeyError, OverflowError):
+                    # We need to create a deep copy if :
+                    # i. `find_and_replace` was not successful or any of
+                    #    `to_replace_per_column`, `replacements_per_column`,
+                    #    `all_na_per_column` don't contain the `name`
+                    #    that exists in `copy_data`.
+                    # ii. There is an OverflowError while trying to cast
+                    #     `to_replace_per_column` to `replacements_per_column`.
                     copy_data[name] = col.copy(deep=True)
         else:
             copy_data = self._data.copy(deep=True)
@@ -2735,7 +2522,6 @@ def searchsorted(
         array([4, 4, 4, 0], dtype=int32)
         """
         # Call libcudf++ search_sorted primitive
-        from cudf.utils.dtypes import is_scalar
 
         scalar_flag = None
         if is_scalar(values):
@@ -3471,17 +3257,21 @@ def _reindex(
                 # double-argsort to map back from sorted to unsorted positions
                 df = df.take(index.argsort(ascending=True).argsort())
 
-        cols = OrderedDict()
         index = index if index is not None else df.index
         names = columns if columns is not None else list(df.columns)
-        for name in names:
-            if name in df._data:
-                cols[name] = df._data[name].copy(deep=deep)
-            else:
-                dtype = dtypes.get(name, np.float64)
-                cols[name] = column_empty(
-                    dtype=dtype, masked=True, row_count=len(index)
+        cols = {
+            name: (
+                df._data[name].copy(deep=deep)
+                if name in df._data
+                else column_empty(
+                    dtype=dtypes.get(name, np.float64),
+                    masked=True,
+                    row_count=len(index),
                 )
+            )
+            for name in names
+        }
+
         result = self.__class__._from_table(
             Frame(
                 data=cudf.core.column_accessor.ColumnAccessor(
diff --git a/python/cudf/cudf/core/groupby/groupby.py b/python/cudf/cudf/core/groupby/groupby.py
index 86e1f5cfe30..cc94548d9a2 100644
--- a/python/cudf/cudf/core/groupby/groupby.py
+++ b/python/cudf/cudf/core/groupby/groupby.py
@@ -13,6 +13,8 @@
 from cudf.utils.utils import cached_property
 
 
+# Note that all valid aggregation methods (e.g. GroupBy.min) are bound to the
+# class after its definition (see below).
 class GroupBy(Serializable):
 
     _MAX_GROUPS_BEFORE_WARN = 100
@@ -58,14 +60,6 @@ def __init__(
         else:
             self.grouping = _Grouping(obj, by, level)
 
-    def __getattribute__(self, key):
-        try:
-            return super().__getattribute__(key)
-        except AttributeError:
-            if key in libgroupby._GROUPBY_AGGS:
-                return functools.partial(self._agg_func_name_with_args, key)
-            raise
-
     def __iter__(self):
         group_names, offsets, _, grouped_values = self._grouped()
         if isinstance(group_names, cudf.Index):
@@ -267,19 +261,6 @@ def _grouped(self):
         group_names = grouped_keys.unique()
         return (group_names, offsets, grouped_keys, grouped_values)
 
-    def _agg_func_name_with_args(self, func_name, *args, **kwargs):
-        """
-        Aggregate given an aggregate function name
-        and arguments to the function, e.g.,
-        `_agg_func_name_with_args("quantile", 0.5)`
-        """
-
-        def func(x):
-            return getattr(x, func_name)(*args, **kwargs)
-
-        func.__name__ = func_name
-        return self.agg(func)
-
     def _normalize_aggs(self, aggs):
         """
         Normalize aggs to a dict mapping column names
@@ -590,6 +571,48 @@ def rolling(self, *args, **kwargs):
         return cudf.core.window.rolling.RollingGroupby(self, *args, **kwargs)
 
 
+# Set of valid groupby aggregations that are monkey-patched into the GroupBy
+# namespace.
+_VALID_GROUPBY_AGGS = {
+    "count",
+    "sum",
+    "idxmin",
+    "idxmax",
+    "min",
+    "max",
+    "mean",
+    "var",
+    "std",
+    "quantile",
+    "median",
+    "nunique",
+    "collect",
+    "unique",
+}
+
+
+# Dynamically bind the different aggregation methods.
+def _agg_func_name_with_args(self, func_name, *args, **kwargs):
+    """
+    Aggregate given an aggregate function name and arguments to the
+    function, e.g., `_agg_func_name_with_args("quantile", 0.5)`. The named
+    aggregations must be members of _AggregationFactory.
+    """
+
+    def func(x):
+        """Compute the {} of the group.""".format(func_name)
+        return getattr(x, func_name)(*args, **kwargs)
+
+    func.__name__ = func_name
+    return self.agg(func)
+
+
+for key in _VALID_GROUPBY_AGGS:
+    setattr(
+        GroupBy, key, functools.partialmethod(_agg_func_name_with_args, key)
+    )
+
+
 class DataFrameGroupBy(GroupBy):
     def __init__(
         self, obj, by=None, level=None, sort=False, as_index=True, dropna=True
@@ -685,15 +708,16 @@ def __init__(
             dropna=dropna,
         )
 
-    def __getattribute__(self, key):
+    def __getattr__(self, key):
+        # Without this check, copying can trigger a RecursionError. See
+        # https://nedbatchelder.com/blog/201010/surprising_getattr_recursion.html  # noqa: E501
+        # for an explanation.
+        if key == "obj":
+            raise AttributeError
         try:
-            return super().__getattribute__(key)
-        except AttributeError:
-            if key in self.obj:
-                return self.obj[key].groupby(
-                    self.grouping, dropna=self._dropna, sort=self._sort
-                )
-            raise
+            return self[key]
+        except KeyError:
+            raise AttributeError
 
     def __getitem__(self, key):
         return self.obj[key].groupby(
diff --git a/python/cudf/cudf/core/index.py b/python/cudf/cudf/core/index.py
index 5104629eee0..f65afb6a1d4 100644
--- a/python/cudf/cudf/core/index.py
+++ b/python/cudf/cudf/core/index.py
@@ -13,25 +13,32 @@
 from pandas._config import get_option
 
 import cudf
+from cudf._lib.filling import sequence
 from cudf._typing import DtypeObj
 from cudf.core.abc import Serializable
 from cudf.core.column import (
     CategoricalColumn,
     ColumnBase,
     DatetimeColumn,
+    IntervalColumn,
     NumericalColumn,
     StringColumn,
     TimeDeltaColumn,
+    arange,
     column,
 )
 from cudf.core.column.string import StringMethods as StringMethods
+from cudf.core.dtypes import IntervalDtype
 from cudf.core.frame import Frame
 from cudf.utils import ioutils, utils
 from cudf.utils.docutils import copy_docstring
 from cudf.utils.dtypes import (
+    find_common_type,
     is_categorical_dtype,
+    is_interval_dtype,
     is_list_like,
     is_mixed_with_object_dtype,
+    is_numerical_dtype,
     is_scalar,
     numeric_normalize_types,
 )
@@ -2111,6 +2118,10 @@ def __repr__(self):
         return "\n".join(lines)
 
     def __getitem__(self, index):
+        if type(self) == IntervalIndex:
+            raise NotImplementedError(
+                "Getting a scalar from an IntervalIndex is not yet supported"
+            )
         res = self._values[index]
         if not isinstance(index, int):
             res = as_index(res)
@@ -2635,7 +2646,8 @@ def inferred_freq(self):
 
 
 class CategoricalIndex(GenericIndex):
-    """An categorical of orderable values that represent the indices of another
+    """
+    A categorical of orderable values that represent the indices of another
     Column
 
     Parameters
@@ -2752,6 +2764,236 @@ def categories(self):
         return self._values.cat().categories
 
 
+def interval_range(
+    start=None, end=None, periods=None, freq=None, name=None, closed="right",
+) -> "IntervalIndex":
+    """
+    Returns a fixed frequency IntervalIndex.
+
+    Parameters
+    ----------
+    start : numeric, default None
+        Left bound for generating intervals.
+    end : numeric , default None
+        Right bound for generating intervals.
+    periods : int, default None
+        Number of periods to generate
+    freq : numeric, default None
+        The length of each interval. Must be consistent
+        with the type of start and end
+    name : str, default None
+        Name of the resulting IntervalIndex.
+    closed : {"left", "right", "both", "neither"}, default "right"
+        Whether the intervals are closed on the left-side, right-side,
+        both or neither.
+
+    Returns
+    -------
+    IntervalIndex
+
+    Examples
+    --------
+    >>> import cudf
+    >>> import pandas as pd
+    >>> cudf.interval_range(start=0,end=5)
+    IntervalIndex([(0, 0], (1, 1], (2, 2], (3, 3], (4, 4], (5, 5]],
+    ...closed='right',dtype='interval')
+    >>> cudf.interval_range(start=0,end=10, freq=2,closed='left')
+    IntervalIndex([[0, 2), [2, 4), [4, 6), [6, 8), [8, 10)],
+    ...closed='left',dtype='interval')
+    >>> cudf.interval_range(start=0,end=10, periods=3,closed='left')
+    ...IntervalIndex([[0.0, 3.3333333333333335),
+            [3.3333333333333335, 6.666666666666667),
+            [6.666666666666667, 10.0)],
+            closed='left',
+            dtype='interval')
+    """
+    if freq and periods and start and end:
+        raise ValueError(
+            "Of the four parameters: start, end, periods, and "
+            "freq, exactly three must be specified"
+        )
+    args = [
+        cudf.Scalar(x) if x is not None else None
+        for x in (start, end, freq, periods)
+    ]
+    if any(
+        not is_numerical_dtype(x.dtype) if x is not None else False
+        for x in args
+    ):
+        raise ValueError("start, end, periods, freq must be numeric values.")
+    *rargs, periods = args
+    common_dtype = find_common_type([x.dtype for x in rargs if x])
+    start, end, freq = rargs
+    periods = periods.astype("int64") if periods is not None else None
+
+    if periods and not freq:
+        # if statement for mypy to pass
+        if end is not None and start is not None:
+            # divmod only supported on host side scalars
+            quotient, remainder = divmod((end - start).value, periods.value)
+            if remainder:
+                freq_step = cudf.Scalar((end - start) / periods)
+            else:
+                freq_step = cudf.Scalar(quotient)
+            if start.dtype != freq_step.dtype:
+                start = start.astype(freq_step.dtype)
+            bin_edges = sequence(
+                size=periods + 1,
+                init=start.device_value,
+                step=freq_step.device_value,
+            )
+            left_col = bin_edges[:-1]
+            right_col = bin_edges[1:]
+    elif freq and periods:
+        if end:
+            start = end - (freq * periods)
+        if start:
+            end = freq * periods + start
+        if end is not None and start is not None:
+            left_col = arange(
+                start.value, end.value, freq.value, dtype=common_dtype
+            )
+            end = end + 1
+            start = start + freq
+            right_col = arange(
+                start.value, end.value, freq.value, dtype=common_dtype
+            )
+    elif freq and not periods:
+        if end is not None and start is not None:
+            end = end - freq + 1
+            left_col = arange(
+                start.value, end.value, freq.value, dtype=common_dtype
+            )
+            end = end + freq + 1
+            start = start + freq
+            right_col = arange(
+                start.value, end.value, freq.value, dtype=common_dtype
+            )
+    elif start is not None and end is not None:
+        # if statements for mypy to pass
+        if freq:
+            left_col = arange(
+                start.value, end.value, freq.value, dtype=common_dtype
+            )
+        else:
+            left_col = arange(start.value, end.value, dtype=common_dtype)
+        start = start + 1
+        end = end + 1
+        if freq:
+            right_col = arange(
+                start.value, end.value, freq.value, dtype=common_dtype
+            )
+        else:
+            right_col = arange(start.value, end.value, dtype=common_dtype)
+    else:
+        raise ValueError(
+            "Of the four parameters: start, end, periods, and "
+            "freq, at least two must be specified"
+        )
+    if len(right_col) == 0 or len(left_col) == 0:
+        dtype = IntervalDtype("int64", closed)
+        data = column.column_empty_like_same_mask(left_col, dtype)
+        return cudf.IntervalIndex(data, closed=closed)
+
+    interval_col = column.build_interval_column(
+        left_col, right_col, closed=closed
+    )
+    return IntervalIndex(interval_col)
+
+
+class IntervalIndex(GenericIndex):
+    """
+    Immutable index of intervals that are closed on the same side.
+
+    Parameters
+    ----------
+    data : array-like (1-dimensional)
+        Array-like containing Interval objects from which to build the
+        IntervalIndex.
+    closed : {"left", "right", "both", "neither"}, default "right"
+        Whether the intervals are closed on the left-side, right-side,
+        both or neither.
+    dtype : dtype or None, default None
+        If None, dtype will be inferred.
+    copy : bool, default False
+        Copy the input data.
+    name : object, optional
+        Name to be stored in the index.
+
+    Returns
+    -------
+    IntervalIndex
+    """
+
+    def __new__(
+        cls, data, closed=None, dtype=None, copy=False, name=None,
+    ) -> "IntervalIndex":
+        if copy:
+            data = column.as_column(data, dtype=dtype).copy()
+        out = Frame.__new__(cls)
+        kwargs = _setdefault_name(data, name=name)
+        if isinstance(data, IntervalColumn):
+            data = data
+        elif isinstance(data, pd.Series) and (is_interval_dtype(data.dtype)):
+            data = column.as_column(data, data.dtype)
+        elif isinstance(data, (pd._libs.interval.Interval, pd.IntervalIndex)):
+            data = column.as_column(data, dtype=dtype,)
+        elif not data:
+            dtype = IntervalDtype("int64", closed)
+            data = column.column_empty_like_same_mask(
+                column.as_column(data), dtype
+            )
+        else:
+            data = column.as_column(data)
+            data.dtype.closed = closed
+
+        out._initialize(data, **kwargs)
+        return out
+
+    def from_breaks(breaks, closed="right", name=None, copy=False, dtype=None):
+        """
+        Construct an IntervalIndex from an array of splits.
+
+        Parameters
+        ---------
+        breaks : array-like (1-dimensional)
+            Left and right bounds for each interval.
+        closed : {"left", "right", "both", "neither"}, default "right"
+            Whether the intervals are closed on the left-side, right-side,
+            both or neither.
+        copy : bool, default False
+            Copy the input data.
+        name : object, optional
+            Name to be stored in the index.
+        dtype : dtype or None, default None
+            If None, dtype will be inferred.
+
+        Returns
+        -------
+        IntervalIndex
+
+        Examples
+        --------
+        >>> import cudf
+        >>> import pandas as pd
+        >>> cudf.IntervalIndex.from_breaks([0, 1, 2, 3])
+        IntervalIndex([(0, 1], (1, 2], (2, 3]],
+                    closed='right',
+                    dtype='interval[int64]')
+        """
+        if copy:
+            breaks = column.as_column(breaks, dtype=dtype).copy()
+        left_col = breaks[:-1:]
+        right_col = breaks[+1::]
+
+        interval_col = column.build_interval_column(
+            left_col, right_col, closed=closed
+        )
+
+        return IntervalIndex(interval_col, name=name)
+
+
 class StringIndex(GenericIndex):
     """String defined indices into another Column
 
diff --git a/python/cudf/cudf/core/join/_join_helpers.py b/python/cudf/cudf/core/join/_join_helpers.py
index 3807f408369..5e15ddfc359 100644
--- a/python/cudf/cudf/core/join/_join_helpers.py
+++ b/python/cudf/cudf/core/join/_join_helpers.py
@@ -97,6 +97,14 @@ def _match_join_keys(
     if pd.api.types.is_dtype_equal(ltype, rtype):
         return lcol, rcol
 
+    if isinstance(ltype, cudf.Decimal64Dtype) or isinstance(
+        rtype, cudf.Decimal64Dtype
+    ):
+        raise TypeError(
+            "Decimal columns can only be merged with decimal columns "
+            "of the same precision and scale"
+        )
+
     if (np.issubdtype(ltype, np.number)) and (np.issubdtype(rtype, np.number)):
         common_type = (
             max(ltype, rtype)
diff --git a/python/cudf/cudf/core/join/join.py b/python/cudf/cudf/core/join/join.py
index 1a4826d0570..3f5776b4ea4 100644
--- a/python/cudf/cudf/core/join/join.py
+++ b/python/cudf/cudf/core/join/join.py
@@ -196,14 +196,14 @@ def perform_merge(self) -> Frame:
 
     def _compute_join_keys(self):
         # Computes self._keys
+        left_keys = []
+        right_keys = []
         if (
             self.left_index
             or self.right_index
             or self.left_on
             or self.right_on
         ):
-            left_keys = []
-            right_keys = []
             if self.left_index:
                 left_keys.extend(
                     [
@@ -234,14 +234,25 @@ def _compute_join_keys(self):
                         for on in _coerce_to_tuple(self.right_on)
                     ]
                 )
+        elif self.on:
+            on_names = _coerce_to_tuple(self.on)
+            for on in on_names:
+                # If `on` is provided, Merge on columns if present,
+                # otherwise default to indexes.
+                if on in self.lhs._data:
+                    left_keys.append(_Indexer(name=on, column=True))
+                else:
+                    left_keys.append(_Indexer(name=on, index=True))
+                if on in self.rhs._data:
+                    right_keys.append(_Indexer(name=on, column=True))
+                else:
+                    right_keys.append(_Indexer(name=on, index=True))
+
         else:
-            # Use `on` if provided. Otherwise,
-            # implicitly use identically named columns as the key columns:
-            on_names = (
-                _coerce_to_tuple(self.on)
-                if self.on is not None
-                else set(self.lhs._data) & set(self.rhs._data)
-            )
+            # if `on` is not provided and we're not merging
+            # index with column or on both indexes, then use
+            # the intersection  of columns in both frames
+            on_names = set(self.lhs._data) & set(self.rhs._data)
             left_keys = [_Indexer(name=on, column=True) for on in on_names]
             right_keys = [_Indexer(name=on, column=True) for on in on_names]
 
@@ -384,12 +395,16 @@ def _validate_merge_params(
         if how not in {"left", "inner", "outer", "leftanti", "leftsemi"}:
             raise NotImplementedError(f"{how} merge not supported yet")
 
-        # Passing 'on' with 'left_on' or 'right_on' is ambiguous
-        if on and (left_on or right_on):
-            raise ValueError(
-                'Can only pass argument "on" OR "left_on" '
-                'and "right_on", not a combination of both.'
-            )
+        if on:
+            if left_on or right_on:
+                # Passing 'on' with 'left_on' or 'right_on' is ambiguous
+                raise ValueError(
+                    'Can only pass argument "on" OR "left_on" '
+                    'and "right_on", not a combination of both.'
+                )
+            else:
+                # the validity of 'on' being checked by _Indexer
+                return
 
         # Can't merge on unnamed Series
         if (isinstance(lhs, cudf.Series) and not lhs.name) or (
diff --git a/python/cudf/cudf/core/multiindex.py b/python/cudf/cudf/core/multiindex.py
index 1c1e48e7372..a4748632aab 100644
--- a/python/cudf/cudf/core/multiindex.py
+++ b/python/cudf/cudf/core/multiindex.py
@@ -5,7 +5,6 @@
 import numbers
 import pickle
 import warnings
-from collections import OrderedDict
 from collections.abc import Sequence
 from typing import Any, List, Tuple, Union
 
@@ -1248,7 +1247,7 @@ def _poplevels(self, level):
         if not ilevels:
             return None
 
-        popped_data = OrderedDict({})
+        popped_data = {}
         popped_names = []
         names = list(self.names)
 
diff --git a/python/cudf/cudf/core/reshape.py b/python/cudf/cudf/core/reshape.py
index 1c339d79aaf..146fc22b77f 100644
--- a/python/cudf/cudf/core/reshape.py
+++ b/python/cudf/cudf/core/reshape.py
@@ -524,16 +524,13 @@ def _tile(A, reps):
             return cudf.Series([], dtype=A.dtype)
 
     # Step 1: tile id_vars
-    mdata = collections.OrderedDict()
-    for col in id_vars:
-        mdata[col] = _tile(frame[col], K)
+    mdata = {col: _tile(frame[col], K) for col in id_vars}
 
     # Step 2: add variable
-    var_cols = []
-    for i, _ in enumerate(value_vars):
-        var_cols.append(
-            cudf.Series(cudf.core.column.full(N, i, dtype=np.int8))
-        )
+    var_cols = [
+        cudf.Series(cudf.core.column.full(N, i, dtype=np.int8))
+        for i in range(len(value_vars))
+    ]
     temp = cudf.Series._concat(objs=var_cols, index=None)
 
     if not var_name:
diff --git a/python/cudf/cudf/core/scalar.py b/python/cudf/cudf/core/scalar.py
index 1e998ae37e2..d879b2ec4e2 100644
--- a/python/cudf/cudf/core/scalar.py
+++ b/python/cudf/cudf/core/scalar.py
@@ -1,9 +1,11 @@
 # Copyright (c) 2020-2021, NVIDIA CORPORATION.
+import decimal
 
 import numpy as np
 
 from cudf._lib.scalar import DeviceScalar, _is_null_host_scalar
 from cudf.core.column.column import ColumnBase
+from cudf.core.dtypes import Decimal64Dtype
 from cudf.core.index import Index
 from cudf.core.series import Series
 from cudf.utils.dtypes import (
@@ -112,29 +114,44 @@ def _device_value_to_host(self):
         self._host_value = self._device_value._to_host_scalar()
 
     def _preprocess_host_value(self, value, dtype):
+        if isinstance(dtype, Decimal64Dtype):
+            # TODO: Support coercion from decimal.Decimal to different dtype
+            # TODO: Support coercion from integer to Decimal64Dtype
+            raise NotImplementedError(
+                "dtype as cudf.Decimal64Dtype is not supported. Pass a "
+                "decimal.Decimal to construct a DecimalScalar."
+            )
+        if isinstance(value, decimal.Decimal) and dtype is not None:
+            raise TypeError(f"Can not coerce decimal to {dtype}")
+
         value = to_cudf_compatible_scalar(value, dtype=dtype)
         valid = not _is_null_host_scalar(value)
 
-        if dtype is None:
-            if not valid:
-                if isinstance(value, (np.datetime64, np.timedelta64)):
-                    unit, _ = np.datetime_data(value)
-                    if unit == "generic":
+        if isinstance(value, decimal.Decimal):
+            # 0.0042 -> Decimal64Dtype(2, 4)
+            dtype = Decimal64Dtype._from_decimal(value)
+
+        else:
+            if dtype is None:
+                if not valid:
+                    if isinstance(value, (np.datetime64, np.timedelta64)):
+                        unit, _ = np.datetime_data(value)
+                        if unit == "generic":
+                            raise TypeError(
+                                "Cant convert generic NaT to null scalar"
+                            )
+                        else:
+                            dtype = value.dtype
+                    else:
                         raise TypeError(
-                            "Cant convert generic NaT to null scalar"
+                            "dtype required when constructing a null scalar"
                         )
-                    else:
-                        dtype = value.dtype
                 else:
-                    raise TypeError(
-                        "dtype required when constructing a null scalar"
-                    )
-            else:
-                dtype = value.dtype
-        dtype = np.dtype(dtype)
+                    dtype = value.dtype
+            dtype = np.dtype(dtype)
 
-        # temporary
-        dtype = np.dtype("object") if dtype.char == "U" else dtype
+            # temporary
+            dtype = np.dtype("object") if dtype.char == "U" else dtype
 
         if not valid:
             value = NA
@@ -341,7 +358,7 @@ def _dispatch_scalar_unaop(self, op):
         return getattr(self.value, op)()
 
     def astype(self, dtype):
-        return Scalar(self.device_value, dtype)
+        return Scalar(self.value, dtype)
 
 
 class _NAType(object):
diff --git a/python/cudf/cudf/core/series.py b/python/cudf/cudf/core/series.py
index 71a4a48a07a..55fd510f03a 100644
--- a/python/cudf/cudf/core/series.py
+++ b/python/cudf/cudf/core/series.py
@@ -36,6 +36,7 @@
 )
 from cudf.core.column.lists import ListMethods
 from cudf.core.column.string import StringMethods
+from cudf.core.column.struct import StructMethods
 from cudf.core.column_accessor import ColumnAccessor
 from cudf.core.frame import Frame, _drop_rows_by_labels
 from cudf.core.groupby.groupby import SeriesGroupBy
@@ -2675,6 +2676,11 @@ def str(self):
     def list(self):
         return ListMethods(column=self._column, parent=self)
 
+    @copy_docstring(StructMethods.__init__)  # type: ignore
+    @property
+    def struct(self):
+        return StructMethods(column=self._column, parent=self)
+
     @property
     def dtype(self):
         """dtype of the Series"""
@@ -3923,6 +3929,110 @@ def replace(
 
         return self._mimic_inplace(result, inplace=inplace)
 
+    def update(self, other):
+        """
+        Modify Series in place using values from passed Series.
+        Uses non-NA values from passed Series to make updates. Aligns
+        on index.
+
+        Parameters
+        ----------
+        other : Series, or object coercible into Series
+
+        Examples
+        --------
+        >>> import cudf
+        >>> s = cudf.Series([1, 2, 3])
+        >>> s
+        0    1
+        1    2
+        2    3
+        dtype: int64
+        >>> s.update(cudf.Series([4, 5, 6]))
+        >>> s
+        0    4
+        1    5
+        2    6
+        dtype: int64
+        >>> s = cudf.Series(['a', 'b', 'c'])
+        >>> s
+        0    a
+        1    b
+        2    c
+        dtype: object
+        >>> s.update(cudf.Series(['d', 'e'], index=[0, 2]))
+        >>> s
+        0    d
+        1    b
+        2    e
+        dtype: object
+        >>> s = cudf.Series([1, 2, 3])
+        >>> s
+        0    1
+        1    2
+        2    3
+        dtype: int64
+        >>> s.update(cudf.Series([4, 5, 6, 7, 8]))
+        >>> s
+        0    4
+        1    5
+        2    6
+        dtype: int64
+
+        If ``other`` contains NaNs the corresponding values are not updated
+        in the original Series.
+
+        >>> s = cudf.Series([1, 2, 3])
+        >>> s
+        0    1
+        1    2
+        2    3
+        dtype: int64
+        >>> s.update(cudf.Series([4, np.nan, 6], nan_as_null=False))
+        >>> s
+        0    4
+        1    2
+        2    6
+        dtype: int64
+
+        ``other`` can also be a non-Series object type
+        that is coercible into a Series
+
+        >>> s = cudf.Series([1, 2, 3])
+        >>> s
+        0    1
+        1    2
+        2    3
+        dtype: int64
+        >>> s.update([4, np.nan, 6])
+        >>> s
+        0    4
+        1    2
+        2    6
+        dtype: int64
+        >>> s = cudf.Series([1, 2, 3])
+        >>> s
+        0    1
+        1    2
+        2    3
+        dtype: int64
+        >>> s.update({1: 9})
+        >>> s
+        0    1
+        1    9
+        2    3
+        dtype: int64
+        """
+
+        if not isinstance(other, cudf.Series):
+            other = cudf.Series(other)
+
+        if not self.index.equals(other.index):
+            other = other.reindex(index=self.index)
+        mask = other.notna()
+
+        self.mask(mask, other, inplace=True)
+
     def reverse(self):
         """
         Reverse the Series
diff --git a/python/cudf/cudf/tests/test_binops.py b/python/cudf/cudf/tests/test_binops.py
index eb8aaaadd51..ac80071c8e4 100644
--- a/python/cudf/cudf/tests/test_binops.py
+++ b/python/cudf/cudf/tests/test_binops.py
@@ -1615,6 +1615,12 @@ def test_binops_with_NA_consistent(dtype, op):
         assert result._column.null_count == len(data)
 
 
+def _decimal_series(input, dtype):
+    return cudf.Series(
+        [x if x is None else decimal.Decimal(x) for x in input], dtype=dtype,
+    )
+
+
 @pytest.mark.parametrize(
     "args",
     [
@@ -1753,26 +1759,703 @@ def test_binops_with_NA_consistent(dtype, op):
             ["10.0", None],
             cudf.Decimal64Dtype(scale=1, precision=8),
         ),
+        (
+            operator.eq,
+            ["0.18", "0.42"],
+            cudf.Decimal64Dtype(scale=2, precision=3),
+            ["0.18", "0.21"],
+            cudf.Decimal64Dtype(scale=2, precision=3),
+            [True, False],
+            bool,
+        ),
+        (
+            operator.eq,
+            ["0.18", "0.42"],
+            cudf.Decimal64Dtype(scale=2, precision=3),
+            ["0.1800", "0.2100"],
+            cudf.Decimal64Dtype(scale=4, precision=5),
+            [True, False],
+            bool,
+        ),
+        (
+            operator.eq,
+            ["100", None],
+            cudf.Decimal64Dtype(scale=-2, precision=3),
+            ["100", "200"],
+            cudf.Decimal64Dtype(scale=-1, precision=4),
+            [True, None],
+            bool,
+        ),
+        (
+            operator.lt,
+            ["0.18", "0.42", "1.00"],
+            cudf.Decimal64Dtype(scale=2, precision=3),
+            ["0.10", "0.87", "1.00"],
+            cudf.Decimal64Dtype(scale=2, precision=3),
+            [False, True, False],
+            bool,
+        ),
+        (
+            operator.lt,
+            ["0.18", "0.42", "1.00"],
+            cudf.Decimal64Dtype(scale=2, precision=3),
+            ["0.1000", "0.8700", "1.0000"],
+            cudf.Decimal64Dtype(scale=4, precision=5),
+            [False, True, False],
+            bool,
+        ),
+        (
+            operator.lt,
+            ["200", None, "100"],
+            cudf.Decimal64Dtype(scale=-2, precision=3),
+            ["100", "200", "100"],
+            cudf.Decimal64Dtype(scale=-1, precision=4),
+            [False, None, False],
+            bool,
+        ),
+        (
+            operator.gt,
+            ["0.18", "0.42", "1.00"],
+            cudf.Decimal64Dtype(scale=2, precision=3),
+            ["0.10", "0.87", "1.00"],
+            cudf.Decimal64Dtype(scale=2, precision=3),
+            [True, False, False],
+            bool,
+        ),
+        (
+            operator.gt,
+            ["0.18", "0.42", "1.00"],
+            cudf.Decimal64Dtype(scale=2, precision=3),
+            ["0.1000", "0.8700", "1.0000"],
+            cudf.Decimal64Dtype(scale=4, precision=5),
+            [True, False, False],
+            bool,
+        ),
+        (
+            operator.gt,
+            ["300", None, "100"],
+            cudf.Decimal64Dtype(scale=-2, precision=3),
+            ["100", "200", "100"],
+            cudf.Decimal64Dtype(scale=-1, precision=4),
+            [True, None, False],
+            bool,
+        ),
+        (
+            operator.le,
+            ["0.18", "0.42", "1.00"],
+            cudf.Decimal64Dtype(scale=2, precision=3),
+            ["0.10", "0.87", "1.00"],
+            cudf.Decimal64Dtype(scale=2, precision=3),
+            [False, True, True],
+            bool,
+        ),
+        (
+            operator.le,
+            ["0.18", "0.42", "1.00"],
+            cudf.Decimal64Dtype(scale=2, precision=3),
+            ["0.1000", "0.8700", "1.0000"],
+            cudf.Decimal64Dtype(scale=4, precision=5),
+            [False, True, True],
+            bool,
+        ),
+        (
+            operator.le,
+            ["300", None, "100"],
+            cudf.Decimal64Dtype(scale=-2, precision=3),
+            ["100", "200", "100"],
+            cudf.Decimal64Dtype(scale=-1, precision=4),
+            [False, None, True],
+            bool,
+        ),
+        (
+            operator.ge,
+            ["0.18", "0.42", "1.00"],
+            cudf.Decimal64Dtype(scale=2, precision=3),
+            ["0.10", "0.87", "1.00"],
+            cudf.Decimal64Dtype(scale=2, precision=3),
+            [True, False, True],
+            bool,
+        ),
+        (
+            operator.ge,
+            ["0.18", "0.42", "1.00"],
+            cudf.Decimal64Dtype(scale=2, precision=3),
+            ["0.1000", "0.8700", "1.0000"],
+            cudf.Decimal64Dtype(scale=4, precision=5),
+            [True, False, True],
+            bool,
+        ),
+        (
+            operator.ge,
+            ["300", None, "100"],
+            cudf.Decimal64Dtype(scale=-2, precision=3),
+            ["100", "200", "100"],
+            cudf.Decimal64Dtype(scale=-1, precision=4),
+            [True, None, True],
+            bool,
+        ),
     ],
 )
 def test_binops_decimal(args):
     op, lhs, l_dtype, rhs, r_dtype, expect, expect_dtype = args
 
+    a = _decimal_series(lhs, l_dtype)
+    b = _decimal_series(rhs, r_dtype)
+    expect = (
+        _decimal_series(expect, expect_dtype)
+        if isinstance(expect_dtype, cudf.Decimal64Dtype)
+        else cudf.Series(expect, dtype=expect_dtype)
+    )
+
+    got = op(a, b)
+    assert expect.dtype == got.dtype
+    utils.assert_eq(expect, got)
+
+
+@pytest.mark.parametrize(
+    "args",
+    [
+        (
+            operator.eq,
+            ["100", "41", None],
+            cudf.Decimal64Dtype(scale=0, precision=5),
+            [100, 42, 12],
+            cudf.Series([True, False, None], dtype=bool),
+            cudf.Series([True, False, None], dtype=bool),
+        ),
+        (
+            operator.eq,
+            ["100.000", "42.001", None],
+            cudf.Decimal64Dtype(scale=3, precision=6),
+            [100, 42, 12],
+            cudf.Series([True, False, None], dtype=bool),
+            cudf.Series([True, False, None], dtype=bool),
+        ),
+        (
+            operator.eq,
+            ["100", "40", None],
+            cudf.Decimal64Dtype(scale=-1, precision=3),
+            [100, 42, 12],
+            cudf.Series([True, False, None], dtype=bool),
+            cudf.Series([True, False, None], dtype=bool),
+        ),
+        (
+            operator.lt,
+            ["100", "40", "28", None],
+            cudf.Decimal64Dtype(scale=0, precision=3),
+            [100, 42, 24, 12],
+            cudf.Series([False, True, False, None], dtype=bool),
+            cudf.Series([False, False, True, None], dtype=bool),
+        ),
+        (
+            operator.lt,
+            ["100.000", "42.002", "23.999", None],
+            cudf.Decimal64Dtype(scale=3, precision=6),
+            [100, 42, 24, 12],
+            cudf.Series([False, False, True, None], dtype=bool),
+            cudf.Series([False, True, False, None], dtype=bool),
+        ),
+        (
+            operator.lt,
+            ["100", "40", "10", None],
+            cudf.Decimal64Dtype(scale=-1, precision=3),
+            [100, 42, 8, 12],
+            cudf.Series([False, True, False, None], dtype=bool),
+            cudf.Series([False, False, True, None], dtype=bool),
+        ),
+        (
+            operator.gt,
+            ["100", "42", "20", None],
+            cudf.Decimal64Dtype(scale=0, precision=3),
+            [100, 40, 24, 12],
+            cudf.Series([False, True, False, None], dtype=bool),
+            cudf.Series([False, False, True, None], dtype=bool),
+        ),
+        (
+            operator.gt,
+            ["100.000", "42.002", "23.999", None],
+            cudf.Decimal64Dtype(scale=3, precision=6),
+            [100, 42, 24, 12],
+            cudf.Series([False, True, False, None], dtype=bool),
+            cudf.Series([False, False, True, None], dtype=bool),
+        ),
+        (
+            operator.gt,
+            ["100", "40", "10", None],
+            cudf.Decimal64Dtype(scale=-1, precision=3),
+            [100, 42, 8, 12],
+            cudf.Series([False, False, True, None], dtype=bool),
+            cudf.Series([False, True, False, None], dtype=bool),
+        ),
+        (
+            operator.le,
+            ["100", "40", "28", None],
+            cudf.Decimal64Dtype(scale=0, precision=3),
+            [100, 42, 24, 12],
+            cudf.Series([True, True, False, None], dtype=bool),
+            cudf.Series([True, False, True, None], dtype=bool),
+        ),
+        (
+            operator.le,
+            ["100.000", "42.002", "23.999", None],
+            cudf.Decimal64Dtype(scale=3, precision=6),
+            [100, 42, 24, 12],
+            cudf.Series([True, False, True, None], dtype=bool),
+            cudf.Series([True, True, False, None], dtype=bool),
+        ),
+        (
+            operator.le,
+            ["100", "40", "10", None],
+            cudf.Decimal64Dtype(scale=-1, precision=3),
+            [100, 42, 8, 12],
+            cudf.Series([True, True, False, None], dtype=bool),
+            cudf.Series([True, False, True, None], dtype=bool),
+        ),
+        (
+            operator.ge,
+            ["100", "42", "20", None],
+            cudf.Decimal64Dtype(scale=0, precision=3),
+            [100, 40, 24, 12],
+            cudf.Series([True, True, False, None], dtype=bool),
+            cudf.Series([True, False, True, None], dtype=bool),
+        ),
+        (
+            operator.ge,
+            ["100.000", "42.002", "23.999", None],
+            cudf.Decimal64Dtype(scale=3, precision=6),
+            [100, 42, 24, 12],
+            cudf.Series([True, True, False, None], dtype=bool),
+            cudf.Series([True, False, True, None], dtype=bool),
+        ),
+        (
+            operator.ge,
+            ["100", "40", "10", None],
+            cudf.Decimal64Dtype(scale=-1, precision=3),
+            [100, 42, 8, 12],
+            cudf.Series([True, False, True, None], dtype=bool),
+            cudf.Series([True, True, False, None], dtype=bool),
+        ),
+    ],
+)
+@pytest.mark.parametrize("integer_dtype", cudf.tests.utils.INTEGER_TYPES)
+@pytest.mark.parametrize("reflected", [True, False])
+def test_binops_decimal_comp_mixed_integer(args, integer_dtype, reflected):
+    """
+    Tested compare operations:
+        eq, lt, gt, le, ge
+    Each operation has 3 decimal data setups, with scale from {==0, >0, <0}.
+    Decimal precisions are sufficient to hold the digits.
+    For each decimal data setup, there is at least one row that lead to one
+    of the following compare results: {True, False, None}.
+    """
+    if not reflected:
+        op, ldata, ldtype, rdata, expected, _ = args
+    else:
+        op, ldata, ldtype, rdata, _, expected = args
+
+    lhs = _decimal_series(ldata, ldtype)
+    rhs = cudf.Series(rdata, dtype=integer_dtype)
+
+    if reflected:
+        rhs, lhs = lhs, rhs
+
+    actual = op(lhs, rhs)
+
+    utils.assert_eq(expected, actual)
+
+
+@pytest.mark.parametrize(
+    "args",
+    [
+        (
+            operator.add,
+            ["100", "200"],
+            cudf.Decimal64Dtype(scale=-2, precision=3),
+            decimal.Decimal(1),
+            ["101", "201"],
+            cudf.Decimal64Dtype(scale=0, precision=6),
+            False,
+        ),
+        (
+            operator.add,
+            ["100", "200"],
+            cudf.Decimal64Dtype(scale=-2, precision=3),
+            1,
+            ["101", "201"],
+            cudf.Decimal64Dtype(scale=0, precision=6),
+            False,
+        ),
+        (
+            operator.add,
+            ["100", "200"],
+            cudf.Decimal64Dtype(scale=-2, precision=3),
+            decimal.Decimal("1.5"),
+            ["101.5", "201.5"],
+            cudf.Decimal64Dtype(scale=1, precision=7),
+            False,
+        ),
+        (
+            operator.add,
+            ["100", "200"],
+            cudf.Decimal64Dtype(scale=-2, precision=3),
+            cudf.Scalar(decimal.Decimal("1.5")),
+            ["101.5", "201.5"],
+            cudf.Decimal64Dtype(scale=1, precision=7),
+            False,
+        ),
+        (
+            operator.add,
+            ["100", "200"],
+            cudf.Decimal64Dtype(scale=-2, precision=3),
+            decimal.Decimal(1),
+            ["101", "201"],
+            cudf.Decimal64Dtype(scale=0, precision=6),
+            True,
+        ),
+        (
+            operator.add,
+            ["100", "200"],
+            cudf.Decimal64Dtype(scale=-2, precision=3),
+            1,
+            ["101", "201"],
+            cudf.Decimal64Dtype(scale=0, precision=6),
+            True,
+        ),
+        (
+            operator.add,
+            ["100", "200"],
+            cudf.Decimal64Dtype(scale=-2, precision=3),
+            decimal.Decimal("1.5"),
+            ["101.5", "201.5"],
+            cudf.Decimal64Dtype(scale=1, precision=7),
+            True,
+        ),
+        (
+            operator.add,
+            ["100", "200"],
+            cudf.Decimal64Dtype(scale=-2, precision=3),
+            cudf.Scalar(decimal.Decimal("1.5")),
+            ["101.5", "201.5"],
+            cudf.Decimal64Dtype(scale=1, precision=7),
+            True,
+        ),
+        (
+            operator.mul,
+            ["100", "200"],
+            cudf.Decimal64Dtype(scale=-2, precision=3),
+            1,
+            ["100", "200"],
+            cudf.Decimal64Dtype(scale=-2, precision=5),
+            False,
+        ),
+        (
+            operator.mul,
+            ["100", "200"],
+            cudf.Decimal64Dtype(scale=-2, precision=3),
+            decimal.Decimal(2),
+            ["200", "400"],
+            cudf.Decimal64Dtype(scale=-2, precision=5),
+            False,
+        ),
+        (
+            operator.mul,
+            ["100", "200"],
+            cudf.Decimal64Dtype(scale=-2, precision=3),
+            decimal.Decimal("1.5"),
+            ["150", "300"],
+            cudf.Decimal64Dtype(scale=-1, precision=6),
+            False,
+        ),
+        (
+            operator.mul,
+            ["100", "200"],
+            cudf.Decimal64Dtype(scale=-2, precision=3),
+            cudf.Scalar(decimal.Decimal("1.5")),
+            ["150", "300"],
+            cudf.Decimal64Dtype(scale=-1, precision=6),
+            False,
+        ),
+        (
+            operator.mul,
+            ["100", "200"],
+            cudf.Decimal64Dtype(scale=-2, precision=3),
+            1,
+            ["100", "200"],
+            cudf.Decimal64Dtype(scale=-2, precision=5),
+            True,
+        ),
+        (
+            operator.mul,
+            ["100", "200"],
+            cudf.Decimal64Dtype(scale=-2, precision=3),
+            decimal.Decimal(2),
+            ["200", "400"],
+            cudf.Decimal64Dtype(scale=-2, precision=5),
+            True,
+        ),
+        (
+            operator.mul,
+            ["100", "200"],
+            cudf.Decimal64Dtype(scale=-2, precision=3),
+            decimal.Decimal("1.5"),
+            ["150", "300"],
+            cudf.Decimal64Dtype(scale=-1, precision=6),
+            True,
+        ),
+        (
+            operator.mul,
+            ["100", "200"],
+            cudf.Decimal64Dtype(scale=-2, precision=3),
+            cudf.Scalar(decimal.Decimal("1.5")),
+            ["150", "300"],
+            cudf.Decimal64Dtype(scale=-1, precision=6),
+            True,
+        ),
+        (
+            operator.sub,
+            ["100", "200"],
+            cudf.Decimal64Dtype(scale=-2, precision=3),
+            decimal.Decimal(2),
+            ["98", "198"],
+            cudf.Decimal64Dtype(scale=0, precision=6),
+            False,
+        ),
+        (
+            operator.sub,
+            ["100", "200"],
+            cudf.Decimal64Dtype(scale=-2, precision=3),
+            decimal.Decimal("2.5"),
+            ["97.5", "197.5"],
+            cudf.Decimal64Dtype(scale=1, precision=7),
+            False,
+        ),
+        (
+            operator.sub,
+            ["100", "200"],
+            cudf.Decimal64Dtype(scale=-2, precision=3),
+            4,
+            ["96", "196"],
+            cudf.Decimal64Dtype(scale=0, precision=6),
+            False,
+        ),
+        (
+            operator.sub,
+            ["100", "200"],
+            cudf.Decimal64Dtype(scale=-2, precision=3),
+            cudf.Scalar(decimal.Decimal("2.5")),
+            ["97.5", "197.5"],
+            cudf.Decimal64Dtype(scale=1, precision=7),
+            False,
+        ),
+        (
+            operator.sub,
+            ["100", "200"],
+            cudf.Decimal64Dtype(scale=-2, precision=3),
+            decimal.Decimal(2),
+            ["-98", "-198"],
+            cudf.Decimal64Dtype(scale=0, precision=6),
+            True,
+        ),
+        (
+            operator.sub,
+            ["100", "200"],
+            cudf.Decimal64Dtype(scale=-2, precision=3),
+            4,
+            ["-96", "-196"],
+            cudf.Decimal64Dtype(scale=0, precision=6),
+            True,
+        ),
+        (
+            operator.sub,
+            ["100", "200"],
+            cudf.Decimal64Dtype(scale=-2, precision=3),
+            decimal.Decimal("2.5"),
+            ["-97.5", "-197.5"],
+            cudf.Decimal64Dtype(scale=1, precision=7),
+            True,
+        ),
+        (
+            operator.sub,
+            ["100", "200"],
+            cudf.Decimal64Dtype(scale=-2, precision=3),
+            cudf.Scalar(decimal.Decimal("2.5")),
+            ["-97.5", "-197.5"],
+            cudf.Decimal64Dtype(scale=1, precision=7),
+            True,
+        ),
+    ],
+)
+def test_binops_decimal_scalar(args):
+    op, lhs, l_dtype, rhs, expect, expect_dtype, reflect = args
+
     def decimal_series(input, dtype):
         return cudf.Series(
             [x if x is None else decimal.Decimal(x) for x in input],
             dtype=dtype,
         )
 
-    a = decimal_series(lhs, l_dtype)
-    b = decimal_series(rhs, r_dtype)
+    lhs = decimal_series(lhs, l_dtype)
     expect = decimal_series(expect, expect_dtype)
 
-    got = op(a, b)
+    if reflect:
+        lhs, rhs = rhs, lhs
+
+    got = op(lhs, rhs)
     assert expect.dtype == got.dtype
     utils.assert_eq(expect, got)
 
 
+@pytest.mark.parametrize(
+    "args",
+    [
+        (
+            operator.eq,
+            ["100.00", "41", None],
+            cudf.Decimal64Dtype(scale=0, precision=5),
+            100,
+            cudf.Series([True, False, None], dtype=bool),
+            cudf.Series([True, False, None], dtype=bool),
+        ),
+        (
+            operator.eq,
+            ["100.123", "41", None],
+            cudf.Decimal64Dtype(scale=3, precision=6),
+            decimal.Decimal("100.123"),
+            cudf.Series([True, False, None], dtype=bool),
+            cudf.Series([True, False, None], dtype=bool),
+        ),
+        (
+            operator.eq,
+            ["100.123", "41", None],
+            cudf.Decimal64Dtype(scale=3, precision=6),
+            cudf.Scalar(decimal.Decimal("100.123")),
+            cudf.Series([True, False, None], dtype=bool),
+            cudf.Series([True, False, None], dtype=bool),
+        ),
+        (
+            operator.gt,
+            ["100.00", "41", "120.21", None],
+            cudf.Decimal64Dtype(scale=2, precision=5),
+            100,
+            cudf.Series([False, False, True, None], dtype=bool),
+            cudf.Series([False, True, False, None], dtype=bool),
+        ),
+        (
+            operator.gt,
+            ["100.123", "41", "120.21", None],
+            cudf.Decimal64Dtype(scale=3, precision=6),
+            decimal.Decimal("100.123"),
+            cudf.Series([False, False, True, None], dtype=bool),
+            cudf.Series([False, True, False, None], dtype=bool),
+        ),
+        (
+            operator.gt,
+            ["100.123", "41", "120.21", None],
+            cudf.Decimal64Dtype(scale=3, precision=6),
+            cudf.Scalar(decimal.Decimal("100.123")),
+            cudf.Series([False, False, True, None], dtype=bool),
+            cudf.Series([False, True, False, None], dtype=bool),
+        ),
+        (
+            operator.ge,
+            ["100.00", "41", "120.21", None],
+            cudf.Decimal64Dtype(scale=2, precision=5),
+            100,
+            cudf.Series([True, False, True, None], dtype=bool),
+            cudf.Series([True, True, False, None], dtype=bool),
+        ),
+        (
+            operator.ge,
+            ["100.123", "41", "120.21", None],
+            cudf.Decimal64Dtype(scale=3, precision=6),
+            decimal.Decimal("100.123"),
+            cudf.Series([True, False, True, None], dtype=bool),
+            cudf.Series([True, True, False, None], dtype=bool),
+        ),
+        (
+            operator.ge,
+            ["100.123", "41", "120.21", None],
+            cudf.Decimal64Dtype(scale=3, precision=6),
+            cudf.Scalar(decimal.Decimal("100.123")),
+            cudf.Series([True, False, True, None], dtype=bool),
+            cudf.Series([True, True, False, None], dtype=bool),
+        ),
+        (
+            operator.lt,
+            ["100.00", "41", "120.21", None],
+            cudf.Decimal64Dtype(scale=2, precision=5),
+            100,
+            cudf.Series([False, True, False, None], dtype=bool),
+            cudf.Series([False, False, True, None], dtype=bool),
+        ),
+        (
+            operator.lt,
+            ["100.123", "41", "120.21", None],
+            cudf.Decimal64Dtype(scale=3, precision=6),
+            decimal.Decimal("100.123"),
+            cudf.Series([False, True, False, None], dtype=bool),
+            cudf.Series([False, False, True, None], dtype=bool),
+        ),
+        (
+            operator.lt,
+            ["100.123", "41", "120.21", None],
+            cudf.Decimal64Dtype(scale=3, precision=6),
+            cudf.Scalar(decimal.Decimal("100.123")),
+            cudf.Series([False, True, False, None], dtype=bool),
+            cudf.Series([False, False, True, None], dtype=bool),
+        ),
+        (
+            operator.le,
+            ["100.00", "41", "120.21", None],
+            cudf.Decimal64Dtype(scale=2, precision=5),
+            100,
+            cudf.Series([True, True, False, None], dtype=bool),
+            cudf.Series([True, False, True, None], dtype=bool),
+        ),
+        (
+            operator.le,
+            ["100.123", "41", "120.21", None],
+            cudf.Decimal64Dtype(scale=3, precision=6),
+            decimal.Decimal("100.123"),
+            cudf.Series([True, True, False, None], dtype=bool),
+            cudf.Series([True, False, True, None], dtype=bool),
+        ),
+        (
+            operator.le,
+            ["100.123", "41", "120.21", None],
+            cudf.Decimal64Dtype(scale=3, precision=6),
+            cudf.Scalar(decimal.Decimal("100.123")),
+            cudf.Series([True, True, False, None], dtype=bool),
+            cudf.Series([True, False, True, None], dtype=bool),
+        ),
+    ],
+)
+@pytest.mark.parametrize("reflected", [True, False])
+def test_binops_decimal_scalar_compare(args, reflected):
+    """
+    Tested compare operations:
+        eq, lt, gt, le, ge
+    Each operation has 3 data setups: pyints, Decimal, and
+    decimal cudf.Scalar
+    For each data setup, there is at least one row that lead to one of the
+    following compare results: {True, False, None}.
+    """
+    if not reflected:
+        op, ldata, ldtype, rdata, expected, _ = args
+    else:
+        op, ldata, ldtype, rdata, _, expected = args
+
+    lhs = _decimal_series(ldata, ldtype)
+    rhs = rdata
+
+    if reflected:
+        rhs, lhs = lhs, rhs
+
+    actual = op(lhs, rhs)
+
+    utils.assert_eq(expected, actual)
+
+
 @pytest.mark.parametrize(
     "dtype",
     [
diff --git a/python/cudf/cudf/tests/test_column.py b/python/cudf/cudf/tests/test_column.py
index 9509cabc117..51745c9bd03 100644
--- a/python/cudf/cudf/tests/test_column.py
+++ b/python/cudf/cudf/tests/test_column.py
@@ -54,7 +54,7 @@ def test_column_offset_and_size(pandas_input, offset, size):
     if cudf.utils.dtypes.is_categorical_dtype(col.dtype):
         assert col.size == col.codes.size
         assert col.size == (col.codes.data.size / col.codes.dtype.itemsize)
-    elif pd.api.types.is_string_dtype(col.dtype):
+    elif cudf.utils.dtypes.is_string_dtype(col.dtype):
         if col.size > 0:
             assert col.size == (col.children[0].size - 1)
             assert col.size == (
@@ -79,6 +79,49 @@ def test_column_offset_and_size(pandas_input, offset, size):
     assert_eq(expect, got)
 
 
+def column_slicing_test(col, offset, size, cast_to_float=False):
+    sl = slice(offset, offset + size)
+    col_slice = col[sl]
+    series = cudf.Series(col)
+    sliced_series = cudf.Series(col_slice)
+
+    if cast_to_float:
+        pd_series = series.astype(float).to_pandas()
+        sliced_series = sliced_series.astype(float)
+    else:
+        pd_series = series.to_pandas()
+
+    if cudf.utils.dtypes.is_categorical_dtype(col.dtype):
+        # The cudf.Series is constructed from an already sliced column, whereas
+        # the pandas.Series is constructed from the unsliced series and then
+        # sliced, so the indexes should be different and we must ignore it.
+        # However, we must compare these as frames, not raw arrays,  because
+        # numpy comparison of categorical values won't work.
+        assert_eq(
+            pd_series[sl].reset_index(drop=True),
+            sliced_series.reset_index(drop=True),
+        )
+    else:
+        assert_eq(np.asarray(pd_series[sl]), sliced_series.to_array())
+
+
+@pytest.mark.parametrize("offset", [0, 1, 15])
+@pytest.mark.parametrize("size", [50, 10, 0])
+def test_column_slicing(pandas_input, offset, size):
+    col = cudf.core.column.as_column(pandas_input)
+    column_slicing_test(col, offset, size)
+
+
+@pytest.mark.parametrize("offset", [0, 1, 15])
+@pytest.mark.parametrize("size", [50, 10, 0])
+@pytest.mark.parametrize("precision", [2, 3, 5])
+@pytest.mark.parametrize("scale", [0, 1, 2])
+def test_decimal_column_slicing(offset, size, precision, scale):
+    col = cudf.core.column.as_column(pd.Series(np.random.rand(1000)))
+    col = col.astype(cudf.Decimal64Dtype(precision, scale))
+    column_slicing_test(col, offset, size, True)
+
+
 @pytest.mark.parametrize(
     "data",
     [
diff --git a/python/cudf/cudf/tests/test_contains.py b/python/cudf/cudf/tests/test_contains.py
index 4737faf65a4..b669c40022e 100644
--- a/python/cudf/cudf/tests/test_contains.py
+++ b/python/cudf/cudf/tests/test_contains.py
@@ -1,11 +1,17 @@
 from datetime import datetime as dt
 
+import numpy as np
 import pandas as pd
 import pytest
 
 from cudf import Series
 from cudf.core.index import RangeIndex, as_index
-from cudf.tests.utils import assert_eq
+from cudf.tests.utils import (
+    DATETIME_TYPES,
+    NUMERIC_TYPES,
+    TIMEDELTA_TYPES,
+    assert_eq,
+)
 
 
 def cudf_date_series(start, stop, freq):
@@ -72,3 +78,43 @@ def test_index_contains(values, item, expected):
 def test_rangeindex_contains():
     assert_eq(True, 9 in RangeIndex(start=0, stop=10, name="Index"))
     assert_eq(False, 10 in RangeIndex(start=0, stop=10, name="Index"))
+
+
+@pytest.mark.parametrize("dtype", NUMERIC_TYPES)
+def test_lists_contains(dtype):
+    dtype = np.dtype(dtype)
+    inner_data = np.array([1, 2, 3], dtype=dtype)
+
+    data = Series([inner_data])
+
+    contained_scalar = inner_data.dtype.type(2)
+    not_contained_scalar = inner_data.dtype.type(42)
+
+    assert data.list.contains(contained_scalar)[0]
+    assert not data.list.contains(not_contained_scalar)[0]
+
+
+@pytest.mark.parametrize("dtype", DATETIME_TYPES + TIMEDELTA_TYPES)
+def test_lists_contains_datetime(dtype):
+    dtype = np.dtype(dtype)
+    inner_data = np.array([1, 2, 3])
+
+    unit, _ = np.datetime_data(dtype)
+
+    data = Series([inner_data])
+
+    contained_scalar = inner_data.dtype.type(2)
+    not_contained_scalar = inner_data.dtype.type(42)
+
+    assert data.list.contains(contained_scalar)[0]
+    assert not data.list.contains(not_contained_scalar)[0]
+
+
+def test_lists_contains_bool():
+    data = Series([[True, True, True]])
+
+    contained_scalar = True
+    not_contained_scalar = False
+
+    assert data.list.contains(contained_scalar)[0]
+    assert not data.list.contains(not_contained_scalar)[0]
diff --git a/python/cudf/cudf/tests/test_dataframe.py b/python/cudf/cudf/tests/test_dataframe.py
index d72b88f1713..f068d02d575 100644
--- a/python/cudf/cudf/tests/test_dataframe.py
+++ b/python/cudf/cudf/tests/test_dataframe.py
@@ -8215,9 +8215,6 @@ def test_agg_for_dataframe_with_string_columns(aggs):
 @pytest.mark.parametrize(
     "overwrite", [True, False],
 )
-@pytest.mark.parametrize(
-    "filter_func", [None],
-)
 @pytest.mark.parametrize(
     "errors", ["ignore"],
 )
@@ -8262,19 +8259,17 @@ def test_agg_for_dataframe_with_string_columns(aggs):
         },
     ],
 )
-def test_update_for_dataframes(
-    data, data2, join, overwrite, filter_func, errors
-):
+def test_update_for_dataframes(data, data2, join, overwrite, errors):
     pdf = pd.DataFrame(data)
     gdf = cudf.DataFrame(data)
 
     other_pd = pd.DataFrame(data2)
     other_gd = cudf.DataFrame(data2)
 
-    expect = pdf.update(other_pd, join, overwrite, filter_func, errors)
-    got = gdf.update(other_gd, join, overwrite, filter_func, errors)
+    pdf.update(other=other_pd, join=join, overwrite=overwrite, errors=errors)
+    gdf.update(other=other_gd, join=join, overwrite=overwrite, errors=errors)
 
-    assert_eq(expect, got)
+    assert_eq(pdf, gdf, check_dtype=False)
 
 
 @pytest.mark.parametrize(
diff --git a/python/cudf/cudf/tests/test_decimal.py b/python/cudf/cudf/tests/test_decimal.py
index 80ff9d5734c..70fc63baba8 100644
--- a/python/cudf/cudf/tests/test_decimal.py
+++ b/python/cudf/cudf/tests/test_decimal.py
@@ -5,15 +5,14 @@
 import numpy as np
 import pyarrow as pa
 import pytest
-import cudf
 
-from cudf.core.dtypes import Decimal64Dtype
+import cudf
 from cudf.core.column import DecimalColumn, NumericalColumn
-
+from cudf.core.dtypes import Decimal64Dtype
 from cudf.tests.utils import (
-    NUMERIC_TYPES,
     FLOAT_TYPES,
     INTEGER_TYPES,
+    NUMERIC_TYPES,
     assert_eq,
 )
 
@@ -88,7 +87,6 @@ def test_typecast_from_float_to_decimal(data, from_dtype, to_dtype):
     got = got.astype(to_dtype)
 
     assert_eq(got, expected)
-    assert_eq(got.dtype, expected.dtype)
 
 
 @pytest.mark.parametrize(
@@ -129,7 +127,6 @@ def test_typecast_from_int_to_decimal(data, from_dtype, to_dtype):
     got = got.astype(to_dtype)
 
     assert_eq(got, expected)
-    assert_eq(got.dtype, expected.dtype)
 
 
 @pytest.mark.parametrize(
@@ -170,7 +167,6 @@ def test_typecast_to_from_decimal(data, from_dtype, to_dtype):
     got = got.astype(to_dtype)
 
     assert_eq(got, expected)
-    assert_eq(got.dtype, expected.dtype)
 
 
 @pytest.mark.parametrize(
@@ -205,4 +201,3 @@ def test_typecast_from_decimal(data, from_dtype, to_dtype):
     expected = cudf.Series(NumericalColumn.from_arrow(pa_arr))
 
     assert_eq(got, expected)
-    assert_eq(got.dtype, expected.dtype)
diff --git a/python/cudf/cudf/tests/test_groupby.py b/python/cudf/cudf/tests/test_groupby.py
index a96db59dee3..4dbe608af82 100644
--- a/python/cudf/cudf/tests/test_groupby.py
+++ b/python/cudf/cudf/tests/test_groupby.py
@@ -2,6 +2,7 @@
 
 import datetime
 import itertools
+from decimal import Decimal
 
 import numpy as np
 import pandas as pd
@@ -9,6 +10,8 @@
 from numba import cuda
 from numpy.testing import assert_array_equal
 
+import rmm
+
 import cudf
 from cudf.core import DataFrame, Series
 from cudf.core._compat import PANDAS_GE_110
@@ -27,6 +30,16 @@
 _index_type_aggs = {"count", "idxmin", "idxmax"}
 
 
+def assert_groupby_results_equal(expect, got, sort=True, **kwargs):
+    # Because we don't sort by index by default in groupby,
+    # sort expect and got by index before comparing
+    if sort:
+        expect = expect.sort_index()
+        got = got.sort_index()
+    else:
+        assert_eq(expect.sort_index(), got.sort_index(), **kwargs)
+
+
 def make_frame(
     dataframe_class,
     nelem,
@@ -73,17 +86,11 @@ def pdf(gdf):
 
 @pytest.mark.parametrize("nelem", [2, 3, 100, 1000])
 def test_groupby_mean(nelem):
-    got_df = (
-        make_frame(DataFrame, nelem=nelem)
-        .groupby(["x", "y"], sort=True)
-        .mean()
-    )
+    got_df = make_frame(DataFrame, nelem=nelem).groupby(["x", "y"]).mean()
     expect_df = (
-        make_frame(pd.DataFrame, nelem=nelem)
-        .groupby(["x", "y"], sort=True)
-        .mean()
+        make_frame(pd.DataFrame, nelem=nelem).groupby(["x", "y"]).mean()
     )
-    assert_eq(got_df, expect_df)
+    assert_groupby_results_equal(got_df, expect_df)
 
 
 @pytest.mark.parametrize("nelem", [2, 3, 100, 1000])
@@ -92,87 +99,67 @@ def test_groupby_mean_3level(nelem):
     bys = list("xyz")
     got_df = (
         make_frame(DataFrame, nelem=nelem, extra_levels=lvls)
-        .groupby(bys, sort=True)
+        .groupby(bys)
         .mean()
     )
     expect_df = (
         make_frame(pd.DataFrame, nelem=nelem, extra_levels=lvls)
-        .groupby(bys, sort=True)
+        .groupby(bys)
         .mean()
     )
-    assert_eq(got_df, expect_df)
+    assert_groupby_results_equal(got_df, expect_df)
 
 
 @pytest.mark.parametrize("nelem", [2, 3, 100, 1000])
 def test_groupby_agg_mean_min(nelem):
     got_df = (
         make_frame(DataFrame, nelem=nelem)
-        .groupby(["x", "y"], sort=True)
+        .groupby(["x", "y"])
         .agg(["mean", "min"])
     )
     expect_df = (
         make_frame(pd.DataFrame, nelem=nelem)
-        .groupby(["x", "y"], sort=True)
+        .groupby(["x", "y"])
         .agg(["mean", "min"])
     )
-    assert_eq(got_df, expect_df)
+    assert_groupby_results_equal(got_df, expect_df)
 
 
 @pytest.mark.parametrize("nelem", [2, 3, 100, 1000])
 def test_groupby_agg_min_max_dictargs(nelem):
     expect_df = (
         make_frame(pd.DataFrame, nelem=nelem, extra_vals="ab")
-        .groupby(["x", "y"], sort=True)
+        .groupby(["x", "y"])
         .agg({"a": "min", "b": "max"})
     )
     got_df = (
         make_frame(DataFrame, nelem=nelem, extra_vals="ab")
-        .groupby(["x", "y"], sort=True)
+        .groupby(["x", "y"])
         .agg({"a": "min", "b": "max"})
     )
-    assert_eq(expect_df, got_df)
+    assert_groupby_results_equal(expect_df, got_df)
 
 
 @pytest.mark.parametrize("nelem", [2, 3, 100, 1000])
 def test_groupby_agg_min_max_dictlist(nelem):
     expect_df = (
         make_frame(pd.DataFrame, nelem=nelem, extra_vals="ab")
-        .groupby(["x", "y"], sort=True)
+        .groupby(["x", "y"])
         .agg({"a": ["min", "max"], "b": ["min", "max"]})
     )
     got_df = (
         make_frame(DataFrame, nelem=nelem, extra_vals="ab")
-        .groupby(["x", "y"], sort=True)
+        .groupby(["x", "y"])
         .agg({"a": ["min", "max"], "b": ["min", "max"]})
     )
-    assert_eq(got_df, expect_df)
-
-
-@pytest.mark.parametrize("nelem", [2, 3, 100, 1000])
-@pytest.mark.parametrize(
-    "func", ["mean", "min", "max", "idxmin", "idxmax", "count", "sum"]
-)
-def test_groupby_2keys_agg(nelem, func):
-    # gdf (Note: lack of multiIndex)
-    expect_df = (
-        make_frame(pd.DataFrame, nelem=nelem)
-        .groupby(["x", "y"], sort=True)
-        .agg(func)
-    )
-    got_df = (
-        make_frame(DataFrame, nelem=nelem)
-        .groupby(["x", "y"], sort=True)
-        .agg(func)
-    )
-    check_dtype = False if func in _index_type_aggs else True
-    assert_eq(got_df, expect_df, check_dtype=check_dtype)
+    assert_groupby_results_equal(got_df, expect_df)
 
 
 @pytest.mark.parametrize("as_index", [True, False])
 def test_groupby_as_index_single_agg(pdf, gdf, as_index):
-    gdf = gdf.groupby("y", as_index=as_index, sort=True).agg({"x": "mean"})
-    pdf = pdf.groupby("y", as_index=as_index, sort=True).agg({"x": "mean"})
-    assert_eq(pdf, gdf)
+    gdf = gdf.groupby("y", as_index=as_index).agg({"x": "mean"})
+    pdf = pdf.groupby("y", as_index=as_index).agg({"x": "mean"})
+    assert_groupby_results_equal(pdf, gdf)
 
 
 @pytest.mark.parametrize("as_index", [True, False])
@@ -198,36 +185,33 @@ def test_groupby_as_index_multiindex(pdf, gdf, as_index):
 
 
 def test_groupby_default(pdf, gdf):
-    gdf = gdf.groupby("y", sort=True).agg({"x": "mean"})
-    pdf = pdf.groupby("y", sort=True).agg({"x": "mean"})
-    assert_eq(pdf, gdf)
+    gdf = gdf.groupby("y").agg({"x": "mean"})
+    pdf = pdf.groupby("y").agg({"x": "mean"})
+    assert_groupby_results_equal(pdf, gdf)
 
 
 def test_group_keys_true(pdf, gdf):
-    gdf = gdf.groupby("y", group_keys=True, sort=True).sum()
-    pdf = pdf.groupby("y", group_keys=True, sort=True).sum()
-    assert_eq(pdf, gdf)
+    gdf = gdf.groupby("y", group_keys=True).sum()
+    pdf = pdf.groupby("y", group_keys=True).sum()
+    assert_groupby_results_equal(pdf, gdf)
 
 
 @pytest.mark.parametrize("as_index", [True, False])
 def test_groupby_getitem_getattr(as_index):
     pdf = pd.DataFrame({"x": [1, 3, 1], "y": [1, 2, 3], "z": [1, 4, 5]})
     gdf = cudf.from_pandas(pdf)
-    assert_eq(
-        pdf.groupby("x", sort=True)["y"].sum(),
-        gdf.groupby("x", sort=True)["y"].sum(),
+    assert_groupby_results_equal(
+        pdf.groupby("x")["y"].sum(), gdf.groupby("x")["y"].sum(),
     )
-    assert_eq(
-        pdf.groupby("x", sort=True).y.sum(),
-        gdf.groupby("x", sort=True).y.sum(),
+    assert_groupby_results_equal(
+        pdf.groupby("x").y.sum(), gdf.groupby("x").y.sum(),
     )
-    assert_eq(
-        pdf.groupby("x", sort=True)[["y"]].sum(),
-        gdf.groupby("x", sort=True)[["y"]].sum(),
+    assert_groupby_results_equal(
+        pdf.groupby("x")[["y"]].sum(), gdf.groupby("x")[["y"]].sum(),
     )
-    assert_eq(
-        pdf.groupby(["x", "y"], as_index=as_index, sort=True).sum(),
-        gdf.groupby(["x", "y"], as_index=as_index, sort=True).sum(),
+    assert_groupby_results_equal(
+        pdf.groupby(["x", "y"], as_index=as_index).sum(),
+        gdf.groupby(["x", "y"], as_index=as_index).sum(),
     )
 
 
@@ -277,10 +261,8 @@ def test_groupby_apply():
     df["val1"] = np.random.random(nelem)
     df["val2"] = np.random.random(nelem)
 
-    expect_grpby = df.to_pandas().groupby(
-        ["key1", "key2"], as_index=False, sort=True
-    )
-    got_grpby = df.groupby(["key1", "key2"], sort=True)
+    expect_grpby = df.to_pandas().groupby(["key1", "key2"], as_index=False)
+    got_grpby = df.groupby(["key1", "key2"])
 
     def foo(df):
         df["out"] = df["val1"] + df["val2"]
@@ -288,7 +270,7 @@ def foo(df):
 
     expect = expect_grpby.apply(foo)
     got = got_grpby.apply(foo)
-    assert_eq(expect, got)
+    assert_groupby_results_equal(expect, got)
 
 
 def test_groupby_apply_grouped():
@@ -300,10 +282,8 @@ def test_groupby_apply_grouped():
     df["val1"] = np.random.random(nelem)
     df["val2"] = np.random.random(nelem)
 
-    expect_grpby = df.to_pandas().groupby(
-        ["key1", "key2"], as_index=False, sort=True
-    )
-    got_grpby = df.groupby(["key1", "key2"], sort=True)
+    expect_grpby = df.to_pandas().groupby(["key1", "key2"], as_index=False)
+    got_grpby = df.groupby(["key1", "key2"])
 
     def foo(key1, val1, com1, com2):
         for i in range(cuda.threadIdx.x, len(key1), cuda.blockDim.x):
@@ -328,29 +308,85 @@ def emulate(df):
     expect = expect_grpby.apply(emulate)
     expect = expect.sort_values(["key1", "key2"])
 
-    assert_eq(expect, got)
+    assert_groupby_results_equal(expect, got)
 
 
-@pytest.mark.parametrize("nelem", [100, 500])
+@pytest.mark.parametrize("nelem", [2, 3, 100, 500, 1000])
 @pytest.mark.parametrize(
     "func",
     ["mean", "std", "var", "min", "max", "idxmin", "idxmax", "count", "sum"],
 )
-def test_groupby_cudf_2keys_agg(nelem, func):
-    got_df = (
-        make_frame(DataFrame, nelem=nelem)
-        .groupby(["x", "y"], sort=True)
-        .agg(func)
-    )
-
-    # pandas
+def test_groupby_2keys_agg(nelem, func):
+    # gdf (Note: lack of multiIndex)
     expect_df = (
-        make_frame(pd.DataFrame, nelem=nelem)
-        .groupby(["x", "y"], sort=True)
-        .agg(func)
+        make_frame(pd.DataFrame, nelem=nelem).groupby(["x", "y"]).agg(func)
     )
+    got_df = make_frame(DataFrame, nelem=nelem).groupby(["x", "y"]).agg(func)
+
     check_dtype = False if func in _index_type_aggs else True
-    assert_eq(got_df, expect_df, check_dtype=check_dtype)
+    assert_groupby_results_equal(got_df, expect_df, check_dtype=check_dtype)
+
+
+@pytest.mark.parametrize("num_groups", [2, 3, 10, 50, 100])
+@pytest.mark.parametrize("nelem_per_group", [1, 10, 100])
+@pytest.mark.parametrize(
+    "func",
+    ["min", "max", "count", "sum"],
+    # TODO: Replace the above line with the one below once
+    # https://github.com/pandas-dev/pandas/issues/40685 is resolved.
+    # "func", ["min", "max", "idxmin", "idxmax", "count", "sum"],
+)
+def test_groupby_agg_decimal(num_groups, nelem_per_group, func):
+    # The number of digits after the decimal to use.
+    decimal_digits = 2
+    # The number of digits before the decimal to use.
+    whole_digits = 2
+
+    scale = 10 ** whole_digits
+    nelem = num_groups * nelem_per_group
+
+    # The unique is necessary because otherwise if there are duplicates idxmin
+    # and idxmax may return different results than pandas (see
+    # https://github.com/rapidsai/cudf/issues/7756). This is not relevant to
+    # the current version of the test, because idxmin and idxmax simply don't
+    # work with pandas Series composed of Decimal objects (see
+    # https://github.com/pandas-dev/pandas/issues/40685). However, if that is
+    # ever enabled, then this issue will crop up again so we may as well have
+    # it fixed now.
+    x = np.unique((np.random.rand(nelem) * scale).round(decimal_digits))
+    y = np.unique((np.random.rand(nelem) * scale).round(decimal_digits))
+
+    if x.size < y.size:
+        total_elements = x.size
+        y = y[: x.size]
+    else:
+        total_elements = y.size
+        x = x[: y.size]
+
+    # Note that this filtering can lead to one group with fewer elements, but
+    # that shouldn't be a problem and is probably useful to test.
+    idx_col = np.tile(np.arange(num_groups), nelem_per_group)[:total_elements]
+
+    decimal_x = pd.Series([Decimal(str(d)) for d in x])
+    decimal_y = pd.Series([Decimal(str(d)) for d in y])
+
+    pdf = pd.DataFrame({"idx": idx_col, "x": decimal_x, "y": decimal_y})
+    gdf = DataFrame(
+        {
+            "idx": idx_col,
+            "x": cudf.Series(decimal_x),
+            "y": cudf.Series(decimal_y),
+        }
+    )
+
+    expect_df = pdf.groupby("idx", sort=True).agg(func)
+    if rmm._cuda.gpu.runtimeGetVersion() < 11000:
+        with pytest.raises(RuntimeError):
+            got_df = gdf.groupby("idx", sort=True).agg(func)
+    else:
+        got_df = gdf.groupby("idx", sort=True).agg(func)
+        assert_eq(expect_df["x"], got_df["x"], check_dtype=False)
+        assert_eq(expect_df["y"], got_df["y"], check_dtype=False)
 
 
 @pytest.mark.parametrize(
@@ -364,7 +400,7 @@ def test_series_groupby(agg):
     sa = getattr(sg, agg)()
     ga = getattr(gg, agg)()
     check_dtype = False if agg in _index_type_aggs else True
-    assert_eq(sa, ga, check_dtype=check_dtype)
+    assert_groupby_results_equal(sa, ga, check_dtype=check_dtype)
 
 
 @pytest.mark.parametrize(
@@ -376,7 +412,7 @@ def test_series_groupby_agg(agg):
     sg = s.groupby(s // 2).agg(agg)
     gg = g.groupby(g // 2).agg(agg)
     check_dtype = False if agg in _index_type_aggs else True
-    assert_eq(sg, gg, check_dtype=check_dtype)
+    assert_groupby_results_equal(sg, gg, check_dtype=check_dtype)
 
 
 @pytest.mark.parametrize(
@@ -405,7 +441,7 @@ def test_groupby_level_zero(agg):
     pdresult = getattr(pdg, agg)()
     gdresult = getattr(gdg, agg)()
     check_dtype = False if agg in _index_type_aggs else True
-    assert_eq(pdresult, gdresult, check_dtype=check_dtype)
+    assert_groupby_results_equal(pdresult, gdresult, check_dtype=check_dtype)
 
 
 @pytest.mark.parametrize(
@@ -434,59 +470,59 @@ def test_groupby_series_level_zero(agg):
     pdresult = getattr(pdg, agg)()
     gdresult = getattr(gdg, agg)()
     check_dtype = False if agg in _index_type_aggs else True
-    assert_eq(pdresult, gdresult, check_dtype=check_dtype)
+    assert_groupby_results_equal(pdresult, gdresult, check_dtype=check_dtype)
 
 
 def test_groupby_column_name():
     pdf = pd.DataFrame({"xx": [1.0, 2.0, 3.0], "yy": [1, 2, 3]})
     gdf = DataFrame.from_pandas(pdf)
-    g = gdf.groupby("yy", sort=True)
-    p = pdf.groupby("yy", sort=True)
+    g = gdf.groupby("yy")
+    p = pdf.groupby("yy")
     gxx = g["xx"].sum()
     pxx = p["xx"].sum()
-    assert_eq(pxx, gxx)
+    assert_groupby_results_equal(pxx, gxx)
 
     gxx = g["xx"].count()
     pxx = p["xx"].count()
-    assert_eq(pxx, gxx, check_dtype=False)
+    assert_groupby_results_equal(pxx, gxx, check_dtype=False)
 
     gxx = g["xx"].min()
     pxx = p["xx"].min()
-    assert_eq(pxx, gxx)
+    assert_groupby_results_equal(pxx, gxx)
 
     gxx = g["xx"].max()
     pxx = p["xx"].max()
-    assert_eq(pxx, gxx)
+    assert_groupby_results_equal(pxx, gxx)
 
     gxx = g["xx"].idxmin()
     pxx = p["xx"].idxmin()
-    assert_eq(pxx, gxx, check_dtype=False)
+    assert_groupby_results_equal(pxx, gxx, check_dtype=False)
 
     gxx = g["xx"].idxmax()
     pxx = p["xx"].idxmax()
-    assert_eq(pxx, gxx, check_dtype=False)
+    assert_groupby_results_equal(pxx, gxx, check_dtype=False)
 
     gxx = g["xx"].mean()
     pxx = p["xx"].mean()
-    assert_eq(pxx, gxx)
+    assert_groupby_results_equal(pxx, gxx)
 
 
 def test_groupby_column_numeral():
     pdf = pd.DataFrame({0: [1.0, 2.0, 3.0], 1: [1, 2, 3]})
     gdf = DataFrame.from_pandas(pdf)
-    p = pdf.groupby(1, sort=True)
-    g = gdf.groupby(1, sort=True)
+    p = pdf.groupby(1)
+    g = gdf.groupby(1)
     pxx = p[0].sum()
     gxx = g[0].sum()
-    assert_eq(pxx, gxx)
+    assert_groupby_results_equal(pxx, gxx)
 
     pdf = pd.DataFrame({0.5: [1.0, 2.0, 3.0], 1.5: [1, 2, 3]})
     gdf = DataFrame.from_pandas(pdf)
-    p = pdf.groupby(1.5, sort=True)
-    g = gdf.groupby(1.5, sort=True)
+    p = pdf.groupby(1.5)
+    g = gdf.groupby(1.5)
     pxx = p[0.5].sum()
     gxx = g[0.5].sum()
-    assert_eq(pxx, gxx)
+    assert_groupby_results_equal(pxx, gxx)
 
 
 @pytest.mark.parametrize(
@@ -496,18 +532,18 @@ def test_groupby_column_numeral():
 def test_groupby_external_series(series):
     pdf = pd.DataFrame({"x": [1.0, 2.0, 3.0], "y": [1, 2, 1]})
     gdf = DataFrame.from_pandas(pdf)
-    pxx = pdf.groupby(pd.Series(series), sort=True).x.sum()
-    gxx = gdf.groupby(cudf.Series(series), sort=True).x.sum()
-    assert_eq(pxx, gxx)
+    pxx = pdf.groupby(pd.Series(series)).x.sum()
+    gxx = gdf.groupby(cudf.Series(series)).x.sum()
+    assert_groupby_results_equal(pxx, gxx)
 
 
 @pytest.mark.parametrize("series", [[0.0, 1.0], [1.0, 1.0, 1.0, 1.0]])
 def test_groupby_external_series_incorrect_length(series):
     pdf = pd.DataFrame({"x": [1.0, 2.0, 3.0], "y": [1, 2, 1]})
     gdf = DataFrame.from_pandas(pdf)
-    pxx = pdf.groupby(pd.Series(series), sort=True).x.sum()
-    gxx = gdf.groupby(cudf.Series(series), sort=True).x.sum()
-    assert_eq(pxx, gxx)
+    pxx = pdf.groupby(pd.Series(series)).x.sum()
+    gxx = gdf.groupby(cudf.Series(series)).x.sum()
+    assert_groupby_results_equal(pxx, gxx)
 
 
 @pytest.mark.parametrize(
@@ -517,52 +553,51 @@ def test_groupby_levels(level):
     idx = pd.MultiIndex.from_tuples([(1, 1), (1, 2), (2, 2)], names=("a", "b"))
     pdf = pd.DataFrame({"c": [1, 2, 3], "d": [2, 3, 4]}, index=idx)
     gdf = cudf.from_pandas(pdf)
-    assert_eq(
-        pdf.groupby(level=level, sort=True).sum(),
-        gdf.groupby(level=level, sort=True).sum(),
+    assert_groupby_results_equal(
+        pdf.groupby(level=level).sum(), gdf.groupby(level=level).sum(),
     )
 
 
 def test_advanced_groupby_levels():
     pdf = pd.DataFrame({"x": [1, 2, 3], "y": [1, 2, 1], "z": [1, 1, 1]})
     gdf = cudf.from_pandas(pdf)
-    pdg = pdf.groupby(["x", "y"], sort=True).sum()
-    gdg = gdf.groupby(["x", "y"], sort=True).sum()
-    assert_eq(pdg, gdg)
-    pdh = pdg.groupby(level=1, sort=True).sum()
-    gdh = gdg.groupby(level=1, sort=True).sum()
-    assert_eq(pdh, gdh)
-    pdg = pdf.groupby(["x", "y", "z"], sort=True).sum()
-    gdg = gdf.groupby(["x", "y", "z"], sort=True).sum()
-    assert_eq(pdg, gdg)
-    pdg = pdf.groupby(["z"], sort=True).sum()
-    gdg = gdf.groupby(["z"], sort=True).sum()
-    assert_eq(pdg, gdg)
-    pdg = pdf.groupby(["y", "z"], sort=True).sum()
-    gdg = gdf.groupby(["y", "z"], sort=True).sum()
-    assert_eq(pdg, gdg)
-    pdg = pdf.groupby(["x", "z"], sort=True).sum()
-    gdg = gdf.groupby(["x", "z"], sort=True).sum()
-    assert_eq(pdg, gdg)
-    pdg = pdf.groupby(["y"], sort=True).sum()
-    gdg = gdf.groupby(["y"], sort=True).sum()
-    assert_eq(pdg, gdg)
-    pdg = pdf.groupby(["x"], sort=True).sum()
-    gdg = gdf.groupby(["x"], sort=True).sum()
-    assert_eq(pdg, gdg)
-    pdh = pdg.groupby(level=0, sort=True).sum()
-    gdh = gdg.groupby(level=0, sort=True).sum()
-    assert_eq(pdh, gdh)
-    pdg = pdf.groupby(["x", "y"], sort=True).sum()
-    gdg = gdf.groupby(["x", "y"], sort=True).sum()
-    pdh = pdg.groupby(level=[0, 1], sort=True).sum()
-    gdh = gdg.groupby(level=[0, 1], sort=True).sum()
-    assert_eq(pdh, gdh)
-    pdh = pdg.groupby(level=[1, 0], sort=True).sum()
-    gdh = gdg.groupby(level=[1, 0], sort=True).sum()
-    assert_eq(pdh, gdh)
-    pdg = pdf.groupby(["x", "y"], sort=True).sum()
-    gdg = gdf.groupby(["x", "y"], sort=True).sum()
+    pdg = pdf.groupby(["x", "y"]).sum()
+    gdg = gdf.groupby(["x", "y"]).sum()
+    assert_groupby_results_equal(pdg, gdg)
+    pdh = pdg.groupby(level=1).sum()
+    gdh = gdg.groupby(level=1).sum()
+    assert_groupby_results_equal(pdh, gdh)
+    pdg = pdf.groupby(["x", "y", "z"]).sum()
+    gdg = gdf.groupby(["x", "y", "z"]).sum()
+    assert_groupby_results_equal(pdg, gdg)
+    pdg = pdf.groupby(["z"]).sum()
+    gdg = gdf.groupby(["z"]).sum()
+    assert_groupby_results_equal(pdg, gdg)
+    pdg = pdf.groupby(["y", "z"]).sum()
+    gdg = gdf.groupby(["y", "z"]).sum()
+    assert_groupby_results_equal(pdg, gdg)
+    pdg = pdf.groupby(["x", "z"]).sum()
+    gdg = gdf.groupby(["x", "z"]).sum()
+    assert_groupby_results_equal(pdg, gdg)
+    pdg = pdf.groupby(["y"]).sum()
+    gdg = gdf.groupby(["y"]).sum()
+    assert_groupby_results_equal(pdg, gdg)
+    pdg = pdf.groupby(["x"]).sum()
+    gdg = gdf.groupby(["x"]).sum()
+    assert_groupby_results_equal(pdg, gdg)
+    pdh = pdg.groupby(level=0).sum()
+    gdh = gdg.groupby(level=0).sum()
+    assert_groupby_results_equal(pdh, gdh)
+    pdg = pdf.groupby(["x", "y"]).sum()
+    gdg = gdf.groupby(["x", "y"]).sum()
+    pdh = pdg.groupby(level=[0, 1]).sum()
+    gdh = gdg.groupby(level=[0, 1]).sum()
+    assert_groupby_results_equal(pdh, gdh)
+    pdh = pdg.groupby(level=[1, 0]).sum()
+    gdh = gdg.groupby(level=[1, 0]).sum()
+    assert_groupby_results_equal(pdh, gdh)
+    pdg = pdf.groupby(["x", "y"]).sum()
+    gdg = gdf.groupby(["x", "y"]).sum()
 
     assert_exceptions_equal(
         lfunc=pdg.groupby,
@@ -595,7 +630,7 @@ def test_advanced_groupby_levels():
 def test_empty_groupby(func):
     pdf = pd.DataFrame({"x": [], "y": [], "z": []})
     gdf = cudf.from_pandas(pdf)
-    assert_eq(func(pdf), func(gdf), check_index_type=False)
+    assert_groupby_results_equal(func(pdf), func(gdf), check_index_type=False)
 
 
 def test_groupby_unsupported_columns():
@@ -613,21 +648,21 @@ def test_groupby_unsupported_columns():
     )
     pdf["b"] = pd_cat
     gdf = cudf.from_pandas(pdf)
-    pdg = pdf.groupby("x", sort=True).sum()
-    gdg = gdf.groupby("x", sort=True).sum()
-    assert_eq(pdg, gdg)
+    pdg = pdf.groupby("x").sum()
+    gdg = gdf.groupby("x").sum()
+    assert_groupby_results_equal(pdg, gdg)
 
 
 def test_list_of_series():
     pdf = pd.DataFrame({"x": [1, 2, 3], "y": [1, 2, 1]})
     gdf = cudf.from_pandas(pdf)
-    pdg = pdf.groupby([pdf.x], sort=True).y.sum()
-    gdg = gdf.groupby([gdf.x], sort=True).y.sum()
-    assert_eq(pdg, gdg)
-    pdg = pdf.groupby([pdf.x, pdf.y], sort=True).y.sum()
-    gdg = gdf.groupby([gdf.x, gdf.y], sort=True).y.sum()
+    pdg = pdf.groupby([pdf.x]).y.sum()
+    gdg = gdf.groupby([gdf.x]).y.sum()
+    assert_groupby_results_equal(pdg, gdg)
+    pdg = pdf.groupby([pdf.x, pdf.y]).y.sum()
+    gdg = gdf.groupby([gdf.x, gdf.y]).y.sum()
     pytest.skip()
-    assert_eq(pdg, gdg)
+    assert_groupby_results_equal(pdg, gdg)
 
 
 def test_groupby_use_agg_column_as_index():
@@ -637,7 +672,7 @@ def test_groupby_use_agg_column_as_index():
     gdf["a"] = [1, 1, 1, 3, 5]
     pdg = pdf.groupby("a").agg({"a": "count"})
     gdg = gdf.groupby("a").agg({"a": "count"})
-    assert_eq(pdg, gdg, check_dtype=False)
+    assert_groupby_results_equal(pdg, gdg, check_dtype=False)
 
 
 def test_groupby_list_then_string():
@@ -646,13 +681,13 @@ def test_groupby_list_then_string():
     gdf["b"] = [11, 2, 15, 12, 2]
     gdf["c"] = [6, 7, 6, 7, 6]
     pdf = gdf.to_pandas()
-    gdg = gdf.groupby("a", as_index=True, sort=True).agg(
+    gdg = gdf.groupby("a", as_index=True).agg(
         {"b": ["min", "max"], "c": "max"}
     )
-    pdg = pdf.groupby("a", as_index=True, sort=True).agg(
+    pdg = pdf.groupby("a", as_index=True).agg(
         {"b": ["min", "max"], "c": "max"}
     )
-    assert_eq(gdg, pdg)
+    assert_groupby_results_equal(gdg, pdg)
 
 
 def test_groupby_different_unequal_length_column_aggregations():
@@ -661,13 +696,13 @@ def test_groupby_different_unequal_length_column_aggregations():
     gdf["b"] = [11, 2, 15, 12, 2]
     gdf["c"] = [11, 2, 15, 12, 2]
     pdf = gdf.to_pandas()
-    gdg = gdf.groupby("a", as_index=True, sort=True).agg(
+    gdg = gdf.groupby("a", as_index=True).agg(
         {"b": "min", "c": ["max", "min"]}
     )
-    pdg = pdf.groupby("a", as_index=True, sort=True).agg(
+    pdg = pdf.groupby("a", as_index=True).agg(
         {"b": "min", "c": ["max", "min"]}
     )
-    assert_eq(pdg, gdg)
+    assert_groupby_results_equal(pdg, gdg)
 
 
 def test_groupby_single_var_two_aggs():
@@ -676,9 +711,9 @@ def test_groupby_single_var_two_aggs():
     gdf["b"] = [11, 2, 15, 12, 2]
     gdf["c"] = [11, 2, 15, 12, 2]
     pdf = gdf.to_pandas()
-    gdg = gdf.groupby("a", as_index=True, sort=True).agg({"b": ["min", "max"]})
-    pdg = pdf.groupby("a", as_index=True, sort=True).agg({"b": ["min", "max"]})
-    assert_eq(pdg, gdg)
+    gdg = gdf.groupby("a", as_index=True).agg({"b": ["min", "max"]})
+    pdg = pdf.groupby("a", as_index=True).agg({"b": ["min", "max"]})
+    assert_groupby_results_equal(pdg, gdg)
 
 
 def test_groupby_double_var_two_aggs():
@@ -687,13 +722,9 @@ def test_groupby_double_var_two_aggs():
     gdf["b"] = [11, 2, 15, 12, 2]
     gdf["c"] = [11, 2, 15, 12, 2]
     pdf = gdf.to_pandas()
-    gdg = gdf.groupby(["a", "b"], as_index=True, sort=True).agg(
-        {"c": ["min", "max"]}
-    )
-    pdg = pdf.groupby(["a", "b"], as_index=True, sort=True).agg(
-        {"c": ["min", "max"]}
-    )
-    assert_eq(pdg, gdg)
+    gdg = gdf.groupby(["a", "b"], as_index=True).agg({"c": ["min", "max"]})
+    pdg = pdf.groupby(["a", "b"], as_index=True).agg({"c": ["min", "max"]})
+    assert_groupby_results_equal(pdg, gdg)
 
 
 def test_groupby_apply_basic_agg_single_column():
@@ -703,9 +734,9 @@ def test_groupby_apply_basic_agg_single_column():
     gdf["mult"] = gdf["key"] * gdf["val"]
     pdf = gdf.to_pandas()
 
-    gdg = gdf.groupby(["key", "val"], sort=True).mult.sum()
-    pdg = pdf.groupby(["key", "val"], sort=True).mult.sum()
-    assert_eq(pdg, gdg)
+    gdg = gdf.groupby(["key", "val"]).mult.sum()
+    pdg = pdf.groupby(["key", "val"]).mult.sum()
+    assert_groupby_results_equal(pdg, gdg)
 
 
 def test_groupby_multi_agg_single_groupby_series():
@@ -716,10 +747,10 @@ def test_groupby_multi_agg_single_groupby_series():
         }
     )
     gdf = cudf.from_pandas(pdf)
-    pdg = pdf.groupby("x", sort=True).y.agg(["sum", "max"])
-    gdg = gdf.groupby("x", sort=True).y.agg(["sum", "max"])
+    pdg = pdf.groupby("x").y.agg(["sum", "max"])
+    gdg = gdf.groupby("x").y.agg(["sum", "max"])
 
-    assert_eq(pdg, gdg)
+    assert_groupby_results_equal(pdg, gdg)
 
 
 def test_groupby_multi_agg_multi_groupby():
@@ -732,9 +763,9 @@ def test_groupby_multi_agg_multi_groupby():
         }
     )
     gdf = cudf.from_pandas(pdf)
-    pdg = pdf.groupby(["a", "b"], sort=True).agg(["sum", "max"])
-    gdg = gdf.groupby(["a", "b"], sort=True).agg(["sum", "max"])
-    assert_eq(pdg, gdg)
+    pdg = pdf.groupby(["a", "b"]).agg(["sum", "max"])
+    gdg = gdf.groupby(["a", "b"]).agg(["sum", "max"])
+    assert_groupby_results_equal(pdg, gdg)
 
 
 def test_groupby_datetime_multi_agg_multi_groupby():
@@ -751,10 +782,10 @@ def test_groupby_datetime_multi_agg_multi_groupby():
         }
     )
     gdf = cudf.from_pandas(pdf)
-    pdg = pdf.groupby(["a", "b"], sort=True).agg(["sum", "max"])
-    gdg = gdf.groupby(["a", "b"], sort=True).agg(["sum", "max"])
+    pdg = pdf.groupby(["a", "b"]).agg(["sum", "max"])
+    gdg = gdf.groupby(["a", "b"]).agg(["sum", "max"])
 
-    assert_eq(pdg, gdg)
+    assert_groupby_results_equal(pdg, gdg)
 
 
 @pytest.mark.parametrize(
@@ -778,9 +809,9 @@ def test_groupby_multi_agg_hash_groupby(agg):
     ).reset_index(drop=True)
     pdf = gdf.to_pandas()
     check_dtype = False if "count" in agg else True
-    pdg = pdf.groupby("id", sort=True).agg(agg)
-    gdg = gdf.groupby("id", sort=True).agg(agg)
-    assert_eq(pdg, gdg, check_dtype=check_dtype)
+    pdg = pdf.groupby("id").agg(agg)
+    gdg = gdf.groupby("id").agg(agg)
+    assert_groupby_results_equal(pdg, gdg, check_dtype=check_dtype)
 
 
 @pytest.mark.parametrize(
@@ -791,9 +822,9 @@ def test_groupby_nulls_basic(agg):
 
     pdf = pd.DataFrame({"a": [0, 0, 1, 1, 2, 2], "b": [1, 2, 1, 2, 1, None]})
     gdf = cudf.from_pandas(pdf)
-    assert_eq(
-        getattr(pdf.groupby("a", sort=True), agg)(),
-        getattr(gdf.groupby("a", sort=True), agg)(),
+    assert_groupby_results_equal(
+        getattr(pdf.groupby("a"), agg)(),
+        getattr(gdf.groupby("a"), agg)(),
         check_dtype=check_dtype,
     )
 
@@ -805,9 +836,9 @@ def test_groupby_nulls_basic(agg):
         }
     )
     gdf = cudf.from_pandas(pdf)
-    assert_eq(
-        getattr(pdf.groupby("a", sort=True), agg)(),
-        getattr(gdf.groupby("a", sort=True), agg)(),
+    assert_groupby_results_equal(
+        getattr(pdf.groupby("a"), agg)(),
+        getattr(gdf.groupby("a"), agg)(),
         check_dtype=check_dtype,
     )
 
@@ -822,9 +853,9 @@ def test_groupby_nulls_basic(agg):
 
     # TODO: fillna() used here since we don't follow
     # Pandas' null semantics. Should we change it?
-    assert_eq(
-        getattr(pdf.groupby("a", sort=True), agg)().fillna(0),
-        getattr(gdf.groupby("a", sort=True), agg)().fillna(0),
+    assert_groupby_results_equal(
+        getattr(pdf.groupby("a"), agg)().fillna(0),
+        getattr(gdf.groupby("a"), agg)().fillna(0),
         check_dtype=check_dtype,
     )
 
@@ -833,7 +864,9 @@ def test_groupby_nulls_in_index():
     pdf = pd.DataFrame({"a": [None, 2, 1, 1], "b": [1, 2, 3, 4]})
     gdf = cudf.from_pandas(pdf)
 
-    assert_eq(pdf.groupby("a").sum(), gdf.groupby("a").sum())
+    assert_groupby_results_equal(
+        pdf.groupby("a").sum(), gdf.groupby("a").sum()
+    )
 
 
 def test_groupby_all_nulls_index():
@@ -844,13 +877,17 @@ def test_groupby_all_nulls_index():
         }
     )
     pdf = gdf.to_pandas()
-    assert_eq(pdf.groupby("a").sum(), gdf.groupby("a").sum())
+    assert_groupby_results_equal(
+        pdf.groupby("a").sum(), gdf.groupby("a").sum()
+    )
 
     gdf = cudf.DataFrame(
         {"a": cudf.Series([np.nan, np.nan, np.nan, np.nan]), "b": [1, 2, 3, 4]}
     )
     pdf = gdf.to_pandas()
-    assert_eq(pdf.groupby("a").sum(), gdf.groupby("a").sum())
+    assert_groupby_results_equal(
+        pdf.groupby("a").sum(), gdf.groupby("a").sum()
+    )
 
 
 @pytest.mark.parametrize("sort", [True, False])
@@ -902,10 +939,8 @@ def test_groupby_cat():
         {"a": [1, 1, 2], "b": pd.Series(["b", "b", "a"], dtype="category")}
     )
     gdf = cudf.from_pandas(pdf)
-    assert_eq(
-        pdf.groupby("a", sort=True).count(),
-        gdf.groupby("a", sort=True).count(),
-        check_dtype=False,
+    assert_groupby_results_equal(
+        pdf.groupby("a").count(), gdf.groupby("a").count(), check_dtype=False,
     )
 
 
@@ -947,7 +982,7 @@ def test_groupby_quantile(interpolation, q):
             "Pandas NaN Rounding will fail nearest interpolation at 0.5"
         )
 
-    assert_eq(pdresult, gdresult)
+    assert_groupby_results_equal(pdresult, gdresult)
 
 
 def test_groupby_std():
@@ -957,16 +992,12 @@ def test_groupby_std():
     }
     pdf = pd.DataFrame(raw_data)
     gdf = DataFrame.from_pandas(pdf)
-    pdg = pdf.groupby("x", sort=True)
-    gdg = gdf.groupby("x", sort=True)
+    pdg = pdf.groupby("x")
+    gdg = gdf.groupby("x")
     pdresult = pdg.std()
     gdresult = gdg.std()
 
-    # There's a lot left to add to python bindings like index name
-    # so this is a temporary workaround
-    pdresult = pdresult["y"].reset_index(drop=True)
-    gdresult = gdresult["y"].reset_index(drop=True)
-    assert_eq(pdresult, gdresult)
+    assert_groupby_results_equal(pdresult, gdresult)
 
 
 def test_groupby_size():
@@ -979,23 +1010,19 @@ def test_groupby_size():
     )
     gdf = cudf.from_pandas(pdf)
 
-    assert_eq(
-        pdf.groupby("a", sort=True).size(),
-        gdf.groupby("a", sort=True).size(),
-        check_dtype=False,
+    assert_groupby_results_equal(
+        pdf.groupby("a").size(), gdf.groupby("a").size(), check_dtype=False,
     )
 
-    assert_eq(
-        pdf.groupby(["a", "b", "c"], sort=True).size(),
-        gdf.groupby(["a", "b", "c"], sort=True).size(),
+    assert_groupby_results_equal(
+        pdf.groupby(["a", "b", "c"]).size(),
+        gdf.groupby(["a", "b", "c"]).size(),
         check_dtype=False,
     )
 
     sr = pd.Series(range(len(pdf)))
-    assert_eq(
-        pdf.groupby(sr, sort=True).size(),
-        gdf.groupby(sr, sort=True).size(),
-        check_dtype=False,
+    assert_groupby_results_equal(
+        pdf.groupby(sr).size(), gdf.groupby(sr).size(), check_dtype=False,
     )
 
 
@@ -1010,15 +1037,15 @@ def test_groupby_datetime(nelem, as_index, agg):
     check_dtype = agg not in ("mean", "count", "idxmin", "idxmax")
     pdf = make_frame(pd.DataFrame, nelem=nelem, with_datetime=True)
     gdf = make_frame(cudf.DataFrame, nelem=nelem, with_datetime=True)
-    pdg = pdf.groupby("datetime", as_index=as_index, sort=True)
-    gdg = gdf.groupby("datetime", as_index=as_index, sort=True)
+    pdg = pdf.groupby("datetime", as_index=as_index)
+    gdg = gdf.groupby("datetime", as_index=as_index)
     if as_index is False:
         pdres = getattr(pdg, agg)()
         gdres = getattr(gdg, agg)()
     else:
         pdres = pdg.agg({"datetime": agg})
         gdres = gdg.agg({"datetime": agg})
-    assert_eq(pdres, gdres, check_dtype=check_dtype)
+    assert_groupby_results_equal(pdres, gdres, check_dtype=check_dtype)
 
 
 def test_groupby_dropna():
@@ -1026,8 +1053,8 @@ def test_groupby_dropna():
     expect = cudf.DataFrame(
         {"b": [3, 3]}, index=cudf.Series([1, None], name="a")
     )
-    got = df.groupby("a", dropna=False, sort=True).sum()
-    assert_eq(expect, got)
+    got = df.groupby("a", dropna=False).sum()
+    assert_groupby_results_equal(expect, got)
 
     df = cudf.DataFrame(
         {"a": [1, 1, 1, None], "b": [1, None, 1, None], "c": [1, 2, 3, 4]}
@@ -1037,22 +1064,22 @@ def test_groupby_dropna():
         names=["a", "b"],
     )
     expect = cudf.DataFrame({"c": [4, 2, 4]}, index=idx)
-    got = df.groupby(["a", "b"], dropna=False, sort=True).sum()
+    got = df.groupby(["a", "b"], dropna=False).sum()
 
-    assert_eq(expect, got)
+    assert_groupby_results_equal(expect, got)
 
 
 def test_groupby_dropna_getattr():
     df = cudf.DataFrame()
     df["id"] = [0, 1, 1, None, None, 3, 3]
     df["val"] = [0, 1, 1, 2, 2, 3, 3]
-    got = df.groupby("id", dropna=False, sort=True).val.sum()
+    got = df.groupby("id", dropna=False).val.sum()
 
     expect = cudf.Series(
         [0, 2, 6, 4], name="val", index=cudf.Series([0, 1, 3, None], name="id")
     )
 
-    assert_eq(expect, got)
+    assert_groupby_results_equal(expect, got)
 
 
 def test_groupby_categorical_from_string():
@@ -1060,9 +1087,9 @@ def test_groupby_categorical_from_string():
     gdf["id"] = ["a", "b", "c"]
     gdf["val"] = [0, 1, 2]
     gdf["id"] = gdf["id"].astype("category")
-    assert_eq(
+    assert_groupby_results_equal(
         cudf.DataFrame({"val": gdf["val"]}).set_index(keys=gdf["id"]),
-        gdf.groupby("id", sort=True).sum(),
+        gdf.groupby("id").sum(),
     )
 
 
@@ -1076,7 +1103,7 @@ def test_groupby_arbitrary_length_series():
     expect = pdf.groupby(psr).sum()
     got = gdf.groupby(gsr).sum()
 
-    assert_eq(expect, got)
+    assert_groupby_results_equal(expect, got)
 
 
 def test_groupby_series_same_name_as_dataframe_column():
@@ -1089,7 +1116,7 @@ def test_groupby_series_same_name_as_dataframe_column():
     expect = pdf.groupby(psr).sum()
     got = gdf.groupby(gsr).sum()
 
-    assert_eq(expect, got)
+    assert_groupby_results_equal(expect, got)
 
 
 def test_group_by_series_and_column_name_in_by():
@@ -1106,7 +1133,7 @@ def test_group_by_series_and_column_name_in_by():
     expect = pdf.groupby(["x", psr0, psr1]).sum()
     got = gdf.groupby(["x", gsr0, gsr1]).sum()
 
-    assert_eq(expect, got)
+    assert_groupby_results_equal(expect, got)
 
 
 @pytest.mark.parametrize(
@@ -1147,10 +1174,10 @@ def test_groupby_count(agg, by):
     )
     gdf = cudf.from_pandas(pdf)
 
-    expect = pdf.groupby(by, sort=True).agg(agg)
-    got = gdf.groupby(by, sort=True).agg(agg)
+    expect = pdf.groupby(by).agg(agg)
+    got = gdf.groupby(by).agg(agg)
 
-    assert_eq(expect, got, check_dtype=False)
+    assert_groupby_results_equal(expect, got, check_dtype=False)
 
 
 @pytest.mark.parametrize("agg", [lambda x: x.median(), "median"])
@@ -1164,7 +1191,7 @@ def test_groupby_median(agg, by):
     expect = pdf.groupby(by).agg(agg)
     got = gdf.groupby(by).agg(agg)
 
-    assert_eq(expect, got, check_dtype=False)
+    assert_groupby_results_equal(expect, got, check_dtype=False)
 
 
 @pytest.mark.parametrize("agg", [lambda x: x.nunique(), "nunique"])
@@ -1180,7 +1207,7 @@ def test_groupby_nunique(agg, by):
     expect = pdf.groupby(by).nunique()
     got = gdf.groupby(by).nunique()
 
-    assert_eq(expect, got, check_dtype=False)
+    assert_groupby_results_equal(expect, got, check_dtype=False)
 
 
 @pytest.mark.parametrize(
@@ -1198,10 +1225,10 @@ def test_groupby_nth(n, by):
     )
     gdf = cudf.from_pandas(pdf)
 
-    expect = pdf.groupby(by, sort=True).nth(n)
-    got = gdf.groupby(by, sort=True).nth(n)
+    expect = pdf.groupby(by).nth(n)
+    got = gdf.groupby(by).nth(n)
 
-    assert_eq(expect, got, check_dtype=False)
+    assert_groupby_results_equal(expect, got, check_dtype=False)
 
 
 def test_raise_data_error():
@@ -1217,7 +1244,7 @@ def test_drop_unsupported_multi_agg():
     gdf = cudf.DataFrame(
         {"a": [1, 1, 2, 2], "b": [1, 2, 3, 4], "c": ["a", "b", "c", "d"]}
     )
-    assert_eq(
+    assert_groupby_results_equal(
         gdf.groupby("a").agg(["count", "mean"]),
         gdf.groupby("a").agg({"b": ["count", "mean"], "c": ["count"]}),
     )
@@ -1245,9 +1272,9 @@ def test_groupby_agg_combinations(agg):
     )
     gdf = cudf.from_pandas(pdf)
 
-    assert_eq(
-        pdf.groupby("a", sort=True).agg(agg),
-        gdf.groupby("a", sort=True).agg(agg),
+    assert_groupby_results_equal(
+        pdf.groupby("a").agg(agg),
+        gdf.groupby("a").agg(agg),
         check_dtype=False,
     )
 
@@ -1257,7 +1284,7 @@ def test_groupby_apply_noempty_group():
         {"a": [1, 1, 2, 2], "b": [1, 2, 1, 2], "c": [1, 2, 3, 4]}
     )
     gdf = cudf.from_pandas(pdf)
-    assert_eq(
+    assert_groupby_results_equal(
         pdf.groupby("a")
         .apply(lambda x: x.iloc[[0, 1]])
         .reset_index(drop=True),
@@ -1272,9 +1299,9 @@ def test_reset_index_after_empty_groupby():
     pdf = pd.DataFrame({"a": [1, 2, 3]})
     gdf = cudf.from_pandas(pdf)
 
-    assert_eq(
-        pdf.groupby("a", sort=True).sum().reset_index(),
-        gdf.groupby("a", sort=True).sum().reset_index(),
+    assert_groupby_results_equal(
+        pdf.groupby("a").sum().reset_index(),
+        gdf.groupby("a").sum().reset_index(),
     )
 
 
@@ -1354,7 +1381,7 @@ def test_groupby_nunique_series():
     pdf = pd.DataFrame({"a": [1, 1, 1, 2, 2, 2], "b": [1, 2, 3, 1, 1, 2]})
     gdf = cudf.from_pandas(pdf)
 
-    assert_eq(
+    assert_groupby_results_equal(
         pdf.groupby("a")["b"].nunique(),
         gdf.groupby("a")["b"].nunique(),
         check_dtype=False,
@@ -1366,7 +1393,7 @@ def test_groupby_list_simple(list_agg):
     pdf = pd.DataFrame({"a": [1, 1, 1, 2, 2, 2], "b": [1, 2, None, 4, 5, 6]})
     gdf = cudf.from_pandas(pdf)
 
-    assert_eq(
+    assert_groupby_results_equal(
         pdf.groupby("a").agg({"b": list}),
         gdf.groupby("a").agg({"b": list_agg}),
         check_dtype=False,
@@ -1383,7 +1410,7 @@ def test_groupby_list_of_lists(list_agg):
     )
     gdf = cudf.from_pandas(pdf)
 
-    assert_eq(
+    assert_groupby_results_equal(
         pdf.groupby("a").agg({"b": list}),
         gdf.groupby("a").agg({"b": list_agg}),
         check_dtype=False,
@@ -1395,7 +1422,7 @@ def test_groupby_list_single_element(list_agg):
     pdf = pd.DataFrame({"a": [1, 2], "b": [3, None]})
     gdf = cudf.from_pandas(pdf)
 
-    assert_eq(
+    assert_groupby_results_equal(
         pdf.groupby("a").agg({"b": list}),
         gdf.groupby("a").agg({"b": list_agg}),
         check_dtype=False,
@@ -1415,7 +1442,7 @@ def test_groupby_list_strings(agg):
     )
     gdf = cudf.from_pandas(pdf)
 
-    assert_eq(
+    assert_groupby_results_equal(
         pdf.groupby("a").agg(agg),
         gdf.groupby("a").agg(agg),
         check_dtype=False,
@@ -1432,11 +1459,11 @@ def test_groupby_list_columns_excluded():
     )
     gdf = cudf.from_pandas(pdf)
 
-    assert_eq(
+    assert_groupby_results_equal(
         pdf.groupby("a").mean(), gdf.groupby("a").mean(), check_dtype=False
     )
 
-    assert_eq(
+    assert_groupby_results_equal(
         pdf.groupby("a").agg("mean"),
         gdf.groupby("a").agg("mean"),
         check_dtype=False,
@@ -1450,7 +1477,7 @@ def test_groupby_pipe():
     expected = pdf.groupby("A").pipe(lambda x: x.max() - x.min())
     actual = gdf.groupby("A").pipe(lambda x: x.max() - x.min())
 
-    assert_eq(expected, actual)
+    assert_groupby_results_equal(expected, actual)
 
 
 def test_groupby_apply_return_scalars():
@@ -1482,7 +1509,7 @@ def custom_map_func(x):
     expected = pdf.groupby("A").apply(lambda x: custom_map_func(x))
     actual = gdf.groupby("A").apply(lambda x: custom_map_func(x))
 
-    assert_eq(expected, actual)
+    assert_groupby_results_equal(expected, actual)
 
 
 @pytest.mark.parametrize(
@@ -1498,7 +1525,7 @@ def test_groupby_apply_return_series_dataframe(cust_func):
     expected = pdf.groupby(["key"]).apply(cust_func)
     actual = gdf.groupby(["key"]).apply(cust_func)
 
-    assert_eq(expected, actual)
+    assert_groupby_results_equal(expected, actual)
 
 
 @pytest.mark.parametrize(
@@ -1507,7 +1534,7 @@ def test_groupby_apply_return_series_dataframe(cust_func):
 )
 def test_groupby_no_keys(pdf):
     gdf = cudf.from_pandas(pdf)
-    assert_eq(
+    assert_groupby_results_equal(
         pdf.groupby([]).max(),
         gdf.groupby([]).max(),
         check_dtype=False,
@@ -1521,7 +1548,7 @@ def test_groupby_no_keys(pdf):
 )
 def test_groupby_apply_no_keys(pdf):
     gdf = cudf.from_pandas(pdf)
-    assert_eq(
+    assert_groupby_results_equal(
         pdf.groupby([]).apply(lambda x: x.max()),
         gdf.groupby([]).apply(lambda x: x.max()),
     )
@@ -1560,4 +1587,4 @@ def test_groupby_unique(by, data, dtype):
 
     expect = pdf.groupby("by")["data"].unique()
     got = gdf.groupby("by")["data"].unique()
-    assert_eq(expect, got)
+    assert_groupby_results_equal(expect, got)
diff --git a/python/cudf/cudf/tests/test_index.py b/python/cudf/cudf/tests/test_index.py
index 688efef555b..21a431dd540 100644
--- a/python/cudf/cudf/tests/test_index.py
+++ b/python/cudf/cudf/tests/test_index.py
@@ -17,6 +17,7 @@
     DatetimeIndex,
     GenericIndex,
     Int64Index,
+    IntervalIndex,
     RangeIndex,
     as_index,
 )
@@ -1360,6 +1361,256 @@ def test_categorical_index_basic(data, categories, dtype, ordered, name):
     assert_eq(pindex, gindex)
 
 
+INTERVAL_BOUNDARY_TYPES = [
+    int,
+    np.int8,
+    np.int16,
+    np.int32,
+    np.int64,
+    np.float32,
+    np.float64,
+    cudf.Scalar,
+]
+
+
+@pytest.mark.parametrize("closed", ["left", "right", "both", "neither"])
+@pytest.mark.parametrize("start", [0, 1, 2, 3])
+@pytest.mark.parametrize("end", [4, 5, 6, 7])
+def test_interval_range_basic(start, end, closed):
+    pindex = pd.interval_range(start=start, end=end, closed=closed)
+    gindex = cudf.interval_range(start=start, end=end, closed=closed)
+
+    assert_eq(pindex, gindex)
+
+
+@pytest.mark.parametrize("start_t", INTERVAL_BOUNDARY_TYPES)
+@pytest.mark.parametrize("end_t", INTERVAL_BOUNDARY_TYPES)
+def test_interval_range_dtype_basic(start_t, end_t):
+    start, end = start_t(24), end_t(42)
+    start_val = start.value if isinstance(start, cudf.Scalar) else start
+    end_val = end.value if isinstance(end, cudf.Scalar) else end
+    pindex = pd.interval_range(start=start_val, end=end_val, closed="left")
+    gindex = cudf.interval_range(start=start, end=end, closed="left")
+
+    assert_eq(pindex, gindex)
+
+
+@pytest.mark.parametrize("closed", ["left", "right", "both", "neither"])
+@pytest.mark.parametrize("start", [0])
+@pytest.mark.parametrize("end", [0])
+def test_interval_range_empty(start, end, closed):
+    pindex = pd.interval_range(start=start, end=end, closed=closed)
+    gindex = cudf.interval_range(start=start, end=end, closed=closed)
+
+    assert_eq(pindex, gindex)
+
+
+@pytest.mark.parametrize("closed", ["left", "right", "both", "neither"])
+@pytest.mark.parametrize("freq", [1, 2, 3])
+@pytest.mark.parametrize("start", [0, 1, 2, 3, 5])
+@pytest.mark.parametrize("end", [6, 8, 10, 43, 70])
+def test_interval_range_freq_basic(start, end, freq, closed):
+    pindex = pd.interval_range(start=start, end=end, freq=freq, closed=closed)
+    gindex = cudf.interval_range(
+        start=start, end=end, freq=freq, closed=closed
+    )
+
+    assert_eq(pindex, gindex)
+
+
+@pytest.mark.parametrize("start_t", INTERVAL_BOUNDARY_TYPES)
+@pytest.mark.parametrize("end_t", INTERVAL_BOUNDARY_TYPES)
+@pytest.mark.parametrize("freq_t", INTERVAL_BOUNDARY_TYPES)
+def test_interval_range_freq_basic_dtype(start_t, end_t, freq_t):
+    start, end, freq = start_t(5), end_t(70), freq_t(3)
+    start_val = start.value if isinstance(start, cudf.Scalar) else start
+    end_val = end.value if isinstance(end, cudf.Scalar) else end
+    freq_val = freq.value if isinstance(freq, cudf.Scalar) else freq
+    pindex = pd.interval_range(
+        start=start_val, end=end_val, freq=freq_val, closed="left"
+    )
+    gindex = cudf.interval_range(
+        start=start, end=end, freq=freq, closed="left"
+    )
+
+    assert_eq(pindex, gindex)
+
+
+@pytest.mark.parametrize("closed", ["left", "right", "both", "neither"])
+@pytest.mark.parametrize("periods", [1, 1.0, 2, 2.0, 3.0, 3])
+@pytest.mark.parametrize("start", [0, 0.0, 1.0, 1, 2, 2.0, 3.0, 3])
+@pytest.mark.parametrize("end", [4, 4.0, 5.0, 5, 6, 6.0, 7.0, 7])
+def test_interval_range_periods_basic(start, end, periods, closed):
+    pindex = pd.interval_range(
+        start=start, end=end, periods=periods, closed=closed
+    )
+    gindex = cudf.interval_range(
+        start=start, end=end, periods=periods, closed=closed
+    )
+
+    assert_eq(pindex, gindex)
+
+
+@pytest.mark.parametrize("start_t", INTERVAL_BOUNDARY_TYPES)
+@pytest.mark.parametrize("end_t", INTERVAL_BOUNDARY_TYPES)
+@pytest.mark.parametrize("periods_t", INTERVAL_BOUNDARY_TYPES)
+def test_interval_range_periods_basic_dtype(start_t, end_t, periods_t):
+    start, end, periods = start_t(0), end_t(4), periods_t(1.0)
+    start_val = start.value if isinstance(start, cudf.Scalar) else start
+    end_val = end.value if isinstance(end, cudf.Scalar) else end
+    periods_val = (
+        periods.value if isinstance(periods, cudf.Scalar) else periods
+    )
+    pindex = pd.interval_range(
+        start=start_val, end=end_val, periods=periods_val, closed="left"
+    )
+    gindex = cudf.interval_range(
+        start=start, end=end, periods=periods, closed="left"
+    )
+
+    assert_eq(pindex, gindex)
+
+
+@pytest.mark.parametrize("closed", ["left", "right", "both", "neither"])
+@pytest.mark.parametrize("periods", [1, 2, 3])
+@pytest.mark.parametrize("freq", [1, 2, 3, 4])
+@pytest.mark.parametrize("end", [4, 8, 9, 10])
+def test_interval_range_periods_freq_end(end, freq, periods, closed):
+    pindex = pd.interval_range(
+        end=end, freq=freq, periods=periods, closed=closed
+    )
+    gindex = cudf.interval_range(
+        end=end, freq=freq, periods=periods, closed=closed
+    )
+
+    assert_eq(pindex, gindex)
+
+
+@pytest.mark.parametrize("periods_t", INTERVAL_BOUNDARY_TYPES)
+@pytest.mark.parametrize("freq_t", INTERVAL_BOUNDARY_TYPES)
+@pytest.mark.parametrize("end_t", INTERVAL_BOUNDARY_TYPES)
+def test_interval_range_periods_freq_end_dtype(periods_t, freq_t, end_t):
+    periods, freq, end = periods_t(2), freq_t(3), end_t(10)
+    freq_val = freq.value if isinstance(freq, cudf.Scalar) else freq
+    end_val = end.value if isinstance(end, cudf.Scalar) else end
+    periods_val = (
+        periods.value if isinstance(periods, cudf.Scalar) else periods
+    )
+    pindex = pd.interval_range(
+        end=end_val, freq=freq_val, periods=periods_val, closed="left"
+    )
+    gindex = cudf.interval_range(
+        end=end, freq=freq, periods=periods, closed="left"
+    )
+
+    assert_eq(pindex, gindex)
+
+
+@pytest.mark.parametrize("closed", ["left", "right", "both", "neither"])
+@pytest.mark.parametrize("periods", [1, 2, 3])
+@pytest.mark.parametrize("freq", [1, 2, 3, 4])
+@pytest.mark.parametrize("start", [1, 4, 9, 12])
+def test_interval_range_periods_freq_start(start, freq, periods, closed):
+    pindex = pd.interval_range(
+        start=start, freq=freq, periods=periods, closed=closed
+    )
+    gindex = cudf.interval_range(
+        start=start, freq=freq, periods=periods, closed=closed
+    )
+
+    assert_eq(pindex, gindex)
+
+
+@pytest.mark.parametrize("periods_t", INTERVAL_BOUNDARY_TYPES)
+@pytest.mark.parametrize("freq_t", INTERVAL_BOUNDARY_TYPES)
+@pytest.mark.parametrize("start_t", INTERVAL_BOUNDARY_TYPES)
+def test_interval_range_periods_freq_start_dtype(periods_t, freq_t, start_t):
+    periods, freq, start = periods_t(2), freq_t(3), start_t(9)
+    freq_val = freq.value if isinstance(freq, cudf.Scalar) else freq
+    start_val = start.value if isinstance(start, cudf.Scalar) else start
+    periods_val = (
+        periods.value if isinstance(periods, cudf.Scalar) else periods
+    )
+    pindex = pd.interval_range(
+        start=start_val, freq=freq_val, periods=periods_val, closed="left"
+    )
+    gindex = cudf.interval_range(
+        start=start, freq=freq, periods=periods, closed="left"
+    )
+
+    assert_eq(pindex, gindex)
+
+
+@pytest.mark.parametrize("closed", ["right", "left", "both", "neither"])
+@pytest.mark.parametrize(
+    "data",
+    [
+        ([pd.Interval(30, 50)]),
+        ([pd.Interval(0, 3), pd.Interval(1, 7)]),
+        ([pd.Interval(0.2, 60.3), pd.Interval(1, 7), pd.Interval(0, 0)]),
+        ([]),
+    ],
+)
+def test_interval_index_basic(data, closed):
+    pindex = pd.IntervalIndex(data, closed=closed)
+    gindex = IntervalIndex(data, closed=closed)
+
+    assert_eq(pindex, gindex)
+
+
+@pytest.mark.parametrize("closed", ["right", "left", "both", "neither"])
+def test_interval_index_empty(closed):
+    pindex = pd.IntervalIndex([], closed=closed)
+    gindex = IntervalIndex([], closed=closed)
+
+    assert_eq(pindex, gindex)
+
+
+@pytest.mark.parametrize("closed", ["right", "left", "both", "neither"])
+@pytest.mark.parametrize(
+    "data",
+    [
+        ([pd.Interval(1, 6), pd.Interval(1, 10), pd.Interval(1, 3)]),
+        (
+            [
+                pd.Interval(3.5, 6.0),
+                pd.Interval(1.0, 7.0),
+                pd.Interval(0.0, 10.0),
+            ]
+        ),
+        (
+            [
+                pd.Interval(50, 100, closed="left"),
+                pd.Interval(1.0, 7.0, closed="left"),
+                pd.Interval(16, 322, closed="left"),
+            ]
+        ),
+        (
+            [
+                pd.Interval(50, 100, closed="right"),
+                pd.Interval(1.0, 7.0, closed="right"),
+                pd.Interval(16, 322, closed="right"),
+            ]
+        ),
+    ],
+)
+def test_interval_index_many_params(data, closed):
+
+    pindex = pd.IntervalIndex(data, closed=closed)
+    gindex = IntervalIndex(data, closed=closed)
+
+    assert_eq(pindex, gindex)
+
+
+@pytest.mark.parametrize("closed", ["left", "right", "both", "neither"])
+def test_interval_index_from_breaks(closed):
+    breaks = [0, 3, 6, 10]
+    pindex = pd.IntervalIndex.from_breaks(breaks, closed=closed)
+    gindex = IntervalIndex.from_breaks(breaks, closed=closed)
+
+    assert_eq(pindex, gindex)
+
+
 @pytest.mark.parametrize("n", [0, 2, 5, 10, None])
 @pytest.mark.parametrize("frac", [0.1, 0.5, 1, 2, None])
 @pytest.mark.parametrize("replace", [True, False])
diff --git a/python/cudf/cudf/tests/test_indexing.py b/python/cudf/cudf/tests/test_indexing.py
index cec2623027f..086d59ab0f2 100644
--- a/python/cudf/cudf/tests/test_indexing.py
+++ b/python/cudf/cudf/tests/test_indexing.py
@@ -1401,3 +1401,14 @@ def test_iloc_before_zero_terminate(arg, pobj):
     gobj = cudf.from_pandas(pobj)
 
     assert_eq(pobj.iloc[arg], gobj.iloc[arg])
+
+
+def test_iloc_decimal():
+    sr = cudf.Series(["1.00", "2.00", "3.00", "4.00"]).astype(
+        cudf.Decimal64Dtype(scale=2, precision=3)
+    )
+    got = sr.iloc[[3, 2, 1, 0]]
+    expect = cudf.Series(["4.00", "3.00", "2.00", "1.00"],).astype(
+        cudf.Decimal64Dtype(scale=2, precision=3)
+    )
+    assert_eq(expect.reset_index(drop=True), got.reset_index(drop=True))
diff --git a/python/cudf/cudf/tests/test_joining.py b/python/cudf/cudf/tests/test_joining.py
index 9164bfe98d1..183385bacc1 100644
--- a/python/cudf/cudf/tests/test_joining.py
+++ b/python/cudf/cudf/tests/test_joining.py
@@ -6,7 +6,7 @@
 
 import cudf
 from cudf.core._compat import PANDAS_GE_120
-from cudf.core.dtypes import CategoricalDtype
+from cudf.core.dtypes import CategoricalDtype, Decimal64Dtype
 from cudf.tests.utils import (
     INTEGER_TYPES,
     NUMERIC_TYPES,
@@ -89,7 +89,7 @@ def assert_join_results_equal(expect, got, how, **kwargs):
         ):  # can't sort_values() on a df without columns
             return assert_eq(expect, got, **kwargs)
 
-        return assert_eq(
+        assert_eq(
             expect.sort_values(expect.columns.to_list()).reset_index(
                 drop=True
             ),
@@ -1152,6 +1152,137 @@ def test_typecast_on_join_overflow_unsafe(dtypes):
         merged = lhs.merge(rhs, on="a", how="left")  # noqa: F841
 
 
+@pytest.mark.parametrize(
+    "dtype",
+    [Decimal64Dtype(5, 2), Decimal64Dtype(7, 5), Decimal64Dtype(12, 7)],
+)
+def test_decimal_typecast_inner(dtype):
+    other_data = ["a", "b", "c", "d", "e"]
+
+    join_data_l = cudf.Series(["1.6", "9.5", "7.2", "8.7", "2.3"]).astype(
+        dtype
+    )
+    join_data_r = cudf.Series(["1.6", "9.5", "7.2", "4.5", "2.3"]).astype(
+        dtype
+    )
+
+    gdf_l = cudf.DataFrame({"join_col": join_data_l, "B": other_data})
+    gdf_r = cudf.DataFrame({"join_col": join_data_r, "B": other_data})
+
+    exp_join_data = ["1.6", "9.5", "7.2", "2.3"]
+    exp_other_data = ["a", "b", "c", "e"]
+
+    exp_join_col = cudf.Series(exp_join_data).astype(dtype)
+
+    expected = cudf.DataFrame(
+        {
+            "join_col": exp_join_col,
+            "B_x": exp_other_data,
+            "B_y": exp_other_data,
+        }
+    )
+
+    got = gdf_l.merge(gdf_r, on="join_col", how="inner")
+
+    assert_join_results_equal(expected, got, how="inner")
+    assert_eq(dtype, got["join_col"].dtype)
+
+
+@pytest.mark.parametrize(
+    "dtype",
+    [Decimal64Dtype(7, 3), Decimal64Dtype(9, 5), Decimal64Dtype(14, 10)],
+)
+def test_decimal_typecast_left(dtype):
+    other_data = ["a", "b", "c", "d"]
+
+    join_data_l = cudf.Series(["95.05", "384.26", "74.22", "1456.94"]).astype(
+        dtype
+    )
+    join_data_r = cudf.Series(
+        ["95.05", "62.4056", "74.22", "1456.9472"]
+    ).astype(dtype)
+
+    gdf_l = cudf.DataFrame({"join_col": join_data_l, "B": other_data})
+    gdf_r = cudf.DataFrame({"join_col": join_data_r, "B": other_data})
+
+    exp_join_data = ["95.05", "74.22", "384.26", "1456.94"]
+    exp_other_data_x = ["a", "c", "b", "d"]
+    exp_other_data_y = ["a", "c", None, None]
+
+    exp_join_col = cudf.Series(exp_join_data).astype(dtype)
+
+    expected = cudf.DataFrame(
+        {
+            "join_col": exp_join_col,
+            "B_x": exp_other_data_x,
+            "B_y": exp_other_data_y,
+        }
+    )
+
+    got = gdf_l.merge(gdf_r, on="join_col", how="left")
+
+    assert_join_results_equal(expected, got, how="left")
+    assert_eq(dtype, got["join_col"].dtype)
+
+
+@pytest.mark.parametrize(
+    "dtype",
+    [Decimal64Dtype(7, 3), Decimal64Dtype(10, 5), Decimal64Dtype(18, 9)],
+)
+def test_decimal_typecast_outer(dtype):
+    other_data = ["a", "b", "c"]
+    join_data_l = cudf.Series(["741.248", "1029.528", "3627.292"]).astype(
+        dtype
+    )
+    join_data_r = cudf.Series(["9284.103", "1029.528", "948.637"]).astype(
+        dtype
+    )
+    gdf_l = cudf.DataFrame({"join_col": join_data_l, "B": other_data})
+    gdf_r = cudf.DataFrame({"join_col": join_data_r, "B": other_data})
+    exp_join_data = ["9284.103", "948.637", "1029.528", "741.248", "3627.292"]
+    exp_other_data_x = [None, None, "b", "a", "c"]
+    exp_other_data_y = ["a", "c", "b", None, None]
+    exp_join_col = cudf.Series(exp_join_data).astype(dtype)
+    expected = cudf.DataFrame(
+        {
+            "join_col": exp_join_col,
+            "B_x": exp_other_data_x,
+            "B_y": exp_other_data_y,
+        }
+    )
+    got = gdf_l.merge(gdf_r, on="join_col", how="outer")
+
+    assert_join_results_equal(expected, got, how="outer")
+    assert_eq(dtype, got["join_col"].dtype)
+
+
+@pytest.mark.parametrize(
+    "dtype_l", [Decimal64Dtype(7, 3), Decimal64Dtype(9, 5)],
+)
+@pytest.mark.parametrize(
+    "dtype_r", [Decimal64Dtype(8, 3), Decimal64Dtype(11, 6)],
+)
+def test_mixed_decimal_typecast(dtype_l, dtype_r):
+    other_data = ["a", "b", "c", "d"]
+
+    join_data_l = cudf.Series(["95.05", "34.6", "74.22", "14.94"]).astype(
+        dtype_r
+    )
+    join_data_r = cudf.Series(["95.05", "62.4056", "74.22", "1.42"]).astype(
+        dtype_l
+    )
+
+    gdf_l = cudf.DataFrame({"join_col": join_data_l, "B": other_data})
+    gdf_r = cudf.DataFrame({"join_col": join_data_r, "B": other_data})
+
+    with pytest.raises(
+        TypeError,
+        match="Decimal columns can only be merged with decimal columns "
+        "of the same precision and scale",
+    ):
+        gdf_l.merge(gdf_r, on="join_col", how="inner")
+
+
 @pytest.mark.parametrize(
     "dtype_l",
     ["datetime64[s]", "datetime64[ms]", "datetime64[us]", "datetime64[ns]"],
@@ -1738,3 +1869,56 @@ def test_join_renamed_index():
     )
     got = df.merge(df, left_index=True, right_index=True, how="inner")
     assert_join_results_equal(expect, got, how="inner")
+
+
+@pytest.mark.parametrize(
+    "lhs_col, lhs_idx, rhs_col, rhs_idx, on",
+    [
+        (["A", "B"], "L0", ["B", "C"], "L0", ["B"]),
+        (["A", "B"], "L0", ["B", "C"], "L0", ["L0"]),
+        (["A", "B"], "L0", ["B", "C"], "L0", ["B", "L0"]),
+        (["A", "B"], "L0", ["C", "L0"], "A", ["A"]),
+        (["A", "B"], "L0", ["C", "L0"], "A", ["L0"]),
+        (["A", "B"], "L0", ["C", "L0"], "A", ["A", "L0"]),
+    ],
+)
+@pytest.mark.parametrize(
+    "how", ["left", "inner", "right", "outer", "leftanti", "leftsemi"]
+)
+def test_join_merge_with_on(lhs_col, lhs_idx, rhs_col, rhs_idx, on, how):
+    lhs_data = {col_name: [4, 5, 6] for col_name in lhs_col}
+    lhs_index = cudf.Index([0, 1, 2], name=lhs_idx)
+
+    rhs_data = {col_name: [4, 5, 6] for col_name in rhs_col}
+    rhs_index = cudf.Index([2, 3, 4], name=rhs_idx)
+
+    gd_left = cudf.DataFrame(lhs_data, lhs_index)
+    gd_right = cudf.DataFrame(rhs_data, rhs_index)
+    pd_left = gd_left.to_pandas()
+    pd_right = gd_right.to_pandas()
+
+    expect = pd_left.merge(pd_right, on=on).sort_index(axis=1, ascending=False)
+    got = gd_left.merge(gd_right, on=on).sort_index(axis=1, ascending=False)
+
+    assert_join_results_equal(expect, got, how=how)
+
+
+@pytest.mark.parametrize(
+    "on", ["A", "L0"],
+)
+@pytest.mark.parametrize(
+    "how", ["left", "inner", "right", "outer", "leftanti", "leftsemi"]
+)
+def test_join_merge_invalid_keys(on, how):
+    gd_left = cudf.DataFrame(
+        {"A": [1, 2, 3], "B": [4, 5, 6]}, index=cudf.Index([0, 1, 2], name="C")
+    )
+    gd_right = cudf.DataFrame(
+        {"D": [2, 3, 4], "E": [7, 8, 0]}, index=cudf.Index([0, 2, 4], name="F")
+    )
+    pd_left = gd_left.to_pandas()
+    pd_right = gd_right.to_pandas()
+
+    with pytest.raises(KeyError):
+        pd_left.merge(pd_right, on=on)
+        gd_left.merge(gd_right, on=on)
diff --git a/python/cudf/cudf/tests/test_list.py b/python/cudf/cudf/tests/test_list.py
index 5645ce60596..9906600304b 100644
--- a/python/cudf/cudf/tests/test_list.py
+++ b/python/cudf/cudf/tests/test_list.py
@@ -1,6 +1,7 @@
 # Copyright (c) 2020-2021, NVIDIA CORPORATION.
 import functools
 
+import numpy as np
 import pandas as pd
 import pyarrow as pa
 import pytest
@@ -162,6 +163,39 @@ def test_take_invalid(invalid, exception):
         gs.list.take(invalid)
 
 
+@pytest.mark.parametrize(
+    ("data", "expected"),
+    [
+        ([[1, 1, 2, 2], [], None, [3, 4, 5]], [[1, 2], [], None, [3, 4, 5]]),
+        (
+            [[1.233, np.nan, 1.234, 3.141, np.nan, 1.234]],
+            [[1.233, 1.234, np.nan, 3.141]],
+        ),  # duplicate nans
+        ([[1, 1, 2, 2, None, None]], [[1, 2, None]]),  # duplicate nulls
+        (
+            [[1.233, np.nan, None, 1.234, 3.141, np.nan, 1.234, None]],
+            [[1.233, 1.234, np.nan, None, 3.141]],
+        ),  # duplicate nans and nulls
+        ([[2, None, 1, None, 2]], [[1, 2, None]]),
+        ([[], []], [[], []]),
+        ([[], None], [[], None]),
+    ],
+)
+def test_unique(data, expected):
+    """
+    Pandas de-duplicates nans and nulls respectively in Series.unique.
+    `expected` is setup to mimic such behavior
+    """
+    gs = cudf.Series(data, nan_as_null=False)
+
+    got = gs.list.unique()
+    expected = cudf.Series(expected, nan_as_null=False).list.sort_values()
+
+    got = got.list.sort_values()
+
+    assert_eq(expected, got)
+
+
 def key_func_builder(x, na_position):
     if x is None:
         if na_position == "first":
diff --git a/python/cudf/cudf/tests/test_parquet.py b/python/cudf/cudf/tests/test_parquet.py
index fe418d1ade1..4781ff995b0 100644
--- a/python/cudf/cudf/tests/test_parquet.py
+++ b/python/cudf/cudf/tests/test_parquet.py
@@ -19,7 +19,7 @@
 import cudf
 from cudf.io.parquet import ParquetWriter, merge_parquet_filemetadata
 from cudf.tests import dataset_generator as dg
-from cudf.tests.utils import assert_eq
+from cudf.tests.utils import assert_eq, assert_exceptions_equal
 
 
 @pytest.fixture(scope="module")
@@ -1937,3 +1937,15 @@ def test_parquet_writer_decimal(tmpdir):
 
     got = pd.read_parquet(fname)
     assert_eq(gdf, got)
+
+
+def test_parquet_writer_column_validation():
+    df = cudf.DataFrame({1: [1, 2, 3], "1": ["a", "b", "c"]})
+    pdf = df.to_pandas()
+
+    assert_exceptions_equal(
+        lfunc=df.to_parquet,
+        rfunc=pdf.to_parquet,
+        lfunc_args_and_kwargs=(["cudf.parquet"],),
+        rfunc_args_and_kwargs=(["pandas.parquet"],),
+    )
diff --git a/python/cudf/cudf/tests/test_reductions.py b/python/cudf/cudf/tests/test_reductions.py
index 80a2e89bf46..c998f308417 100644
--- a/python/cudf/cudf/tests/test_reductions.py
+++ b/python/cudf/cudf/tests/test_reductions.py
@@ -7,12 +7,15 @@
 from itertools import product
 
 import numpy as np
+import pandas as pd
 import pytest
+from decimal import Decimal
 
 import cudf
 from cudf.core import Series
+from cudf.core.dtypes import Decimal64Dtype
 from cudf.tests import utils
-from cudf.tests.utils import NUMERIC_TYPES, gen_rand
+from cudf.tests.utils import NUMERIC_TYPES, gen_rand, assert_eq
 
 params_dtype = NUMERIC_TYPES
 
@@ -50,6 +53,20 @@ def test_sum_string():
     assert got == expected
 
 
+@pytest.mark.parametrize(
+    "dtype",
+    [Decimal64Dtype(6, 3), Decimal64Dtype(10, 6), Decimal64Dtype(16, 7)],
+)
+@pytest.mark.parametrize("nelem", params_sizes)
+def test_sum_decimal(dtype, nelem):
+    data = [str(x) for x in gen_rand("int64", nelem) / 100]
+
+    expected = pd.Series([Decimal(x) for x in data]).sum()
+    got = cudf.Series(data).astype(dtype).sum()
+
+    assert_eq(expected, got)
+
+
 @pytest.mark.parametrize("dtype,nelem", params)
 def test_product(dtype, nelem):
     dtype = np.dtype(dtype).type
@@ -70,6 +87,19 @@ def test_product(dtype, nelem):
     np.testing.assert_approx_equal(expect, got, significant=significant)
 
 
+@pytest.mark.parametrize(
+    "dtype",
+    [Decimal64Dtype(6, 2), Decimal64Dtype(8, 4), Decimal64Dtype(10, 5)],
+)
+def test_product_decimal(dtype):
+    data = [str(x) for x in gen_rand("int8", 3) / 10]
+
+    expected = pd.Series([Decimal(x) for x in data]).product()
+    got = cudf.Series(data).astype(dtype).product()
+
+    assert_eq(expected, got)
+
+
 accuracy_for_dtype = {np.float64: 6, np.float32: 5}
 
 
@@ -94,6 +124,19 @@ def test_sum_of_squares(dtype, nelem):
         )
 
 
+@pytest.mark.parametrize(
+    "dtype",
+    [Decimal64Dtype(6, 2), Decimal64Dtype(8, 4), Decimal64Dtype(10, 5)],
+)
+def test_sum_of_squares_decimal(dtype):
+    data = [str(x) for x in gen_rand("int8", 3) / 10]
+
+    expected = pd.Series([Decimal(x) for x in data]).pow(2).sum()
+    got = cudf.Series(data).astype(dtype).sum_of_squares()
+
+    assert_eq(expected, got)
+
+
 @pytest.mark.parametrize("dtype,nelem", params)
 def test_min(dtype, nelem):
     dtype = np.dtype(dtype).type
@@ -106,6 +149,20 @@ def test_min(dtype, nelem):
     assert expect == got
 
 
+@pytest.mark.parametrize(
+    "dtype",
+    [Decimal64Dtype(6, 3), Decimal64Dtype(10, 6), Decimal64Dtype(16, 7)],
+)
+@pytest.mark.parametrize("nelem", params_sizes)
+def test_min_decimal(dtype, nelem):
+    data = [str(x) for x in gen_rand("int64", nelem) / 100]
+
+    expected = pd.Series([Decimal(x) for x in data]).min()
+    got = cudf.Series(data).astype(dtype).min()
+
+    assert_eq(expected, got)
+
+
 @pytest.mark.parametrize("dtype,nelem", params)
 def test_max(dtype, nelem):
     dtype = np.dtype(dtype).type
@@ -118,6 +175,20 @@ def test_max(dtype, nelem):
     assert expect == got
 
 
+@pytest.mark.parametrize(
+    "dtype",
+    [Decimal64Dtype(6, 3), Decimal64Dtype(10, 6), Decimal64Dtype(16, 7)],
+)
+@pytest.mark.parametrize("nelem", params_sizes)
+def test_max_decimal(dtype, nelem):
+    data = [str(x) for x in gen_rand("int64", nelem) / 100]
+
+    expected = pd.Series([Decimal(x) for x in data]).max()
+    got = cudf.Series(data).astype(dtype).max()
+
+    assert_eq(expected, got)
+
+
 @pytest.mark.parametrize("nelem", params_sizes)
 def test_sum_masked(nelem):
     dtype = np.float64
diff --git a/python/cudf/cudf/tests/test_replace.py b/python/cudf/cudf/tests/test_replace.py
index e7baa4ee926..9009b018ce2 100644
--- a/python/cudf/cudf/tests/test_replace.py
+++ b/python/cudf/cudf/tests/test_replace.py
@@ -7,7 +7,6 @@
 import pytest
 
 import cudf
-from cudf.core import DataFrame, Series
 from cudf.tests.utils import (
     INTEGER_TYPES,
     NUMERIC_TYPES,
@@ -36,6 +35,7 @@
         ([1, 2, 3], cudf.Series([10, 11, 12])),
         (cudf.Series([1, 2, 3]), None),
         ({1: 10, 2: 22}, None),
+        (np.inf, 4),
     ],
 )
 def test_series_replace_all(gsr, to_replace, value):
@@ -64,14 +64,14 @@ def test_series_replace():
 
     # Numerical
     a2 = np.array([5, 1, 2, 3, 4])
-    sr1 = Series(a1)
+    sr1 = cudf.Series(a1)
     sr2 = sr1.replace(0, 5)
     assert_eq(a2, sr2.to_array())
 
     # Categorical
     psr3 = pd.Series(["one", "two", "three"], dtype="category")
     psr4 = psr3.replace("one", "two")
-    sr3 = Series.from_pandas(psr3)
+    sr3 = cudf.from_pandas(psr3)
     sr4 = sr3.replace("one", "two")
     assert_eq(psr4, sr4)
 
@@ -94,7 +94,7 @@ def test_series_replace():
     assert_eq(a8, sr8.to_array())
 
     # large input containing null
-    sr9 = Series(list(range(400)) + [None])
+    sr9 = cudf.Series(list(range(400)) + [None])
     sr10 = sr9.replace([22, 323, 27, 0], None)
     assert sr10.null_count == 5
     assert len(sr10.to_array()) == (401 - 5)
@@ -119,7 +119,7 @@ def test_series_replace_with_nulls():
 
     # Numerical
     a2 = np.array([-10, 1, 2, 3, 4])
-    sr1 = Series(a1)
+    sr1 = cudf.Series(a1)
     sr2 = sr1.replace(0, None).fillna(-10)
     assert_eq(a2, sr2.to_array())
 
@@ -128,7 +128,7 @@ def test_series_replace_with_nulls():
     sr6 = sr1.replace([0, 1], [None, 6]).fillna(-10)
     assert_eq(a6, sr6.to_array())
 
-    sr1 = Series([0, 1, 2, 3, 4, None])
+    sr1 = cudf.Series([0, 1, 2, 3, 4, None])
     with pytest.raises(TypeError):
         sr1.replace([0, 1], [5.5, 6.5]).fillna(-10)
 
@@ -230,7 +230,7 @@ def test_dataframe_replace(df, to_replace, value):
 def test_dataframe_replace_with_nulls():
     # numerical
     pdf1 = pd.DataFrame({"a": [0, 1, 2, 3], "b": [0, 1, 2, 3]})
-    gdf1 = DataFrame.from_pandas(pdf1)
+    gdf1 = cudf.from_pandas(pdf1)
     pdf2 = pdf1.replace(0, 4)
     gdf2 = gdf1.replace(0, None).fillna(4)
     assert_eq(gdf2, pdf2)
@@ -249,7 +249,7 @@ def test_dataframe_replace_with_nulls():
     gdf8 = gdf1.replace({"a": 0, "b": 0}, {"a": None, "b": 5}).fillna(4)
     assert_eq(gdf8, pdf8)
 
-    gdf1 = DataFrame({"a": [0, 1, 2, 3], "b": [0, 1, 2, None]})
+    gdf1 = cudf.DataFrame({"a": [0, 1, 2, 3], "b": [0, 1, 2, None]})
     gdf9 = gdf1.replace([0, 1], [4, 5]).fillna(3)
     assert_eq(gdf9, pdf6)
 
@@ -375,7 +375,7 @@ def test_fillna_method_numerical(data, container, data_dtype, method, inplace):
 @pytest.mark.parametrize("inplace", [True, False])
 def test_fillna_categorical(psr_data, fill_value, inplace):
     psr = psr_data.copy(deep=True)
-    gsr = Series.from_pandas(psr)
+    gsr = cudf.from_pandas(psr)
 
     if isinstance(fill_value, pd.Series):
         fill_value_cudf = cudf.from_pandas(fill_value)
@@ -620,7 +620,7 @@ def test_fillna_method_fixed_width_non_num(data, container, method, inplace):
 @pytest.mark.parametrize("inplace", [True, False])
 def test_fillna_dataframe(df, value, inplace):
     pdf = df.copy(deep=True)
-    gdf = DataFrame.from_pandas(pdf)
+    gdf = cudf.from_pandas(pdf)
 
     fill_value_pd = value
     if isinstance(fill_value_pd, (pd.Series, pd.DataFrame)):
@@ -688,7 +688,7 @@ def test_fillna_string(ps_data, fill_value, inplace):
 
 @pytest.mark.parametrize("data_dtype", INTEGER_TYPES)
 def test_series_fillna_invalid_dtype(data_dtype):
-    gdf = Series([1, 2, None, 3], dtype=data_dtype)
+    gdf = cudf.Series([1, 2, None, 3], dtype=data_dtype)
     fill_value = 2.5
     with pytest.raises(TypeError) as raises:
         gdf.fillna(fill_value)
@@ -702,38 +702,53 @@ def test_series_fillna_invalid_dtype(data_dtype):
 @pytest.mark.parametrize("fill_value", [100, 100.0, 128.5])
 def test_series_where(data_dtype, fill_value):
     psr = pd.Series(list(range(10)), dtype=data_dtype)
-    sr = Series.from_pandas(psr)
+    sr = cudf.from_pandas(psr)
 
     if sr.dtype.type(fill_value) != fill_value:
         with pytest.raises(TypeError):
             sr.where(sr > 0, fill_value)
     else:
         # Cast back to original dtype as pandas automatically upcasts
-        expect = psr.where(psr > 0, fill_value).astype(psr.dtype)
+        expect = psr.where(psr > 0, fill_value)
         got = sr.where(sr > 0, fill_value)
-        assert_eq(expect, got)
+        # pandas returns 'float16' dtype, which is not supported in cudf
+        assert_eq(
+            expect,
+            got,
+            check_dtype=False if expect.dtype.kind in ("f") else True,
+        )
 
     if sr.dtype.type(fill_value) != fill_value:
         with pytest.raises(TypeError):
             sr.where(sr < 0, fill_value)
     else:
-        expect = psr.where(psr < 0, fill_value).astype(psr.dtype)
+        expect = psr.where(psr < 0, fill_value)
         got = sr.where(sr < 0, fill_value)
-        assert_eq(expect, got)
+        # pandas returns 'float16' dtype, which is not supported in cudf
+        assert_eq(
+            expect,
+            got,
+            check_dtype=False if expect.dtype.kind in ("f") else True,
+        )
 
     if sr.dtype.type(fill_value) != fill_value:
         with pytest.raises(TypeError):
             sr.where(sr == 0, fill_value)
     else:
-        expect = psr.where(psr == 0, fill_value).astype(psr.dtype)
+        expect = psr.where(psr == 0, fill_value)
         got = sr.where(sr == 0, fill_value)
-        assert_eq(expect, got)
+        # pandas returns 'float16' dtype, which is not supported in cudf
+        assert_eq(
+            expect,
+            got,
+            check_dtype=False if expect.dtype.kind in ("f") else True,
+        )
 
 
 @pytest.mark.parametrize("fill_value", [100, 100.0, 100.5])
 def test_series_with_nulls_where(fill_value):
     psr = pd.Series([None] * 3 + list(range(5)))
-    sr = Series.from_pandas(psr)
+    sr = cudf.from_pandas(psr)
 
     expect = psr.where(psr > 0, fill_value)
     got = sr.where(sr > 0, fill_value)
@@ -756,7 +771,7 @@ def test_dataframe_with_nulls_where_with_scalars(fill_value):
             "B": [4, -2, 3, None, 7, 6, 8, 0],
         }
     )
-    gdf = DataFrame.from_pandas(pdf)
+    gdf = cudf.from_pandas(pdf)
 
     expect = pdf.where(pdf % 3 == 0, fill_value)
     got = gdf.where(gdf % 3 == 0, fill_value)
@@ -770,7 +785,7 @@ def test_dataframe_with_different_types():
     pdf = pd.DataFrame(
         {"A": [111, 22, 31, 410, 56], "B": [-10.12, 121.2, 45.7, 98.4, 87.6]}
     )
-    gdf = DataFrame.from_pandas(pdf)
+    gdf = cudf.from_pandas(pdf)
     expect = pdf.where(pdf > 50, -pdf)
     got = gdf.where(gdf > 50, -gdf)
 
@@ -778,9 +793,9 @@ def test_dataframe_with_different_types():
 
     # Testing for string
     pdf = pd.DataFrame({"A": ["a", "bc", "cde", "fghi"]})
-    gdf = DataFrame.from_pandas(pdf)
+    gdf = cudf.from_pandas(pdf)
     pdf_mask = pd.DataFrame({"A": [True, False, True, False]})
-    gdf_mask = DataFrame.from_pandas(pdf_mask)
+    gdf_mask = cudf.from_pandas(pdf_mask)
     expect = pdf.where(pdf_mask, ["cudf"])
     got = gdf.where(gdf_mask, ["cudf"])
 
@@ -789,7 +804,7 @@ def test_dataframe_with_different_types():
     # Testing for categoriacal
     pdf = pd.DataFrame({"A": ["a", "b", "b", "c"]})
     pdf["A"] = pdf["A"].astype("category")
-    gdf = DataFrame.from_pandas(pdf)
+    gdf = cudf.from_pandas(pdf)
     expect = pdf.where(pdf_mask, "c")
     got = gdf.where(gdf_mask, ["c"])
 
@@ -798,7 +813,7 @@ def test_dataframe_with_different_types():
 
 def test_dataframe_where_with_different_options():
     pdf = pd.DataFrame({"A": [1, 2, 3], "B": [3, 4, 5]})
-    gdf = DataFrame.from_pandas(pdf)
+    gdf = cudf.from_pandas(pdf)
 
     # numpy array
     boolean_mask = np.array([[False, True], [True, False], [False, True]])
@@ -822,8 +837,8 @@ def test_dataframe_where_with_different_options():
 
 
 def test_series_multiple_times_with_nulls():
-    sr = Series([1, 2, 3, None])
-    expected = Series([None, None, None, None], dtype=np.int64)
+    sr = cudf.Series([1, 2, 3, None])
+    expected = cudf.Series([None, None, None, None], dtype=np.int64)
 
     for i in range(3):
         got = sr.replace([1, 2, 3], None)
@@ -835,7 +850,7 @@ def test_series_multiple_times_with_nulls():
         # subsequent calls and the memory used for mask may have junk values.
         # So, if it is not updated properly, the result would be wrong.
         # So, this will help verify that scenario.
-        Series([1, 1, 1, None])
+        cudf.Series([1, 1, 1, None])
 
 
 @pytest.mark.parametrize("series_dtype", NUMERIC_TYPES)
@@ -844,7 +859,7 @@ def test_series_multiple_times_with_nulls():
 )
 def test_numeric_series_replace_dtype(series_dtype, replacement):
     psr = pd.Series([0, 1, 2, 3, 4, 5], dtype=series_dtype)
-    sr = Series.from_pandas(psr)
+    sr = cudf.from_pandas(psr)
 
     # Both Scalar
     if sr.dtype.type(replacement) != replacement:
@@ -889,7 +904,7 @@ def test_numeric_series_replace_dtype(series_dtype, replacement):
 
 def test_replace_inplace():
     data = np.array([5, 1, 2, 3, 4])
-    sr = Series(data)
+    sr = cudf.Series(data)
     psr = pd.Series(data)
 
     sr_copy = sr.copy()
@@ -902,7 +917,7 @@ def test_replace_inplace():
     assert_eq(sr, psr)
     assert_eq(sr_copy, psr_copy)
 
-    sr = Series(data)
+    sr = cudf.Series(data)
     psr = pd.Series(data)
 
     sr_copy = sr.copy()
@@ -919,7 +934,7 @@ def test_replace_inplace():
     assert_eq(srr, psrr)
 
     psr = pd.Series(["one", "two", "three"], dtype="category")
-    sr = Series.from_pandas(psr)
+    sr = cudf.from_pandas(psr)
 
     sr_copy = sr.copy()
     psr_copy = psr.copy()
@@ -932,7 +947,7 @@ def test_replace_inplace():
     assert_eq(sr_copy, psr_copy)
 
     pdf = pd.DataFrame({"A": [0, 1, 2, 3, 4], "B": [5, 6, 7, 8, 9]})
-    gdf = DataFrame.from_pandas(pdf)
+    gdf = cudf.from_pandas(pdf)
 
     pdf_copy = pdf.copy()
     gdf_copy = gdf.copy()
@@ -944,7 +959,7 @@ def test_replace_inplace():
     assert_eq(pdf_copy, gdf_copy)
 
     pds = pd.Series([1, 2, 3, 45])
-    gds = Series.from_pandas(pds)
+    gds = cudf.from_pandas(pds)
     vals = np.array([]).astype(int)
 
     assert_eq(pds.replace(vals, -1), gds.replace(vals, -1))
@@ -954,7 +969,7 @@ def test_replace_inplace():
     assert_eq(pds, gds)
 
     pdf = pd.DataFrame({"a": [1, 2, 3, 4, 5, 666]})
-    gdf = DataFrame.from_pandas(pdf)
+    gdf = cudf.from_pandas(pdf)
 
     assert_eq(
         pdf.replace({"a": 2}, {"a": -33}), gdf.replace({"a": 2}, {"a": -33})
@@ -987,7 +1002,7 @@ def test_dataframe_clip(lower, upper, inplace):
     pdf = pd.DataFrame(
         {"a": [1, 2, 3, 4, 5], "b": [7.1, 7.24, 7.5, 7.8, 8.11]}
     )
-    gdf = DataFrame.from_pandas(pdf)
+    gdf = cudf.from_pandas(pdf)
 
     got = gdf.clip(lower=lower, upper=upper, inplace=inplace)
     expect = pdf.clip(lower=lower, upper=upper, axis=1)
@@ -1005,7 +1020,7 @@ def test_dataframe_clip(lower, upper, inplace):
 def test_dataframe_category_clip(lower, upper, inplace):
     data = ["a", "b", "c", "d", "e"]
     pdf = pd.DataFrame({"a": data})
-    gdf = DataFrame.from_pandas(pdf)
+    gdf = cudf.from_pandas(pdf)
     gdf["a"] = gdf["a"].astype("category")
 
     expect = pdf.clip(lower=lower, upper=upper)
@@ -1022,7 +1037,9 @@ def test_dataframe_category_clip(lower, upper, inplace):
     [([2, 7.4], [4, 7.9, "d"]), ([2, 7.4, "a"], [4, 7.9, "d"])],
 )
 def test_dataframe_exceptions_for_clip(lower, upper):
-    gdf = DataFrame({"a": [1, 2, 3, 4, 5], "b": [7.1, 7.24, 7.5, 7.8, 8.11]})
+    gdf = cudf.DataFrame(
+        {"a": [1, 2, 3, 4, 5], "b": [7.1, 7.24, 7.5, 7.8, 8.11]}
+    )
 
     with pytest.raises(ValueError):
         gdf.clip(lower=lower, upper=upper)
@@ -1045,7 +1062,7 @@ def test_dataframe_exceptions_for_clip(lower, upper):
 @pytest.mark.parametrize("inplace", [True, False])
 def test_series_clip(data, lower, upper, inplace):
     psr = pd.Series(data)
-    gsr = Series.from_pandas(data)
+    gsr = cudf.Series.from_pandas(data)
 
     expect = psr.clip(lower=lower, upper=upper)
     got = gsr.clip(lower=lower, upper=upper, inplace=inplace)
@@ -1059,10 +1076,10 @@ def test_series_clip(data, lower, upper, inplace):
 def test_series_exceptions_for_clip():
 
     with pytest.raises(ValueError):
-        Series([1, 2, 3, 4]).clip([1, 2], [2, 3])
+        cudf.Series([1, 2, 3, 4]).clip([1, 2], [2, 3])
 
     with pytest.raises(NotImplementedError):
-        Series([1, 2, 3, 4]).clip(1, 2, axis=0)
+        cudf.Series([1, 2, 3, 4]).clip(1, 2, axis=0)
 
 
 @pytest.mark.parametrize(
@@ -1080,7 +1097,7 @@ def test_series_exceptions_for_clip():
 @pytest.mark.parametrize("inplace", [True, False])
 def test_index_clip(data, lower, upper, inplace):
     pdf = pd.DataFrame({"a": data})
-    index = DataFrame.from_pandas(pdf).set_index("a").index
+    index = cudf.from_pandas(pdf).set_index("a").index
 
     expect = pdf.clip(lower=lower, upper=upper)
     got = index.clip(lower=lower, upper=upper, inplace=inplace)
@@ -1097,7 +1114,7 @@ def test_index_clip(data, lower, upper, inplace):
 @pytest.mark.parametrize("inplace", [True, False])
 def test_multiindex_clip(lower, upper, inplace):
     df = pd.DataFrame({"a": [1, 2, 3, 4, 5], "b": [1, 2, 3, 4, 5]})
-    gdf = DataFrame.from_pandas(df)
+    gdf = cudf.from_pandas(df)
 
     index = gdf.set_index(["a", "b"]).index
 
@@ -1128,13 +1145,13 @@ def test_series_fillna(data, index, value):
         data,
         index=index if index is not None and len(index) == len(data) else None,
     )
-    gsr = Series(
+    gsr = cudf.Series(
         data,
         index=index if index is not None and len(index) == len(data) else None,
     )
 
     expect = psr.fillna(pd.Series(value))
-    got = gsr.fillna(Series(value))
+    got = gsr.fillna(cudf.Series(value))
     assert_eq(expect, got)
 
 
diff --git a/python/cudf/cudf/tests/test_scalar.py b/python/cudf/cudf/tests/test_scalar.py
index 58115cecee7..916e73ea381 100644
--- a/python/cudf/cudf/tests/test_scalar.py
+++ b/python/cudf/cudf/tests/test_scalar.py
@@ -1,6 +1,7 @@
 import datetime
 import datetime as dt
 import re
+from decimal import Decimal
 
 import numpy as np
 import pandas as pd
@@ -16,6 +17,12 @@
     TIMEDELTA_TYPES,
 )
 
+TEST_DECIMAL_TYPES = [
+    cudf.Decimal64Dtype(1, 1),
+    cudf.Decimal64Dtype(4, 2),
+    cudf.Decimal64Dtype(4, -2),
+]
+
 SCALAR_VALUES = [
     0,
     -1,
@@ -103,8 +110,14 @@
     np.object_("asdf"),
 ]
 
+DECIMAL_VALUES = [
+    Decimal("100"),
+    Decimal("0.0042"),
+    Decimal("1.0042"),
+]
+
 
-@pytest.mark.parametrize("value", SCALAR_VALUES)
+@pytest.mark.parametrize("value", SCALAR_VALUES + DECIMAL_VALUES)
 def test_scalar_host_initialization(value):
     s = cudf.Scalar(value)
 
@@ -130,7 +143,24 @@ def test_scalar_device_initialization(value):
     assert s._is_host_value_current
 
 
-@pytest.mark.parametrize("value", SCALAR_VALUES)
+@pytest.mark.parametrize("value", DECIMAL_VALUES)
+def test_scalar_device_initialization_decimal(value):
+    dtype = cudf.Decimal64Dtype._from_decimal(value)
+    column = cudf.Series([str(value)]).astype(dtype)._column
+    dev_slr = get_element(column, 0)
+
+    s = cudf.Scalar(dev_slr)
+
+    assert s._is_device_value_current
+    assert not s._is_host_value_current
+
+    assert s.value == value
+
+    assert s._is_device_value_current
+    assert s._is_host_value_current
+
+
+@pytest.mark.parametrize("value", SCALAR_VALUES + DECIMAL_VALUES)
 def test_scalar_roundtrip(value):
     s = cudf.Scalar(value)
 
@@ -156,9 +186,19 @@ def test_scalar_roundtrip(value):
 
 
 @pytest.mark.parametrize(
-    "dtype", NUMERIC_TYPES + DATETIME_TYPES + TIMEDELTA_TYPES + ["object"]
+    "dtype",
+    NUMERIC_TYPES
+    + DATETIME_TYPES
+    + TIMEDELTA_TYPES
+    + ["object"]
+    + TEST_DECIMAL_TYPES,
 )
 def test_null_scalar(dtype):
+    if isinstance(dtype, cudf.Decimal64Dtype):
+        with pytest.raises(NotImplementedError):
+            s = cudf.Scalar(None, dtype=dtype)
+        return
+
     s = cudf.Scalar(None, dtype=dtype)
     assert s.value is cudf.NA
     assert s.dtype == np.dtype(dtype)
@@ -194,9 +234,19 @@ def test_generic_null_scalar_construction_fails(value):
 
 
 @pytest.mark.parametrize(
-    "dtype", NUMERIC_TYPES + DATETIME_TYPES + TIMEDELTA_TYPES + ["object"]
+    "dtype",
+    NUMERIC_TYPES
+    + DATETIME_TYPES
+    + TIMEDELTA_TYPES
+    + ["object"]
+    + TEST_DECIMAL_TYPES,
 )
 def test_scalar_dtype_and_validity(dtype):
+    if isinstance(dtype, cudf.Decimal64Dtype):
+        with pytest.raises(NotImplementedError):
+            s = cudf.Scalar(None, dtype=dtype)
+        return
+
     s = cudf.Scalar(1, dtype=dtype)
 
     assert s.dtype == np.dtype(dtype)
@@ -277,24 +327,33 @@ def test_scalar_invalid_implicit_conversion(cls, dtype):
             cls(slr)
 
 
-@pytest.mark.parametrize("value", SCALAR_VALUES)
+@pytest.mark.parametrize("value", SCALAR_VALUES + DECIMAL_VALUES)
 def test_device_scalar_direct_construction(value):
     value = cudf.utils.utils.to_cudf_compatible_scalar(value)
-    dtype = value.dtype
+    dtype = (
+        value.dtype
+        if not isinstance(value, Decimal)
+        else cudf.Decimal64Dtype._from_decimal(value)
+    )
 
     s = cudf._lib.scalar.DeviceScalar(value, dtype)
 
     assert s.value == value or np.isnan(s.value) and np.isnan(value)
-    if dtype.char == "U":
+    if isinstance(dtype, cudf.Decimal64Dtype):
+        assert s.dtype.precision == dtype.precision
+        assert s.dtype.scale == dtype.scale
+    elif dtype.char == "U":
         assert s.dtype == "object"
     else:
         assert s.dtype == dtype
 
 
-@pytest.mark.parametrize("value", SCALAR_VALUES)
+@pytest.mark.parametrize("value", SCALAR_VALUES + DECIMAL_VALUES)
 def test_construct_from_scalar(value):
     value = cudf.utils.utils.to_cudf_compatible_scalar(value)
-    x = cudf.Scalar(value, value.dtype)
+    x = cudf.Scalar(
+        value, value.dtype if not isinstance(value, Decimal) else None
+    )
     y = cudf.Scalar(x)
     assert x.value == y.value or np.isnan(x.value) and np.isnan(y.value)
 
diff --git a/python/cudf/cudf/tests/test_series.py b/python/cudf/cudf/tests/test_series.py
index beda14934ca..0dc53fa29e9 100644
--- a/python/cudf/cudf/tests/test_series.py
+++ b/python/cudf/cudf/tests/test_series.py
@@ -921,6 +921,42 @@ def custom_add_func(sr, val):
     )
 
 
+@pytest.mark.parametrize(
+    "data",
+    [cudf.Series([1, 2, 3]), cudf.Series([10, 11, 12], index=[1, 2, 3])],
+)
+@pytest.mark.parametrize(
+    "other",
+    [
+        cudf.Series([4, 5, 6]),
+        cudf.Series([4, 5, 6, 7, 8]),
+        cudf.Series([4, np.nan, 6], nan_as_null=False),
+        [4, np.nan, 6],
+        {1: 9},
+    ],
+)
+def test_series_update(data, other):
+    gs = data.copy(deep=True)
+    if isinstance(other, cudf.Series):
+        g_other = other.copy(deep=True)
+        p_other = g_other.to_pandas()
+    else:
+        g_other = other
+        p_other = other
+
+    ps = gs.to_pandas()
+
+    gs_column_before = gs._column
+    gs.update(g_other)
+    gs_column_after = gs._column
+
+    assert_eq(gs_column_before.to_array(), gs_column_after.to_array())
+
+    ps.update(p_other)
+
+    assert_eq(gs, ps)
+
+
 @pytest.mark.parametrize(
     "data",
     [
@@ -942,6 +978,19 @@ def test_fillna_with_nan(data, nan_as_null, fill_value):
     assert_eq(expected, actual)
 
 
+def test_series_mask_mixed_dtypes_error():
+    s = cudf.Series(["a", "b", "c"])
+    with pytest.raises(
+        TypeError,
+        match=re.escape(
+            "cudf does not support mixed types, please type-cast "
+            "the column of dataframe/series and other "
+            "to same dtypes."
+        ),
+    ):
+        s.where([True, False, True], [1, 2, 3])
+
+
 @pytest.mark.parametrize(
     "ps",
     [
diff --git a/python/cudf/cudf/tests/test_struct.py b/python/cudf/cudf/tests/test_struct.py
index c7efb55c089..3c211951dff 100644
--- a/python/cudf/cudf/tests/test_struct.py
+++ b/python/cudf/cudf/tests/test_struct.py
@@ -34,3 +34,13 @@ def test_struct_of_struct_loc():
     df = cudf.DataFrame({"col": [{"a": {"b": 1}}]})
     expect = cudf.Series([{"a": {"b": 1}}], name="col")
     assert_eq(expect, df["col"])
+
+
+@pytest.mark.parametrize(
+    "key, expect", [(0, [1, 3]), (1, [2, 4]), ("a", [1, 3]), ("b", [2, 4])]
+)
+def test_struct_for_field(key, expect):
+    sr = cudf.Series([{"a": 1, "b": 2}, {"a": 3, "b": 4}])
+    expect = cudf.Series(expect)
+    got = sr.struct.field(key)
+    assert_eq(expect, got)
diff --git a/python/cudf/cudf/tests/utils.py b/python/cudf/cudf/tests/utils.py
index 1163c3085e4..37a74ab4760 100644
--- a/python/cudf/cudf/tests/utils.py
+++ b/python/cudf/cudf/tests/utils.py
@@ -74,6 +74,17 @@ def assert_eq(left, right, **kwargs):
     without switching between assert_frame_equal/assert_series_equal/...
     functions.
     """
+    # dtypes that we support but Pandas doesn't will convert to
+    # `object`. Check equality before that happens:
+    if kwargs.get("check_dtype", True):
+        if hasattr(left, "dtype") and hasattr(right, "dtype"):
+            if isinstance(
+                left.dtype, cudf.core.dtypes._BaseDtype
+            ) and not isinstance(
+                left.dtype, cudf.CategoricalDtype
+            ):  # leave categorical comparison to Pandas
+                assert_eq(left.dtype, right.dtype)
+
     if hasattr(left, "to_pandas"):
         left = left.to_pandas()
     if hasattr(right, "to_pandas"):
diff --git a/python/cudf/cudf/utils/dtypes.py b/python/cudf/cudf/utils/dtypes.py
index 8af225ecb58..a8ff2177154 100644
--- a/python/cudf/cudf/utils/dtypes.py
+++ b/python/cudf/cudf/utils/dtypes.py
@@ -4,6 +4,7 @@
 import numbers
 from collections import namedtuple
 from collections.abc import Sequence
+from decimal import Decimal
 
 import cupy as cp
 import numpy as np
@@ -294,15 +295,20 @@ def cudf_dtype_to_pa_type(dtype):
     """
     if is_categorical_dtype(dtype):
         raise NotImplementedError()
-    elif is_list_dtype(dtype):
-        return dtype.to_arrow()
-    elif is_struct_dtype(dtype):
+    elif (
+        is_list_dtype(dtype)
+        or is_struct_dtype(dtype)
+        or is_decimal_dtype(dtype)
+    ):
         return dtype.to_arrow()
     else:
         return np_to_pa_dtype(np.dtype(dtype))
 
 
 def cudf_dtype_from_pa_type(typ):
+    """ Given a cuDF pyarrow dtype, converts it into the equivalent
+        cudf pandas dtype.
+    """
     if pa.types.is_list(typ):
         return cudf.core.dtypes.ListDtype.from_arrow(typ)
     elif pa.types.is_struct(typ):
@@ -345,9 +351,12 @@ def to_cudf_compatible_scalar(val, dtype=None):
     if not is_scalar(val):
         raise ValueError(
             f"Cannot convert value of type {type(val).__name__} "
-            " to cudf scalar"
+            "to cudf scalar"
         )
 
+    if isinstance(val, Decimal):
+        return val
+
     if isinstance(val, (np.ndarray, cp.ndarray)) and val.ndim == 0:
         val = val.item()
 
@@ -578,6 +587,24 @@ def _get_nan_for_dtype(dtype):
         return np.float64("nan")
 
 
+def _decimal_to_int64(decimal: Decimal) -> int:
+    """
+    Scale a Decimal such that the result is the integer
+    that would result from removing the decimal point.
+
+    Examples
+    --------
+    >>> _decimal_to_int64(Decimal('1.42'))
+    142
+    >>> _decimal_to_int64(Decimal('0.0042'))
+    42
+    >>> _decimal_to_int64(Decimal('-1.004201'))
+    -1004201
+
+    """
+    return int(f"{decimal:0f}".replace(".", ""))
+
+
 def get_allowed_combinations_for_operator(dtype_l, dtype_r, op):
     error = TypeError(
         f"{op} not supported between {dtype_l} and {dtype_r} scalars"
@@ -637,6 +664,11 @@ def find_common_type(dtypes):
     # Aggregate same types
     dtypes = set(dtypes)
 
+    if any(is_decimal_dtype(dtype) for dtype in dtypes):
+        raise NotImplementedError(
+            "DecimalDtype is not yet supported in find_common_type"
+        )
+
     # Corner case 1:
     # Resort to np.result_type to handle "M" and "m" types separately
     dt_dtypes = set(filter(lambda t: is_datetime_dtype(t), dtypes))
@@ -651,7 +683,64 @@ def find_common_type(dtypes):
         dtypes = dtypes - td_dtypes
         dtypes.add(np.result_type(*td_dtypes))
 
-    return np.find_common_type(list(dtypes), [])
+    common_dtype = np.find_common_type(list(dtypes), [])
+    if common_dtype == np.dtype("float16"):
+        # cuDF does not support float16 dtype
+        return np.dtype("float32")
+    else:
+        return common_dtype
+
+
+def _can_cast(from_dtype, to_dtype):
+    """
+    Utility function to determine if we can cast
+    from `from_dtype` to `to_dtype`. This function primarily calls
+    `np.can_cast` but with some special handling around
+    cudf specific dtypes.
+    """
+    if isinstance(from_dtype, type):
+        from_dtype = np.dtype(from_dtype)
+    if isinstance(to_dtype, type):
+        to_dtype = np.dtype(to_dtype)
+
+    # TODO : Add precision & scale checking for
+    # decimal types in future
+    if isinstance(from_dtype, cudf.core.dtypes.Decimal64Dtype):
+        if isinstance(to_dtype, cudf.core.dtypes.Decimal64Dtype):
+            return True
+        elif isinstance(to_dtype, np.dtype):
+            if to_dtype.kind in {"i", "f", "u", "U", "O"}:
+                return True
+            else:
+                return False
+    elif isinstance(from_dtype, np.dtype):
+        if isinstance(to_dtype, np.dtype):
+            return np.can_cast(from_dtype, to_dtype)
+        elif isinstance(to_dtype, cudf.core.dtypes.Decimal64Dtype):
+            if from_dtype.kind in {"i", "f", "u", "U", "O"}:
+                return True
+            else:
+                return False
+        elif isinstance(to_dtype, cudf.core.types.CategoricalDtype):
+            return True
+        else:
+            return False
+    elif isinstance(from_dtype, cudf.core.dtypes.ListDtype):
+        # TODO: Add level based checks too once casting of
+        # list columns is supported
+        if isinstance(to_dtype, cudf.core.dtypes.ListDtype):
+            return np.can_cast(from_dtype.leaf_type, to_dtype.leaf_type)
+        else:
+            return False
+    elif isinstance(from_dtype, cudf.core.dtypes.CategoricalDtype):
+        if isinstance(to_dtype, cudf.core.dtypes.CategoricalDtype):
+            return True
+        elif isinstance(to_dtype, np.dtype):
+            return np.can_cast(from_dtype._categories.dtype, to_dtype)
+        else:
+            return False
+    else:
+        return np.can_cast(from_dtype, to_dtype)
 
 
 # Type dispatch loops similar to what are found in `np.add.types`
diff --git a/python/cudf/cudf/utils/utils.py b/python/cudf/cudf/utils/utils.py
index ba9fa734248..c69ccb0f42e 100644
--- a/python/cudf/cudf/utils/utils.py
+++ b/python/cudf/cudf/utils/utils.py
@@ -1,7 +1,6 @@
 # Copyright (c) 2020-2021, NVIDIA CORPORATION.
 
 import functools
-from collections import OrderedDict
 from collections.abc import Sequence
 from math import floor, isinf, isnan
 
@@ -280,62 +279,6 @@ def __get__(self, instance, cls):
             return value
 
 
-class NestedMappingMixin:
-    """
-    Make missing values of a mapping empty instances
-    of the same type as the mapping.
-    """
-
-    def __getitem__(self, key):
-        if isinstance(key, tuple):
-            d = self
-            for k in key[:-1]:
-                d = d[k]
-            return d.__getitem__(key[-1])
-        else:
-            return super().__getitem__(key)
-
-    def __setitem__(self, key, value):
-        if isinstance(key, tuple):
-            d = self
-            for k in key[:-1]:
-                d = d.setdefault(k, self.__class__())
-            d.__setitem__(key[-1], value)
-        else:
-            super().__setitem__(key, value)
-
-
-class NestedOrderedDict(NestedMappingMixin, OrderedDict):
-    pass
-
-
-def to_flat_dict(d):
-    """
-    Convert the given nested dictionary to a flat dictionary
-    with tuple keys.
-    """
-
-    def _inner(d, parents=None):
-        if parents is None:
-            parents = []
-        for k, v in d.items():
-            if not isinstance(v, d.__class__):
-                if parents:
-                    k = tuple(parents + [k])
-                yield (k, v)
-            else:
-                yield from _inner(d=v, parents=parents + [k])
-
-    return {k: v for k, v in _inner(d)}
-
-
-def to_nested_dict(d):
-    """
-    Convert the given dictionary with tuple keys to a NestedOrderedDict.
-    """
-    return NestedOrderedDict(d)
-
-
 def time_col_replace_nulls(input_col):
 
     null = column.column_empty_like(input_col, masked=True, newsize=1)
diff --git a/python/cudf/pyproject.toml b/python/cudf/pyproject.toml
new file mode 100644
index 00000000000..630efd5b9ec
--- /dev/null
+++ b/python/cudf/pyproject.toml
@@ -0,0 +1,29 @@
+# Copyright (c) 2021, NVIDIA CORPORATION.
+
+[build-system]
+
+requires = [
+    "wheel",
+    "setuptools",
+    "cython>=0.29,<0.30",
+]
+
+[tool.black]
+line-length = 79
+target-version = ["py36"]
+include = '\.py?$'
+exclude = '''
+/(
+    thirdparty |
+    \.eggs |
+    \.git |
+    \.hg |
+    \.mypy_cache |
+    \.tox |
+    \.venv |
+    _build |
+    buck-out |
+    build |
+    dist
+)/
+'''
diff --git a/python/cudf/requirements/cuda-10.1/dev_requirements.txt b/python/cudf/requirements/cuda-10.1/dev_requirements.txt
new file mode 100644
index 00000000000..967974d38b5
--- /dev/null
+++ b/python/cudf/requirements/cuda-10.1/dev_requirements.txt
@@ -0,0 +1,41 @@
+# Copyright (c) 2021, NVIDIA CORPORATION.
+
+# pyarrow gpu package will have to be built from source :
+# https://arrow.apache.org/docs/python/install.html#installing-from-source
+
+cupy-cuda101
+cachetools
+cmake
+cmake-setuptools>=0.1.3
+cython>=0.29,<0.30
+dlpack
+fastavro>=0.22.9
+flatbuffers
+fsspec>=0.6.0
+hypothesis
+mimesis
+mypy==0.782
+nbsphinx
+numba>=0.49.0,!=0.51.0
+numpy
+numpydoc
+nvtx>=0.2.1
+packaging
+pandas>=1.0,<1.3.0dev0
+pandoc==2.0a4
+protobuf
+pyorc
+pytest
+pytest-benchmark
+pytest-xdist
+rapidjson
+recommonmark
+setuptools
+sphinx
+sphinx-copybutton
+sphinx-markdown-tables
+sphinx_rtd_theme
+sphinxcontrib-websupport
+typing_extensions
+typing_extensions
+wheel
\ No newline at end of file
diff --git a/python/cudf/requirements/cuda-10.2/dev_requirements.txt b/python/cudf/requirements/cuda-10.2/dev_requirements.txt
new file mode 100644
index 00000000000..34450456b5a
--- /dev/null
+++ b/python/cudf/requirements/cuda-10.2/dev_requirements.txt
@@ -0,0 +1,41 @@
+# Copyright (c) 2021, NVIDIA CORPORATION.
+
+# pyarrow gpu package will have to be built from source :
+# https://arrow.apache.org/docs/python/install.html#installing-from-source
+
+cupy-cuda102
+cachetools
+cmake
+cmake-setuptools>=0.1.3
+cython>=0.29,<0.30
+dlpack
+fastavro>=0.22.9
+flatbuffers
+fsspec>=0.6.0
+hypothesis
+mimesis
+mypy==0.782
+nbsphinx
+numba>=0.49.0,!=0.51.0
+numpy
+numpydoc
+nvtx>=0.2.1
+packaging
+pandas>=1.0,<1.3.0dev0
+pandoc==2.0a4
+protobuf
+pyorc
+pytest
+pytest-benchmark
+pytest-xdist
+rapidjson
+recommonmark
+setuptools
+sphinx
+sphinx-copybutton
+sphinx-markdown-tables
+sphinx_rtd_theme
+sphinxcontrib-websupport
+typing_extensions
+typing_extensions
+wheel
\ No newline at end of file
diff --git a/python/cudf/requirements/cuda-11.0/dev_requirements.txt b/python/cudf/requirements/cuda-11.0/dev_requirements.txt
new file mode 100644
index 00000000000..278b1a6bf61
--- /dev/null
+++ b/python/cudf/requirements/cuda-11.0/dev_requirements.txt
@@ -0,0 +1,41 @@
+# Copyright (c) 2021, NVIDIA CORPORATION.
+
+# pyarrow gpu package will have to be built from source :
+# https://arrow.apache.org/docs/python/install.html#installing-from-source
+
+cupy-cuda110
+cachetools
+cmake
+cmake-setuptools>=0.1.3
+cython>=0.29,<0.30
+dlpack
+fastavro>=0.22.9
+flatbuffers
+fsspec>=0.6.0
+hypothesis
+mimesis
+mypy==0.782
+nbsphinx
+numba>=0.49.0,!=0.51.0
+numpy
+numpydoc
+nvtx>=0.2.1
+packaging
+pandas>=1.0,<1.3.0dev0
+pandoc==2.0a4
+protobuf
+pyorc
+pytest
+pytest-benchmark
+pytest-xdist
+rapidjson
+recommonmark
+setuptools
+sphinx
+sphinx-copybutton
+sphinx-markdown-tables
+sphinx_rtd_theme
+sphinxcontrib-websupport
+typing_extensions
+typing_extensions
+wheel
\ No newline at end of file
diff --git a/python/cudf/requirements/cuda-11.1/dev_requirements.txt b/python/cudf/requirements/cuda-11.1/dev_requirements.txt
new file mode 100644
index 00000000000..fafdc7d7d4f
--- /dev/null
+++ b/python/cudf/requirements/cuda-11.1/dev_requirements.txt
@@ -0,0 +1,41 @@
+# Copyright (c) 2021, NVIDIA CORPORATION.
+
+# pyarrow gpu package will have to be built from source :
+# https://arrow.apache.org/docs/python/install.html#installing-from-source
+
+cupy-cuda111
+cachetools
+cmake
+cmake-setuptools>=0.1.3
+cython>=0.29,<0.30
+dlpack
+fastavro>=0.22.9
+flatbuffers
+fsspec>=0.6.0
+hypothesis
+mimesis
+mypy==0.782
+nbsphinx
+numba>=0.49.0,!=0.51.0
+numpy
+numpydoc
+nvtx>=0.2.1
+packaging
+pandas>=1.0,<1.3.0dev0
+pandoc==2.0a4
+protobuf
+pyorc
+pytest
+pytest-benchmark
+pytest-xdist
+rapidjson
+recommonmark
+setuptools
+sphinx
+sphinx-copybutton
+sphinx-markdown-tables
+sphinx_rtd_theme
+sphinxcontrib-websupport
+typing_extensions
+typing_extensions
+wheel
\ No newline at end of file
diff --git a/python/cudf/requirements/cuda-11.2/dev_requirements.txt b/python/cudf/requirements/cuda-11.2/dev_requirements.txt
new file mode 100644
index 00000000000..db434b7c8ec
--- /dev/null
+++ b/python/cudf/requirements/cuda-11.2/dev_requirements.txt
@@ -0,0 +1,41 @@
+# Copyright (c) 2021, NVIDIA CORPORATION.
+
+# pyarrow gpu package will have to be built from source :
+# https://arrow.apache.org/docs/python/install.html#installing-from-source
+
+cupy-cuda112
+cachetools
+cmake
+cmake-setuptools>=0.1.3
+cython>=0.29,<0.30
+dlpack
+fastavro>=0.22.9
+flatbuffers
+fsspec>=0.6.0
+hypothesis
+mimesis
+mypy==0.782
+nbsphinx
+numba>=0.49.0,!=0.51.0
+numpy
+numpydoc
+nvtx>=0.2.1
+packaging
+pandas>=1.0,<1.3.0dev0
+pandoc==2.0a4
+protobuf
+pyorc
+pytest
+pytest-benchmark
+pytest-xdist
+rapidjson
+recommonmark
+setuptools
+sphinx
+sphinx-copybutton
+sphinx-markdown-tables
+sphinx_rtd_theme
+sphinxcontrib-websupport
+typing_extensions
+typing_extensions
+wheel
\ No newline at end of file
diff --git a/python/cudf/setup.cfg b/python/cudf/setup.cfg
index 3067d2daafd..697a689801f 100644
--- a/python/cudf/setup.cfg
+++ b/python/cudf/setup.cfg
@@ -1,4 +1,4 @@
-# Copyright (c) 2018-2020, NVIDIA CORPORATION.
+# Copyright (c) 2018-2021, NVIDIA CORPORATION.
 
 # See the docstring in versioneer.py for instructions. Note that you must
 # re-run 'versioneer.py setup' after changing this section, and commit the
diff --git a/python/cudf/setup.py b/python/cudf/setup.py
index fdef940bc88..5d95516c0dd 100644
--- a/python/cudf/setup.py
+++ b/python/cudf/setup.py
@@ -1,5 +1,7 @@
-# Copyright (c) 2018-2020, NVIDIA CORPORATION.
+# Copyright (c) 2018-2021, NVIDIA CORPORATION.
+
 import os
+import re
 import shutil
 import subprocess
 import sys
@@ -16,10 +18,56 @@
 
 import versioneer
 
-install_requires = ["numba", "cython"]
+install_requires = [
+    "numba>=0.49.0,!=0.51.0",
+    "Cython>=0.29,<0.30",
+    "fastavro>=0.22.9",
+    "fsspec>=0.6.0",
+    "numpy",
+    "pandas>=1.0,<1.3.0dev0",
+    "typing_extensions",
+    "protobuf",
+    "nvtx>=0.2.1",
+    "cachetools",
+    "packaging",
+]
+
+extras_require = {
+    "test": [
+        "pytest",
+        "pytest-benchmark",
+        "pytest-xdist",
+        "hypothesis" "mimesis",
+        "pyorc",
+        "msgpack",
+    ]
+}
 
 cython_files = ["cudf/**/*.pyx"]
 
+
+def get_cuda_version_from_header(cuda_include_dir, delimeter=""):
+
+    cuda_version = None
+
+    with open(
+        os.path.join(cuda_include_dir, "cuda.h"), "r", encoding="utf-8"
+    ) as f:
+        for line in f.readlines():
+            if re.search(r"#define CUDA_VERSION ", line) is not None:
+                cuda_version = line
+                break
+
+    if cuda_version is None:
+        raise TypeError("CUDA_VERSION not found in cuda.h")
+    cuda_version = int(cuda_version.split()[2])
+    return "%d%s%d" % (
+        cuda_version // 1000,
+        delimeter,
+        (cuda_version % 1000) // 10,
+    )
+
+
 CUDA_HOME = os.environ.get("CUDA_HOME", False)
 if not CUDA_HOME:
     path_to_cuda_gdb = shutil.which("cuda-gdb")
@@ -36,8 +84,25 @@
     raise OSError(f"Invalid CUDA_HOME: directory does not exist: {CUDA_HOME}")
 
 cuda_include_dir = os.path.join(CUDA_HOME, "include")
+cuda_lib_dir = os.path.join(CUDA_HOME, "lib64")
+install_requires.append(
+    "cupy-cuda" + get_cuda_version_from_header(cuda_include_dir)
+)
 
-CUDF_ROOT = os.environ.get("CUDF_ROOT", "../../cpp/build/")
+CUDF_HOME = os.environ.get(
+    "CUDF_HOME",
+    os.path.abspath(
+        os.path.join(os.path.dirname(os.path.abspath(__file__)), "../../")
+    ),
+)
+CUDF_ROOT = os.environ.get(
+    "CUDF_ROOT",
+    os.path.abspath(
+        os.path.join(
+            os.path.dirname(os.path.abspath(__file__)), "../../cpp/build/"
+        )
+    ),
+)
 
 try:
     nthreads = int(os.environ.get("PARALLEL_LEVEL", "0") or "0")
@@ -102,10 +167,11 @@ def run(self):
         "*",
         sources=cython_files,
         include_dirs=[
-            "../../cpp/include/cudf",
-            "../../cpp/include",
-            os.path.join(CUDF_ROOT, "include"),
+            os.path.abspath(os.path.join(CUDF_HOME, "cpp/include/cudf")),
+            os.path.abspath(os.path.join(CUDF_HOME, "cpp/include")),
+            os.path.abspath(os.path.join(CUDF_ROOT, "include")),
             os.path.join(CUDF_ROOT, "_deps/libcudacxx-src/include"),
+            os.path.join(CUDF_ROOT, "_deps/dlpack-src/include"),
             os.path.join(
                 os.path.dirname(sysconfig.get_path("include")),
                 "libcudf/libcudacxx",
@@ -117,9 +183,13 @@ def run(self):
         ],
         library_dirs=(
             pa.get_library_dirs()
-            + [get_python_lib(), os.path.join(os.sys.prefix, "lib")]
+            + [
+                get_python_lib(),
+                os.path.join(os.sys.prefix, "lib"),
+                cuda_lib_dir,
+            ]
         ),
-        libraries=["cudf"] + pa.get_libraries() + ["arrow_cuda"],
+        libraries=["cudart", "cudf"] + pa.get_libraries() + ["arrow_cuda"],
         language="c++",
         extra_compile_args=["-std=c++14"],
     )
@@ -138,8 +208,8 @@ def run(self):
         "Topic :: Scientific/Engineering",
         "License :: OSI Approved :: Apache Software License",
         "Programming Language :: Python",
-        "Programming Language :: Python :: 3.6",
         "Programming Language :: Python :: 3.7",
+        "Programming Language :: Python :: 3.8",
     ],
     # Include the separately-compiled shared library
     setup_requires=["cython", "protobuf"],
@@ -157,4 +227,5 @@ def run(self):
     cmdclass=cmdclass,
     install_requires=install_requires,
     zip_safe=False,
+    extras_require=extras_require,
 )
diff --git a/python/cudf_kafka/dev_requirements.txt b/python/cudf_kafka/dev_requirements.txt
new file mode 100644
index 00000000000..e7f0d8edf99
--- /dev/null
+++ b/python/cudf_kafka/dev_requirements.txt
@@ -0,0 +1,11 @@
+# Copyright (c) 2021, NVIDIA CORPORATION.
+
+flake8==3.8.3
+black==19.10b0
+isort==5.0.7
+python-confluent-kafka
+pytest
+setuptools
+wheel
+cython>=0.29,<0.30
+python-confluent-kafka
\ No newline at end of file
diff --git a/python/cudf_kafka/pyproject.toml b/python/cudf_kafka/pyproject.toml
new file mode 100644
index 00000000000..9855188ac6c
--- /dev/null
+++ b/python/cudf_kafka/pyproject.toml
@@ -0,0 +1,30 @@
+# Copyright (c) 2021, NVIDIA CORPORATION.
+
+[build-system]
+
+requires = [
+    "wheel",
+    "setuptools",
+    "Cython>=0.29,<0.30",
+]
+
+
+[tool.black]
+line-length = 79
+target-version = ["py36"]
+include = '\.py?$'
+exclude = '''
+/(
+    thirdparty |
+    \.eggs |
+    \.git |
+    \.hg |
+    \.mypy_cache |
+    \.tox |
+    \.venv |
+    _build |
+    buck-out |
+    build |
+    dist
+)/
+'''
diff --git a/python/cudf_kafka/setup.py b/python/cudf_kafka/setup.py
index 290dcc036af..f7523dda503 100644
--- a/python/cudf_kafka/setup.py
+++ b/python/cudf_kafka/setup.py
@@ -32,7 +32,14 @@
 
 cuda_include_dir = os.path.join(CUDA_HOME, "include")
 
-CUDF_ROOT = os.environ.get("CUDF_ROOT", "../../cpp/build/")
+CUDF_ROOT = os.environ.get(
+    "CUDF_ROOT",
+    os.path.abspath(
+        os.path.join(
+            os.path.dirname(os.path.abspath(__file__)), "../../cpp/build/"
+        )
+    ),
+)
 CUDF_KAFKA_ROOT = os.environ.get(
     "CUDF_KAFKA_ROOT", "../../libcudf_kafka/build"
 )
@@ -47,9 +54,11 @@
         "*",
         sources=cython_files,
         include_dirs=[
-            "../../cpp/include/cudf",
-            "../../cpp/include",
-            "../../cpp/libcudf_kafka/include/cudf_kafka",
+            os.path.abspath(os.path.join(CUDF_ROOT, "../include/cudf")),
+            os.path.abspath(os.path.join(CUDF_ROOT, "../include")),
+            os.path.abspath(
+                os.path.join(CUDF_ROOT, "../libcudf_kafka/include/cudf_kafka")
+            ),
             os.path.join(CUDF_ROOT, "include"),
             os.path.join(CUDF_ROOT, "_deps/libcudacxx-src/include"),
             os.path.join(
@@ -81,11 +90,11 @@
         "Topic :: Apache Kafka",
         "License :: OSI Approved :: Apache Software License",
         "Programming Language :: Python",
-        "Programming Language :: Python :: 3.6",
         "Programming Language :: Python :: 3.7",
+        "Programming Language :: Python :: 3.8",
     ],
     # Include the separately-compiled shared library
-    setup_requires=["cython"],
+    setup_requires=["Cython>=0.29,<0.30"],
     ext_modules=cythonize(
         extensions,
         nthreads=nthreads,
@@ -99,5 +108,6 @@
     ),
     cmdclass=versioneer.get_cmdclass(),
     install_requires=install_requires,
+    extras_require={"test": ["pytest", "pytest-xdist"]},
     zip_safe=False,
 )
diff --git a/python/custreamz/dev_requirements.txt b/python/custreamz/dev_requirements.txt
new file mode 100644
index 00000000000..4234d7ee2ab
--- /dev/null
+++ b/python/custreamz/dev_requirements.txt
@@ -0,0 +1,12 @@
+# Copyright (c) 2021, NVIDIA CORPORATION.
+
+flake8==3.8.3
+black==19.10b0
+isort==5.0.7
+dask==2021.4.0
+distributed>=2.22.0,<=2021.4.0
+streamz
+python-confluent-kafka
+pytest
+setuptools
+wheel
diff --git a/python/custreamz/pyproject.toml b/python/custreamz/pyproject.toml
new file mode 100644
index 00000000000..dfe475a2e46
--- /dev/null
+++ b/python/custreamz/pyproject.toml
@@ -0,0 +1,28 @@
+# Copyright (c) 2021, NVIDIA CORPORATION.
+
+[build-system]
+
+requires = [
+    "wheel",
+    "setuptools",
+]
+
+[tool.black]
+line-length = 79
+target-version = ["py36"]
+include = '\.py?$'
+exclude = '''
+/(
+    thirdparty |
+    \.eggs |
+    \.git |
+    \.hg |
+    \.mypy_cache |
+    \.tox |
+    \.venv |
+    _build |
+    buck-out |
+    build |
+    dist
+)/
+'''
diff --git a/python/custreamz/setup.py b/python/custreamz/setup.py
index 976412f6e18..07a6b92f65d 100644
--- a/python/custreamz/setup.py
+++ b/python/custreamz/setup.py
@@ -20,11 +20,12 @@
         "Topic :: Apache Kafka",
         "License :: OSI Approved :: Apache Software License",
         "Programming Language :: Python",
-        "Programming Language :: Python :: 3.6",
         "Programming Language :: Python :: 3.7",
+        "Programming Language :: Python :: 3.8",
     ],
     packages=find_packages(include=["custreamz", "custreamz.*"]),
     cmdclass=versioneer.get_cmdclass(),
     install_requires=install_requires,
     zip_safe=False,
+    extras_require={"test": ["pytest", "pytest-xdist"]},
 )
diff --git a/python/dask_cudf/dask_cudf/backends.py b/python/dask_cudf/dask_cudf/backends.py
index 66b06acc858..0570654fde3 100644
--- a/python/dask_cudf/dask_cudf/backends.py
+++ b/python/dask_cudf/dask_cudf/backends.py
@@ -1,13 +1,10 @@
 # Copyright (c) 2020-2021, NVIDIA CORPORATION.
 
-from distutils.version import LooseVersion
-
 import cupy as cp
 import numpy as np
 import pandas as pd
 import pyarrow as pa
 
-import dask
 from dask.dataframe.categorical import categorical_dtype_dispatch
 from dask.dataframe.core import get_parallel_type, make_meta, meta_nonempty
 from dask.dataframe.methods import (
@@ -31,7 +28,6 @@
 get_parallel_type.register(cudf.DataFrame, lambda _: DataFrame)
 get_parallel_type.register(cudf.Series, lambda _: Series)
 get_parallel_type.register(cudf.Index, lambda _: Index)
-DASK_VERSION = LooseVersion(dask.__version__)
 
 
 @meta_nonempty.register(cudf.Index)
@@ -205,45 +201,26 @@ def make_meta_object(x, index=None):
     raise TypeError(f"Don't know how to create metadata from {x}")
 
 
-if DASK_VERSION > "2021.03.1":
-
-    @concat_dispatch.register((cudf.DataFrame, cudf.Series, cudf.Index))
-    def concat_cudf(
-        dfs,
-        axis=0,
-        join="outer",
-        uniform=False,
-        filter_warning=True,
-        sort=None,
-        ignore_index=False,
-        **kwargs,
-    ):
-        assert join == "outer"
-
-        ignore_order = kwargs.get("ignore_order", False)
-        if ignore_order:
-            raise NotImplementedError(
-                "ignore_order parameter is not yet supported in dask-cudf"
-            )
-
-        return cudf.concat(dfs, axis=axis, ignore_index=ignore_index)
-
-
-else:
-
-    @concat_dispatch.register((cudf.DataFrame, cudf.Series, cudf.Index))
-    def concat_cudf(
-        dfs,
-        axis=0,
-        join="outer",
-        uniform=False,
-        filter_warning=True,
-        sort=None,
-        ignore_index=False,
-    ):
-        assert join == "outer"
+@concat_dispatch.register((cudf.DataFrame, cudf.Series, cudf.Index))
+def concat_cudf(
+    dfs,
+    axis=0,
+    join="outer",
+    uniform=False,
+    filter_warning=True,
+    sort=None,
+    ignore_index=False,
+    **kwargs,
+):
+    assert join == "outer"
+
+    ignore_order = kwargs.get("ignore_order", False)
+    if ignore_order:
+        raise NotImplementedError(
+            "ignore_order parameter is not yet supported in dask-cudf"
+        )
 
-        return cudf.concat(dfs, axis=axis, ignore_index=ignore_index)
+    return cudf.concat(dfs, axis=axis, ignore_index=ignore_index)
 
 
 @categorical_dtype_dispatch.register((cudf.DataFrame, cudf.Series, cudf.Index))
diff --git a/python/dask_cudf/dev_requirements.txt b/python/dask_cudf/dev_requirements.txt
new file mode 100644
index 00000000000..c157c0be86f
--- /dev/null
+++ b/python/dask_cudf/dev_requirements.txt
@@ -0,0 +1,14 @@
+# Copyright (c) 2021, NVIDIA CORPORATION.
+
+dask==2021.4.0
+distributed>=2.22.0,<=2021.4.0
+fsspec>=0.6.0
+numba>=0.49.0,!=0.51.0
+numpy
+pandas>=1.0,<1.3.0dev0
+pytest
+setuptools
+wheel
+flake8==3.8.3
+black==19.10b0
+isort==5.0.7
diff --git a/python/dask_cudf/pyproject.toml b/python/dask_cudf/pyproject.toml
new file mode 100644
index 00000000000..dfe475a2e46
--- /dev/null
+++ b/python/dask_cudf/pyproject.toml
@@ -0,0 +1,28 @@
+# Copyright (c) 2021, NVIDIA CORPORATION.
+
+[build-system]
+
+requires = [
+    "wheel",
+    "setuptools",
+]
+
+[tool.black]
+line-length = 79
+target-version = ["py36"]
+include = '\.py?$'
+exclude = '''
+/(
+    thirdparty |
+    \.eggs |
+    \.git |
+    \.hg |
+    \.mypy_cache |
+    \.tox |
+    \.venv |
+    _build |
+    buck-out |
+    build |
+    dist
+)/
+'''
diff --git a/python/dask_cudf/setup.py b/python/dask_cudf/setup.py
index d4809ff8f34..f735d895095 100644
--- a/python/dask_cudf/setup.py
+++ b/python/dask_cudf/setup.py
@@ -1,9 +1,77 @@
-# Copyright (c) 2019, NVIDIA CORPORATION.
+# Copyright (c) 2019-2021, NVIDIA CORPORATION.
+
+import os
+import re
+import shutil
+
 from setuptools import find_packages, setup
 
 import versioneer
 
-install_requires = ["cudf", "dask", "distributed"]
+install_requires = [
+    "cudf",
+    "dask==2021.4.0",
+    "distributed>=2.22.0,<=2021.4.0",
+    "fsspec>=0.6.0",
+    "numpy",
+    "pandas>=1.0,<1.3.0dev0",
+]
+
+extras_require = {
+    "test": [
+        "numpy",
+        "pandas>=1.0,<1.3.0dev0",
+        "pytest",
+        "numba>=0.49.0,!=0.51.0",
+        "dask==2021.4.0",
+        "distributed>=2.22.0,<=2021.4.0",
+    ]
+}
+
+
+def get_cuda_version_from_header(cuda_include_dir, delimeter=""):
+
+    cuda_version = None
+
+    with open(
+        os.path.join(cuda_include_dir, "cuda.h"), "r", encoding="utf-8"
+    ) as f:
+        for line in f.readlines():
+            if re.search(r"#define CUDA_VERSION ", line) is not None:
+                cuda_version = line
+                break
+
+    if cuda_version is None:
+        raise TypeError("CUDA_VERSION not found in cuda.h")
+    cuda_version = int(cuda_version.split()[2])
+    return "%d%s%d" % (
+        cuda_version // 1000,
+        delimeter,
+        (cuda_version % 1000) // 10,
+    )
+
+
+CUDA_HOME = os.environ.get("CUDA_HOME", False)
+if not CUDA_HOME:
+    path_to_cuda_gdb = shutil.which("cuda-gdb")
+    if path_to_cuda_gdb is None:
+        raise OSError(
+            "Could not locate CUDA. "
+            "Please set the environment variable "
+            "CUDA_HOME to the path to the CUDA installation "
+            "and try again."
+        )
+    CUDA_HOME = os.path.dirname(os.path.dirname(path_to_cuda_gdb))
+
+if not os.path.isdir(CUDA_HOME):
+    raise OSError(f"Invalid CUDA_HOME: directory does not exist: {CUDA_HOME}")
+
+cuda_include_dir = os.path.join(CUDA_HOME, "include")
+cupy_package_name = "cupy-cuda" + get_cuda_version_from_header(
+    cuda_include_dir
+)
+install_requires.append(cupy_package_name)
+
 
 setup(
     name="dask-cudf",
@@ -18,10 +86,11 @@
         "Topic :: Scientific/Engineering",
         "License :: OSI Approved :: Apache Software License",
         "Programming Language :: Python",
-        "Programming Language :: Python :: 3.6",
         "Programming Language :: Python :: 3.7",
+        "Programming Language :: Python :: 3.8",
     ],
     packages=find_packages(exclude=["tests", "tests.*"]),
     cmdclass=versioneer.get_cmdclass(),
     install_requires=install_requires,
+    extras_require=extras_require,
 )