From 7f2ca040b8b8f3fb67d0a849efc027837ec8548a Mon Sep 17 00:00:00 2001
From: Wei Kang <wkang@pku.org.cn>
Date: Fri, 15 Apr 2022 06:52:41 +0800
Subject: [PATCH] update v2.0-pre (#953)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

* Update doc URL. (#821)

* Support indexing 2-axes RaggedTensor, Support slicing for RaggedTensor (#825)

* Support index 2-axes RaggedTensor, Support slicing for RaggedTensor

* Fix compiling errors

* Fix unit test

* Change RaggedTensor.data to RaggedTensor.values

* Fix style

* Add docs

* Run nightly-cpu when pushing code to nightly-cpu branch

* Prune with max_arcs in IntersectDense (#820)

* Add checking for array constructor

* Prune with max arcs

* Minor fix

* Fix typo

* Fix review comments

* Fix typo

* Release v1.8

* Create a ragged tensor from a regular tensor. (#827)

* Create a ragged tensor from a regular tensor.

* Add tests for creating ragged tensors from regular tensors.

* Add more tests.

* Print ragged tensors in a way like what PyTorch is doing.

* Fix test cases.

* Trigger GitHub actions manually. (#829)

* Run GitHub actions on merging. (#830)

* Support printing ragged tensors in a more compact way. (#831)

* Support printing ragged tensors in a more compact way.

* Disable support for torch 1.3.1

* Fix test failures.

* Add levenshtein alignment (#828)

* Add levenshtein graph

* Contruct k2.RaggedTensor in python part

* Fix review comments, return aux_labels in ctc_graph

* Fix tests

* Fix bug of accessing symbols

* Fix bug of accessing symbols

* Change argument name, add levenshtein_distance interface

* Fix test error, add tests for levenshtein_distance

* Fix review comments and add unit test for c++ side

* update the interface of levenshtein alignment

* Fix review comments

* Release v1.9

* Support a[b[i]] where both a and b are ragged tensors. (#833)

* Display import error solution message on MacOS (#837)

* Fix installation doc. (#841)

* Fix installation doc.

Remove Windows support. Will fix it later.

* Fix style issues.

* fix typos in the install instructions (#844)

* make cmake adhere to the modernized way of finding packages outside default dirs (#845)

* import torch first in the smoke tests to preven SEGFAULT (#846)

* Add doc about how to install a CPU version of k2. (#850)

* Add doc about how to install a CPU version of k2.

* Remove property setter of Fsa.labels

* Update Ubuntu version in GitHub CI since 16.04 reaches end-of-life.

* Support PyTorch 1.10. (#851)

* Fix test cases for k2.union() (#853)

* Fix out-of-boundary access (read). (#859)

* Update all the example codes in the docs (#861)

* Update all the example codes in the docs

I have run all the modified codes with  the newest version k2.

* do some changes

* Fix compilation errors with CUB 1.15. (#865)

* Update README. (#873)

* Update README.

* Fix typos.

* Fix ctc graph (make aux_labels of final arcs -1) (#877)

* Fix LICENSE location to k2 folder (#880)

* Release v1.11. (#881)

It contains bugfixes.

* Update documentation for hash.h (#887)

* Update documentation for hash.h

* Typo fix

* Wrap MonotonicLowerBound (#883)

* Wrap MonotonicLowerBound

* Add unit tests

* Support int64; update documents

* Remove extra commas after 'TOPSORTED' properity and fix RaggedTensor constructer parameter 'byte_offset' out-of-range bug. (#892)

Co-authored-by: gzchenduisheng <gzchenduisheng@corp.netease.com>

* Fix small typos (#896)

* Fix k2.ragged.create_ragged_shape2 (#901)

Before the fix, we have to specify both `row_splits` and `row_ids`
while calling `k2.create_ragged_shape2` even if one of them is `None`.

After this fix, we only need to specify one of them.

* Add rnnt loss (#891)

* Add cpp code of mutual information

* mutual information working

* Add rnnt loss

* Add pruned rnnt loss

* Minor Fixes

* Minor fixes & fix code style

* Fix cpp style

* Fix code style

* Fix s_begin values in padding positions

* Fix bugs related to boundary; Fix s_begin padding value; Add more tests

* Minor fixes

* Fix comments

* Add boundary to pruned loss tests

* Use more efficient way to fix boundaries (#906)

* Release v1.12 (#907)

* Change the sign of the rnnt_loss and add reduction argument (#911)

* Add right boundary constrains for s_begin

* Minor fixes to the interface of rnnt_loss to make it return positive value

* Fix comments

* Release a new version

* Minor fixes

* Minor fixes to the docs

* Fix building doc. (#908)

* Fix building doc.

* Minor fixes.

* Minor fixes.

* Fix building doc (#912)

* Fix building doc

* Fix flake8

* Support torch 1.10.x (#914)

* Support torch 1.10.x

* Fix installing PyTorch.

* Update INSTALL.rst (#915)

* Update INSTALL.rst

Setting a few additional env variables to enable compilation from source *with CUDA GPU computation support enabled*

* Fix torch/cuda/python versions in the doc. (#918)

* Fix torch/cuda/python versions in the doc.

* Minor fixes.

* Fix building for CUDA 11.6 (#917)

* Fix building for CUDA 11.6

* Minor fixes.

* Implement Unstack (#920)

* Implement unstack

* Remove code does not relate to this PR

* Remove for loop on output dim; add Unstack ragged

* Add more docs

* Fix comments

* Fix docs & unit tests

* SubsetRagged & PruneRagged (#919)

* Extend interface of SubsampleRagged.

* Add interface for pruning ragged tensor.

* Draft of new RNN-T decoding method

* Implements SubsampleRaggedShape

* Implements PruneRagged

* Rename subsample-> subset

* Minor fixes

* Fix comments

Co-authored-by: Daniel Povey <dpovey@gmail.com>

* Add Hash64 (#895)

* Add hash64

* Fix tests

* Resize hash64

* Fix comments

* fix typo

* Modified rnnt (#902)

* Add modified mutual_information_recursion

* Add modified rnnt loss

* Using more efficient way to fix boundaries

* Fix modified pruned rnnt loss

* Fix the s_begin constrains of pruned loss for modified version transducer

* Fix Stack (#925)

* return the correct layer

* unskip the test

* Fix 'TypeError' of rnnt_loss_pruned function. (#924)

* Fix 'TypeError' of rnnt_loss_simple function.

Fix 'TypeError' exception when calling rnnt_loss_simple(..., return_grad=False)  at validation steps.

* Fix 'MutualInformationRecursionFunction.forward()' return type check error for pytorch < 1.10.x

* Modify return type.

* Add documents about class MutualInformationRecursionFunction.

* Formated code style.

* Fix rnnt_loss_smoothed return type.

Co-authored-by: gzchenduisheng <gzchenduisheng@corp.netease.com>

* Support torch 1.11.0 and CUDA 11.5 (#931)

* Support torch 1.11.0 and CUDA 11.5

* Implement Rnnt decoding (#926)

* first working draft of rnnt decoding

* FormatOutput works...

* Different num frames for FormatOutput works

* Update docs

* Fix comments, break advance into several stages, add more docs

* Add python wrapper

* Add more docs

* Minor fixes

* Fix comments

* fix building docs (#933)

* Release v1.14

* Remove unused DiscountedCumSum. (#936)

* Fix compiler warnings. (#937)

* Fix compiler warnings.

* Minor fixes for RNN-T decoding. (#938)

* Minor fixes for RNN-T decoding.

* Removes arcs with label 0 from the TrivialGraph. (#939)

* Implement linear_fsa_with_self_loops. (#940)

* Implement linear_fsa_with_self_loops.

* Fix the pruning with max-states (#941)

* Rnnt allow different encoder/decoder dims (#945)

* Allow different encoder and decoder dim in rnnt_pruning

* Bug fixes

* Supporting building k2 on Windows (#946)

* Fix nightly windows CPU build (#948)

* Fix nightly building k2 for windows.

* Run nightly build only if there are new commits.

* Check the versions of PyTorch and CUDA at the import time. (#949)

* Check the versions of PyTorch and CUDA at the import time.

* More straightforward message when CUDA support is missing (#950)

* Implement ArrayOfRagged (#927)

* Implement ArrayOfRagged

* Fix issues and pass tests

* fix style

* change few statements of functions and move the definiation of template Array1OfRagged to header file

* add offsets test code

* Fix precision (#951)

* Fix precision

* Using different pow version for windows and *nix

* Use int64_t pow

* Minor fixes

Co-authored-by: Fangjun Kuang <csukuangfj@gmail.com>
Co-authored-by: Piotr Żelasko <petezor@gmail.com>
Co-authored-by: Jan "yenda" Trmal <jtrmal@gmail.com>
Co-authored-by: Mingshuang Luo <37799481+luomingshuang@users.noreply.github.com>
Co-authored-by: Ludwig Kürzinger <lumaku@users.noreply.github.com>
Co-authored-by: Daniel Povey <dpovey@gmail.com>
Co-authored-by: drawfish <duisheng.chen@gmail.com>
Co-authored-by: gzchenduisheng <gzchenduisheng@corp.netease.com>
Co-authored-by: alexei-v-ivanov <alexei_v_ivanov@ieee.org>
Co-authored-by: Wang, Guanbo <wgb14@outlook.com>
Co-authored-by: Nickolay V. Shmyrev <nshmyrev@gmail.com>
Co-authored-by: LvHang <hanglyu1991@gmail.com>
---
 .flake8                                       |   1 +
 .github/workflows/build-conda-cpu-macos.yml   | 117 ++++++++
 ...nda_cpu.yml => build-conda-cpu-ubuntu.yml} |  89 ++----
 ...-conda.yml => build-conda-cpu-windows.yml} |  42 +--
 .github/workflows/build-conda-cuda-ubuntu.yml | 127 ++++++++
 .github/workflows/build-cpu-macos.yml         | 123 ++++++++
 .../{build-cpu.yml => build-cpu-ubuntu.yml}   | 100 +++----
 .../{windows.yml => build-cpu-windows.yml}    |  69 +++--
 .github/workflows/build-cuda-ubuntu.yml       | 147 ++++++++++
 .github/workflows/build.yml                   | 275 ------------------
 .github/workflows/build_conda.yml             | 259 -----------------
 .github/workflows/nightly-cpu-macos.yml       | 129 ++++++++
 ...nightly-cpu.yml => nightly-cpu-ubuntu.yml} | 110 +++----
 ...ly-windows.yml => nightly-cpu-windows.yml} | 115 +++++---
 .../{nightly.yml => nightly-cuda-ubuntu.yml}  |  27 +-
 .github/workflows/run-tests.yml               |   8 +-
 ...eel-cpu-stable.yml => wheel-cpu-macos.yml} |   8 +-
 .../{wheel-cpu.yml => wheel-cpu-windows.yml}  |  31 +-
 ...wheel-stable.yml => wheel-cuda-ubuntu.yml} |   9 +-
 .gitignore                                    |   1 +
 CMakeLists.txt                                | 100 +++++--
 cmake/moderngpu.cmake                         |   6 +-
 docs/source/installation/conda.rst            |   2 +-
 docs/source/installation/for_developers.rst   |   4 +
 docs/source/installation/from_source.rst      |   6 +-
 docs/source/installation/images/README.md     |   2 +-
 .../images/torch_ge_1.6.0-green.svg           |   1 +
 docs/source/installation/index.rst            |   6 +-
 docs/source/installation/pip.rst              |   2 +-
 k2/csrc/CMakeLists.txt                        |  37 ++-
 k2/csrc/array_of_ragged.cu                    | 104 ++++++-
 k2/csrc/array_of_ragged.h                     | 119 +++++---
 k2/csrc/array_of_ragged_test.cu               |  34 +++
 k2/csrc/benchmark/CMakeLists.txt              |   1 +
 k2/csrc/fsa.h                                 |   2 +-
 k2/csrc/host/CMakeLists.txt                   |  19 +-
 k2/csrc/log.h                                 |  45 +--
 k2/csrc/log_test.cu                           |   4 +
 k2/csrc/macros_test.cu                        |   4 +-
 k2/csrc/math.h                                |  24 +-
 k2/csrc/ragged_ops.cu                         |  58 +++-
 k2/csrc/ragged_test.cu                        |  11 -
 k2/csrc/rand_test.cu                          |   2 +-
 k2/csrc/rm_epsilon.cu                         |   8 +-
 k2/csrc/rnnt_decode.cu                        |   6 +-
 k2/csrc/tensor_ops.cu                         |  57 ++--
 k2/csrc/tensor_ops_test.cu                    |   6 +-
 k2/csrc/test_utils.h                          |   7 +-
 k2/csrc/version.h.in                          |   6 +-
 k2/python/csrc/CMakeLists.txt                 |  13 +-
 k2/python/csrc/torch.h                        |  32 --
 k2/python/csrc/torch/fsa.cu                   |   4 +-
 k2/python/csrc/torch/fsa_algo.cu              |  85 ++----
 k2/python/csrc/torch/ragged_ops.cu            |   7 +-
 k2/python/csrc/torch/v2/any.cu                |  64 ++--
 k2/python/csrc/torch/v2/ragged_shape.cu       |  10 +-
 k2/python/host/k2host/fsa.py                  |   4 +-
 k2/python/k2/__init__.py                      |  21 ++
 k2/python/k2/rnnt_decode.py                   |   2 +-
 k2/python/k2/rnnt_loss.py                     |  30 +-
 k2/python/k2/torch_version.py.in              |  17 ++
 .../tests/linear_fsa_with_self_loops_test.py  |   2 +-
 k2/python/tests/mutual_information_test.py    |   6 +-
 .../github_actions/generate_build_matrix.py   | 111 +++++++
 scripts/github_actions/run-nightly-build.py   |  35 +++
 65 files changed, 1706 insertions(+), 1207 deletions(-)
 create mode 100644 .github/workflows/build-conda-cpu-macos.yml
 rename .github/workflows/{build_conda_cpu.yml => build-conda-cpu-ubuntu.yml} (52%)
 rename .github/workflows/{windows-conda.yml => build-conda-cpu-windows.yml} (77%)
 create mode 100644 .github/workflows/build-conda-cuda-ubuntu.yml
 create mode 100644 .github/workflows/build-cpu-macos.yml
 rename .github/workflows/{build-cpu.yml => build-cpu-ubuntu.yml} (55%)
 rename .github/workflows/{windows.yml => build-cpu-windows.yml} (72%)
 create mode 100644 .github/workflows/build-cuda-ubuntu.yml
 delete mode 100644 .github/workflows/build.yml
 delete mode 100644 .github/workflows/build_conda.yml
 create mode 100644 .github/workflows/nightly-cpu-macos.yml
 rename .github/workflows/{nightly-cpu.yml => nightly-cpu-ubuntu.yml} (54%)
 rename .github/workflows/{nightly-windows.yml => nightly-cpu-windows.yml} (61%)
 rename .github/workflows/{nightly.yml => nightly-cuda-ubuntu.yml} (80%)
 rename .github/workflows/{wheel-cpu-stable.yml => wheel-cpu-macos.yml} (93%)
 rename .github/workflows/{wheel-cpu.yml => wheel-cpu-windows.yml} (63%)
 rename .github/workflows/{wheel-stable.yml => wheel-cuda-ubuntu.yml} (95%)
 create mode 100644 docs/source/installation/images/torch_ge_1.6.0-green.svg
 create mode 100644 k2/python/k2/torch_version.py.in
 create mode 100755 scripts/github_actions/generate_build_matrix.py
 create mode 100755 scripts/github_actions/run-nightly-build.py

diff --git a/.flake8 b/.flake8
index 71dca8579..bba1867c0 100644
--- a/.flake8
+++ b/.flake8
@@ -14,6 +14,7 @@ exclude =
   get_version.py
   build,
   k2/python/host,
+  k2/python/k2/__init__.py,
   k2/python/k2/ctc_loss.py,
   docs
 
diff --git a/.github/workflows/build-conda-cpu-macos.yml b/.github/workflows/build-conda-cpu-macos.yml
new file mode 100644
index 000000000..623d3a472
--- /dev/null
+++ b/.github/workflows/build-conda-cpu-macos.yml
@@ -0,0 +1,117 @@
+# Copyright      2021  Xiaomi Corp.       (author: Fangjun Kuang)
+
+# See ../../LICENSE for clarification regarding multiple authors
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# refer to https://github.com/actions/starter-workflows/pull/47/files
+
+
+# Note, we have to set
+#
+# export DYLD_LIBRARY_PATH=$CONDA_PREFIX/lib/python3.8/site-packages:$DYLD_LIBRARY_PATH
+#
+# before running `python3 -m k2.version`
+#
+# See https://github.com/openPMD/openPMD-api/issues/593#issuecomment-552690470
+
+
+name: build_conda_cpu_macos
+
+on:
+  push:
+    tags:
+      - '*'
+
+env:
+  K2_BUILD_TYPE: Release
+
+jobs:
+  generate_build_matrix:
+    # see https://github.com/pytorch/pytorch/pull/50633
+    runs-on: ubuntu-latest
+    outputs:
+      matrix: ${{ steps.set-matrix.outputs.matrix }}
+    steps:
+      - uses: actions/checkout@v2
+        with:
+          fetch-depth: 0
+      - name: Generating build matrix
+        id: set-matrix
+        run: |
+          # outputting for debugging purposes
+          python scripts/github_actions/generate_build_matrix.py
+          MATRIX=$(python scripts/github_actions/generate_build_matrix.py)
+          echo "::set-output name=matrix::${MATRIX}"
+
+  build_conda_cpu_macos:
+    needs: generate_build_matrix
+    runs-on: macos-10.15
+    strategy:
+      fail-fast: false
+      matrix:
+        ${{ fromJson(needs.generate_build_matrix.outputs.matrix) }}
+
+    steps:
+      # refer to https://github.com/actions/checkout
+      - uses: actions/checkout@v2
+        with:
+          fetch-depth: 0
+
+      - uses: conda-incubator/setup-miniconda@v2
+        with:
+          auto-update-conda: true
+          python-version: ${{ matrix.python-version }}
+          activate-environment: k2
+
+      - name: Display Python version
+        shell: bash -l {0}
+        run: |
+          python3 -c "import sys; print(sys.version)"
+          which python3
+
+      - name: Install conda dependencies
+        shell: bash -l {0}
+        run: |
+          conda install -y -q anaconda-client
+          conda install -y -q conda-build
+          conda install -y -q -c pytorch pytorch=${{ matrix.torch }} cpuonly
+
+      - name: Display conda info
+        shell: bash -l {0}
+        run: |
+          which conda
+          conda env list
+          conda info
+
+      - name: Build k2
+        shell: bash -l {0}
+        env:
+          K2_PYTHON_VERSION: ${{ matrix.python-version}}
+          K2_TORCH_VERSION: ${{ matrix.torch }}
+          K2_CONDA_TOKEN: ${{ secrets.K2_CONDA_TOKEN}}
+          K2_IS_GITHUB_ACTIONS: 1
+          K2_IS_FOR_CONDA: 1
+        run: |
+          export K2_BUILD_TYPE=$K2_BUILD_TYPE
+          ./scripts/build_conda_cpu.sh
+
+      - name: Display generated files
+        run: |
+          ls -lh /usr/local/miniconda/envs/k2/conda-bld/osx-64
+
+      - name: Upload generated files
+        uses: actions/upload-artifact@v2
+        with:
+          name: torch-${{ matrix.torch }}-python-${{ matrix.python-version }}-${{ matrix.os }}
+          path: /usr/local/miniconda/envs/k2/conda-bld/osx-64/*.tar.bz2
diff --git a/.github/workflows/build_conda_cpu.yml b/.github/workflows/build-conda-cpu-ubuntu.yml
similarity index 52%
rename from .github/workflows/build_conda_cpu.yml
rename to .github/workflows/build-conda-cpu-ubuntu.yml
index fe3e552ab..72cf5b412 100644
--- a/.github/workflows/build_conda_cpu.yml
+++ b/.github/workflows/build-conda-cpu-ubuntu.yml
@@ -26,68 +26,41 @@
 # See https://github.com/openPMD/openPMD-api/issues/593#issuecomment-552690470
 
 
-name: build_conda_cpu
+name: build_conda_cpu_ubuntu
 
 on:
   push:
-    branches:
-      - conda-cpu
+    tags:
+      - '*'
 
 env:
   K2_BUILD_TYPE: Release
 
 jobs:
-  build_conda_cpu:
-    runs-on: ${{ matrix.os }}
+  generate_build_matrix:
+    # see https://github.com/pytorch/pytorch/pull/50633
+    runs-on: ubuntu-latest
+    outputs:
+      matrix: ${{ steps.set-matrix.outputs.matrix }}
+    steps:
+      - uses: actions/checkout@v2
+        with:
+          fetch-depth: 0
+      - name: Generating build matrix
+        id: set-matrix
+        run: |
+          # outputting for debugging purposes
+          python scripts/github_actions/generate_build_matrix.py
+          MATRIX=$(python scripts/github_actions/generate_build_matrix.py)
+          echo "::set-output name=matrix::${MATRIX}"
+
+  build_conda_cpu_ubuntu:
+    needs: generate_build_matrix
+    runs-on: ubuntu-18.04
     strategy:
       fail-fast: false
       matrix:
-        os: [ubuntu-18.04, macos-10.15]
-        python-version: ["3.6", "3.7", "3.8", "3.9", "3.10"]
-        # from https://download.pytorch.org/whl/torch_stable.html
-        #
-        # PyTorch 1.11.x supports 3.7, 3.8, 3.9, 3.10
-        # PyTorch 1.10, 1.9.x, 1.8.x, and 1.7.1 support 3.6, 3.7, 3.8, 3.9
-        # PyTorch 1.7.0, 1.6.0, and 1.5.x support 3.6, 3.7, 3.8
-        #
-        # Other PyTorch versions are not tested
-        #
-        torch: ["1.5.0", "1.5.1", "1.6.0", "1.7.0", "1.7.1", "1.8.0", "1.8.1", "1.9.0", "1.9.1", "1.10.0", "1.10.1", "1.10.2", "1.11.0"]
-        exclude:
-          - python-version: "3.9" # exclude Python 3.9 for [1.5.0, 1.5.1, 1.6.0, 1.7.0]
-            torch: "1.5.0"
-          - python-version: "3.9"
-            torch: "1.5.1"
-          - python-version: "3.9"
-            torch: "1.6.0"
-          - python-version: "3.9"
-            torch: "1.7.0"
-          - python-version: "3.10" # exclude Python 3.10 for [1.5.0, 1.5.1, 1.6.0, 1.7.0, 1.7.1, 1.8.0, 1.8.1, 1.9.0, 1.9.1, 1.10.0, 1.10.1, 1.10.2]
-            torch: "1.5.0"
-          - python-version: "3.10"
-            torch: "1.5.1"
-          - python-version: "3.10"
-            torch: "1.6.0"
-          - python-version: "3.10"
-            torch: "1.7.0"
-          - python-version: "3.10"
-            torch: "1.7.1"
-          - python-version: "3.10"
-            torch: "1.8.0"
-          - python-version: "3.10"
-            torch: "1.8.1"
-          - python-version: "3.10"
-            torch: "1.9.0"
-          - python-version: "3.10"
-            torch: "1.9.1"
-          - python-version: "3.10"
-            torch: "1.10.0"
-          - python-version: "3.10"
-            torch: "1.10.1"
-          - python-version: "3.10"
-            torch: "1.10.2"
-          - python-version: "3.6" # exclude Python 3.6 for [1.11.0]
-            torch: "1.11.0"
+        ${{ fromJson(needs.generate_build_matrix.outputs.matrix) }}
 
     steps:
       # refer to https://github.com/actions/checkout
@@ -134,25 +107,11 @@ jobs:
           ./scripts/build_conda_cpu.sh
 
       - name: Display generated files
-        if: startsWith(matrix.os, 'ubuntu')
         run: |
           ls -lh /usr/share/miniconda/envs/k2/conda-bld/linux-64
 
-      - name: Display generated files
-        if: startsWith(matrix.os, 'macos')
-        run: |
-          ls -lh /usr/local/miniconda/envs/k2/conda-bld/osx-64
-
       - name: Upload generated files
-        if: startsWith(matrix.os, 'ubuntu')
         uses: actions/upload-artifact@v2
         with:
           name: torch-${{ matrix.torch }}-python-${{ matrix.python-version }}-${{ matrix.os }}
           path: /usr/share/miniconda/envs/k2/conda-bld/linux-64/*.tar.bz2
-
-      - name: Upload generated files
-        if: startsWith(matrix.os, 'macos')
-        uses: actions/upload-artifact@v2
-        with:
-          name: torch-${{ matrix.torch }}-python-${{ matrix.python-version }}-${{ matrix.os }}
-          path: /usr/local/miniconda/envs/k2/conda-bld/osx-64/*.tar.bz2
diff --git a/.github/workflows/windows-conda.yml b/.github/workflows/build-conda-cpu-windows.yml
similarity index 77%
rename from .github/workflows/windows-conda.yml
rename to .github/workflows/build-conda-cpu-windows.yml
index 00bab9adc..551c13ce3 100644
--- a/.github/workflows/windows-conda.yml
+++ b/.github/workflows/build-conda-cpu-windows.yml
@@ -15,36 +15,42 @@
 # limitations under the License.
 
 
-name: build-windows-conda
+name: build_conda_cpu_windows
 
 on:
   push:
-    branches:
-      - conda-win
+    tags:
+      - '*'
 
 env:
   BUILD_TYPE: Release
 
 jobs:
-  build-windows-conda:
+  generate_build_matrix:
+    # see https://github.com/pytorch/pytorch/pull/50633
+    runs-on: ubuntu-latest
+    outputs:
+      matrix: ${{ steps.set-matrix.outputs.matrix }}
+    steps:
+      - uses: actions/checkout@v2
+        with:
+          fetch-depth: 0
+      - name: Generating build matrix
+        id: set-matrix
+        run: |
+          # outputting for debugging purposes
+          python scripts/github_actions/generate_build_matrix.py
+          MATRIX=$(python scripts/github_actions/generate_build_matrix.py)
+          echo "::set-output name=matrix::${MATRIX}"
+
+  build_conda_cpu_windows:
     # see https://github.com/actions/virtual-environments/blob/win19/20210525.0/images/win/Windows2019-Readme.md
-    runs-on: ${{ matrix.os }}
+    needs: generate_build_matrix
+    runs-on: windows-2019
     strategy:
       fail-fast: false
       matrix:
-        os: [windows-2019]
-        # Python 3.9 is for PyTorch 1.7.1, 1.8.x, 1.9.0,
-        python-version: [3.6, 3.7, 3.8, 3.9]
-        torch: ["1.5.0", "1.5.1", "1.6.0", "1.7.0", "1.7.1", "1.8.0", "1.8.1", "1.9.0"]
-        exclude:
-          - python-version: 3.9 # exclude Python 3.9 for [1.5.0, 1.5.1, 1.6.0, 1.7.0]
-            torch: "1.5.0"
-          - python-version: 3.9
-            torch: "1.5.1"
-          - python-version: 3.9
-            torch: "1.6.0"
-          - python-version: 3.9
-            torch: "1.7.0"
+        ${{ fromJson(needs.generate_build_matrix.outputs.matrix) }}
 
     steps:
       - uses: actions/checkout@v2
diff --git a/.github/workflows/build-conda-cuda-ubuntu.yml b/.github/workflows/build-conda-cuda-ubuntu.yml
new file mode 100644
index 000000000..fa6dca28e
--- /dev/null
+++ b/.github/workflows/build-conda-cuda-ubuntu.yml
@@ -0,0 +1,127 @@
+# Copyright      2021  Xiaomi Corp.       (author: Fangjun Kuang)
+
+# See ../../LICENSE for clarification regarding multiple authors
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# refer to https://github.com/actions/starter-workflows/pull/47/files
+
+name: build_conda_cuda_ubuntu
+
+on:
+  push:
+    tags:
+      - '*'
+
+env:
+  K2_BUILD_TYPE: Release
+
+jobs:
+  generate_build_matrix:
+    # see https://github.com/pytorch/pytorch/pull/50633
+    runs-on: ubuntu-latest
+    outputs:
+      matrix: ${{ steps.set-matrix.outputs.matrix }}
+    steps:
+      - uses: actions/checkout@v2
+        with:
+          fetch-depth: 0
+      - name: Generating build matrix
+        id: set-matrix
+        run: |
+          # outputting for debugging purposes
+          python scripts/github_actions/generate_build_matrix.py
+          MATRIX=$(python scripts/github_actions/generate_build_matrix.py --enable-cuda)
+          echo "::set-output name=matrix::${MATRIX}"
+
+  build_conda_cuda_ubuntu:
+    needs: generate_build_matrix
+    runs-on: ubuntu-18.04
+    strategy:
+      fail-fast: false
+      matrix:
+        ${{ fromJson(needs.generate_build_matrix.outputs.matrix) }}
+
+    steps:
+      # refer to https://github.com/actions/checkout
+      - uses: actions/checkout@v2
+        with:
+          fetch-depth: 0
+
+      - name: Install CUDA Toolkit ${{ matrix.cuda }}
+        shell: bash -l {0}
+        env:
+          cuda: ${{ matrix.cuda }}
+        run: |
+          source ./scripts/github_actions/install_cuda.sh
+          echo "CUDA_HOME=${CUDA_HOME}" >> $GITHUB_ENV
+          echo "${CUDA_HOME}/bin" >> $GITHUB_PATH
+          echo "LD_LIBRARY_PATH=${CUDA_HOME}/lib:${CUDA_HOME}/lib64:${LD_LIBRARY_PATH}" >> $GITHUB_ENV
+
+      - name: Display NVCC version
+        shell: bash -l {0}
+        run: |
+          which nvcc
+          nvcc --version
+
+      - uses: conda-incubator/setup-miniconda@v2
+        with:
+          auto-update-conda: true
+          python-version: ${{ matrix.python-version }}
+          activate-environment: k2
+
+      - name: Display Python version
+        shell: bash -l {0}
+        run: |
+          python3 -c "import sys; print(sys.version)"
+          which python3
+
+      - name: Install conda dependencies
+        shell: bash -l {0}
+        run: |
+          conda install -y -q anaconda-client
+          conda install -y -q conda-build
+          conda install -y -q bs4 requests tqdm
+          conda install -y -q -c pytorch -c conda-forge pytorch=${{ matrix.torch }} cudatoolkit=${{ matrix.cuda }}
+
+      - name: Display conda info
+        shell: bash -l {0}
+        run: |
+          which conda
+          conda env list
+          conda info
+          nproc
+
+      - name: Install git lfs
+        run: |
+          sudo apt-get install -y git-lfs
+
+      - name: Download cudnn 8.0
+        shell: bash -l {0}
+        env:
+          cuda: ${{ matrix.cuda }}
+        run: |
+          ./scripts/github_actions/install_cudnn.sh
+
+      - name: Build k2
+        shell: bash -l {0}
+        env:
+          K2_CUDA_VERSION: ${{ matrix.cuda }}
+          K2_PYTHON_VERSION: ${{ matrix.python-version}}
+          K2_TORCH_VERSION: ${{ matrix.torch }}
+          K2_CONDA_TOKEN: ${{ secrets.K2_CONDA_TOKEN}}
+          K2_IS_GITHUB_ACTIONS: 1
+          K2_IS_FOR_CONDA: 1
+        run: |
+          export K2_BUILD_TYPE=$K2_BUILD_TYPE
+          ./scripts/build_conda.sh
diff --git a/.github/workflows/build-cpu-macos.yml b/.github/workflows/build-cpu-macos.yml
new file mode 100644
index 000000000..392683ba8
--- /dev/null
+++ b/.github/workflows/build-cpu-macos.yml
@@ -0,0 +1,123 @@
+# Copyright      2021  Fangjun Kuang (csukuangfj@gmail.com)
+
+# See ../../LICENSE for clarification regarding multiple authors
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# refer to https://github.com/actions/starter-workflows/pull/47/files
+
+name: build-cpu-macos
+
+on:
+  push:
+    branches:
+      - master
+  pull_request:
+    types: [labeled]
+
+env:
+  BUILD_TYPE: Release
+
+jobs:
+  generate_build_matrix:
+    # see https://github.com/pytorch/pytorch/pull/50633
+    runs-on: macos-latest
+    outputs:
+      matrix: ${{ steps.set-matrix.outputs.matrix }}
+    steps:
+      - uses: actions/checkout@v2
+        with:
+          fetch-depth: 0
+      - name: Generating build matrix
+        id: set-matrix
+        run: |
+          # outputting for debugging purposes
+          python scripts/github_actions/generate_build_matrix.py
+          MATRIX=$(python scripts/github_actions/generate_build_matrix.py)
+          echo "::set-output name=matrix::${MATRIX}"
+
+  build-cpu-macos:
+    if: github.event.label.name == 'ready' || github.event_name == 'push'
+    needs: generate_build_matrix
+    runs-on: macos-10.15
+    strategy:
+      fail-fast: false
+      matrix:
+        ${{ fromJson(needs.generate_build_matrix.outputs.matrix) }}
+
+    steps:
+      # refer to https://github.com/actions/checkout
+      - uses: actions/checkout@v2
+        with:
+          fetch-depth: 0
+
+      - uses: szenius/set-timezone@v1.0
+        with:
+          timezoneLinux: "Asia/Shanghai"
+
+      - name: Display date and time
+        run: date
+
+      - name: Display clang version
+        run: |
+          clang --version
+
+      - name: Setup Python ${{ matrix.python-version }}
+        uses: actions/setup-python@v2
+        with:
+          python-version: ${{ matrix.python-version }}
+
+      - name: Display Python version
+        run: python -c "import sys; print(sys.version)"
+
+      - name: Install PyTorch ${{ matrix.torch }}
+        shell: bash
+        run: |
+          python3 -m pip install -qq --upgrade pip
+          python3 -m pip install -qq wheel twine dataclasses
+          python3 -m pip install -qq torch==${{ matrix.torch }}
+
+          python3 -c "import torch; print('torch version:', torch.__version__)"
+
+      - name: Build k2
+        shell: bash
+        run: |
+          pwd
+          mkdir build
+          cd build
+          cmake -DCMAKE_BUILD_TYPE=$BUILD_TYPE -DK2_WITH_CUDA=OFF ..
+          cat k2/csrc/version.h
+          cat CMakeCache.txt
+
+          make VERBOSE=1 -j2
+
+      - name: Run tests
+        shell: bash
+        run: |
+          cd build
+          ctest --output-on-failure
+
+      - name: Build wheel
+        shell: bash
+        run: |
+          export K2_CMAKE_ARGS="-DCMAKE_BUILD_TYPE=$BUILD_TYPE -DK2_WITH_CUDA=OFF"
+          export K2_MAKE_ARGS="-j2"
+          python3 setup.py bdist_wheel
+          ls -lh dist/
+          ls -lh build/*
+
+      - name: Upload Wheel
+        uses: actions/upload-artifact@v2
+        with:
+          name: torch-${{ matrix.torch }}-python-${{ matrix.python-version }}-macos-10.15-cpu
+          path: dist/*.whl
diff --git a/.github/workflows/build-cpu.yml b/.github/workflows/build-cpu-ubuntu.yml
similarity index 55%
rename from .github/workflows/build-cpu.yml
rename to .github/workflows/build-cpu-ubuntu.yml
index 2bb539ff9..3cd7ec443 100644
--- a/.github/workflows/build-cpu.yml
+++ b/.github/workflows/build-cpu-ubuntu.yml
@@ -16,13 +16,12 @@
 
 # refer to https://github.com/actions/starter-workflows/pull/47/files
 
-name: build-cpu
+name: build-cpu-ubuntu
 
 on:
   push:
     branches:
       - master
-      - v2.0-pre
   pull_request:
     types: [labeled]
 
@@ -30,52 +29,31 @@ env:
   BUILD_TYPE: Release
 
 jobs:
-  build-cpu:
+  generate_build_matrix:
+    # see https://github.com/pytorch/pytorch/pull/50633
+    runs-on: ubuntu-latest
+    outputs:
+      matrix: ${{ steps.set-matrix.outputs.matrix }}
+    steps:
+      - uses: actions/checkout@v2
+        with:
+          fetch-depth: 0
+      - name: Generating build matrix
+        id: set-matrix
+        run: |
+          # outputting for debugging purposes
+          python scripts/github_actions/generate_build_matrix.py
+          MATRIX=$(python scripts/github_actions/generate_build_matrix.py)
+          echo "::set-output name=matrix::${MATRIX}"
+
+  build-cpu-ubuntu:
     if: github.event.label.name == 'ready' || github.event_name == 'push'
-    runs-on: ${{ matrix.os }}
+    needs: generate_build_matrix
+    runs-on: ubuntu-18.04
     strategy:
       fail-fast: false
       matrix:
-        os: [ubuntu-18.04, macos-10.15]
-        torch: ["1.5.0", "1.5.1", "1.6.0", "1.7.0", "1.7.1", "1.8.0", "1.8.1", "1.9.0", "1.9.1", "1.10.0", "1.10.1", "1.10.2", "1.11.0"]
-        # Python 3.9 is for PyTorch 1.7.1, 1.8.x, 1.9.x, 1.10.x, 1.11.x
-        # Python 3.10 is for PyTorch 1.11.x
-        python-version: ["3.6", "3.7", "3.8", "3.9", "3.10"]
-        exclude:
-          - python-version: "3.10" # exclude Python 3.9 for [1.5.0, 1.5.1, 1.6.0, 1.7.0, 1.7.1, 1.8.0, 1.8.1, 1.9.0, 1.9.1, 1.10.0, 1.10.1, 1.10.2]
-            torch: "1.5.0"
-          - python-version: "3.10"
-            torch: "1.5.1"
-          - python-version: "3.10"
-            torch: "1.6.0"
-          - python-version: "3.10"
-            torch: "1.7.0"
-          - python-version: "3.10"
-            torch: "1.7.1"
-          - python-version: "3.10"
-            torch: "1.8.0"
-          - python-version: "3.10"
-            torch: "1.8.1"
-          - python-version: "3.10"
-            torch: "1.9.0"
-          - python-version: "3.10"
-            torch: "1.9.1"
-          - python-version: "3.10"
-            torch: "1.10.0"
-          - python-version: "3.10"
-            torch: "1.10.1"
-          - python-version: "3.10"
-            torch: "1.10.2"
-          - python-version: 3.9 # exclude Python 3.9 for [1.5.0, 1.5.1, 1.6.0, 1.7.0]
-            torch: "1.5.0"
-          - python-version: 3.9
-            torch: "1.5.1"
-          - python-version: 3.9
-            torch: "1.6.0"
-          - python-version: 3.9
-            torch: "1.7.0"
-          - python-version: 3.6 # exclude Python 3.6 for [1.11.0]
-            torch: "1.11.0"
+        ${{ fromJson(needs.generate_build_matrix.outputs.matrix) }}
 
     steps:
       # refer to https://github.com/actions/checkout
@@ -91,17 +69,11 @@ jobs:
         run: date
 
       - name: Install GCC 7
-        if: startsWith(matrix.os, 'ubuntu')
         run: |
           sudo apt-get install -y gcc-7 g++-7
           echo "CC=/usr/bin/gcc-7" >> $GITHUB_ENV
           echo "CXX=/usr/bin/g++-7" >> $GITHUB_ENV
 
-      - name: Display clang version
-        if: startsWith(matrix.os, 'macos')
-        run: |
-          clang --version
-
       - name: Setup Python ${{ matrix.python-version }}
         uses: actions/setup-python@v2
         with:
@@ -111,26 +83,15 @@ jobs:
         run: python -c "import sys; print(sys.version)"
 
       - name: Install PyTorch ${{ matrix.torch }}
-        if: startsWith(matrix.os, 'ubuntu')
         shell: bash
         run: |
           python3 -m pip install -qq --upgrade pip
-          python3 -m pip install -qq wheel twine typing_extensions
+          python3 -m pip install -qq wheel twine typing_extensions dataclasses
           python3 -m pip install -qq torch==${{ matrix.torch }}+cpu -f https://download.pytorch.org/whl/torch_stable.html
 
           python3 -c "import torch; print('torch version:', torch.__version__)"
 
-      - name: Install PyTorch ${{ matrix.torch }}
-        if: startsWith(matrix.os, 'macos')
-        shell: bash
-        run: |
-          python3 -m pip install -qq --upgrade pip
-          python3 -m pip install -qq wheel twine
-          python3 -m pip install -qq torch==${{ matrix.torch }}
-
-          python3 -c "import torch; print('torch version:', torch.__version__)"
-
-      - name: Configure CMake
+      - name: Build k2
         shell: bash
         run: |
           pwd
@@ -138,8 +99,17 @@ jobs:
           cd build
           cmake -DCMAKE_BUILD_TYPE=$BUILD_TYPE -DK2_WITH_CUDA=OFF ..
           cat k2/csrc/version.h
+          cat CMakeCache.txt
 
-      - name: Build k2
+          make VERBOSE=1 -j2
+
+      - name: Run tests
+        shell: bash
+        run: |
+          cd build
+          ctest --output-on-failure
+
+      - name: Build wheel
         shell: bash
         run: |
           export K2_CMAKE_ARGS="-DCMAKE_BUILD_TYPE=$BUILD_TYPE -DK2_WITH_CUDA=OFF"
@@ -151,5 +121,5 @@ jobs:
       - name: Upload Wheel
         uses: actions/upload-artifact@v2
         with:
-          name: torch-${{ matrix.torch }}-python-${{ matrix.python-version }}-${{ matrix.os }}-cpu
+          name: torch-${{ matrix.torch }}-python-${{ matrix.python-version }}-ubuntu-18.04-cpu
           path: dist/*.whl
diff --git a/.github/workflows/windows.yml b/.github/workflows/build-cpu-windows.yml
similarity index 72%
rename from .github/workflows/windows.yml
rename to .github/workflows/build-cpu-windows.yml
index 7890fb805..e622e9a37 100644
--- a/.github/workflows/windows.yml
+++ b/.github/workflows/build-cpu-windows.yml
@@ -15,7 +15,7 @@
 # limitations under the License.
 
 
-name: build-windows
+name: build-cpu-windows
 
 on:
   push:
@@ -28,26 +28,32 @@ env:
   BUILD_TYPE: Release
 
 jobs:
-  build-windows:
+  generate_build_matrix:
+    # see https://github.com/pytorch/pytorch/pull/50633
+    runs-on: ubuntu-latest
+    outputs:
+      matrix: ${{ steps.set-matrix.outputs.matrix }}
+    steps:
+      - uses: actions/checkout@v2
+        with:
+          fetch-depth: 0
+      - name: Generating build matrix
+        id: set-matrix
+        run: |
+          # outputting for debugging purposes
+          python scripts/github_actions/generate_build_matrix.py
+          MATRIX=$(python scripts/github_actions/generate_build_matrix.py)
+          echo "::set-output name=matrix::${MATRIX}"
+
+  build-cpu-windows:
     # see https://github.com/actions/virtual-environments/blob/win19/20210525.0/images/win/Windows2019-Readme.md
     if: github.event.label.name == 'ready' || github.event_name == 'push'
-    runs-on: ${{ matrix.os }}
+    needs: generate_build_matrix
+    runs-on: windows-2019
     strategy:
       fail-fast: false
       matrix:
-        os: [windows-2019]
-        # Python 3.9 is for PyTorch 1.7.1, 1.8.x, 1.9.0
-        python-version: [3.6, 3.7, 3.8, 3.9]
-        torch: ["1.5.0", "1.5.1", "1.6.0", "1.7.0", "1.7.1", "1.8.0", "1.8.1", "1.9.0"]
-        exclude:
-          - python-version: 3.9 # exclude Python 3.9 for [1.5.0, 1.5.1, 1.6.0, 1.7.0]
-            torch: "1.5.0"
-          - python-version: 3.9
-            torch: "1.5.1"
-          - python-version: 3.9
-            torch: "1.6.0"
-          - python-version: 3.9
-            torch: "1.7.0"
+        ${{ fromJson(needs.generate_build_matrix.outputs.matrix) }}
 
     steps:
       - uses: actions/checkout@v2
@@ -68,8 +74,8 @@ jobs:
 
       - name: Install PyTorch ${{ matrix.torch }}
         run: |
-          pip3 install -qq torch==${{ matrix.torch }}+cpu -f https://download.pytorch.org/whl/torch_stable.html
-          pip3 install -qq wheel twine dataclasses numpy typing_extensions
+          pip3 install -qq torch==${{ matrix.torch }}+cpu -f https://download.pytorch.org/whl/torch_stable.html numpy
+          pip3 install -qq wheel twine dataclasses typing_extensions
 
           python3 -m torch.utils.collect_env
 
@@ -85,18 +91,17 @@ jobs:
           cd build_release
           cmake -DCMAKE_BUILD_TYPE=$BUILD_TYPE -DK2_WITH_CUDA=OFF ..
           ls -lh
+          cat k2/csrc/version.h
+          cat CMakeCache.txt
 
       - name: Build k2
-        run: |
-          cd build_release
-          cmake --build . --target _k2 --config Release
-
-      - name: Display generated files
         shell: bash
         run: |
           cd build_release
-          ls -lh bin/*/*
+          cmake --build . --target _k2 --config Release -- -m
+          cmake --build . --target ALL_BUILD --config Release
           ls -lh lib/*/*
+          ls -lh bin/*/*
 
       - name: Build wheel
         shell: bash
@@ -106,15 +111,15 @@ jobs:
           ls -lh dist/
           pip install ./dist/*.whl
 
-      - name: Upload Wheel
-        uses: actions/upload-artifact@v2
-        with:
-          name: torch-${{ matrix.torch }}-python-${{ matrix.python-version }}-${{ matrix.os }}-cpu
-          path: dist/*.whl
-
-      - name: Run C++ tests
+      - name: Run tests
+        shell: bash
         run: |
           cd build_release
-          cmake --build . --target ALL_BUILD --config Release
           # disable python tests for k2host
           ctest -C Release --output-on-failure -E host
+
+      - name: Upload Wheel
+        uses: actions/upload-artifact@v2
+        with:
+          name: torch-${{ matrix.torch }}-python-${{ matrix.python-version }}-windows-2019-cpu
+          path: dist/*.whl
diff --git a/.github/workflows/build-cuda-ubuntu.yml b/.github/workflows/build-cuda-ubuntu.yml
new file mode 100644
index 000000000..adddf92e1
--- /dev/null
+++ b/.github/workflows/build-cuda-ubuntu.yml
@@ -0,0 +1,147 @@
+# Copyright      2020  Fangjun Kuang (csukuangfj@gmail.com)
+
+# See ../../LICENSE for clarification regarding multiple authors
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# refer to https://github.com/actions/starter-workflows/pull/47/files
+
+name: build-cuda-ubuntu
+
+on:
+  push:
+    branches:
+      - master
+  pull_request:
+    types: [labeled]
+
+env:
+  BUILD_TYPE: Release
+
+jobs:
+  generate_build_matrix:
+    # see https://github.com/pytorch/pytorch/pull/50633
+    runs-on: ubuntu-latest
+    outputs:
+      matrix: ${{ steps.set-matrix.outputs.matrix }}
+    steps:
+      - uses: actions/checkout@v2
+        with:
+          fetch-depth: 0
+      - name: Generating build matrix
+        id: set-matrix
+        run: |
+          # outputting for debugging purposes
+          python scripts/github_actions/generate_build_matrix.py --enable-cuda
+          MATRIX=$(python scripts/github_actions/generate_build_matrix.py --enable-cuda --test-only-latest-torch)
+          echo "::set-output name=matrix::${MATRIX}"
+
+  build-cuda-ubuntu:
+    if: github.event.label.name == 'ready' || github.event_name == 'push'
+    needs: generate_build_matrix
+    runs-on: ubuntu-18.04
+    strategy:
+      fail-fast: false
+      matrix:
+        ${{ fromJson(needs.generate_build_matrix.outputs.matrix) }}
+
+    steps:
+      # refer to https://github.com/actions/checkout
+      - uses: actions/checkout@v2
+        with:
+          fetch-depth: 0
+
+      - uses: szenius/set-timezone@v1.0
+        with:
+          timezoneLinux: "Asia/Shanghai"
+
+      - name: Display date and time
+        run: date
+
+      - name: Install CUDA Toolkit ${{ matrix.cuda }}
+        env:
+          cuda: ${{ matrix.cuda }}
+        run: |
+          source ./scripts/github_actions/install_cuda.sh
+          echo "CUDA_HOME=${CUDA_HOME}" >> $GITHUB_ENV
+          echo "${CUDA_HOME}/bin" >> $GITHUB_PATH
+          echo "LD_LIBRARY_PATH=${CUDA_HOME}/lib:${CUDA_HOME}/lib64:${LD_LIBRARY_PATH}" >> $GITHUB_ENV
+        shell: bash
+
+      - name: Display NVCC version
+        run: |
+          which nvcc
+          nvcc --version
+
+      - name: Install GCC 7
+        run: |
+          sudo apt-get install -y gcc-7 g++-7
+          echo "CC=/usr/bin/gcc-7" >> $GITHUB_ENV
+          echo "CXX=/usr/bin/g++-7" >> $GITHUB_ENV
+          echo "CUDAHOSTCXX=/usr/bin/g++-7" >> $GITHUB_ENV
+
+      - name: Install git lfs
+        run: |
+          sudo apt-get install -y git-lfs
+
+      - name: Setup Python ${{ matrix.python-version }}
+        uses: actions/setup-python@v2
+        with:
+          python-version: ${{ matrix.python-version }}
+
+      - name: Display Python version
+        run: python -c "import sys; print(sys.version)"
+
+      - name: Install PyTorch ${{ matrix.torch }}
+        env:
+          cuda: ${{ matrix.cuda }}
+          torch: ${{ matrix.torch }}
+        shell: bash
+        run: |
+          python3 -m pip install -q --upgrade pip
+          python3 -m pip install -q wheel twine typing_extensions
+          python3 -m pip install -q bs4 requests tqdm
+
+          ./scripts/github_actions/install_torch.sh
+          python3 -c "import torch; print('torch version:', torch.__version__)"
+
+      - name: Download cudnn 8.0
+        env:
+          cuda: ${{ matrix.cuda }}
+        run: |
+          ./scripts/github_actions/install_cudnn.sh
+
+      - name: Configure CMake
+        shell: bash
+        run: |
+          pwd
+          mkdir build
+          cd build
+          cmake -DCMAKE_BUILD_TYPE=$BUILD_TYPE ..
+          cat k2/csrc/version.h
+          cat CMakeCache.txt
+
+      - name: Build k2
+        shell: bash
+        run: |
+          export K2_CMAKE_ARGS="-DCMAKE_BUILD_TYPE=$BUILD_TYPE"
+          export K2_MAKE_ARGS="-j2"
+          python3 setup.py bdist_wheel
+          ls -lh dist/
+          ls -lh build/*
+
+      - name: Upload Wheel
+        uses: actions/upload-artifact@v2
+        with:
+          name: gcc-7-cuda-${{ matrix.cuda }}-torch-${{ matrix.torch }}-python-${{ matrix.python-version }}
+          path: dist/*.whl
diff --git a/.github/workflows/build.yml b/.github/workflows/build.yml
deleted file mode 100644
index b0240c887..000000000
--- a/.github/workflows/build.yml
+++ /dev/null
@@ -1,275 +0,0 @@
-# Copyright      2020  Fangjun Kuang (csukuangfj@gmail.com)
-
-# See ../../LICENSE for clarification regarding multiple authors
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-# refer to https://github.com/actions/starter-workflows/pull/47/files
-
-name: build
-
-on:
-  push:
-    branches:
-      - master
-      - v2.0-pre
-  pull_request:
-    types: [labeled]
-
-env:
-  BUILD_TYPE: Release
-
-jobs:
-  build:
-    if: github.event.label.name == 'ready' || github.event_name == 'push'
-    runs-on: ${{ matrix.os }}
-    strategy:
-      fail-fast: false
-      matrix:
-        os: [ubuntu-18.04]
-        # from https://download.pytorch.org/whl/torch_stable.html
-        # Note: There are no torch versions for CUDA 11.2
-        #
-        # 1.11.x supports: cuda10.2 (default), 11.3, 11.5
-        # 1.10.x supports: cuda10.2 (default), 11.1, 11.3
-        # 1.9.x supports: cuda10.2 (default), 11.1
-        # PyTorch 1.8.x supports: cuda 10.1, 10.2 (default), 11.1
-        # PyTorch 1.7.x supports: cuda 10.1, 10.2 (default), 11.0
-        # PyTorch 1.6.0 supports: cuda 10.1, 10.2 (default)
-        # PyTorch 1.5.x supports: cuda 10.1, 10.2 (default)
-        # Other PyTorch versions are not tested
-        # CUDA 10.1 is for 1.5.x, 1.6.0, 1.7.x, 1.8.x
-        # CUDA 11.1 is for torch 1.8.x, 1.9.x, 1.10.x
-        # CUDA 11.3 is for torch 1.10, 1.11.x
-        # CUDA 11.5 is for torch 1.11.x
-        cuda: ["10.1", "10.2", "11.0", "11.1", "11.3", "11.5"]
-        gcc: ["7"]
-        torch: ["1.5.0", "1.5.1", "1.6.0", "1.7.0", "1.7.1", "1.8.0", "1.8.1", "1.9.0", "1.9.1", "1.10.0", "1.10.1", "1.10.2", "1.11.0"]
-        #
-        # torch 1.11.x does not support Python 3.6
-        # From torch 1.11.x, it supports Python 3.10
-        # Python 3.9 is for PyTorch 1.7.1, 1.8.0, 1.8.1, 1.9.x, 1.10.x, 11.x
-        python-version: ["3.6", "3.7", "3.8", "3.9", "3.10"]
-        exclude:
-          - cuda: "11.5" # exclude 11.5 for [1.5.0, 1.5.1, 1.6.0, 1.7.0, 1.7.1, 1.8.0, 1.8.1, 1.9.0, 1.9.1, 1.10.0, 1.10.1, 1.10.2]
-            torch: "1.5.0"
-          - cuda: "11.5"
-            torch: "1.5.1"
-          - cuda: "11.5"
-            torch: "1.6.0"
-          - cuda: "11.5"
-            torch: "1.7.0"
-          - cuda: "11.5"
-            torch: "1.7.1"
-          - cuda: "11.5"
-            torch: "1.8.0"
-          - cuda: "11.5"
-            torch: "1.8.1"
-          - cuda: "11.5"
-            torch: "1.9.0"
-          - cuda: "11.5"
-            torch: "1.9.1"
-          - cuda: "11.5"
-            torch: "1.10.0"
-          - cuda: "11.5"
-            torch: "1.10.1"
-          - cuda: "11.5"
-            torch: "1.10.2"
-          - cuda: "11.3" # exclude 11.3 for [1.5.0, 1.5.1, 1.6.0, 1.7.0, 1.7.1, 1.8.0, 1.8.1, 1.9.0, 1.9.1]
-            torch: "1.5.0"
-          - cuda: "11.3"
-            torch: "1.5.1"
-          - cuda: "11.3"
-            torch: "1.6.0"
-          - cuda: "11.3"
-            torch: "1.7.0"
-          - cuda: "11.3"
-            torch: "1.7.1"
-          - cuda: "11.3"
-            torch: "1.8.0"
-          - cuda: "11.3"
-            torch: "1.8.1"
-          - cuda: "11.3"
-            torch: "1.9.0"
-          - cuda: "11.3"
-            torch: "1.9.1"
-          - cuda: "11.0" # exclude 11.0 for [1.5.0, 1.5.1, 1.6.0, 1.8.0, 1.8.1, 1.9.0, 1.9.1, 1.10.0, 1.10.1, 1.10.2, 1.11.0]
-            torch: "1.5.0"
-          - cuda: "11.0"
-            torch: "1.5.1"
-          - cuda: "11.0"
-            torch: "1.6.0"
-          - cuda: "11.0"
-            torch: "1.8.0"
-          - cuda: "11.0"
-            torch: "1.8.1"
-          - cuda: "11.0"
-            torch: "1.9.0"
-          - cuda: "11.0"
-            torch: "1.9.1"
-          - cuda: "11.0"
-            torch: "1.10.0"
-          - cuda: "11.0"
-            torch: "1.10.1"
-          - cuda: "11.0"
-            torch: "1.10.2"
-          - cuda: "11.0"
-            torch: "1.11.0"
-          - cuda: "11.1" # exclude 11.1 for [1.5.0, 1.5.1, 1.6.0, 1.7.0, 1.7.1, 1.11.0]
-            torch: "1.5.0"
-          - cuda: "11.1"
-            torch: "1.5.1"
-          - cuda: "11.1"
-            torch: "1.6.0"
-          - cuda: "11.1"
-            torch: "1.7.0"
-          - cuda: "11.1"
-            torch: "1.7.1"
-          - cuda: "11.1"
-            torch: "1.11.0"
-          - cuda: "10.1" # exclude CUDA 10.1 for [1.9.0, 1.9.1, 1.10.0, 10.1, 10.2, 1.11.0]
-            torch: "1.9.0"
-          - cuda: "10.1"
-            torch: "1.9.1"
-          - cuda: "10.1"
-            torch: "1.10.0"
-          - cuda: "10.1"
-            torch: "1.10.1"
-          - cuda: "10.1"
-            torch: "1.10.2"
-          - cuda: "10.1"
-            torch: "1.11.0"
-          - python-version: 3.9 # exclude Python 3.9 for [1.5.0, 1.5.1, 1.6.0, 1.7.0]
-            torch: "1.5.0"
-          - python-version: 3.9
-            torch: "1.5.1"
-          - python-version: 3.9
-            torch: "1.6.0"
-          - python-version: 3.9
-            torch: "1.7.0"
-          - python-version: "3.10" # exclude Python 3.9 for [1.5.0, 1.5.1, 1.6.0, 1.7.0, 1.7.1, 1.8.0, 1.8.1, 1.9.0, 1.9.1, 1.10.0, 1.10.1, 1.10.2]
-            torch: "1.5.0"
-          - python-version: "3.10"
-            torch: "1.5.1"
-          - python-version: "3.10"
-            torch: "1.6.0"
-          - python-version: "3.10"
-            torch: "1.7.0"
-          - python-version: "3.10"
-            torch: "1.7.1"
-          - python-version: "3.10"
-            torch: "1.8.0"
-          - python-version: "3.10"
-            torch: "1.8.1"
-          - python-version: "3.10"
-            torch: "1.9.0"
-          - python-version: "3.10"
-            torch: "1.9.1"
-          - python-version: "3.10"
-            torch: "1.10.0"
-          - python-version: "3.10"
-            torch: "1.10.1"
-          - python-version: "3.10"
-            torch: "1.10.2"
-          - python-version: "3.6" # exclude Python 3.6 for [1.11.0]
-            torch: "1.11.0"
-
-    steps:
-      # refer to https://github.com/actions/checkout
-      - uses: actions/checkout@v2
-        with:
-          fetch-depth: 0
-
-      - uses: szenius/set-timezone@v1.0
-        with:
-          timezoneLinux: "Asia/Shanghai"
-
-      - name: Display date and time
-        run: date
-
-      - name: Install CUDA Toolkit ${{ matrix.cuda }}
-        env:
-          cuda: ${{ matrix.cuda }}
-        run: |
-          source ./scripts/github_actions/install_cuda.sh
-          echo "CUDA_HOME=${CUDA_HOME}" >> $GITHUB_ENV
-          echo "${CUDA_HOME}/bin" >> $GITHUB_PATH
-          echo "LD_LIBRARY_PATH=${CUDA_HOME}/lib:${CUDA_HOME}/lib64:${LD_LIBRARY_PATH}" >> $GITHUB_ENV
-        shell: bash
-
-      - name: Display NVCC version
-        run: |
-          which nvcc
-          nvcc --version
-
-      - name: Install GCC ${{ matrix.gcc }}
-        run: |
-          sudo apt-get install -y gcc-${{ matrix.gcc }} g++-${{ matrix.gcc }}
-          echo "CC=/usr/bin/gcc-${{ matrix.gcc }}" >> $GITHUB_ENV
-          echo "CXX=/usr/bin/g++-${{ matrix.gcc }}" >> $GITHUB_ENV
-          echo "CUDAHOSTCXX=/usr/bin/g++-${{ matrix.gcc }}" >> $GITHUB_ENV
-
-      - name: Install git lfs
-        run: |
-          sudo apt-get install -y git-lfs
-
-      - name: Setup Python ${{ matrix.python-version }}
-        uses: actions/setup-python@v2
-        with:
-          python-version: ${{ matrix.python-version }}
-
-      - name: Display Python version
-        run: python -c "import sys; print(sys.version)"
-
-      - name: Install PyTorch ${{ matrix.torch }}
-        env:
-          cuda: ${{ matrix.cuda }}
-          torch: ${{ matrix.torch }}
-        shell: bash
-        run: |
-          python3 -m pip install -q --upgrade pip
-          python3 -m pip install -q wheel twine typing_extensions
-          python3 -m pip install -q bs4 requests tqdm
-
-          ./scripts/github_actions/install_torch.sh
-          python3 -c "import torch; print('torch version:', torch.__version__)"
-
-      - name: Download cudnn 8.0
-        env:
-          cuda: ${{ matrix.cuda }}
-        run: |
-          ./scripts/github_actions/install_cudnn.sh
-
-      - name: Configure CMake
-        shell: bash
-        run: |
-          pwd
-          mkdir build
-          cd build
-          cmake -DCMAKE_BUILD_TYPE=$BUILD_TYPE ..
-          cat k2/csrc/version.h
-
-      - name: Build k2
-        shell: bash
-        run: |
-          export K2_CMAKE_ARGS="-DCMAKE_BUILD_TYPE=$BUILD_TYPE"
-          export K2_MAKE_ARGS="-j2"
-          python3 setup.py bdist_wheel
-          ls -lh dist/
-          ls -lh build/*
-
-      - name: Upload Wheel
-        uses: actions/upload-artifact@v2
-        with:
-          name: gcc-${{ matrix.gcc }}-cuda-${{ matrix.cuda }}-torch-${{ matrix.torch }}-python-${{ matrix.python-version }}-${{ matrix.os }}
-          path: dist/*.whl
diff --git a/.github/workflows/build_conda.yml b/.github/workflows/build_conda.yml
deleted file mode 100644
index b8107d0c6..000000000
--- a/.github/workflows/build_conda.yml
+++ /dev/null
@@ -1,259 +0,0 @@
-# Copyright      2021  Xiaomi Corp.       (author: Fangjun Kuang)
-
-# See ../../LICENSE for clarification regarding multiple authors
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-# refer to https://github.com/actions/starter-workflows/pull/47/files
-
-name: build_conda_cuda
-
-on:
-  push:
-    branches:
-      - conda-cuda
-
-env:
-  K2_BUILD_TYPE: Release
-
-jobs:
-  build_conda_cuda:
-    runs-on: ${{ matrix.os }}
-    strategy:
-      fail-fast: false
-      matrix:
-        os: [ubuntu-18.04]
-        python-version: ["3.6", "3.7", "3.8", "3.9", "3.10"]
-        cuda: ["10.1", "10.2", "11.0", "11.1", "11.3", "11.5"]
-        # from https://download.pytorch.org/whl/torch_stable.html
-        # Note: There are no torch versions for CUDA 11.2
-        #
-        # 1.11.x supports: cuda10.2 (default), 11.3, 11.5
-        # PyTorch 1.10.x supports: 10.2 (default), 11.1, 11.3
-        # PyTorch 1.9.x supports: 10.2 (default), 11.1
-        # PyTorch 1.8.1 supports: cuda 10.1, 10.2 (default), 11.1
-        # PyTorch 1.8.0 supports: cuda 10.1, 10.2 (default), 11.1
-        # PyTorch 1.7.x supports: cuda 10.1, 10.2 (default), 11.0, 9.2 (not included in this setup)
-        # PyTorch 1.6.0 supports: cuda 10.1, 10.2 (default), 9.2 (not included in this setup)
-        # PyTorch 1.5.x supports: cuda 10.1, 10.2 (default), 9.2 (not included in this setup)
-        #
-        # PyTorch 1.11.x supports Python 3.10
-        # PyTorch 1.7.1, 1.8.x, 1.9.x, 1.10.x, and 1.11.x support 3.6, 3.7, 3.8, 3.9
-        # PyTorch 1.7.0, 1.6.0, and 1.5.x support 3.6, 3.7, 3.8
-        #
-        # Other PyTorch versions are not tested
-        #
-        # torch: ["1.5.0", "1.5.1", "1.6.0", "1.7.0", "1.7.1", "1.8.0", "1.8.1"]
-        # 1.5.x is removed because there are compilation errors.
-        #  See
-        #  https://github.com/csukuangfj/k2/runs/2533830771?check_suite_focus=true
-        #  and
-        #  https://github.com/NVIDIA/apex/issues/805
-        torch: ["1.6.0", "1.7.0", "1.7.1", "1.8.0", "1.8.1", "1.9.0", "1.9.1", "1.10.0", "1.10.1", "1.10.2", "1.11.0"]
-        exclude:
-          - cuda: "11.5" # exclude cuda 11.5 for [1.5.0, 1.5.1, 1.6.0, 1.7.0, 1.7.1, 1.8.0, 1.8.1, 1.9.0, 1.9.1, 1.10.0, 1.10.1, 1.10.2]
-            torch: "1.5.0"
-          - cuda: "11.5"
-            torch: "1.5.1"
-          - cuda: "11.5"
-            torch: "1.6.0"
-          - cuda: "11.5"
-            torch: "1.7.0"
-          - cuda: "11.5"
-            torch: "1.7.1"
-          - cuda: "11.5"
-            torch: "1.8.0"
-          - cuda: "11.5"
-            torch: "1.8.1"
-          - cuda: "11.5"
-            torch: "1.9.0"
-          - cuda: "11.5"
-            torch: "1.9.1"
-          - cuda: "11.5"
-            torch: "1.10.0"
-          - cuda: "11.5"
-            torch: "1.10.1"
-          - cuda: "11.5"
-            torch: "1.10.2"
-          - cuda: "11.3" # exclude cuda 11.3 for [1.5.0, 1.5.1, 1.6.0, 1.7.0, 1.7.1, 1.8.0, 1.8.1, 1.9.0, 1.9.1]
-            torch: "1.5.0"
-          - cuda: "11.3"
-            torch: "1.5.1"
-          - cuda: "11.3"
-            torch: "1.6.0"
-          - cuda: "11.3"
-            torch: "1.7.0"
-          - cuda: "11.3"
-            torch: "1.7.1"
-          - cuda: "11.3"
-            torch: "1.8.0"
-          - cuda: "11.3"
-            torch: "1.8.1"
-          - cuda: "11.3"
-            torch: "1.9.0"
-          - cuda: "11.3"
-            torch: "1.9.1"
-          # - cuda: "11.0" # exclude 11.0 for [1.5.0, 1.5.1, 1.6.0, 1.8.0, 1.8.1, 1.9.0, 1.9.1, 1.10.0, 1.10.1, 1.10.2, 1.11.0]
-          #   torch: "1.5.0"
-          # - cuda: "11.0"
-          #   torch: "1.5.1"
-          - cuda: "11.0"
-            torch: "1.6.0"
-          - cuda: "11.0"
-            torch: "1.8.0"
-          - cuda: "11.0"
-            torch: "1.8.1"
-          - cuda: "11.0"
-            torch: "1.9.0"
-          - cuda: "11.0"
-            torch: "1.9.1"
-          - cuda: "11.0"
-            torch: "1.10.0"
-          - cuda: "11.0"
-            torch: "1.10.1"
-          - cuda: "11.0"
-            torch: "1.10.2"
-          - cuda: "11.0"
-            torch: "1.11.0"
-          # - cuda: "11.1" # exclude 11.1 for [1.5.0, 1.5.1, 1.6.0, 1.7.0, 1.7.1, 1.11.0]
-          #   torch: "1.5.0"
-          # - cuda: "11.1"
-          #   torch: "1.5.1"
-          - cuda: "11.1"
-            torch: "1.6.0"
-          - cuda: "11.1"
-            torch: "1.7.0"
-          - cuda: "11.1"
-            torch: "1.7.1"
-          - cuda: "11.1"
-            torch: "1.11.0"
-          - cuda: "10.1" # exclude 10.1 for [1.9.0, 1.9.1, 1.10.0, 1.10.1, 1.10.2, 1.11.0]
-            torch: "1.9.0"
-          - cuda: "10.1"
-            torch: "1.9.1"
-          - cuda: "10.1"
-            torch: "1.10.0"
-          - cuda: "10.1"
-            torch: "1.10.1"
-          - cuda: "10.1"
-            torch: "1.10.2"
-          - cuda: "10.1"
-            torch: "1.11.0"
-          - python-version: "3.9" # exclude Python 3.9 for [1.5.0, 1.5.1, 1.6.0, 1.7.0]
-            torch: "1.5.0"
-          - python-version: "3.9"
-            torch: "1.5.1"
-          - python-version: "3.9"
-            torch: "1.6.0"
-          - python-version: "3.9"
-            torch: "1.7.0"
-          - python-version: "3.10" # exclude Python 3.10 for [1.5.0, 1.5.1, 1.6.0, 1.7.0, 1.7.1, 1.8.0, 1.8.1, 1.9.0, 1.9.1, 1.10.0, 1.10.1, 1.10.2]
-            torch: "1.5.0"
-          - python-version: "3.10"
-            torch: "1.5.1"
-          - python-version: "3.10"
-            torch: "1.6.0"
-          - python-version: "3.10"
-            torch: "1.7.0"
-          - python-version: "3.10"
-            torch: "1.7.1"
-          - python-version: "3.10"
-            torch: "1.8.0"
-          - python-version: "3.10"
-            torch: "1.8.1"
-          - python-version: "3.10"
-            torch: "1.9.0"
-          - python-version: "3.10"
-            torch: "1.9.1"
-          - python-version: "3.10"
-            torch: "1.10.0"
-          - python-version: "3.10"
-            torch: "1.10.1"
-          - python-version: "3.10"
-            torch: "1.10.2"
-          - python-version: "3.6" # exclude Python 3.6 for [1.11.0]
-            torch: "1.11.0"
-
-    steps:
-      # refer to https://github.com/actions/checkout
-      - uses: actions/checkout@v2
-        with:
-          fetch-depth: 0
-
-      - name: Install CUDA Toolkit ${{ matrix.cuda }}
-        shell: bash -l {0}
-        env:
-          cuda: ${{ matrix.cuda }}
-        run: |
-          source ./scripts/github_actions/install_cuda.sh
-          echo "CUDA_HOME=${CUDA_HOME}" >> $GITHUB_ENV
-          echo "${CUDA_HOME}/bin" >> $GITHUB_PATH
-          echo "LD_LIBRARY_PATH=${CUDA_HOME}/lib:${CUDA_HOME}/lib64:${LD_LIBRARY_PATH}" >> $GITHUB_ENV
-
-      - name: Display NVCC version
-        shell: bash -l {0}
-        run: |
-          which nvcc
-          nvcc --version
-
-      - uses: conda-incubator/setup-miniconda@v2
-        with:
-          auto-update-conda: true
-          python-version: ${{ matrix.python-version }}
-          activate-environment: k2
-
-      - name: Display Python version
-        shell: bash -l {0}
-        run: |
-          python3 -c "import sys; print(sys.version)"
-          which python3
-
-      - name: Install conda dependencies
-        shell: bash -l {0}
-        run: |
-          conda install -y -q anaconda-client
-          conda install -y -q conda-build
-          conda install -y -q bs4 requests tqdm
-          conda install -y -q -c pytorch -c conda-forge pytorch=${{ matrix.torch }} cudatoolkit=${{ matrix.cuda }}
-
-      - name: Display conda info
-        shell: bash -l {0}
-        run: |
-          which conda
-          conda env list
-          conda info
-          nproc
-
-      - name: Install git lfs
-        run: |
-          sudo apt-get install -y git-lfs
-
-      - name: Download cudnn 8.0
-        shell: bash -l {0}
-        env:
-          cuda: ${{ matrix.cuda }}
-        run: |
-          ./scripts/github_actions/install_cudnn.sh
-
-      - name: Build k2
-        shell: bash -l {0}
-        env:
-          K2_CUDA_VERSION: ${{ matrix.cuda }}
-          K2_PYTHON_VERSION: ${{ matrix.python-version}}
-          K2_TORCH_VERSION: ${{ matrix.torch }}
-          K2_CONDA_TOKEN: ${{ secrets.K2_CONDA_TOKEN}}
-          K2_IS_GITHUB_ACTIONS: 1
-          K2_IS_FOR_CONDA: 1
-        run: |
-          export K2_BUILD_TYPE=$K2_BUILD_TYPE
-          ./scripts/build_conda.sh
diff --git a/.github/workflows/nightly-cpu-macos.yml b/.github/workflows/nightly-cpu-macos.yml
new file mode 100644
index 000000000..0ce5354cb
--- /dev/null
+++ b/.github/workflows/nightly-cpu-macos.yml
@@ -0,0 +1,129 @@
+# Copyright      2021  Fangjun Kuang (csukuangfj@gmail.com)
+
+# See ../../LICENSE for clarification regarding multiple authors
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+name: nightly_cpu_macos
+
+on:
+  schedule:
+    # minute (0-59)
+    # hour (0-23)
+    # day of the month (1-31)
+    # month (1-12)
+    # day of the week (0-6)
+    # nightly build at 23:50 UTC time every day
+    - cron: "50 23 * * *"
+
+env:
+  BUILD_TYPE: Release
+
+jobs:
+  enable_nightly_build:
+    runs-on: ubuntu-latest
+    outputs:
+      enabled: ${{ steps.set-enabled.outputs.enabled }}
+    steps:
+      - uses: actions/checkout@v2
+        with:
+          fetch-depth: 0
+      - name: Set enabled
+        id: set-enabled
+        run: |
+          enabled=$(python scripts/github_actions/run-nightly-build.py)
+          echo "enabled: $enabled"
+          echo "::set-output name=enabled::${enabled}"
+
+  generate_build_matrix:
+    needs: enable_nightly_build
+    if: needs.enable_nightly_build.outputs.enabled == 'true'
+    # see https://github.com/pytorch/pytorch/pull/50633
+    runs-on: ubuntu-latest
+    outputs:
+      matrix: ${{ steps.set-matrix.outputs.matrix }}
+    steps:
+      - uses: actions/checkout@v2
+        with:
+          fetch-depth: 0
+      - name: Generating build matrix
+        id: set-matrix
+        run: |
+          # outputting for debugging purposes
+          python scripts/github_actions/generate_build_matrix.py
+          MATRIX=$(python scripts/github_actions/generate_build_matrix.py)
+          echo "::set-output name=matrix::${MATRIX}"
+
+  nightly_cpu_macos:
+    needs: generate_build_matrix
+    runs-on: macos-10.15
+    strategy:
+      fail-fast: false
+      matrix:
+        ${{ fromJson(needs.generate_build_matrix.outputs.matrix) }}
+
+    steps:
+      - uses: actions/checkout@v2
+        with:
+          fetch-depth: 0
+
+      - name: Display date and time
+        run: date
+
+      - name: Setup Python ${{ matrix.python-version }}
+        uses: actions/setup-python@v2
+        with:
+          python-version: ${{ matrix.python-version }}
+
+      - name: Display Python version
+        run: python -c "import sys; print(sys.version)"
+
+      - name: Display GCC version
+        run: |
+          gcc --version
+
+      - name: Display clang version
+        run: |
+          clang --version
+
+      - name: Install PyTorch ${{ matrix.torch }}
+        shell: bash
+        run: |
+          python3 -m pip install -qq --upgrade pip
+          python3 -m pip install -qq wheel twine
+          python3 -m pip install -qq torch==${{ matrix.torch }}
+          python3 -m pip install --upgrade numpy
+
+      - name: Build pip packages
+        shell: bash
+        run: |
+          export K2_CMAKE_ARGS="-DCMAKE_BUILD_TYPE=$BUILD_TYPE -DK2_WITH_CUDA=OFF"
+          export K2_MAKE_ARGS="-j2"
+          python3 setup.py bdist_wheel
+          ls -lh dist/
+
+      - name: Upload Wheel
+        uses: actions/upload-artifact@v2
+        with:
+          name: torch-${{ matrix.torch }}-python-${{ matrix.python-version }}-macos-10.15
+          path: dist/*.whl
+
+      - name: Copy wheels to k2-fsa.org
+        if: ${{ github.repository_owner == 'k2-fsa' }}
+        run: |
+          user=${{ secrets.K2_USERNAME }}
+          server=${{ secrets.K2_HOST }}
+          port=${{ secrets.K2_PORT }}
+          echo "${{ secrets.K2_KEY }}" > id_rsa && chmod 600 id_rsa
+          scp -P $port -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i id_rsa dist/*.whl $user@$server:~/nightly/whl
+          rm id_rsa
diff --git a/.github/workflows/nightly-cpu.yml b/.github/workflows/nightly-cpu-ubuntu.yml
similarity index 54%
rename from .github/workflows/nightly-cpu.yml
rename to .github/workflows/nightly-cpu-ubuntu.yml
index 8fdc6d0a6..b47074272 100644
--- a/.github/workflows/nightly-cpu.yml
+++ b/.github/workflows/nightly-cpu-ubuntu.yml
@@ -14,73 +14,63 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-name: nightly-cpu
+name: nightly_cpu_ubuntu
 
 on:
-  push:
-    branches:
-      - nightly-cpu
   schedule:
     # minute (0-59)
     # hour (0-23)
     # day of the month (1-31)
     # month (1-12)
     # day of the week (0-6)
-    # nightly build at 14:00 UTC time every day
-    - cron: "0 14 * * *"
+    # nightly build at 23:50 UTC time every day
+    - cron: "50 23 * * *"
 
 env:
   BUILD_TYPE: Release
 
 jobs:
-  nightly-cpu:
-    runs-on: ${{ matrix.os }}
+  enable_nightly_build:
+    runs-on: ubuntu-latest
+    outputs:
+      enabled: ${{ steps.set-enabled.outputs.enabled }}
+    steps:
+      - uses: actions/checkout@v2
+        with:
+          fetch-depth: 0
+      - name: Set enabled
+        id: set-enabled
+        run: |
+          enabled=$(python scripts/github_actions/run-nightly-build.py)
+          echo "enabled: $enabled"
+          echo "::set-output name=enabled::${enabled}"
+
+  generate_build_matrix:
+    needs: enable_nightly_build
+    if: needs.enable_nightly_build.outputs.enabled == 'true'
+    # see https://github.com/pytorch/pytorch/pull/50633
+    runs-on: ubuntu-latest
+    outputs:
+      matrix: ${{ steps.set-matrix.outputs.matrix }}
+    steps:
+      - uses: actions/checkout@v2
+        with:
+          fetch-depth: 0
+      - name: Generating build matrix
+        id: set-matrix
+        run: |
+          # outputting for debugging purposes
+          python scripts/github_actions/generate_build_matrix.py
+          MATRIX=$(python scripts/github_actions/generate_build_matrix.py)
+          echo "::set-output name=matrix::${MATRIX}"
+
+  nightly_cpu_ubuntu:
+    needs: generate_build_matrix
+    runs-on: ubuntu-18.04
     strategy:
       fail-fast: false
       matrix:
-        os: [ubuntu-18.04, macos-10.15]
-        # Python 3.9 is for PyTorch 1.7.1, 1.8.x, 1.9.x, 1.10.x, 1.11.x
-        python-version: ["3.6", "3.7", "3.8", "3.9", "3.10"]
-        torch: ["1.4.0", "1.5.0", "1.5.1", "1.6.0", "1.7.0", "1.7.1", "1.8.0", "1.8.1", "1.9.0", "1.9.1", "1.10.0", "1.10.1", "1.10.2", "1.11.0"]
-        exclude:
-          - python-version: "3.9" # exclude Python 3.9 for [1.4.0, 1.5.0, 1.5.1, 1.6.0, 1.7.0]
-            torch: "1.4.0"
-          - python-version: "3.9"
-            torch: "1.5.0"
-          - python-version: "3.9"
-            torch: "1.5.1"
-          - python-version: "3.9"
-            torch: "1.6.0"
-          - python-version: "3.9"
-            torch: "1.7.0"
-          - python-version: "3.10" # exclude Python 3.10 for [1.4.0, 1.5.0, 1.5.1, 1.6.0, 1.7.0, 1.7.1, 1.8.0, 1.8.1, 1.9.0, 1.9.1, 1.10.0, 1.10.1, 1.10.2]
-            torch: "1.4.0"
-          - python-version: "3.10"
-            torch: "1.5.0"
-          - python-version: "3.10"
-            torch: "1.5.1"
-          - python-version: "3.10"
-            torch: "1.6.0"
-          - python-version: "3.10"
-            torch: "1.7.0"
-          - python-version: "3.10"
-            torch: "1.7.1"
-          - python-version: "3.10"
-            torch: "1.8.0"
-          - python-version: "3.10"
-            torch: "1.8.1"
-          - python-version: "3.10"
-            torch: "1.9.0"
-          - python-version: "3.10"
-            torch: "1.9.1"
-          - python-version: "3.10"
-            torch: "1.10.0"
-          - python-version: "3.10"
-            torch: "1.10.1"
-          - python-version: "3.10"
-            torch: "1.10.2"
-          - python-version: "3.6" # exclude Python 3.6 for [1.11.0]
-            torch: "1.11.0"
+        ${{ fromJson(needs.generate_build_matrix.outputs.matrix) }}
 
     steps:
       - uses: actions/checkout@v2
@@ -109,13 +99,7 @@ jobs:
         run: |
           gcc --version
 
-      - name: Display clang version
-        if: startsWith(matrix.os, 'macos')
-        run: |
-          clang --version
-
       - name: Install PyTorch ${{ matrix.torch }}
-        if: startsWith(matrix.os, 'ubuntu')
         shell: bash
         run: |
           python3 -m pip install -qq --upgrade pip
@@ -125,15 +109,6 @@ jobs:
 
           python3 -c "import torch; print('torch version:', torch.__version__)"
 
-      - name: Install PyTorch ${{ matrix.torch }}
-        if: startsWith(matrix.os, 'macos')
-        shell: bash
-        run: |
-          python3 -m pip install -qq --upgrade pip
-          python3 -m pip install -qq wheel twine
-          python3 -m pip install -qq torch==${{ matrix.torch }}
-          python3 -m pip install --upgrade numpy
-
       - name: Build pip packages
         shell: bash
         run: |
@@ -145,10 +120,11 @@ jobs:
       - name: Upload Wheel
         uses: actions/upload-artifact@v2
         with:
-          name: torch-${{ matrix.torch }}-python-${{ matrix.python-version }}-${{ matrix.os }}
+          name: torch-${{ matrix.torch }}-python-${{ matrix.python-version }}-ubuntu-18.04
           path: dist/*.whl
 
       - name: Copy wheels to k2-fsa.org
+        if: ${{ github.repository_owner == 'k2-fsa' }}
         run: |
           user=${{ secrets.K2_USERNAME }}
           server=${{ secrets.K2_HOST }}
diff --git a/.github/workflows/nightly-windows.yml b/.github/workflows/nightly-cpu-windows.yml
similarity index 61%
rename from .github/workflows/nightly-windows.yml
rename to .github/workflows/nightly-cpu-windows.yml
index 42fa8b7bf..7074494d2 100644
--- a/.github/workflows/nightly-windows.yml
+++ b/.github/workflows/nightly-cpu-windows.yml
@@ -14,7 +14,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-name: nightly-windows
+name: nightly_cpu_windows
 
 on:
   schedule:
@@ -23,31 +23,53 @@ on:
     # day of the month (1-31)
     # month (1-12)
     # day of the week (0-6)
-    # nightly build at 14:00 UTC time every day
-    - cron: "0 14 * * *"
+    # nightly build at 23:50 UTC time every day
+    - cron: "50 23 * * *"
 
 env:
   BUILD_TYPE: Release
 
 jobs:
-  nightly-windows:
-    runs-on: ${{ matrix.os }}
+  enable_nightly_build:
+    runs-on: ubuntu-latest
+    outputs:
+      enabled: ${{ steps.set-enabled.outputs.enabled }}
+    steps:
+      - uses: actions/checkout@v2
+        with:
+          fetch-depth: 0
+      - name: Set enabled
+        id: set-enabled
+        run: |
+          enabled=$(python scripts/github_actions/run-nightly-build.py)
+          echo "enabled: $enabled"
+          echo "::set-output name=enabled::${enabled}"
+
+  generate_build_matrix:
+    needs: enable_nightly_build
+    if: needs.enable_nightly_build.outputs.enabled == 'true'
+    runs-on: ubuntu-latest
+    outputs:
+      matrix: ${{ steps.set-matrix.outputs.matrix }}
+    steps:
+      - uses: actions/checkout@v2
+        with:
+          fetch-depth: 0
+      - name: Generating build matrix
+        id: set-matrix
+        run: |
+          # outputting for debugging purposes
+          python scripts/github_actions/generate_build_matrix.py
+          MATRIX=$(python scripts/github_actions/generate_build_matrix.py)
+          echo "::set-output name=matrix::${MATRIX}"
+
+  nightly_cpu_windows:
+    needs: generate_build_matrix
+    runs-on: windows-2019
     strategy:
       fail-fast: false
       matrix:
-        os: [windows-2019]
-        # Python 3.9 is for PyTorch 1.7.1, 1.8.x, 1.9.0
-        python-version: [3.6, 3.7, 3.8, 3.9]
-        torch: ["1.5.0", "1.5.1", "1.6.0", "1.7.0", "1.7.1", "1.8.0", "1.8.1", "1.9.0"]
-        exclude:
-          - python-version: 3.9 # exclude Python 3.9 for [1.5.0, 1.5.1, 1.6.0, 1.7.0]
-            torch: "1.5.0"
-          - python-version: 3.9
-            torch: "1.5.1"
-          - python-version: 3.9
-            torch: "1.6.0"
-          - python-version: 3.9
-            torch: "1.7.0"
+        ${{ fromJson(needs.generate_build_matrix.outputs.matrix) }}
 
     steps:
       - uses: actions/checkout@v2
@@ -68,7 +90,7 @@ jobs:
 
       - name: Install PyTorch ${{ matrix.torch }}
         run: |
-          pip3 install -qq torch==${{ matrix.torch }}+cpu -f https://download.pytorch.org/whl/torch_stable.html
+          pip3 install -qq torch==${{ matrix.torch }}+cpu -f https://download.pytorch.org/whl/torch_stable.html numpy
           pip3 install -qq wheel twine dataclasses typing_extensions
 
           python3 -m torch.utils.collect_env
@@ -78,40 +100,22 @@ jobs:
           cmake --version
           cmake --help
 
-      - name: Configure CMake
-        shell: bash
-        run: |
-          mkdir build_release
-          cd build_release
-          cmake -DCMAKE_BUILD_TYPE=$BUILD_TYPE -DK2_WITH_CUDA=OFF ..
-          ls -lh
-
-      - name: Build k2
-        run: |
-          cd build_release
-          cmake --build . --target _k2 --config Release
-
-      - name: Display generated files
-        shell: bash
-        run: |
-          cd build_release
-          ls -lh bin/*/*
-          ls -lh lib/*/*
-
       - name: Build wheel
         shell: bash
         run: |
           export K2_CMAKE_ARGS="-DK2_WITH_CUDA=OFF -DCMAKE_BUILD_TYPE=Release"
           python3 setup.py bdist_wheel
           ls -lh dist/
+          pip install ./dist/*.whl
 
       - name: Upload Wheel
         uses: actions/upload-artifact@v2
         with:
-          name: torch-${{ matrix.torch }}-python-${{ matrix.python-version }}-${{ matrix.os }}-cpu
+          name: torch-${{ matrix.torch }}-python-${{ matrix.python-version }}-windows-cpu
           path: dist/*.whl
 
       - name: Copy wheels to k2-fsa.org
+        if: ${{ github.repository_owner == 'k2-fsa' }}
         shell: bash
         run: |
           user=${{ secrets.K2_USERNAME }}
@@ -120,3 +124,34 @@ jobs:
           echo "${{ secrets.K2_KEY }}" > id_rsa && chmod 600 id_rsa
           scp -P $port -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i id_rsa dist/*.whl $user@$server:~/nightly/whl
           rm id_rsa
+
+      - name: Configure CMake
+        shell: bash
+        run: |
+          mkdir build_release
+          cd build_release
+          cmake -DCMAKE_BUILD_TYPE=$BUILD_TYPE -DK2_WITH_CUDA=OFF ..
+          ls -lh
+          cat k2/csrc/version.h
+          cat CMakeCache.txt
+
+      - name: Build k2
+        shell: bash
+        run: |
+          cd build_release
+          cmake --build . --target _k2 --config Release -- -m
+          cmake --build . --target ALL_BUILD --config Release
+
+      - name: Display generated files
+        shell: bash
+        run: |
+          cd build_release
+          ls -lh lib/*/*
+          ls -lh bin/*/*
+
+      - name: Run tests
+        shell: bash
+        run: |
+          cd build_release
+          # disable python tests for k2host
+          ctest -C Release --output-on-failure -E host
diff --git a/.github/workflows/nightly.yml b/.github/workflows/nightly-cuda-ubuntu.yml
similarity index 80%
rename from .github/workflows/nightly.yml
rename to .github/workflows/nightly-cuda-ubuntu.yml
index 0cc6cb1be..ac52040fa 100644
--- a/.github/workflows/nightly.yml
+++ b/.github/workflows/nightly-cuda-ubuntu.yml
@@ -1,4 +1,4 @@
-name: nightly
+name: nightly-cuda-ubuntu
 
 on:
   push:
@@ -10,19 +10,35 @@ on:
     # day of the month (1-31)
     # month (1-12)
     # day of the week (0-6)
-    # nightly build at 14:00 UTC time every day
-    - cron: "0 14 * * *"
+    # nightly build at 23:50 UTC time every day
+    - cron: "50 23 * * *"
 
 env:
   BUILD_TYPE: Release
 
 jobs:
+  enable_nightly_build:
+    runs-on: ubuntu-latest
+    outputs:
+      enabled: ${{ steps.set-enabled.outputs.enabled }}
+    steps:
+      - uses: actions/checkout@v2
+        with:
+          fetch-depth: 0
+      - name: Set enabled
+        id: set-enabled
+        run: |
+          enabled=$(python scripts/github_actions/run-nightly-build.py)
+          echo "enabled: $enabled"
+          echo "::set-output name=enabled::${enabled}"
+
   nightly:
-    runs-on: ${{ matrix.os }}
+    needs: enable_nightly_build
+    if: needs.enable_nightly_build.outputs.enabled == 'true'
+    runs-on: ubuntu-18.04
     strategy:
       fail-fast: false
       matrix:
-        os: [ubuntu-18.04]
         cuda: ["10.1", "10.2", "11.0"]
         gcc: ["7"]
         torch: ["1.7.1"]
@@ -102,6 +118,7 @@ jobs:
           ls -lh dist/
 
       - name: Copy wheels to k2-fsa.org
+        if: ${{ github.repository_owner == 'k2-fsa' }}
         uses: horochx/deploy-via-scp@v1.0.1
         with:
           host: ${{ secrets.K2_HOST }}
diff --git a/.github/workflows/run-tests.yml b/.github/workflows/run-tests.yml
index 6c1877586..88d1089d8 100644
--- a/.github/workflows/run-tests.yml
+++ b/.github/workflows/run-tests.yml
@@ -34,10 +34,10 @@ jobs:
       fail-fast: false
       matrix:
         os: [ubuntu-18.04]
-        cuda: ["11.1"]
-        gcc: ["5"]
-        torch: ["1.9.0"]
-        python-version: [3.9]
+        cuda: ["10.2"]
+        gcc: ["7"]
+        torch: ["1.11.0"]
+        python-version: ["3.10"]
         build_type: ["Release", "Debug"]
 
     steps:
diff --git a/.github/workflows/wheel-cpu-stable.yml b/.github/workflows/wheel-cpu-macos.yml
similarity index 93%
rename from .github/workflows/wheel-cpu-stable.yml
rename to .github/workflows/wheel-cpu-macos.yml
index a87ee808b..eb44821e4 100644
--- a/.github/workflows/wheel-cpu-stable.yml
+++ b/.github/workflows/wheel-cpu-macos.yml
@@ -1,11 +1,11 @@
 # Copyright (c)  2021  Xiaomi Corporation (authors: Fangjun Kuang)
 
-name: Publish to PyPI macOS - stable
+name: Publish to PyPI - macOS CPU
 
 on:
   push:
-    branches:
-      - wheel-stable
+    tags:
+      - '*'
 
 env:
   BUILD_TYPE: Release
@@ -34,7 +34,6 @@ jobs:
         run: python -c "import sys; print(sys.version)"
 
       - name: Install PyTorch ${{ matrix.torch }}
-        if: startsWith(matrix.os, 'macos')
         shell: bash
         run: |
           python3 -m pip install -qq --upgrade pip
@@ -61,6 +60,7 @@ jobs:
           path: dist/*.whl
 
       - name: Publish wheels to PyPI
+        if: ${{ github.repository_owner == 'k2-fsa' }}
         env:
           TWINE_USERNAME: ${{ secrets.PYPI_USERNAME }}
           TWINE_PASSWORD: ${{ secrets.PYPI_PASSWORD }}
diff --git a/.github/workflows/wheel-cpu.yml b/.github/workflows/wheel-cpu-windows.yml
similarity index 63%
rename from .github/workflows/wheel-cpu.yml
rename to .github/workflows/wheel-cpu-windows.yml
index 007688c8a..9d84b51cc 100644
--- a/.github/workflows/wheel-cpu.yml
+++ b/.github/workflows/wheel-cpu-windows.yml
@@ -1,22 +1,23 @@
 # Copyright (c)  2021  Xiaomi Corporation (authors: Fangjun Kuang)
 
-name: Publish to PyPI macOS
+name: Publish to PyPI - Windows CPU
 
 on:
   push:
-    branches:
-      - wheel
+    tags:
+      - '*'
 
 env:
   BUILD_TYPE: Release
 
 jobs:
-  PyPI-macos-cpu:
+  PyPI-windows-cpu:
+    if: ${{ false }} # Disable it at present. Users can install it from https://k2-fsa.org/nightly/index.html
     runs-on: ${{ matrix.os }}
     strategy:
       fail-fast: false
       matrix:
-        os: [macos-10.15]
+        os: [windows-2019]
         torch: ["1.7.1"]
         python-version: [3.6, 3.7, 3.8]
 
@@ -25,6 +26,10 @@ jobs:
         with:
           fetch-depth: 0
 
+      # see https://github.com/microsoft/setup-msbuild
+      - name: Add msbuild to PATH
+        uses: microsoft/setup-msbuild@v1.0.2
+
       - name: Setup Python ${{ matrix.python-version }}
         uses: actions/setup-python@v2
         with:
@@ -34,32 +39,32 @@ jobs:
         run: python -c "import sys; print(sys.version)"
 
       - name: Install PyTorch ${{ matrix.torch }}
-        if: startsWith(matrix.os, 'macos')
-        shell: bash
         run: |
-          python3 -m pip install -qq --upgrade pip
-          python3 -m pip install -q wheel twine typing_extensions
-          python3 -m pip install -qq torch==${{ matrix.torch }}
+          pip3 install -qq torch==${{ matrix.torch }}+cpu -f https://download.pytorch.org/whl/torch_stable.html numpy
+          pip3 install -qq wheel twine dataclasses typing_extensions
+
+          python3 -m torch.utils.collect_env
 
 
       - name: Build pip packages
         shell: bash
         env:
           K2_IS_FOR_PYPI: 1
+          K2_IS_STABLE: 1
         run: |
           tag=$(python3 -c "import sys; print(''.join(sys.version[:3].split('.')))")
-          export K2_CMAKE_ARGS="-DCMAKE_BUILD_TYPE=$BUILD_TYPE"
-          export K2_MAKE_ARGS="-j2"
+          export K2_CMAKE_ARGS="-DK2_WITH_CUDA=OFF -DCMAKE_BUILD_TYPE=$BUILD_TYPE"
           python3 setup.py bdist_wheel --python-tag=py${tag}
           ls -lh dist/
 
       - name: Upload Wheel
         uses: actions/upload-artifact@v2
         with:
-          name: torch-${{ matrix.torch }}-python-${{ matrix.python-version }}-${{ matrix.os }}-cpu
+          name: torch-${{ matrix.torch }}-python-${{ matrix.python-version }}-windows-cpu
           path: dist/*.whl
 
       - name: Publish wheels to PyPI
+        if: ${{ github.repository_owner == 'k2-fsa' }}
         env:
           TWINE_USERNAME: ${{ secrets.PYPI_USERNAME }}
           TWINE_PASSWORD: ${{ secrets.PYPI_PASSWORD }}
diff --git a/.github/workflows/wheel-stable.yml b/.github/workflows/wheel-cuda-ubuntu.yml
similarity index 95%
rename from .github/workflows/wheel-stable.yml
rename to .github/workflows/wheel-cuda-ubuntu.yml
index f142c2910..c4449028c 100644
--- a/.github/workflows/wheel-stable.yml
+++ b/.github/workflows/wheel-cuda-ubuntu.yml
@@ -1,17 +1,17 @@
 # Copyright (c)  2021  Xiaomi Corporation (authors: Fangjun Kuang)
 
-name: Publish to PyPI - stable
+name: Publish to PyPI - Ubuntu CUDA
 
 on:
   push:
-    branches:
-      - wheel-stable
+    tags:
+      - '*'
 
 env:
   BUILD_TYPE: Release
 
 jobs:
-  PyPI:
+  PyPI_CUDA_Ubuntu:
     runs-on: ${{ matrix.os }}
     strategy:
       fail-fast: false
@@ -93,6 +93,7 @@ jobs:
           ls -lh dist/
 
       - name: Publish wheels to PyPI
+        if: ${{ github.repository_owner == 'k2-fsa' }}
         env:
           TWINE_USERNAME: ${{ secrets.PYPI_USERNAME }}
           TWINE_PASSWORD: ${{ secrets.PYPI_PASSWORD }}
diff --git a/.gitignore b/.gitignore
index 39db3723a..7a562e752 100644
--- a/.gitignore
+++ b/.gitignore
@@ -1,3 +1,4 @@
+k2/python/k2/torch_version.py
 # Build folder
 **/build*
 
diff --git a/CMakeLists.txt b/CMakeLists.txt
index 6f1f21611..906ebddb0 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -64,7 +64,7 @@ string(TOUPPER ${CMAKE_BUILD_TYPE} CMAKE_BUILD_TYPE_UPPERCASE)
 if("${CMAKE_BUILD_TYPE_UPPERCASE}" STREQUAL "DEBUG")
   # refer to https://docs.nvidia.com/cuda/cuda-memcheck/index.html#compilation-options
   # The two options are to make cuda-memcheck's stack backtrace feature more useful.
-  string(APPEND CMAKE_CUDA_FLAGS " --compiler-options -rdynamic --compiler-options -lineinfo")
+  set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} --compiler-options -rdynamic --compiler-options -lineinfo")
 endif()
 
 set(CMAKE_EXPORT_COMPILE_COMMANDS ON)
@@ -73,11 +73,12 @@ option(BUILD_SHARED_LIBS "Whether to build shared or static lib" ON)
 option(K2_USE_PYTORCH "Whether to build with PyTorch" ON)
 option(K2_ENABLE_BENCHMARK "Whether to enable benchmark" ON)
 option(K2_WITH_CUDA "Whether to build k2 with CUDA" ${_K2_WITH_CUDA})
+option(K2_ENABLE_NVTX "Whether to build k2 with the NVTX library" ON)
 
-# If K2_WITH_CUDA is ON, then K2_ENABLE_NVTX has a default value ON
-# If K2_WITH_CUDA is OFF, then K2_ENABLE_NVTX is set to OFF
-include(CMakeDependentOption)
-cmake_dependent_option(K2_ENABLE_NVTX "Whether to build with the NVTX library" ON K2_WITH_CUDA OFF)
+if(NOT K2_WITH_CUDA)
+  message(STATUS "Set K2_ENABLE_NVTX to OFF since K2_WITH_CUDA is OFF")
+  set(K2_ENABLE_NVTX OFF CACHE BOOL "" FORCE)
+endif()
 
 if(NOT K2_USE_PYTORCH)
   message(FATAL_ERROR "\
@@ -210,7 +211,16 @@ if(K2_WITH_CUDA)
   # https://www.myzhar.com/blog/tutorials/tutorial-nvidia-gpu-cuda-compute-capability/
   set(K2_COMPUTE_ARCH_CANDIDATES 35 50 60 61 70 75)
   if(CUDA_VERSION VERSION_GREATER "11.0")
-    list(APPEND K2_COMPUTE_ARCH_CANDIDATES 80 86)
+      list(APPEND K2_COMPUTE_ARCH_CANDIDATES 80 86)
+      if(WIN32)
+        # To fix the following warning from PyTorch:
+        # c10/util/TypeCast.h(39): warning : calling a constexpr __host__ function from a
+        # __host__ __device__ function is not allowed. The experimental flag '--expt-relaxed-constexpr'
+        # can be used to allow this
+        string(APPEND CMAKE_CUDA_FLAGS " --expt-relaxed-constexpr ")
+      endif()
+
+      string(APPEND CMAKE_CUDA_FLAGS " -Wno-deprecated-gpu-targets ")
   endif()
   message(STATUS "K2_COMPUTE_ARCH_CANDIDATES ${K2_COMPUTE_ARCH_CANDIDATES}")
 
@@ -232,7 +242,7 @@ if(K2_WITH_CUDA)
   message(STATUS "K2_COMPUTE_ARCHS: ${K2_COMPUTE_ARCHS}")
 
   foreach(COMPUTE_ARCH IN LISTS K2_COMPUTE_ARCHS)
-    string(APPEND CMAKE_CUDA_FLAGS " --expt-extended-lambda -gencode arch=compute_${COMPUTE_ARCH},code=sm_${COMPUTE_ARCH}")
+    set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} --expt-extended-lambda -gencode arch=compute_${COMPUTE_ARCH},code=sm_${COMPUTE_ARCH}")
     set(CMAKE_CUDA_ARCHITECTURES "${COMPUTE_ARCH}-real;${COMPUTE_ARCH}-virtual;${CMAKE_CUDA_ARCHITECTURES}")
   endforeach()
 # ========= Settings for CUB end=========
@@ -254,12 +264,20 @@ if(K2_USE_PYTORCH)
   add_definitions(-DK2_USE_PYTORCH)
   add_definitions(-DTORCH_API_INCLUDE_EXTENSION_H)
   include(torch)
+  configure_file(
+    ${CMAKE_SOURCE_DIR}/k2/python/k2/torch_version.py.in
+    ${CMAKE_SOURCE_DIR}/k2/python/k2/torch_version.py @ONLY
+  )
 endif()
 
 if(K2_WITH_CUDA)
   add_definitions(-DK2_WITH_CUDA)
 endif()
 
+if(WIN32)
+  add_definitions(-DNOMINMAX) # Otherwise, std::max() and std::min() won't work
+endif()
+
 if(K2_WITH_CUDA AND CUDA_VERSION VERSION_LESS 11.0)
   # CUB is included in CUDA toolkit 11.0 and above
   include(cub)
@@ -270,42 +288,76 @@ if(K2_WITH_CUDA)
 endif()
 
 include(googletest)
-include(sentencepiece)
-include(kaldifeat)
 
-if(K2_WITH_CUDA)
-  string(APPEND CMAKE_CUDA_FLAGS " --compiler-options -Wall --compiler-options -Wno-unknown-pragmas --compiler-options -Wno-strict-overflow")
+if(K2_WITH_CUDA AND NOT WIN32)
+  string(APPEND CMAKE_CUDA_FLAGS " --compiler-options -Wall ")
+  string(APPEND CMAKE_CUDA_FLAGS " --compiler-options -Wno-strict-overflow ")
+  string(APPEND CMAKE_CUDA_FLAGS " --compiler-options -Wno-unknown-pragmas ")
   message(STATUS "CMAKE_CUDA_FLAGS: ${CMAKE_CUDA_FLAGS}")
 endif()
 
-if(NOT K2_WITH_CUDA AND NOT WIN32)
-  string(APPEND CMAKE_CXX_FLAGS " -Wno-unused-variable")
-endif()
 
 if(NOT WIN32)
-  string(APPEND CMAKE_CXX_FLAGS " -Wno-strict-overflow")
+  string(APPEND CMAKE_CXX_FLAGS " -Wno-unused-variable ")
+  string(APPEND CMAKE_CXX_FLAGS " -Wno-strict-overflow ")
 endif()
 
 if(WIN32)
   # disable various warnings for MSVC
   # NOTE: Most of the warnings are from PyTorch C++ APIs
+  # 4005: macro redefinition
+  # 4018: signed/unsigned mismatch
+  # 4067: unexpected tokens following preprocessor directive
   # 4068: unknown pragma "unroll"
-  # 4996: "getenv": This function is unsafe
-  # 4224: conversion from 'int64_t' to 'int32_t', possible loss of data
   # 4099: type name first seen using 'class' now seen using 'struct'
+  # 4101: 'identifier' : unreferenced local variable
+  # 4190: 'identifier1' has C-linkage specified, but returns UDT 'identifier2' which is incompatible with C
+  # 4224: conversion from 'int64_t' to 'int32_t', possible loss of data
+  # 4244: conversion from 'const M' to 'const FloatType'
+  # 4251: 'type' : class 'type1' needs to have dll-interface to be used by clients of class 'type2'
   # 4267: conversion from 'size_t' to 'I', possible loss of data
+  # 4275: non - DLL-interface class 'class_1' used as base for DLL-interface class 'class_2'
   # 4305: truncation from 'int' to 'bool'
-  # 4244: conversion from 'const M' to 'const FloatType'
-  # 4624: destructor was implicitly defined as deleted
+  # 4522: 'class' : multiple assignment operators specified
   # 4551: function call missing argument list
-  # 4067: unexpected tokens following preprocessor directive
-  # 4819: The file contains a character that cannot be presented in the current code page.
-  # 4005: macro redefinition
+  # 4624: destructor was implicitly defined as deleted
+  # 4700: uninitialized local variable 'device' used
   # 4722: destructor never returns
-  # 4018: signed/unsigned mismatch
-  string(APPEND CMAKE_CXX_FLAGS " /wd4068 /wd4996 /wd4224 /wd4099 /wd4267 /wd4305 /wd4244 /wd4624 /wd4551 /wd4067 /wd4819 /wd4005 /wd4722 /wd4018")
+  # 4819: The file contains a character that cannot be presented in the current code page.
+  # 4838: conversion from 'type_1' to 'type_2' requires a narrowing conversion
+  # 4996: "getenv": This function is unsafe
+  set(disabled_warnings
+      /wd4005
+      /wd4018
+      /wd4067
+      /wd4068
+      /wd4099
+      /wd4101
+      /wd4190
+      /wd4224
+      /wd4251
+      /wd4244
+      /wd4267
+      /wd4275
+      /wd4305
+      /wd4522
+      /wd4551
+      /wd4624
+      /wd4700
+      /wd4722
+      /wd4819
+      /wd4838
+      /wd4996
+  )
+  message(STATUS "Disabled warnings: ${disabled_warnings}")
+  foreach(w IN LISTS disabled_warnings)
+    string(APPEND CMAKE_CXX_FLAGS " ${w} ")
+    string(APPEND CMAKE_CUDA_FLAGS " --compiler-options ${w} ")
+  endforeach()
+  string(APPEND CMAKE_CXX_FLAGS " /bigobj ")
 endif()
 
 message(STATUS "CMAKE_CXX_FLAGS: ${CMAKE_CXX_FLAGS}")
+message(STATUS "CMAKE_CUDA_FLAGS: ${CMAKE_CUDA_FLAGS}")
 
 add_subdirectory(k2)
diff --git a/cmake/moderngpu.cmake b/cmake/moderngpu.cmake
index efae0a211..a7ef9b291 100644
--- a/cmake/moderngpu.cmake
+++ b/cmake/moderngpu.cmake
@@ -20,9 +20,9 @@ function(download_moderngpu)
 
   include(FetchContent)
 
-  # this is the latest commit of modern gpu as of 2020-09-26
-  set(moderngpu_URL  "https://github.com/moderngpu/moderngpu/archive/2b3985541c8e88a133769598c406c33ddde9d0a5.zip")
-  set(moderngpu_HASH "SHA256=191546af18cd5fb858ecb561316f3af67537ab16f610fc8f1a5febbffc27755a")
+  # this is the latest commit of modern gpu as of 2022-04-03
+  set(moderngpu_URL  "https://github.com/moderngpu/moderngpu/archive/8ec9ac0de8672de7217d014917eedec5317f75f3.zip")
+  set(moderngpu_HASH "SHA256=1c20ffbb81d6f7bbe6107aaa5ee6d37392677c8a5fc7894935149c3ef0a3c2fb")
 
   FetchContent_Declare(moderngpu
     URL               ${moderngpu_URL}
diff --git a/docs/source/installation/conda.rst b/docs/source/installation/conda.rst
index d4ef221e2..bf351263a 100644
--- a/docs/source/installation/conda.rst
+++ b/docs/source/installation/conda.rst
@@ -63,7 +63,7 @@ Supported versions
 .. |conda_cuda_versions| image:: ./images/cuda_ge_10.1-orange.svg
   :alt: Supported cuda versions
 
-.. |conda_pytorch_versions| image:: ./images/pytorch_ge_1.5.0-green.svg
+.. |conda_pytorch_versions| image:: ./images/pytorch_ge_1.6.0-green.svg
   :alt: Supported pytorch versions
 
 - |conda_python_versions|
diff --git a/docs/source/installation/for_developers.rst b/docs/source/installation/for_developers.rst
index 5529191f8..ece6d9167 100644
--- a/docs/source/installation/for_developers.rst
+++ b/docs/source/installation/for_developers.rst
@@ -1,6 +1,10 @@
 For developers
 ==============
 
+.. hint::
+
+    It supports Linux (CPU + CUDA), macOS (CPU), and Windows (CPU + CUDA).
+
 This page is for developers and advanced users. It describes
 how to build k2 and run tests.
 
diff --git a/docs/source/installation/from_source.rst b/docs/source/installation/from_source.rst
index 5aac6d406..97a74b793 100644
--- a/docs/source/installation/from_source.rst
+++ b/docs/source/installation/from_source.rst
@@ -3,6 +3,10 @@
 Install from source
 ===================
 
+.. hint::
+
+    It supports Linux (CPU + CUDA), macOS (CPU), and Windows (CPU + CUDA).
+
 The following versions of Python, CUDA, and PyTorch are known to work.
 
     - |source_python_versions|
@@ -15,7 +19,7 @@ The following versions of Python, CUDA, and PyTorch are known to work.
 .. |source_cuda_versions| image:: ./images/cuda_ge_10.1-orange.svg
   :alt: Supported cuda versions
 
-.. |source_pytorch_versions| image:: ./images/pytorch_ge_1.5.0-green.svg
+.. |source_pytorch_versions| image:: ./images/pytorch_ge_1.6.0-green.svg
   :alt: Supported pytorch versions
 
 Before compiling k2, some preparation work has to be done:
diff --git a/docs/source/installation/images/README.md b/docs/source/installation/images/README.md
index a63890a06..aab295a20 100644
--- a/docs/source/installation/images/README.md
+++ b/docs/source/installation/images/README.md
@@ -5,7 +5,7 @@
 
 - python_ge_3.6-blue.svg
 - cuda_ge_10.1-orange.svg
-- pytorch_ge_1.5.0-green.svg
+- pytorch_ge_1.6.0-green.svg
 
 - pypi_python-3.6_3.7_3.8-blue.svg
 - pypi_cuda-10.1-orange.svg
diff --git a/docs/source/installation/images/torch_ge_1.6.0-green.svg b/docs/source/installation/images/torch_ge_1.6.0-green.svg
new file mode 100644
index 000000000..d3ece9a17
--- /dev/null
+++ b/docs/source/installation/images/torch_ge_1.6.0-green.svg
@@ -0,0 +1 @@
+<svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" width="100" height="20" role="img" aria-label="torch: &gt;= 1.6.0"><title>torch: &gt;= 1.6.0</title><linearGradient id="s" x2="0" y2="100%"><stop offset="0" stop-color="#bbb" stop-opacity=".1"/><stop offset="1" stop-opacity=".1"/></linearGradient><clipPath id="r"><rect width="100" height="20" rx="3" fill="#fff"/></clipPath><g clip-path="url(#r)"><rect width="39" height="20" fill="#555"/><rect x="39" width="61" height="20" fill="#97ca00"/><rect width="100" height="20" fill="url(#s)"/></g><g fill="#fff" text-anchor="middle" font-family="Verdana,Geneva,DejaVu Sans,sans-serif" text-rendering="geometricPrecision" font-size="110"><text aria-hidden="true" x="205" y="150" fill="#010101" fill-opacity=".3" transform="scale(.1)" textLength="290">torch</text><text x="205" y="140" transform="scale(.1)" fill="#fff" textLength="290">torch</text><text aria-hidden="true" x="685" y="150" fill="#010101" fill-opacity=".3" transform="scale(.1)" textLength="510">&gt;= 1.6.0</text><text x="685" y="140" transform="scale(.1)" fill="#fff" textLength="510">&gt;= 1.6.0</text></g></svg>
\ No newline at end of file
diff --git a/docs/source/installation/index.rst b/docs/source/installation/index.rst
index 5b6537ba2..4049aecb0 100644
--- a/docs/source/installation/index.rst
+++ b/docs/source/installation/index.rst
@@ -50,7 +50,7 @@ below:
 .. |conda_cuda_versions| image:: ./images/cuda_ge_10.1-orange.svg
   :alt: Supported cuda versions
 
-.. |conda_pytorch_versions| image:: ./images/pytorch_ge_1.5.0-green.svg
+.. |conda_pytorch_versions| image:: ./images/pytorch_ge_1.6.0-green.svg
   :alt: Supported pytorch versions
 
 .. |pip_python_versions| image:: ./images/python_ge_3.6-blue.svg
@@ -59,7 +59,7 @@ below:
 .. |pip_cuda_versions| image:: ./images/cuda_ge_10.1-orange.svg
   :alt: Supported cuda versions
 
-.. |pip_pytorch_versions| image:: ./images/pytorch_ge_1.5.0-green.svg
+.. |pip_pytorch_versions| image:: ./images/pytorch_ge_1.6.0-green.svg
   :alt: Supported pytorch versions
 
 .. |pypi_python_versions| image:: ./images/pypi_python-3.6_3.7_3.8-blue.svg
@@ -77,7 +77,7 @@ below:
 .. |source_cuda_versions| image:: ./images/cuda_ge_10.1-orange.svg
   :alt: Supported cuda versions
 
-.. |source_pytorch_versions| image:: ./images/pytorch_ge_1.5.0-green.svg
+.. |source_pytorch_versions| image:: ./images/pytorch_ge_1.6.0-green.svg
   :alt: Supported pytorch versions
 
 Reporting issues
diff --git a/docs/source/installation/pip.rst b/docs/source/installation/pip.rst
index b756263b7..f145e16b6 100644
--- a/docs/source/installation/pip.rst
+++ b/docs/source/installation/pip.rst
@@ -7,7 +7,7 @@ Install using pip (k2-fsa.org)
 .. |pip_cuda_versions| image:: ./images/cuda_ge_10.1-orange.svg
   :alt: Supported cuda versions
 
-.. |pip_pytorch_versions| image:: ./images/pytorch_ge_1.5.0-green.svg
+.. |pip_pytorch_versions| image:: ./images/pytorch_ge_1.6.0-green.svg
   :alt: Supported pytorch versions
 
 You can find a list of nightly pre-built
diff --git a/k2/csrc/CMakeLists.txt b/k2/csrc/CMakeLists.txt
index 863d9fc58..8248be6a9 100644
--- a/k2/csrc/CMakeLists.txt
+++ b/k2/csrc/CMakeLists.txt
@@ -38,6 +38,12 @@ add_library(k2_nvtx INTERFACE)
 target_include_directories(k2_nvtx INTERFACE ${CMAKE_SOURCE_DIR})
 if(K2_ENABLE_NVTX)
   target_compile_definitions(k2_nvtx INTERFACE K2_ENABLE_NVTX=1)
+  if(WIN32)
+    target_include_directories(k2_nvtx INTERFACE
+      ${CUDA_TOOLKIT_ROOT_DIR}/include/nvtx3
+      "C:/Program Files/NVIDIA Corporation/NvToolsExt/include"
+    )
+  endif()
 endif()
 
 add_subdirectory(host)
@@ -115,8 +121,37 @@ target_link_libraries(context PUBLIC fsa)
 target_link_libraries(context PUBLIC k2_log)
 target_link_libraries(context PUBLIC k2_nvtx)
 if(K2_USE_PYTORCH)
-  target_link_libraries(context PUBLIC ${TORCH_LIBRARIES})
+  if(NOT WIN32)
+    target_link_libraries(context PUBLIC ${TORCH_LIBRARIES})
+  else()
+    # see https://discuss.pytorch.org/t/nvcc-fatal-a-single-input-file-is-required-for-a-non-link-phase-when-an-outputfile-is-specified/142843/6
+    # Depending on ${TORCH_LIBRARIES} will introduce a compile time option "/bigobj",
+    # which causes the error in the above link.
+    #
+    # It would be ideal to remove /bigobj so that we can use ${TORCH_LIBRARIES}.
+    # To make life simpler, we use the following approach.
+    #
+    message(STATUS "TORCH_DIR: ${TORCH_DIR}") # TORCH_DIR is defined in cmake/torch.cmake
+    # target_link_libraries(context PUBLIC D:/software/anaconda3/envs/py38/Lib/site-packages/torch/lib/*.lib)
+    target_link_libraries(context PUBLIC ${TORCH_DIR}/lib/*.lib)
+    target_include_directories(context PUBLIC ${TORCH_DIR}/include)
+    target_include_directories(context PUBLIC ${TORCH_DIR}/include/torch/csrc/api/include)
+  endif()
+
+  if(UNIX AND NOT APPLE)
+    # It causes errors on macOS
+    target_link_libraries(context PUBLIC ${TORCH_DIR}/lib/libtorch_python.so)
+    # CAUTION: It is PYTHON_LIBRARY on unix
+    target_link_libraries(context PUBLIC ${PYTHON_LIBRARY})
+    message(STATUS "PYTHON_LIBRARIES: ${PYTHON_LIBRARY}")
+  elseif(WIN32)
+    target_link_libraries(context PUBLIC ${TORCH_DIR}/lib/torch_python.lib)
+    # CAUTION: It is PYTHON_LIBRARIES on Windows
+    target_link_libraries(context PUBLIC ${PYTHON_LIBRARIES})
+    message(STATUS "PYTHON_LIBRARIES: ${PYTHON_LIBRARIES}")
+  endif()
 endif()
+target_include_directories(context PUBLIC ${PYTHON_INCLUDE_DIRS})
 
 #---------------------------- Test K2 CUDA sources ----------------------------
 
diff --git a/k2/csrc/array_of_ragged.cu b/k2/csrc/array_of_ragged.cu
index cd93434d9..11f8e8ea4 100644
--- a/k2/csrc/array_of_ragged.cu
+++ b/k2/csrc/array_of_ragged.cu
@@ -1,5 +1,7 @@
 /**
- * Copyright      2022  Xiaomi Corporation (authors: Wei Kang)
+ * Copyright      2022  Xiaomi Corporation (authors: Daniel Povey, Wei Kang)
+ *                2022  ASLP@NWPU          (authors: Hang Lyu)
+
  *
  * See LICENSE for clarification regarding multiple authors
  *
@@ -20,35 +22,107 @@
 
 namespace k2 {
 
-Array1OfRaggedShape::Array1OfRaggedShape(RaggedShape *src, int32_t num_srcs)
-    : num_srcs_(num_srcs) {
-  K2_CHECK_GE(num_srcs, 1);
-  K2_CHECK(src);
-  num_axes_ = src[0].NumAxes();
-  c_ = src[0].Context();
+Array1OfRaggedShape::Array1OfRaggedShape(RaggedShape *srcs, int32_t num_srcs) :
+  num_srcs_(num_srcs) {
+  K2_CHECK_GT(num_srcs, 0);
+  K2_CHECK(srcs);
+
+  // Initialize context and num_axes_.
+  c_ = srcs[0].Context();
+  num_axes_ = srcs[0].NumAxes();
+
+  // Check if they have same num-axes and compatible context.
+  for (int32_t i = 1; i < num_srcs_; ++i) {
+    K2_CHECK_EQ(num_axes_, srcs[i].NumAxes());
+    K2_CHECK(c_->IsCompatible(*(srcs[i].Context())));
+  }
 
-  row_splits_ =
-      Array2<const int32_t *>(GetCpuContext(), num_axes_ - 1, num_srcs_);
+  // Initialize row_splits__, row_ids_ and tot_sizes_.
+  //
+  // Notice: since the Data() function is a __host__ function, it cannot be
+  // called on GPU. It limits us to work on CPU so that the row_splits_ and
+  // row_ids_ are populated on CPU, although the operator() of Array2 is a
+  // __host__ and __device__ function. Bear in mind, we cannot access the
+  // GPU data on CPU.
+  row_splits_ = Array2<const int32_t *>(GetCpuContext(),
+                                        num_axes_ - 1, num_srcs_);
   row_ids_ = Array2<const int32_t *>(GetCpuContext(), num_axes_ - 1, num_srcs_);
+
+  // Notice: no matter the return value of TotSize() is from 'cached_tot_size'
+  //  or the Back() function (i.e. operator[]) of array1, it it a CPU value.
   tot_sizes_ = Array1<int32_t>(GetCpuContext(), num_axes_, 0);
 
   auto row_splits_acc = row_splits_.Accessor(),
        row_ids_acc = row_ids_.Accessor();
+  // Bear in mind, when axis == 0, the TotSize() is row_splits.Dim() - 1.
+  // When 0 < axis < NumAxes(), the TotSize() is row_splits.Back().
   int32_t *tot_sizes_data = tot_sizes_.Data();
 
   for (int32_t i = 0; i < num_srcs_; ++i) {
-    K2_CHECK_EQ(src[i].NumAxes(), num_axes_);
-    K2_CHECK(c_->IsCompatible(*(src[i].Context())));
     for (int32_t j = 1; j < num_axes_; ++j) {
-      row_splits_acc(j - 1, i) = src[i].RowSplits(j).Data();
-      row_ids_acc(j - 1, i) = src[i].RowIds(j).Data();
-      tot_sizes_data[j] += src[i].TotSize(j);
+      row_splits_acc(j - 1, i) = srcs[i].RowSplits(j).Data();
+      row_ids_acc(j - 1, i) = srcs[i].RowIds(j).Data();
+      tot_sizes_data[j] += srcs[i].TotSize(j);
     }
-    tot_sizes_data[0] += src[i].TotSize(0);
+    tot_sizes_data[0] += srcs[i].TotSize(0);
   }
 
   row_splits_ = row_splits_.To(c_);
   row_ids_ = row_ids_.To(c_);
+  tot_sizes_ = tot_sizes_.To(c_);
+
+
+  // Initialize meat_row_splits_
+  // We populate this on CPU and transfer to GPU.
+  meta_row_splits_ = Array2<int32_t>(GetCpuContext(), num_axes_, num_srcs_ + 1);
+  offsets_ = Array2<int32_t>(GetCpuContext(), num_axes_ + 1, num_srcs_ + 1);
+
+  auto meta_row_splits_acc = meta_row_splits_.Accessor(),
+       offsets_acc = offsets_.Accessor();
+
+  // Initialize the 1st row of offsets_, which contains 0,1,2,...
+  for (int32_t col = 0; col <= num_srcs_; ++col) {
+    offsets_acc(0, col) = col;
+  }
+  // Initialize the 1st col of meta_row_splits_ and offsets_
+  for (int32_t row = 0; row < num_axes_; ++row) {
+    meta_row_splits_acc(row, 0) = 0;
+    offsets_acc(row + 1, 0) = 0;
+  }
+
+  // The meta_row_splits_ is the cumulative sum of the tot-sizes of the
+  // individual arrays.
+  for (int32_t i = 0; i < num_axes_; ++i) {
+    for (int32_t j = 1; j <= num_srcs_; ++j) {
+      meta_row_splits_acc(i, j) = meta_row_splits_acc(i, j - 1) +
+                                  srcs[j - 1].TotSize(i);
+      offsets_acc(i + 1, j) = meta_row_splits_acc(i, j);
+    }
+  }
+
+  // Initialize meta_row_ids_
+  // Elements are in [0, NumSrcs() - 1]
+  meta_row_ids_.resize(num_axes_);
+
+  for (int32_t axis = 0; axis < num_axes_; ++axis) {
+    // The length equals to TotSize(axis)
+    meta_row_ids_.at(axis) = Array1<int32_t>(
+        GetCpuContext(), meta_row_splits_acc(axis, num_srcs_));
+    int32_t *meta_row_ids_data = meta_row_ids_[axis].Data();
+
+    int32_t cur_row_start = meta_row_splits_acc(axis, 0);
+    for (int32_t src = 0; src < num_srcs_; ++src) {
+      int32_t next_row_start = meta_row_splits_acc(axis, src + 1);
+      for (; cur_row_start < next_row_start; ++cur_row_start) {
+        meta_row_ids_data[cur_row_start] = src;
+      }
+    }
+    meta_row_ids_[axis] = meta_row_ids_[axis].To(c_);
+  }
+
+  meta_row_splits_ = meta_row_splits_.To(c_);
+  offsets_ = offsets_.To(c_);
 }
 
+
 }  // namespace k2
diff --git a/k2/csrc/array_of_ragged.h b/k2/csrc/array_of_ragged.h
index 31349cf91..facc02dc0 100644
--- a/k2/csrc/array_of_ragged.h
+++ b/k2/csrc/array_of_ragged.h
@@ -1,5 +1,6 @@
 /**
  * Copyright      2022  Xiaomi Corporation (authors: Daniel Povey, Wei Kang)
+ *                2022  ASLP@NWPU          (authors: Hang Lyu)
  *
  * See LICENSE for clarification regarding multiple authors
  *
@@ -24,31 +25,48 @@
 #include <vector>
 
 #include "k2/csrc/array.h"
+#include "k2/csrc/array_ops.h"
 #include "k2/csrc/context.h"
 #include "k2/csrc/log.h"
-#include "k2/csrc/ragged_ops.h"
+#include "k2/csrc/ragged.h"
 
 namespace k2 {
+
+/*
+  Array1OfRagged<T> is a 1-dimensional array of Ragged<T>.
+  It is intended for situations where you want to do some operations on
+  arrays of ragged arrays, without explicitly concatenating them (e.g. to
+  save time).   This is a fairly low-level interface, intended to
+  be used mostly by CUDA/C++ implementation code.  It is a convenience
+  wrapper that saves you the trouble of creating arrays of pointers.
+ */
+
+
 /*
   Array1OfRaggedShape is a convenience function that gives you easy access
   to pointers-of-pointers for an array of ragged shapes.
  */
 class Array1OfRaggedShape {
  public:
+  // Default constructor.
+  Array1OfRaggedShape() = default;
+
   /*
     Constructor.
     Args:
-       srcs: pointers to the source shapes, a CPU pointer
-       num_srcs: the number of source shapes.  All shapes must have the
-                 same NumAxes() and must be on the same device.
+      srcs: pointers to the source shapes, a CPU pointer
+      num_srcs: the number of source shapes.  All shapes must have the
+                same NumAxes() and must be on the same device.
 
    TODO: we'll likely, later, add optional args which dictate which of
    the MetaRowSplits() and MetaRowIds() are to be pre-populated; this should
    enable us to save kernels by combining certain operations across the
    axes.
+
   */
-  Array1OfRaggedShape(RaggedShape *srcs, int32_t num_srcs);
-  Array1OfRaggedShape() = default;
+  Array1OfRaggedShape(RaggedShape *srcs,
+                     int32_t num_srcs);
+
 
   int32_t NumSrcs() const { return num_srcs_; }
   int32_t NumAxes() const { return num_axes_; }
@@ -63,23 +81,35 @@ class Array1OfRaggedShape {
   // Returns device-accessible vector of row-splits for a particular
   // axis, indexed by 0 <= src < num_srcs.
   const int32_t **RowSplits(int32_t axis) {
-    return row_splits_.Row(axis - 1).Data();
+      K2_CHECK_LT(static_cast<uint32_t>(axis),
+                  static_cast<uint32_t>(num_axes_));
+      return row_splits_.Row(axis - 1).Data();
   }
 
   // Returns device-accessible array of row-ids for the individual shapes
   // indexed [axis-1][src], with 0 <= src < num_srcs.  The shape of this
   // Array2 is [NumAxes() - 1][NumSrcs()].
-  const Array2<const int32_t *> *RowIds() const { return &row_ids_; }
+  const Array2<const int32_t*> *RowIds() const { return &row_ids_; }
+
 
   // Returns device-accessible vector of row-splits for a particular
   // axis, indexed by 0 <= src < num_srcs.
-  const int32_t **RowIds(int32_t axis) { return row_ids_.Row(axis - 1).Data(); }
+  const int32_t **RowIds(int32_t axis) {
+      K2_CHECK_LT(static_cast<uint32_t>(axis),
+                  static_cast<uint32_t>(num_axes_));
+      return row_ids_.Row(axis - 1).Data();
+  }
+
 
   /* Return the  total size on this axis, which is the sum of the TotSize() of
      the individual shapes.  Requires 0 <= axis < NumAxes() and
      for axis=0 the returned value is the same as Dim0().
   */
-  int32_t TotSize(int32_t axis) const { return tot_sizes_[axis]; }
+  int32_t TotSize(int32_t axis) const {
+      K2_CHECK_LT(static_cast<uint32_t>(axis),
+                  static_cast<uint32_t>(num_axes_));
+      return tot_sizes_[axis];
+  }
 
   // equivalent to TotSize(0).
   int32_t Dim0() const { return TotSize(0); }
@@ -88,7 +118,7 @@ class Array1OfRaggedShape {
      along the src axis, of the tot-sizes of the individual arrays.
      This Array2 is of shape [NumAxes()][NumSrcs() + 1], indexed [axis][src];
      caution, the indexing is different from RowSplits(), there is no offset.
-     Also, the meta_row_splits0 is a thing, unlike with regular row-splits
+     Also, the meta_row_splits_ is a thing, unlike with regular row-splits
      which start from 1.
 
      Caution: the lengths of the arrays pointed to by the elements of this
@@ -99,38 +129,47 @@ class Array1OfRaggedShape {
      to GPU, this will be faster than invoking an extra kernel in normal cases
      when the NumSrcs() is small.  [Also: see GetRowInfoMulti()].
    */
-  // TODO: implement it...
-  Array2<int32_t> MetaRowSplits();
+  const Array2<int32_t> &MetaRowSplits() const { return meta_row_splits_; }
 
   // could POSSIBLY add this so this code could be used in functions like
   // Stack(). would be like MetaRowSplits but with an extra 1st row containing
   // 0,1,2,... We could perhaps create it with 1 extra initial row so this is
   // always convenient to output.
-  // TODO: implement it...
-  Array2<int32_t> Offsets();
+  const Array2<int32_t> &Offsets() const { return offsets_; }
 
   /*
-    Returns the meta-row-splits for a particular axis, with 0 <= axis <
-    NumAxes(); this is the cumulative sum of the TotSize(axis) for all of the
-    sources, with MetaRowSplits(axis).Dim() == NumSrcs() + 1.
+    Returns the meta-row-splits for a particular axis, with
+    0 <= axis < NumAxes();
+    this is the cumulative sum of the TotSize(axis) for all of the sources,
+    with MetaRowSplits(axis).Dim() == NumSrcs() + 1.
 
-    Note: in ragged_ops.cu we refer to this as composed_row_splits
+    Note: in ragged_opts.cu we refer to this as composed_row_splits
   */
-  // TODO: implement it...
-  Array1<int32_t> MetaRowSplits(int32_t axis);
+  Array1<int32_t> MetaRowSplits(int32_t axis) {
+    K2_CHECK_LT(static_cast<uint32_t>(axis),
+                static_cast<uint32_t>(num_axes_));
+    return meta_row_splits_.Row(axis);
+  }
 
   /* Return the device-accessible meta-row-ids, which are the row-ids
      corresponding to MetaRowSplits(); this tells us, for indexes into the
-     appended/concatenated array, which source array they belong to, i.e.
-     elements are in [0,NumSrcs()-1].
+     appended/concatenated array, which source array they belong to,
+     i.e. elements are in [0,NumSrcs()-1].
 
      This cannot be an Array2 because unlike the MetaRowSplits(), all the
      row-ids arrays are of different lengths.
 
      Note: in ragged_ops.cu we refer to this as composed_row_ids.
   */
-  // TODO: implement it...
-  Array1<int32_t *> MetaRowIds();
+  Array1<const int32_t*> MetaRowIds() {
+    Array1<const int32_t*> ans(GetCpuContext(), num_axes_);
+    const int32_t* *ans_data = ans.Data();
+    for (int32_t i = 0; i < num_axes_; ++i) {
+      ans_data[i] = meta_row_ids_[i].Data();
+    }
+    ans = ans.To(c_);
+    return ans;
+  }
 
   /*
     Returns the meta-row-ids for a particular axis, with 0 <= axis < NumAxes();
@@ -140,18 +179,28 @@ class Array1OfRaggedShape {
     would tell us which source an idx012 with value 100 into axis 2 of
     concatenated array would come from.
   */
-  // TODO: implement it...
-  Array1<int32_t> MetaRowIds(int32_t axis);
+  const Array1<int32_t> &MetaRowIds(int32_t axis) const {
+    K2_CHECK_LT(static_cast<uint32_t>(axis),
+                static_cast<uint32_t>(num_axes_));
+    return meta_row_ids_[axis];
+  }
 
  private:
   ContextPtr c_;
   int32_t num_srcs_;
   int32_t num_axes_;
+
   Array2<const int32_t *> row_splits_;  // shape [num_axes_ - 1][num_srcs_]
   Array2<const int32_t *> row_ids_;     // shape [num_axes_ - 1][num_srcs_]
-  Array1<int32_t> tot_sizes_;           // dim num_axes_, this is on CPU
+  Array1<int32_t> tot_sizes_;           // dim num_axes_
+
+  Array2<int32_t> meta_row_splits_;     // shape [num_axes_][num_srcs_ + 1]
+  Array2<int32_t> offsets_;             // shape [num_axes_][num_srcs_ + 1]
+  std::vector<Array1<int32_t> > meta_row_ids_;  // dim num_axes_
 };
 
+
+
 /*
   Array1OfRagged<T> is a 1-dimensional array of Ragged<T>.
   It is intended for situations where you want to do some operations on
@@ -171,17 +220,14 @@ struct Array1OfRagged {
   int32_t NumSrcs() const { return values.Dim(); }
   ContextPtr &Context() { return shape.Context(); }
 
+  // Default constructor will not leave this a valid Array1OfRagged object,
+  // you shouldn't do anything with it. Both members will be initialized with
+  // default constructors.
   Array1OfRagged() = default;
 
-  /*
-      Constructor.
-      Args:
-         srcs: pointers to the source ragged tensors, a CPU pointer
-         num_srcs: the number of source ragged tensors.  All ragged tensors must
-                   have the same NumAxes() and must be on the same device.
-  */
+  // The 'srcs' should have the same number of axes.
   Array1OfRagged(Ragged<T> *srcs, int32_t num_srcs) {
-    K2_CHECK_GE(num_srcs, 1);
+    K2_CHECK_GT(num_srcs, 0);
     K2_CHECK(srcs);
     values = Array1<T *>(GetCpuContext(), num_srcs);
     T **values_data = values.Data();
@@ -195,6 +241,7 @@ struct Array1OfRagged {
   }
 };
 
+
 }  // namespace k2
 
 #endif  // K2_CSRC_ARRAY_OF_RAGGED_H_
diff --git a/k2/csrc/array_of_ragged_test.cu b/k2/csrc/array_of_ragged_test.cu
index 69b482315..4cb48bdb6 100644
--- a/k2/csrc/array_of_ragged_test.cu
+++ b/k2/csrc/array_of_ragged_test.cu
@@ -43,6 +43,7 @@ void TestArray1OfRaggedConstruct() {
     for (int32_t j = 1; j < num_axes; ++j) {
       const int32_t **row_splits = array_of_ragged.shape.RowSplits(j);
       const int32_t **row_ids = array_of_ragged.shape.RowIds(j);
+
       Array1<int32_t *> expected_row_splits(GetCpuContext(), num_srcs);
       Array1<int32_t *> expected_row_ids(GetCpuContext(), num_srcs);
       int32_t **expected_row_splits_data = expected_row_splits.Data();
@@ -55,6 +56,7 @@ void TestArray1OfRaggedConstruct() {
       expected_row_ids = expected_row_ids.To(c);
       expected_row_splits_data = expected_row_splits.Data();
       expected_row_ids_data = expected_row_ids.Data();
+
       Array1<int32_t> flags(c, 2, 1);
       int32_t *flags_data = flags.Data();
       K2_EVAL(
@@ -67,6 +69,38 @@ void TestArray1OfRaggedConstruct() {
     for (int32_t i = 0; i < num_srcs; ++i) {
       K2_CHECK_EQ(array_of_ragged.values[i], raggeds[i].values.Data());
     }
+
+    for (int32_t j = 0; j < num_axes; ++j) {
+      Array1<int32_t> meta_row_splits(array_of_ragged.shape.MetaRowSplits(j));
+      Array1<int32_t> meta_row_ids(array_of_ragged.shape.MetaRowIds(j));
+      Array1<int32_t> offsets(
+          array_of_ragged.shape.Offsets().RowArange(j + 1, j + 2).Row(0));
+
+      Array1<int32_t> expected_meta_row_splits(GetCpuContext(), num_srcs + 1);
+      int32_t *expected_meta_row_splits_data = expected_meta_row_splits.Data();
+      for (int32_t i = 0; i < num_srcs; ++i) {
+        expected_meta_row_splits_data[i] = raggeds[i].TotSize(j);
+      }
+      ExclusiveSum(expected_meta_row_splits, &expected_meta_row_splits);
+      expected_meta_row_splits = expected_meta_row_splits.To(c);
+      Array1<int32_t> expected_meta_row_ids(c,
+                                            array_of_ragged.shape.TotSize(j));
+      RowSplitsToRowIds(expected_meta_row_splits, &expected_meta_row_ids);
+
+      K2_CHECK(Equal(meta_row_splits, expected_meta_row_splits));
+      K2_CHECK(Equal(meta_row_ids, expected_meta_row_ids));
+      K2_CHECK(Equal(offsets, expected_meta_row_splits));
+    }
+
+    Array1<int32_t> expected_offsets_1st_row(GetCpuContext(), num_srcs + 1);
+    int32_t *expected_offsets_1st_row_data = expected_offsets_1st_row.Data();
+    for (int32_t i = 0; i <= num_srcs; ++i) {
+      expected_offsets_1st_row_data[i] = i;
+    }
+    expected_offsets_1st_row = expected_offsets_1st_row.To(c);
+    Array1<int32_t> offsets_1st_row(
+        array_of_ragged.shape.Offsets().RowArange(0, 1).Row(0));
+    K2_CHECK(Equal(offsets_1st_row, expected_offsets_1st_row));
   }
 }
 
diff --git a/k2/csrc/benchmark/CMakeLists.txt b/k2/csrc/benchmark/CMakeLists.txt
index cabb612cb..57570c926 100644
--- a/k2/csrc/benchmark/CMakeLists.txt
+++ b/k2/csrc/benchmark/CMakeLists.txt
@@ -1,6 +1,7 @@
 function(k2_add_benchmark source)
   get_filename_component(name ${source} NAME_WE)
   add_executable(${name} ${source})
+  set_target_properties(${name} PROPERTIES CUDA_SEPARABLE_COMPILATION ON)
   target_link_libraries(${name} PRIVATE benchmark)
 endfunction()
 
diff --git a/k2/csrc/fsa.h b/k2/csrc/fsa.h
index c72a31d13..66ae4b626 100644
--- a/k2/csrc/fsa.h
+++ b/k2/csrc/fsa.h
@@ -34,7 +34,7 @@ struct Arc {
   int32_t label;
   float score;
 
-  __host__ __device__ __forceinline__ Arc() = default;
+  Arc() = default;
   __host__ __device__ __forceinline__ Arc(int32_t src_state, int32_t dest_state,
                                           int32_t label, float score)
       : src_state(src_state),
diff --git a/k2/csrc/host/CMakeLists.txt b/k2/csrc/host/CMakeLists.txt
index 208f1651c..4d183de8e 100644
--- a/k2/csrc/host/CMakeLists.txt
+++ b/k2/csrc/host/CMakeLists.txt
@@ -26,11 +26,20 @@ target_link_libraries(fsa PUBLIC k2_log)
 target_link_libraries(fsa PUBLIC k2_nvtx)
 target_include_directories(fsa PUBLIC ${CUDA_TOOLKIT_INCLUDE})
 if(K2_ENABLE_NVTX)
-  target_link_libraries(fsa
-    PUBLIC
-      -L${CUDA_TOOLKIT_ROOT_DIR}/lib64  # for /usr/local/cuda
-      -L${CUDA_TOOLKIT_ROOT_DIR}/lib  # for conda
-    nvToolsExt)
+  if(NOT WIN32)
+    target_link_libraries(fsa
+      PUBLIC
+        -L${CUDA_TOOLKIT_ROOT_DIR}/lib64  # for /usr/local/cuda
+        -L${CUDA_TOOLKIT_ROOT_DIR}/lib  # for conda
+      nvToolsExt)
+  else()
+    target_link_directories(fsa PUBLIC
+      ${CUDA_TOOLKIT_ROOT_DIR}/lib64  # for /usr/local/cuda
+      ${CUDA_TOOLKIT_ROOT_DIR}/lib  # for conda
+      "C:/Program Files/NVIDIA Corporation/NvToolsExt/lib/x64/"
+    )
+    target_link_libraries(fsa PUBLIC NvToolsExt64_1)
+  endif()
 endif()
 
 #---------------------------- Test K2 host sources ----------------------------
diff --git a/k2/csrc/log.h b/k2/csrc/log.h
index a046621c5..04624d66f 100644
--- a/k2/csrc/log.h
+++ b/k2/csrc/log.h
@@ -34,13 +34,11 @@
 #include <cassert>
 #include <cstdint>
 #include <cstdio>
-#include <cstring>
 #include <cstdlib>
 #include <mutex>  // NOLINT
 #include <sstream>
 #include <stdexcept>
 #include <string>
-#include <iostream>
 #include <vector>
 
 #include "k2/csrc/macros.h"
@@ -62,9 +60,6 @@ std::ostream &operator<<(std::ostream &os, const std::vector<T> &vec);
 
 namespace internal {
 
-// Return a string like 2021-10-12 00:19:39.265
-std::string GetTimeStamp();
-
 #if defined(NDEBUG)
 constexpr bool kDisableDebug = true;
 #else
@@ -112,19 +107,6 @@ std::string GetStackTrace();
  */
 K2_CUDA_HOSTDEV LogLevel GetCurrentLogLevel();
 
-/// convert /path/to/k2/k2/csrc to k2/csrc
-inline const char *RemovePrefix(const char *filename) {
-  const char *first = strstr(filename, "/k2");
-  const char *second = nullptr;
-
-  do {
-    second = strstr(first + 2, "/k2");
-    if (second != nullptr) first = second;
-  } while (second);
-
-  return first + 1;
-}
-
 class Logger {
  public:
   K2_CUDA_HOSTDEV Logger(const char *filename, const char *func_name,
@@ -134,12 +116,6 @@ class Logger {
         line_num_(line_num),
         level_(level) {
     cur_level_ = GetCurrentLogLevel();
-#if !defined(__CUDA_ARCH__)
-    filename_ = RemovePrefix(filename);
-    if (cur_level_ <= level_) {
-      printf("%s ", GetTimeStamp().c_str());
-    }
-#endif
     switch (level) {
       case TRACE:
         if (cur_level_ <= TRACE) printf("[T] ");
@@ -162,7 +138,7 @@ class Logger {
     }
 
     if (cur_level_ <= level_) {
-      printf("%s:%u:%s ", filename_, line_num, func_name);
+      printf("%s:%u:%s ", filename, line_num, func_name);
 #if defined(__CUDA_ARCH__)
       printf("block:[%u,%u,%u], thread: [%u,%u,%u] ", blockIdx.x, blockIdx.y,
              blockIdx.z, threadIdx.x, threadIdx.y, threadIdx.z);
@@ -175,13 +151,7 @@ class Logger {
     Some bad things happened. Please read the above error messages and stack
     trace. If you are using Python, the following command may be helpful:
 
-      $ gdb --args python /path/to/your/code.py
-      (gdb) catch throw
-      (gdb) b k2::SomeFunctionName
-      # For instance
-      (gdb) b k2::ShortestPath
-      (gdb) run
-      (gdb) bt
+      gdb --args python /path/to/your/code.py
 
     (You can use `gdb` to debug the code. Please consider compiling
     a debug version of k2.).
@@ -196,7 +166,13 @@ class Logger {
       // this is usually caused by one of the K2_CHECK macros and the detailed
       // error messages should have already been printed by the macro, so we
       // use an arbitrary string here.
+#ifndef _MSC_VER
       __assert_fail(kErrMsg, filename_, line_num_, func_name_);
+#else
+      (void)kErrMsg;
+      assert(0);
+#endif  // _MSC_VER
+
 #else
       std::string stack_trace = GetStackTrace();
       if (!stack_trace.empty()) {
@@ -398,8 +374,7 @@ inline K2_CUDA_HOSTDEV LogLevel GetCurrentLogLevel() {
 #define K2_CHECK_CUDA_ERROR(x) \
   K2_CHECK_EQ(x, cudaSuccess) << " Error: " << cudaGetErrorString(x) << ". "
 #else
-#define K2_CHECK_CUDA_ERROR(...) \
-  K2_LOG(FATAL) << "Don't call me (Not compiled with CUDA ?)"
+#define K2_CHECK_CUDA_ERROR(...) K2_LOG(FATAL) << "K2 compiled without CUDA support"
 #endif
 
 // The parameter of `K2_CUDA_SAFE_CALL` should be cuda function call or kernel
@@ -421,7 +396,7 @@ inline K2_CUDA_HOSTDEV LogLevel GetCurrentLogLevel() {
 // Use a separate K2_CUDA_SAFE_CALL() for CPU
 // because the kernel invocation syntax <<< >>>
 // is not valid C++
-#define K2_CUDA_SAFE_CALL(...) K2_LOG(FATAL) << "Don't call me!"
+#define K2_CUDA_SAFE_CALL(...) K2_LOG(FATAL) << "K2 compiled without CUDA support"
 #endif
 
 // ------------------------------------------------------------
diff --git a/k2/csrc/log_test.cu b/k2/csrc/log_test.cu
index bd168bc97..8b86dafc3 100644
--- a/k2/csrc/log_test.cu
+++ b/k2/csrc/log_test.cu
@@ -28,7 +28,11 @@ TEST(Log, Cpu) {
   K2_LOG(DEBUG) << "Debug message";
   K2_LOG(INFO) << "Info message";
   K2_LOG(WARNING) << "Warning message";
+#ifndef _MSC_VER
+  // It fails on Windows with the following error:
+  // k2/csrc/log_test.cu(31): error : expected a ")"
   K2_LOG(ERROR) << "Error message";
+#endif
 
   K2_DLOG(INFO) << "This is printed only in debug mode";
 
diff --git a/k2/csrc/macros_test.cu b/k2/csrc/macros_test.cu
index 6963cc45e..ef68e574e 100644
--- a/k2/csrc/macros_test.cu
+++ b/k2/csrc/macros_test.cu
@@ -27,7 +27,7 @@
 
 namespace k2 {
 
-static void TestEval() {
+/*static*/ void TestEval() {
   for (auto &c : {GetCpuContext(), GetCudaContext()}) {
     Array1<int32_t> array = Range(c, 3, 0);
     int32_t *array_data = array.Data();
@@ -46,7 +46,7 @@ static void TestEval() {
   }
 }
 
-static void TestEval2() {
+/*static*/ void TestEval2() {
   for (auto &c : {GetCpuContext(), GetCudaContext()}) {
     Array1<int32_t> array1 = Range(c, 6, 0);
     Array2<int32_t> array(array1, 2, 3);
diff --git a/k2/csrc/math.h b/k2/csrc/math.h
index 65b6f8e91..3ebc8b406 100644
--- a/k2/csrc/math.h
+++ b/k2/csrc/math.h
@@ -27,6 +27,20 @@
 
 namespace k2 {
 
+// Currently, only used in k2/csrc/rnnt_decode.cu
+// See https://github.com/k2-fsa/k2/pull/951#issuecomment-1096650842
+__host__ __device__ __forceinline__ int64_t Pow(int64_t base,
+                                                int64_t exponent) {
+  K2_CHECK_GE(exponent, 0);
+  int64_t exp = 0;
+  int64_t result = 1;
+  while (exp < exponent) {
+    result *= base;
+    exp++;
+  }
+  return result;
+}
+
 /*
   Returns index of highest bit set, in range -1..30.
   HighestBitSet(0) = -1,
@@ -106,29 +120,29 @@ int32_t RandIntGeometric(int32_t min, int32_t max);
  type, but for types float and double it "fixes" the broken behavior of
  the C++ standard w.r.t. infinity allowing infinities to be parsed.
 */
-template<class T> struct InputFixer {
+template <class T>
+struct InputFixer {
   T t;
   // cast operator
   operator T() const { return t; }
 };
 
-
 namespace internal {
 template <typename Real>
 Real FixedRead(std::istream &is);
 }
 
 template <typename T>
-inline std::istream &operator >>(std::istream &is, InputFixer<T> &f) {
+inline std::istream &operator>>(std::istream &is, InputFixer<T> &f) {
   return is >> f.t;
 }
 template <>
-inline std::istream &operator >>(std::istream &is, InputFixer<float> &f) {
+inline std::istream &operator>>(std::istream &is, InputFixer<float> &f) {
   f.t = internal::FixedRead<float>(is);
   return is;
 }
 template <>
-inline std::istream &operator >>(std::istream &is, InputFixer<double> &f) {
+inline std::istream &operator>>(std::istream &is, InputFixer<double> &f) {
   f.t = internal::FixedRead<double>(is);
   return is;
 }
diff --git a/k2/csrc/ragged_ops.cu b/k2/csrc/ragged_ops.cu
index 78bfed8ab..1a919a02a 100644
--- a/k2/csrc/ragged_ops.cu
+++ b/k2/csrc/ragged_ops.cu
@@ -421,8 +421,12 @@ inline void GetOldAndNewOffsets(RaggedShape &src,
   ExclusiveSum(*new_offsets, new_offsets);
 }
 
-static RaggedShape IndexAxis0(RaggedShape &src, const Array1<int32_t> &new2old,
-                              Array1<int32_t> *elem_indexes /*=nullptr*/) {
+// Don't make it static to fix the following error on Windows.
+// Error : On Windows, the enclosing parent function ("IndexAxis0") for an
+// extended __host__ __device__ lambda cannot have internal or no linkage
+/*static*/ RaggedShape IndexAxis0(RaggedShape &src,
+                                  const Array1<int32_t> &new2old,
+                                  Array1<int32_t> *elem_indexes /*=nullptr*/) {
   NVTX_RANGE(K2_FUNC);
   ContextPtr &c = src.Context();
   K2_CHECK(IsCompatible(src, new2old));
@@ -679,8 +683,8 @@ void GetRowInfoMulti(int32_t num_srcs, RaggedShape **src,
   *row_ids = row_ids_ptrs.To(ctx);
 }
 
-static RaggedShape StackAxis0(int32_t num_srcs, RaggedShape **src,
-                              Array1<uint32_t> *merge_map /* == nullptr*/) {
+/*static*/ RaggedShape StackAxis0(int32_t num_srcs, RaggedShape **src,
+                                  Array1<uint32_t> *merge_map /* == nullptr*/) {
   NVTX_RANGE(K2_FUNC);
   if (num_srcs == 1) {
     if (merge_map)
@@ -1128,7 +1132,7 @@ RaggedShape Stack(int32_t axis, int32_t num_srcs, RaggedShape **src,
     RaggedShape, 1,2,4 to construct the second output RaggedShape, 6 and a empty
     list to construct the third output RaggedShape.
  */
-static void SelectAxis0(RaggedShape &src, const Ragged<int32_t> &indexes,
+/*static*/ void SelectAxis0(RaggedShape &src, const Ragged<int32_t> &indexes,
     std::vector<RaggedShape> *out, std::vector<Array1<int32_t>> *split_map) {
   NVTX_RANGE(K2_FUNC);
   ContextPtr &c = src.Context();
@@ -1475,8 +1479,8 @@ Ragged<int32_t> GetCountsPartitioned(Ragged<int32_t> &src,
   return Ragged<int32_t>(ans_ragged_shape, counts);
 }
 
-static Array1<int32_t> GetTransposeReorderingCpu(Ragged<int32_t> &src,
-                                                 int32_t num_cols) {
+/*static*/ Array1<int32_t> GetTransposeReorderingCpu(Ragged<int32_t> &src,
+                                                     int32_t num_cols) {
   NVTX_RANGE(K2_FUNC);
   std::vector<std::vector<int32_t>> column_indexes(num_cols);  // [column][row]
   const int32_t *values_data = src.values.Data();
@@ -1496,8 +1500,9 @@ static Array1<int32_t> GetTransposeReorderingCpu(Ragged<int32_t> &src,
   return ans;
 }
 
-static Array1<int32_t> GetTransposeReorderingThreeAxesCuda(Ragged<int32_t> &src,
-                                                           int32_t num_cols) {
+#ifndef _MSC_VER
+/*static*/ Array1<int32_t> GetTransposeReorderingThreeAxesCuda(
+    Ragged<int32_t> &src, int32_t num_cols) {
   NVTX_RANGE(K2_FUNC);
   K2_CHECK_EQ(src.NumAxes(), 3);
   ContextPtr &context = src.Context();
@@ -1541,6 +1546,7 @@ static Array1<int32_t> GetTransposeReorderingThreeAxesCuda(Ragged<int32_t> &src,
                                          lambda_comp, *mgpu_context));
   return ans;
 }
+#endif
 
 
 /*
@@ -1565,6 +1571,37 @@ Array1<int32_t> GetTransposeReordering(Ragged<int32_t> &src, int32_t num_cols) {
   if (device_type == kCpu) return GetTransposeReorderingCpu(src, num_cols);
 
   K2_CHECK_EQ(device_type, kCuda);
+
+#ifdef _MSC_VER
+  // See https://github.com/k2-fsa/k2/pull/753
+  // and
+  // https://github.com/k2-fsa/k2/pull/571
+  int32_t num_buckets = num_cols;
+  int32_t num_elements = src.values.Dim();
+  int32_t log_buckets = static_cast<int32_t>(ceilf(log2f(num_buckets)));
+
+  Array1<int32_t> ans = Range(context, num_elements, 0);
+
+  cudaStream_t stream = context->GetCudaStream();
+
+  size_t temp_storage_bytes = 0;
+  K2_CUDA_SAFE_CALL(cub::DeviceRadixSort::SortPairs(
+      nullptr, temp_storage_bytes, src.values.Data(),
+      static_cast<int32_t *>(nullptr), ans.Data(), ans.Data(), num_elements, 0,
+      log_buckets, stream));
+
+  Array1<int8_t> d_temp_storage(
+      context, temp_storage_bytes + num_elements * sizeof(int32_t));
+
+  K2_CUDA_SAFE_CALL(cub::DeviceRadixSort::SortPairs(
+      d_temp_storage.Data() + sizeof(int32_t) * num_elements,
+      temp_storage_bytes, src.values.Data(),
+      reinterpret_cast<int32_t *>(d_temp_storage.Data()), ans.Data(),
+      ans.Data(), num_elements, 0, log_buckets, stream));
+
+  return ans;
+
+#else
   (void)GetTransposeReorderingThreeAxesCuda;  // remove compiler warnings
 
 #if __CUDACC_VER_MAJOR__ > 10 ||   \
@@ -1599,7 +1636,7 @@ Array1<int32_t> GetTransposeReordering(Ragged<int32_t> &src, int32_t num_cols) {
 
   // CheckGetTransposeReordering(src, ans);
   return ans;
-#else
+#else  // __CUDACC_VER_MAJOR__
   if (src.NumAxes() == 3) {
     Array1<int32_t> ans = GetTransposeReorderingThreeAxesCuda(src, num_cols);
     // CheckGetTransposeReordering(src, ans);
@@ -1638,6 +1675,7 @@ Array1<int32_t> GetTransposeReordering(Ragged<int32_t> &src, int32_t num_cols) {
   // CheckGetTransposeReordering(src, ans);
   return ans;
 #endif
+#endif  // _MSC_VER
 }
 
 RaggedShape ChangeSublistSize(const RaggedShape &src, int32_t size_delta) {
diff --git a/k2/csrc/ragged_test.cu b/k2/csrc/ragged_test.cu
index dde9d9b70..e2cae8907 100644
--- a/k2/csrc/ragged_test.cu
+++ b/k2/csrc/ragged_test.cu
@@ -3024,17 +3024,6 @@ TEST(RaggedTest, TestPadRagged) {
   TestPadRagged<double>();
 }
 
-TEST(RaggedTest, ToVecVecInt) {
-  for (auto &c : {GetCpuContext(), GetCudaContext()}) {
-    Ragged<int32_t> src(c, "[[1 2 3] [] [4 0 5 6]]");
-    std::vector<std::vector<int32_t>> v = src.ToVecVec();
-    ASSERT_EQ(v.size(), 3u);
-    EXPECT_EQ(v[0], (std::vector<int32_t>{1, 2, 3}));
-    EXPECT_TRUE(v[1].empty());
-    EXPECT_EQ(v[2], (std::vector<int32_t>{4, 0, 5, 6}));
-  }
-}
-
 template <typename T>
 static void TestPruneRagged() {
   for (auto &c : {GetCpuContext(), GetCudaContext()}) {
diff --git a/k2/csrc/rand_test.cu b/k2/csrc/rand_test.cu
index 1370c8cc6..49d456dff 100644
--- a/k2/csrc/rand_test.cu
+++ b/k2/csrc/rand_test.cu
@@ -107,7 +107,7 @@ TEST(RandInt, CUDA) {
 }
 
 template <typename T>
-static void TestBounds(T low, T high) {
+/*static*/ void TestBounds(T low, T high) {
   int32_t dim = 100000;
   ContextPtr cpu = GetCpuContext();
   ContextPtr cuda = GetCudaContext();
diff --git a/k2/csrc/rm_epsilon.cu b/k2/csrc/rm_epsilon.cu
index a0fffbc76..8451806ed 100644
--- a/k2/csrc/rm_epsilon.cu
+++ b/k2/csrc/rm_epsilon.cu
@@ -69,7 +69,7 @@ namespace k2 {
     @param [out] epsilon_closure_mapped_arc_map  The arc map from
                        `epsilon_closure_mapped` to `src`.
 */
-static void GetEpsilonClosureMapped(
+/*static*/ void GetEpsilonClosureMapped(
     FsaVec &epsilon_fsa_closure,
     const Array1<int32_t> &epsilon_closure_state_map,
     Ragged<int32_t> &epsilon_closure_arc_map, FsaVec &non_epsilon_fsa,
@@ -139,7 +139,7 @@ static void GetEpsilonClosureMapped(
                        foll_shape.RowSplits(1)[i] is the number of following
                        arcs it is combined with.
 */
-static void DecideCombineWithFollowingOrPreceding(
+/*static*/ void DecideCombineWithFollowingOrPreceding(
     FsaVec &epsilon_closure_mapped, FsaVec &non_epsilon_fsa,
     Renumbering *epsilon_prec_renumbering, RaggedShape *foll_shape) {
   NVTX_RANGE(K2_FUNC);
@@ -237,7 +237,7 @@ static void DecideCombineWithFollowingOrPreceding(
      @param [out] combined_foll_arc_map The arc map of `combined_foll`, from
                        arcs idx012 in `combined_foll` to the original Fsa.
 */
-static void CombineWithFollowingNonEpsilonArcs(
+/*static*/ void CombineWithFollowingNonEpsilonArcs(
     FsaVec &epsilon_closure_mapped,
     Ragged<int32_t> &epsilon_closure_mapped_arc_map, FsaVec &non_epsilon_fsa,
     const Array1<int32_t> &non_epsilon_arc_map, RaggedShape &foll_shape,
@@ -341,7 +341,7 @@ static void CombineWithFollowingNonEpsilonArcs(
                        `epsilon_closure_prec_arc_map`, user will get the
                        complete arc map info for `combined_prec`.
 */
-static void CombineWithPrecedingNonEpsilonArcs(
+/*static*/ void CombineWithPrecedingNonEpsilonArcs(
     FsaVec &epsilon_closure_prec, Ragged<int32_t> &epsilon_closure_prec_arc_map,
     FsaVec &non_epsilon_fsa, FsaVec *combined_prec,
     Ragged<int32_t> *epsilon_closure_prec_arc_map_prec,
diff --git a/k2/csrc/rnnt_decode.cu b/k2/csrc/rnnt_decode.cu
index db5e732dd..e86b2f7d5 100644
--- a/k2/csrc/rnnt_decode.cu
+++ b/k2/csrc/rnnt_decode.cu
@@ -159,8 +159,8 @@ void RnntDecodingStreams::GetContexts(RaggedShape *shape,
         int64_t state_value = states_values_data[state_idx01x],
                 context_state = state_value / num_graph_states,
                 exp = decoder_history_len - col,
-                state = context_state % (int64_t)pow(vocab_size, exp);
-        state = state / (int64_t)pow(vocab_size, exp - 1);
+                state = context_state % Pow(vocab_size, exp);
+        state = state / Pow(vocab_size, exp - 1);
         contexts_acc(row, col) = state;
       });
 }
@@ -540,7 +540,7 @@ void RnntDecodingStreams::Advance(const Array2<float> &logprobs) {
           // can be done with `358 % 10^2`, then we append 6 to 58, that can be
           // done with `58 * 10 + 6`.
           context_state = this_context_state %
-                          (int64_t)pow(vocab_size, decoder_history_len - 1);
+                          Pow(vocab_size, decoder_history_len - 1);
           context_state = context_state * vocab_size + arc.label;
         }
 
diff --git a/k2/csrc/tensor_ops.cu b/k2/csrc/tensor_ops.cu
index 94ab6c1c5..481107566 100644
--- a/k2/csrc/tensor_ops.cu
+++ b/k2/csrc/tensor_ops.cu
@@ -20,10 +20,11 @@
 namespace k2 {
 
 template <typename T>
-static void CopyTensorElements2d(ContextPtr c, int32_t dim0, int32_t dim1,
-                                 const T *src_data, int32_t src_stride0,
-                                 int32_t src_stride1, T *dest_data,
-                                 int32_t dest_stride0, int32_t dest_stride1) {
+/*static*/ void CopyTensorElements2d(ContextPtr c, int32_t dim0, int32_t dim1,
+                                     const T *src_data, int32_t src_stride0,
+                                     int32_t src_stride1, T *dest_data,
+                                     int32_t dest_stride0,
+                                     int32_t dest_stride1) {
   NVTX_RANGE(K2_FUNC);
   DeviceType d = c->GetDeviceType();
   if (d == kCpu) {
@@ -132,10 +133,11 @@ Tensor Cast(Tensor src, Dtype new_dtype) {
 
 // See the documentation of `Index`.
 template <typename T>
-static void Index1DImpl(ContextPtr context, const T *src_data,
-                        int32_t src_stride, int32_t src_dim,
-                        const int32_t *indexes_data, bool allow_minus_one,
-                        int32_t ans_dim, T *ans_data, double default_value) {
+/*static*/ void Index1DImpl(ContextPtr context, const T *src_data,
+                            int32_t src_stride, int32_t src_dim,
+                            const int32_t *indexes_data, bool allow_minus_one,
+                            int32_t ans_dim, T *ans_data,
+                            double default_value) {
   if (std::is_integral<T>::value) {
     K2_CHECK_EQ(static_cast<T>(default_value), default_value);
   }
@@ -166,10 +168,11 @@ static void Index1DImpl(ContextPtr context, const T *src_data,
 
 // See the documentation of `Index`.
 template <typename T>
-static void Index2DImpl(ContextPtr context, const T *src_data,
-                        int32_t src_stride, int32_t src_dim0, int32_t src_dim1,
-                        const int32_t *indexes_data, bool allow_minus_one,
-                        int32_t ans_dim, int32_t ans_stride, T *ans_data) {
+/*static*/ void Index2DImpl(ContextPtr context, const T *src_data,
+                            int32_t src_stride, int32_t src_dim0,
+                            int32_t src_dim1, const int32_t *indexes_data,
+                            bool allow_minus_one, int32_t ans_dim,
+                            int32_t ans_stride, T *ans_data) {
   NVTX_RANGE(K2_FUNC);
   if (allow_minus_one) {
     if (context->GetDeviceType() == kCpu) {
@@ -299,11 +302,11 @@ Tensor Index(Tensor &src, Array1<int32_t> &indexes, bool allow_minus_one,
 }
 
 template <typename T>
-static void IndexAdd1DImpl(ContextPtr context, const T *src_data,
-                           int32_t src_dim, int32_t src_stride,
-                           const int32_t *indexes_data, bool allow_minus_one,
-                           int32_t dest_dim, int32_t dest_stride,
-                           T *dest_data) {
+/*static*/ void IndexAdd1DImpl(ContextPtr context, const T *src_data,
+                               int32_t src_dim, int32_t src_stride,
+                               const int32_t *indexes_data,
+                               bool allow_minus_one, int32_t dest_dim,
+                               int32_t dest_stride, T *dest_data) {
   NVTX_RANGE(K2_FUNC);
   if (allow_minus_one) {
     K2_EVAL(
@@ -330,12 +333,13 @@ static void IndexAdd1DImpl(ContextPtr context, const T *src_data,
 }
 
 template <typename T>
-static void IndexAdd2DImpl(ContextPtr context, const T *src_data,
-                           int32_t src_dim0, int32_t src_dim1,
-                           int32_t src_stride0, int32_t src_stride1,
-                           const int32_t *indexes_data, bool allow_minus_one,
-                           int32_t dest_dim, int32_t dest_stride0,
-                           int32_t dest_stride1, T *dest_data) {
+/*static*/ void IndexAdd2DImpl(ContextPtr context, const T *src_data,
+                               int32_t src_dim0, int32_t src_dim1,
+                               int32_t src_stride0, int32_t src_stride1,
+                               const int32_t *indexes_data,
+                               bool allow_minus_one, int32_t dest_dim,
+                               int32_t dest_stride0, int32_t dest_stride1,
+                               T *dest_data) {
   NVTX_RANGE(K2_FUNC);
   if (allow_minus_one) {
     K2_EVAL2(
@@ -437,10 +441,9 @@ void IndexAdd(Tensor &src, Array1<int32_t> &indexes, bool allow_minus_one,
 }
 
 template <typename T>
-static void SimpleRaggedIndexSelect1DImpl(ContextPtr context, const T *src_data,
-                                          int32_t src_stride, int32_t src_dim,
-                                          Ragged<int32_t> &indexes,
-                                          int32_t ans_dim, T *ans_data) {
+/*static*/ void SimpleRaggedIndexSelect1DImpl(
+    ContextPtr context, const T *src_data, int32_t src_stride, int32_t src_dim,
+    Ragged<int32_t> &indexes, int32_t ans_dim, T *ans_data) {
   NVTX_RANGE(K2_FUNC);
   K2_CHECK_EQ(indexes.NumAxes(), 2);
   int32_t indexes_dim0 = indexes.Dim0(),
diff --git a/k2/csrc/tensor_ops_test.cu b/k2/csrc/tensor_ops_test.cu
index f57636ba8..0aa7498c4 100644
--- a/k2/csrc/tensor_ops_test.cu
+++ b/k2/csrc/tensor_ops_test.cu
@@ -36,7 +36,7 @@ namespace k2 {
    @return Returns a 1-D tensor with the given `dim` and `stride`.
  */
 template <typename T>
-static Tensor GenerateRandTensor1D(ContextPtr context, int32_t dim,
+/*static*/ Tensor GenerateRandTensor1D(ContextPtr context, int32_t dim,
                                    int32_t stride) {
   K2_CHECK_GT(stride, 0);
 
@@ -69,7 +69,7 @@ static Tensor GenerateRandTensor1D(ContextPtr context, int32_t dim,
            `stride`.
  */
 template <typename T>
-static Tensor GenerateRandTensor2D(ContextPtr context, int32_t num_rows,
+/*static*/ Tensor GenerateRandTensor2D(ContextPtr context, int32_t num_rows,
                                    int32_t num_cols, int32_t stride) {
   int32_t num_tensor_elements = num_rows * num_cols;
   K2_CHECK_GT(num_cols, 0);
@@ -301,7 +301,7 @@ TEST(IndexAdd, IndexAdd2D) {
 }
 
 template <typename T>
-static void TestSimpleRaggedIndexSelect1D() {
+/*static*/ void TestSimpleRaggedIndexSelect1D() {
   // test with simple case should be good enough
   for (auto &context : {GetCpuContext(), GetCudaContext()}) {
     // create src
diff --git a/k2/csrc/test_utils.h b/k2/csrc/test_utils.h
index 27774d5e8..05a816033 100644
--- a/k2/csrc/test_utils.h
+++ b/k2/csrc/test_utils.h
@@ -20,15 +20,16 @@
 #ifndef K2_CSRC_TEST_UTILS_H_
 #define K2_CSRC_TEST_UTILS_H_
 
-#include <gtest/gtest.h>
 
 #include <cmath>
 #include <limits>
 #include <type_traits>
 #include <vector>
 
+#include "gtest/gtest.h"
 #include "k2/csrc/array.h"
 #include "k2/csrc/fsa.h"
+#include "k2/csrc/log.h"
 
 namespace k2 {
 
@@ -103,9 +104,9 @@ inline void ExpectEqual<double>(const std::vector<double> &expected,
 // check if `array` and `target` have the same values
 template <typename T>
 void CheckArrayData(const Array1<T> &array, const Array1<T> &target,
-                    T abs_error = (T)0.001) {
+                    T abs_error = T(0.001)) {
   if (array.Dim() != target.Dim()) {
-    K2_LOG(ERROR) << "Dims mismatch " << array.Dim() << " vs. " << target.Dim();
+    K2_LOG(FATAL) << "Dims mismatch " << array.Dim() << " vs. " << target.Dim();
   }
   int32_t dim = array.Dim();
   ContextPtr cpu = GetCpuContext();
diff --git a/k2/csrc/version.h.in b/k2/csrc/version.h.in
index a894f979d..cfffaccf8 100644
--- a/k2/csrc/version.h.in
+++ b/k2/csrc/version.h.in
@@ -46,7 +46,7 @@ static constexpr const char *kPythonVersion = "@PYTHON_VERSION_MAJOR@.@PYTHON_VE
 static constexpr const char *kBuildType = "@CMAKE_BUILD_TYPE@";
 
 // The operating system that is used to build k2, e.g., Ubuntu 16.04 LTS
-static constexpr const char *kOS = "@K2_OS@";
+static constexpr const char *kOS = R"os(@K2_OS@)os";
 
 // e.g., 3.18.0
 static constexpr const char *kCMakeVersion = "@CMAKE_VERSION@";
@@ -55,10 +55,10 @@ static constexpr const char *kCMakeVersion = "@CMAKE_VERSION@";
 static constexpr const char *kGCCVersion = "@CMAKE_CXX_COMPILER_VERSION@";
 
 // CUDA flags used to compile k2
-static constexpr const char *kCMakeCudaFlags = "@CMAKE_CUDA_FLAGS@";
+static constexpr const char *kCMakeCudaFlags = R"cuda_flags(@CMAKE_CUDA_FLAGS@)cuda_flags";
 
 // CXX flags used to compile k2
-static constexpr const char *kCMakeCxxFlags = "@CMAKE_CXX_FLAGS@";
+static constexpr const char *kCMakeCxxFlags = R"cxx_flags(@CMAKE_CXX_FLAGS@)cxx_flags";
 
 // Which PyTorch version k2 is using, e.g., 1.6.0+cu101
 static constexpr const char *kTorchVersion = "@TORCH_VERSION@";
diff --git a/k2/python/csrc/CMakeLists.txt b/k2/python/csrc/CMakeLists.txt
index 520fe443c..6031cb87a 100644
--- a/k2/python/csrc/CMakeLists.txt
+++ b/k2/python/csrc/CMakeLists.txt
@@ -19,7 +19,17 @@ if(NOT K2_WITH_CUDA)
   transform(OUTPUT_VARIABLE k2_srcs SRCS ${k2_srcs})
 endif()
 
-pybind11_add_module(_k2 ${k2_srcs} SHARED)
+if(WIN32)
+  # It throws the following error on Windows
+  #  nvcc fatal   : A single input file is required for a non-link phase when an outputfile is specified
+  # because there is an option "/bigobj" in pybind11::windows_extra that cannot be recognized  by NVCC.
+  #
+  # We clear it below.
+  set_property(TARGET pybind11::windows_extras PROPERTY INTERFACE_COMPILE_OPTIONS "")
+endif()
+
+
+pybind11_add_module(_k2 ${k2_srcs})
 target_link_libraries(_k2 PRIVATE context)
 target_link_libraries(_k2 PRIVATE fsa)
 
@@ -33,3 +43,4 @@ endif()
 
 target_include_directories(_k2 PRIVATE ${CMAKE_SOURCE_DIR})
 target_include_directories(_k2 PRIVATE ${CMAKE_BINARY_DIR})
+set_target_properties(_k2 PROPERTIES CUDA_SEPARABLE_COMPILATION ON)
diff --git a/k2/python/csrc/torch.h b/k2/python/csrc/torch.h
index a0f742196..e11827598 100644
--- a/k2/python/csrc/torch.h
+++ b/k2/python/csrc/torch.h
@@ -29,38 +29,6 @@
 
 namespace pybind11 {
 namespace detail {
-#if K2_TORCH_VERSION_MAJOR < 1 || \
-    (K2_TORCH_VERSION_MAJOR == 1 && K2_TORCH_VERSION_MINOR < 9)
-// Only for torch version < 1.9.0
-
-// See https://github.com/pytorch/pytorch/pull/57292
-
-template <>
-struct type_caster<torch::Device> {
- public:
-  PYBIND11_TYPE_CASTER(torch::Device, _("torch::Device"));
-
-  // PYBIND11_TYPE_CASTER defines a member field called value. Since
-  // torch::Device cannot be default-initialized, we provide this constructor to
-  // explicitly initialize that field. The value doesn't matter as it will be
-  // overwritten after a successful call to load.
-  type_caster() : value(torch::kCPU) {}
-
-  bool load(handle src, bool) {
-    PyObject *obj = src.ptr();
-    if (THPDevice_Check(obj)) {
-      value = reinterpret_cast<THPDevice *>(obj)->device;
-      return true;
-    }
-    return false;
-  }
-
-  static handle cast(const torch::Device &src, return_value_policy /* policy */,
-                     handle /* parent */) {
-    return handle(THPDevice_New(src));
-  }
-};
-#endif
 
 template <>
 struct type_caster<torch::ScalarType> {
diff --git a/k2/python/csrc/torch/fsa.cu b/k2/python/csrc/torch/fsa.cu
index 372b853f0..7571667c5 100644
--- a/k2/python/csrc/torch/fsa.cu
+++ b/k2/python/csrc/torch/fsa.cu
@@ -491,7 +491,7 @@ static void PybindBackpropGetArcPost(py::module &m, const char *name) {
    @return It returns the gradient of scores of all arcs.
  */
 template <typename T>
-static torch::Tensor GetTotScoresTropicalBackward(
+/*static*/ torch::Tensor GetTotScoresTropicalBackward(
     FsaVec &fsas, const RaggedAny &best_path_arc_indexes,
     torch::Tensor tot_scores_grad) {
   DeviceGuard guard(fsas.Context());
@@ -542,7 +542,7 @@ static torch::Tensor GetTotScoresTropicalBackward(
    @return It returns the gradient of scores of all arcs.
  */
 template <typename T>
-static torch::Tensor GetTotScoresLogBackward(FsaVec &fsas,
+/*static*/ torch::Tensor GetTotScoresLogBackward(FsaVec &fsas,
                                              torch::Tensor arc_post,
                                              torch::Tensor tot_scores_grad) {
   DeviceGuard guard(fsas.Context());
diff --git a/k2/python/csrc/torch/fsa_algo.cu b/k2/python/csrc/torch/fsa_algo.cu
index f4016695d..aa945c70d 100644
--- a/k2/python/csrc/torch/fsa_algo.cu
+++ b/k2/python/csrc/torch/fsa_algo.cu
@@ -59,7 +59,7 @@ static void PybindTopSort(py::module &m) {
 static void PybindLinearFsa(py::module &m) {
   m.def(
       "linear_fsa",
-      [](RaggedAny &labels, torch::optional<torch::Device> = {}) -> FsaVec {
+      [](RaggedAny &labels, py::object = py::none()) -> FsaVec {
         DeviceGuard guard(labels.any.Context());
         return LinearFsas(labels.any.Specialize<int32_t>());
       },
@@ -68,48 +68,26 @@ static void PybindLinearFsa(py::module &m) {
   m.def(
       "linear_fsa",
       [](const std::vector<int32_t> &labels,
-         torch::optional<torch::Device> device = {}) -> Fsa {
-        ContextPtr context =
-            GetContext(device.value_or(torch::Device(torch::kCPU)));
+         py::object device = py::str("cpu")) -> Fsa {
+        std::string device_str = device.is_none() ? "cpu" : py::str(device);
+        ContextPtr context = GetContext(torch::Device(device_str));
         DeviceGuard guard(context);
         Array1<int32_t> array(context, labels);
         return LinearFsa(array);  //
       },
-      py::arg("labels"), py::arg("device") = py::none());
-
-  m.def(
-      "linear_fsa",
-      [](const std::vector<int32_t> &labels,
-         torch::optional<std::string> device = {}) -> Fsa {
-        ContextPtr context = GetContext(torch::Device(device.value_or("cpu")));
-        DeviceGuard guard(context);
-        Array1<int32_t> array(context, labels);
-        return LinearFsa(array);  //
-      },
-      py::arg("labels"), py::arg("device") = py::none());
+      py::arg("labels"), py::arg("device") = py::str("cpu"));
 
   m.def(
       "linear_fsa",
       [](const std::vector<std::vector<int32_t>> &labels,
-         torch::optional<torch::Device> device = {}) -> FsaVec {
-        ContextPtr context =
-            GetContext(device.value_or(torch::Device(torch::kCPU)));
+         py::object device = py::str("cpu")) -> FsaVec {
+        std::string device_str = device.is_none() ? "cpu" : py::str(device);
+        ContextPtr context = GetContext(torch::Device(device_str));
         DeviceGuard guard(context);
         Ragged<int32_t> ragged = CreateRagged2<int32_t>(labels).To(context);
         return LinearFsas(ragged);
       },
-      py::arg("labels"), py::arg("device") = py::none());
-
-  m.def(
-      "linear_fsa",
-      [](const std::vector<std::vector<int32_t>> &labels,
-         torch::optional<std::string> device = {}) -> FsaVec {
-        ContextPtr context = GetContext(torch::Device(device.value_or("cpu")));
-        DeviceGuard guard(context);
-        Ragged<int32_t> ragged = CreateRagged2<int32_t>(labels).To(context);
-        return LinearFsas(ragged);
-      },
-      py::arg("labels"), py::arg("device") = py::none());
+      py::arg("labels"), py::arg("device") = py::str("cpu"));
 }
 
 static void PybindIntersect(py::module &m) {
@@ -481,7 +459,7 @@ static void PybindRemoveEpsilonSelfLoops(py::module &m) {
       py::arg("src"), py::arg("need_arc_map") = true);
 }
 
-static void PybindExpandArcs(py::module &m) {
+/*static*/ void PybindExpandArcs(py::module &m) {
   // See doc-string below.
   m.def(
       "expand_arcs",
@@ -718,59 +696,34 @@ static void PybindCtcGraph(py::module &m) {
 static void PybindCtcTopo(py::module &m) {
   m.def(
       "ctc_topo",
-      [](int32_t max_token, torch::optional<torch::Device> device = {},
+      [](int32_t max_token, py::object device = py::str("cpu"),
          bool modified = false) -> std::pair<Fsa, torch::Tensor> {
-        ContextPtr context = GetContext(device.value_or(torch::Device("cpu")));
+        std::string device_str = device.is_none() ? "cpu" : py::str(device);
+        ContextPtr context = GetContext(torch::Device(device_str));
         DeviceGuard guard(context);
         Array1<int32_t> aux_labels;
         Fsa fsa = CtcTopo(context, max_token, modified, &aux_labels);
         torch::Tensor tensor = ToTorch(aux_labels);
         return std::make_pair(fsa, tensor);
       },
-      py::arg("max_token"), py::arg("device") = py::none(),
-      py::arg("modified") = false);
-
-  m.def(
-      "ctc_topo",
-      [](int32_t max_token, torch::optional<std::string> device = {},
-         bool modified = false) -> std::pair<Fsa, torch::Tensor> {
-        ContextPtr context = GetContext(torch::Device(device.value_or("cpu")));
-        DeviceGuard guard(context);
-        Array1<int32_t> aux_labels;
-        Fsa fsa = CtcTopo(context, max_token, modified, &aux_labels);
-        torch::Tensor tensor = ToTorch(aux_labels);
-        return std::make_pair(fsa, tensor);
-      },
-      py::arg("max_token"), py::arg("device") = py::none(),
+      py::arg("max_token"), py::arg("device") = py::str("cpu"),
       py::arg("modified") = false);
 }
 
 static void PybindTrivialGraph(py::module &m) {
   m.def(
       "trivial_graph",
-      [](int32_t max_token, torch::optional<torch::Device> device = {})
-          -> std::pair<Fsa, torch::Tensor> {
-        ContextPtr context = GetContext(device.value_or(torch::Device("cpu")));
-        DeviceGuard guard(context);
-        Array1<int32_t> aux_labels;
-        Fsa fsa = TrivialGraph(context, max_token, &aux_labels);
-        torch::Tensor tensor = ToTorch(aux_labels);
-        return std::make_pair(fsa, tensor);
-      },
-      py::arg("max_token"), py::arg("device") = py::none());
-
-  m.def(
-      "trivial_graph",
-      [](int32_t max_token, torch::optional<std::string> device = {})
-          -> std::pair<Fsa, torch::Tensor> {
-        ContextPtr context = GetContext(torch::Device(device.value_or("cpu")));
+      [](int32_t max_token,
+         py::object device = py::str("cpu")) -> std::pair<Fsa, torch::Tensor> {
+        std::string device_str = device.is_none() ? "cpu" : py::str(device);
+        ContextPtr context = GetContext(torch::Device(device_str));
         DeviceGuard guard(context);
         Array1<int32_t> aux_labels;
         Fsa fsa = TrivialGraph(context, max_token, &aux_labels);
         torch::Tensor tensor = ToTorch(aux_labels);
         return std::make_pair(fsa, tensor);
       },
-      py::arg("max_token"), py::arg("device") = py::none());
+      py::arg("max_token"), py::arg("device") = py::str("cpu"));
 }
 
 static void PybindLevenshteinGraph(py::module &m) {
diff --git a/k2/python/csrc/torch/ragged_ops.cu b/k2/python/csrc/torch/ragged_ops.cu
index ec6018e13..d2d431b17 100644
--- a/k2/python/csrc/torch/ragged_ops.cu
+++ b/k2/python/csrc/torch/ragged_ops.cu
@@ -152,8 +152,9 @@ static void PybindNormalizePerSublist(py::module &m, const char *name) {
                         (out.NumElements(),).
  */
 template <typename T>
-static torch::Tensor NormalizePerSublistBackward(Ragged<T> &out, bool use_log,
-                                                 torch::Tensor out_grad) {
+/*static*/ torch::Tensor NormalizePerSublistBackward(Ragged<T> &out,
+                                                     bool use_log,
+                                                     torch::Tensor out_grad) {
   NVTX_RANGE(K2_FUNC);
   DeviceGuard guard(out.Context());
   K2_CHECK_EQ(out_grad.dim(), 1)
@@ -397,7 +398,7 @@ void PybindRaggedOps(py::module &m) {
   PybindArgMaxPerSublist<int32_t>(m);
   PybindCat<int32_t>(m);
   PybindCat<float>(m);
-  PybindCat<Arc>(m);
+  PybindCat<k2::Arc>(m);
   PybindCreateRagged2<int32_t>(m);
   PybindCreateRagged2<float>(m);
   PybindGetLayer(m);
diff --git a/k2/python/csrc/torch/v2/any.cu b/k2/python/csrc/torch/v2/any.cu
index 0c9f07b4a..74f163d23 100644
--- a/k2/python/csrc/torch/v2/any.cu
+++ b/k2/python/csrc/torch/v2/any.cu
@@ -40,24 +40,32 @@ void PybindRaggedAny(py::module &m) {
   //      k2.ragged.Tensor methods
   //--------------------------------------------------
 
-  any.def(py::init<py::list, py::object, torch::Device>(), py::arg("data"),
-          py::arg("dtype") = py::none(),
-          py::arg("device") = torch::Device(torch::kCPU),
-          kRaggedAnyInitDataDeviceDoc);
+  any.def(py::init([](py::list data, py::object dtype = py::none(),
+                      py::object device =
+                          py::str("cpu")) -> std::unique_ptr<RaggedAny> {
+            std::string device_str = device.is_none() ? "cpu" : py::str(device);
+            return std::make_unique<RaggedAny>(data, dtype,
+                                               torch::Device(device_str));
+          }),
+          py::arg("data"), py::arg("dtype") = py::none(),
+          py::arg("device") = py::str("cpu"), kRaggedAnyInitDataDeviceDoc);
 
   any.def(py::init<py::list, py::object, const std::string &>(),
           py::arg("data"), py::arg("dtype") = py::none(),
           py::arg("device") = "cpu", kRaggedAnyInitDataDeviceDoc);
 
-  any.def(py::init<const std::string &, py::object, torch::Device>(),
+  any.def(py::init([](const std::string &s, py::object dtype = py::none(),
+                      py::object device =
+                          py::str("cpu")) -> std::unique_ptr<RaggedAny> {
+            std::string device_str = device.is_none() ? "cpu" : py::str(device);
+            return std::make_unique<RaggedAny>(s, dtype, device_str);
+          }),
           py::arg("s"), py::arg("dtype") = py::none(),
-          py::arg("device") = torch::Device(torch::kCPU),
-          kRaggedAnyInitStrDeviceDoc);
+          py::arg("device") = py::str("cpu"), kRaggedAnyInitStrDeviceDoc);
 
   any.def(py::init<const std::string &, py::object, const std::string &>(),
           py::arg("s"), py::arg("dtype") = py::none(),
-          py::arg("device") = torch::Device(torch::kCPU),
-          kRaggedAnyInitStrDeviceDoc);
+          py::arg("device") = "cpu", kRaggedAnyInitStrDeviceDoc);
 
   any.def(py::init<const RaggedShape &, torch::Tensor>(), py::arg("shape"),
           py::arg("value"), kRaggedInitFromShapeAndTensorDoc);
@@ -110,7 +118,7 @@ void PybindRaggedAny(py::module &m) {
   any.def(
       "__getitem__",
       [](RaggedAny &self, const py::slice &slice) -> RaggedAny {
-        py::ssize_t start = 0, stop = 0, step = 0, slicelength = 0;
+        py::size_t start = 0, stop = 0, step = 0, slicelength = 0;
         if (!slice.compute(self.any.Dim0(), &start, &stop, &step, &slicelength))
           throw py::error_already_set();
         int32_t istart = static_cast<int32_t>(start);
@@ -168,10 +176,13 @@ void PybindRaggedAny(py::module &m) {
       },
       py::arg("src"), py::arg("indexes"), kRaggedAnyIndexAndSumDoc);
 
-  any.def("to",
-          static_cast<RaggedAny (RaggedAny::*)(torch::Device) const>(
-              &RaggedAny::To),
-          py::arg("device"), kRaggedAnyToDeviceDoc);
+  any.def(
+      "to",
+      [](RaggedAny &self, py::object device) -> RaggedAny {
+        std::string device_str = device.is_none() ? "cpu" : py::str(device);
+        return self.To(torch::Device(device_str));
+      },
+      py::arg("device"), kRaggedAnyToDeviceDoc);
 
   any.def("to",
           static_cast<RaggedAny (RaggedAny::*)(const std::string &) const>(
@@ -243,7 +254,8 @@ void PybindRaggedAny(py::module &m) {
       [](const RaggedAny &self) -> py::tuple {
         DeviceGuard guard(self.any.Context());
         K2_CHECK(self.any.NumAxes() == 2 || self.any.NumAxes() == 3)
-            << "Only support Ragged with NumAxes() == 2 or 3 for now, given "
+            << "Only support Ragged with NumAxes() == 2 or 3 for now, "
+               "given "
             << self.any.NumAxes();
         Array1<int32_t> row_splits1 = self.any.RowSplits(1);
         Dtype t = self.any.GetDtype();
@@ -380,10 +392,8 @@ void PybindRaggedAny(py::module &m) {
 
         torch::Device device(device_type, self.any.Context()->GetDeviceId());
 
-        PyObject *ptr = THPDevice_New(device);
-
-        // takes ownership
-        return py::reinterpret_steal<py::object>(ptr);
+        auto torch_device = py::module::import("torch").attr("device");
+        return torch_device(device.str());
       },
       kRaggedAnyDeviceDoc);
 
@@ -443,12 +453,12 @@ void PybindRaggedAny(py::module &m) {
   m.def(
       "create_ragged_tensor",
       [](py::list data, py::object dtype = py::none(),
-         torch::Device device = torch::kCPU) -> RaggedAny {
-        return RaggedAny(data, dtype, device);
+         py::object device = py::str("cpu")) -> RaggedAny {
+        std::string device_str = device.is_none() ? "cpu" : py::str(device);
+        return RaggedAny(data, dtype, torch::Device(device_str));
       },
       py::arg("data"), py::arg("dtype") = py::none(),
-      py::arg("device") = torch::Device(torch::kCPU),
-      kCreateRaggedTensorDataDoc);
+      py::arg("device") = py::str("cpu"), kCreateRaggedTensorDataDoc);
 
   m.def(
       "create_ragged_tensor",
@@ -462,12 +472,12 @@ void PybindRaggedAny(py::module &m) {
   m.def(
       "create_ragged_tensor",
       [](const std::string &s, py::object dtype = py::none(),
-         torch::Device device = torch::kCPU) -> RaggedAny {
-        return RaggedAny(s, dtype, device);
+         py::object device = py::str("cpu")) -> RaggedAny {
+        std::string device_str = device.is_none() ? "cpu" : py::str(device);
+        return RaggedAny(s, dtype, torch::Device(device_str));
       },
       py::arg("s"), py::arg("dtype") = py::none(),
-      py::arg("device") = torch::Device(torch::kCPU),
-      kCreateRaggedTensorStrDoc);
+      py::arg("device") = py::str("cpu"), kCreateRaggedTensorStrDoc);
 
   m.def(
       "create_ragged_tensor",
diff --git a/k2/python/csrc/torch/v2/ragged_shape.cu b/k2/python/csrc/torch/v2/ragged_shape.cu
index cb3bc8c13..f989800de 100644
--- a/k2/python/csrc/torch/v2/ragged_shape.cu
+++ b/k2/python/csrc/torch/v2/ragged_shape.cu
@@ -66,7 +66,9 @@ void PybindRaggedShape(py::module &m) {
 
   shape.def(
       "to",
-      [](const RaggedShape &self, torch::Device device) -> RaggedShape {
+      [](const RaggedShape &self, py::object _device) -> RaggedShape {
+        std::string device_str = _device.is_none() ? "cpu" : py::str(_device);
+        torch::Device device = torch::Device(device_str);
         DeviceGuard guard(self.Context());
 
         if (device.type() == torch::kCPU) return self.To(GetCpuContext());
@@ -166,10 +168,8 @@ void PybindRaggedShape(py::module &m) {
 
         torch::Device device(device_type, self.Context()->GetDeviceId());
 
-        PyObject *ptr = THPDevice_New(device);
-
-        // takes ownership
-        return py::reinterpret_steal<py::object>(ptr);
+        auto torch_device = py::module::import("torch").attr("device");
+        return torch_device(device.str());
       },
       kRaggedShapeDeviceDoc);
 
diff --git a/k2/python/host/k2host/fsa.py b/k2/python/host/k2host/fsa.py
index 59196a422..ecb8a5b0e 100644
--- a/k2/python/host/k2host/fsa.py
+++ b/k2/python/host/k2host/fsa.py
@@ -30,9 +30,9 @@ def __init__(self, src_state: int, dest_state: int, label: int,
         super().__init__(src_state, dest_state, label, weight)
 
     def to_tensor(self):
-        # TODO(fangjun): weight will be truncted to an int.
+        # TODO(fangjun): weight will be truncated to an int.
         return torch.tensor(
-            [self.src_state, self.dest_state, self.label, self.weight],
+            [self.src_state, self.dest_state, self.label, int(self.weight)],
             dtype=torch.int32)
 
     @staticmethod
diff --git a/k2/python/k2/__init__.py b/k2/python/k2/__init__.py
index 930affb18..54102705b 100644
--- a/k2/python/k2/__init__.py
+++ b/k2/python/k2/__init__.py
@@ -1,9 +1,29 @@
 import torch  # noqa
+from .torch_version import k2_torch_cuda_version
+from .torch_version import k2_torch_version
+
+if torch.__version__.split("+")[0] != k2_torch_version.split("+")[0]:
+    raise ImportError(
+        f"k2 was built using PyTorch {k2_torch_version}\n"
+        f"But you are using PyTorch {torch.__version__} to run it"
+    )
+
+if (
+    k2_torch_cuda_version != ""
+    and torch.version.cuda is not None
+    and torch.version.cuda != k2_torch_cuda_version
+):
+    raise ImportError(
+        f"k2 was built using CUDA {k2_torch_cuda_version}\n"
+        f"But you are using CUDA {torch.version.cuda} to run it."
+    )
+
 try:
     from _k2 import DeterminizeWeightPushingType
     from _k2 import simple_ragged_index_select
 except ImportError as e:
     import sys
+
     major_v, minor_v = sys.version_info[:2]
     raise ImportError(
         str(e) + "\nNote: If you're using anaconda and importing k2 on MacOS,"
@@ -18,6 +38,7 @@
 from . import dense_fsa_vec
 from . import fsa
 from . import utils
+
 #
 from .autograd import intersect_dense
 from .autograd import intersect_dense_pruned
diff --git a/k2/python/k2/rnnt_decode.py b/k2/python/k2/rnnt_decode.py
index 85d56cd5d..7e43d9f82 100644
--- a/k2/python/k2/rnnt_decode.py
+++ b/k2/python/k2/rnnt_decode.py
@@ -179,7 +179,7 @@ def format_output(self, num_frames: List[int]) -> Fsa:
             src = self.src_streams[i].fsa
             for name, value in src.named_tensor_attr(include_scores=False):
                 if name not in tensor_attr_info:
-                    filler = 0.0
+                    filler = 0
                     if isinstance(value, Tensor):
                         filler = float(src.get_filler(name))
                         dtype = value.dtype
diff --git a/k2/python/k2/rnnt_loss.py b/k2/python/k2/rnnt_loss.py
index 5918d7b9e..67ad28a57 100644
--- a/k2/python/k2/rnnt_loss.py
+++ b/k2/python/k2/rnnt_loss.py
@@ -471,7 +471,7 @@ def _adjust_pruning_lower_bound(
     # make the transformed tensor to be non-decreasing
     s_begin = k2.monotonic_lower_bound(s_begin)
     # make start symbol to be zero.
-    s_begin = torch.where(s_begin < 0, 0, s_begin)
+    s_begin = torch.clamp(s_begin, min=0)
     # do the magic transformation again to recover s_begin
     s_begin = -(
         s_begin - (s_range - 1) * torch.arange(0, T, device=s_begin.device)
@@ -568,7 +568,7 @@ def get_rnnt_prune_ranges(
 
     s_begin_padding = boundary[:, 2].reshape(B, 1) - s_range + 1
     # handle the cases when `len(symbols) < s_range`
-    s_begin_padding = torch.where(s_begin_padding >= 0, s_begin_padding, 0)
+    s_begin_padding = torch.clamp(s_begin_padding, min=0)
 
     s_begin = torch.where(mask, s_begin, s_begin_padding)
 
@@ -592,9 +592,9 @@ def do_rnnt_pruning(
 
     Args:
       am:
-        The encoder output, with shape (B, T, C)
+        The encoder output, with shape (B, T, encoder_dim)
       lm:
-        The prediction network output, with shape (B, S + 1, C)
+        The prediction network output, with shape (B, S + 1, decoder_dim)
       ranges:
         A tensor containing the symbol indexes for each frame that we want to
         keep. Its shape is (B, T, s_range), see the docs in
@@ -603,26 +603,28 @@ def do_rnnt_pruning(
     Returns:
       Return the pruned am and lm with shape (B, T, s_range, C)
     """
-    # am (B, T, C)
-    # lm (B, S + 1, C)
+    # am (B, T, encoder_dm)
+    # lm (B, S + 1, decoder_dim)
     # ranges (B, T, s_range)
     assert ranges.shape[0] == am.shape[0]
     assert ranges.shape[0] == lm.shape[0]
     assert am.shape[1] == ranges.shape[1]
     (B, T, s_range) = ranges.shape
-    (B, S1, C) = lm.shape
+    (B, S1, decoder_dim) = lm.shape
+    encoder_dim = am.shape[-1]
+    assert am.shape == (B, T, encoder_dim)
     S = S1 - 1
 
-    # (B, T, s_range, C)
-    am_pruning = am.unsqueeze(2).expand((B, T, s_range, C))
+    # (B, T, s_range, encoder_dim)
+    am_pruned = am.unsqueeze(2).expand((B, T, s_range, encoder_dim))
 
-    # (B, T, s_range, C)
-    lm_pruning = torch.gather(
-        lm.unsqueeze(1).expand((B, T, S + 1, C)),
+    # (B, T, s_range, decoder_dim)
+    lm_pruned = torch.gather(
+        lm.unsqueeze(1).expand((B, T, S + 1, decoder_dim)),
         dim=2,
-        index=ranges.reshape((B, T, s_range, 1)).expand((B, T, s_range, C)),
+        index=ranges.reshape((B, T, s_range, 1)).expand((B, T, s_range, decoder_dim)),
     )
-    return am_pruning, lm_pruning
+    return am_pruned, lm_pruned
 
 
 def _roll_by_shifts(src: torch.Tensor, shifts: torch.LongTensor):
diff --git a/k2/python/k2/torch_version.py.in b/k2/python/k2/torch_version.py.in
new file mode 100644
index 000000000..30e83abc0
--- /dev/null
+++ b/k2/python/k2/torch_version.py.in
@@ -0,0 +1,17 @@
+# Auto generated by the toplevel CMakeLists.txt.
+#
+# DO NOT EDIT.
+
+# The torch version used to build k2. We will check it against the torch version
+# that is used to run k2. If they are not the same, `import k2` will throw.
+#
+# Some example values are:
+# - 1.10.0+cu102
+# - 1.5.0+cpu
+k2_torch_version = "@TORCH_VERSION@"
+
+# The CUDA version used to build k2.
+# Note: It is an empty string if you used a CPU version of PyTorch to build k2
+#
+# An example value is "10.2".
+k2_torch_cuda_version = "@TORCH_CUDA_VERSION@"
diff --git a/k2/python/tests/linear_fsa_with_self_loops_test.py b/k2/python/tests/linear_fsa_with_self_loops_test.py
index 1e331bbbc..ec3654cb1 100644
--- a/k2/python/tests/linear_fsa_with_self_loops_test.py
+++ b/k2/python/tests/linear_fsa_with_self_loops_test.py
@@ -55,7 +55,7 @@ def test_multiple_fsa(self):
             expected_labels0 = [0, 2, 0, 5, 0, 8, 0, -1]
             expected_labels1 = [0, 1, 0, 2, 0, -1]
             expected_labels2 = [0, 3, 0, 2, 0, -1]
-            expected_labels = expected_labels0 + expected_labels1 + expected_labels2
+            expected_labels = expected_labels0 + expected_labels1 + expected_labels2  # noqa
             assert dst.labels.tolist() == expected_labels
 
 
diff --git a/k2/python/tests/mutual_information_test.py b/k2/python/tests/mutual_information_test.py
index 11917f18f..cddd817e9 100644
--- a/k2/python/tests/mutual_information_test.py
+++ b/k2/python/tests/mutual_information_test.py
@@ -286,12 +286,12 @@ def get_boundary_row():
                     observed_delta = (delta_m * m_grad).sum().to("cpu")
                     predicted_delta = (delta_px * px.grad).sum().to("cpu")
 
-                    atol = 1.0e-02 if dtype == torch.float32 else 1.0e-04
-                    rtol = 1.0e-02 if dtype == torch.float32 else 1.0e-04
+                    atol = 1.0e-01
+                    rtol = atol
 
                     assert torch.allclose(
                         observed_delta, predicted_delta, atol=atol, rtol=rtol
-                    )
+                    ), (observed_delta, predicted_delta)
 
                     delta_py = delta * torch.randn_like(py)
                     m2 = k2.mutual_information_recursion(
diff --git a/scripts/github_actions/generate_build_matrix.py b/scripts/github_actions/generate_build_matrix.py
new file mode 100755
index 000000000..5899c19b0
--- /dev/null
+++ b/scripts/github_actions/generate_build_matrix.py
@@ -0,0 +1,111 @@
+#!/usr/bin/env python3
+# Copyright    2022  Xiaomi Corp.        (authors: Fangjun Kuang)
+
+import argparse
+import json
+
+
+def get_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--enable-cuda",
+        action="store_true",
+        default=False,
+        help="True to enable CUDA",
+    )
+
+    parser.add_argument(
+        "--test-only-latest-torch",
+        action="store_true",
+        default=False,
+        help="""If True, we test only the latest PyTroch
+        to reduce CI running time.""",
+    )
+    return parser.parse_args()
+
+
+def generate_build_matrix(enable_cuda, test_only_latest_torch):
+    matrix = {
+        # there are issues in serializing ragged tensors in 1.5.0 and 1.5.1
+        #  "1.5.0": {
+        #      "python-version": ["3.6", "3.7", "3.8"],
+        #      "cuda": ["10.1", "10.2"],
+        #  },
+        #  "1.5.1": {
+        #      "python-version": ["3.6", "3.7", "3.8"],
+        #      "cuda": ["10.1", "10.2"],
+        #  },
+        "1.6.0": {
+            "python-version": ["3.6", "3.7", "3.8"],
+            "cuda": ["10.1", "10.2"],
+        },
+        "1.7.0": {
+            "python-version": ["3.6", "3.7", "3.8"],
+            "cuda": ["10.1", "10.2", "11.0"],
+        },
+        "1.7.1": {
+            "python-version": ["3.6", "3.7", "3.8", "3.9"],
+            "cuda": ["10.1", "10.2", "11.0"],
+        },
+        "1.8.0": {
+            "python-version": ["3.6", "3.7", "3.8", "3.9"],
+            "cuda": ["10.1", "10.2", "11.1"],
+        },
+        "1.8.1": {
+            "python-version": ["3.6", "3.7", "3.8", "3.9"],
+            "cuda": ["10.1", "10.2", "11.1"],
+        },
+        "1.9.0": {
+            "python-version": ["3.6", "3.7", "3.8", "3.9"],
+            "cuda": ["10.2", "11.1"],
+        },
+        "1.9.1": {
+            "python-version": ["3.6", "3.7", "3.8", "3.9"],
+            "cuda": ["10.2", "11.1"],
+        },
+        "1.10.0": {
+            "python-version": ["3.6", "3.7", "3.8", "3.9"],
+            "cuda": ["10.2", "11.1", "11.3"],
+        },
+        "1.10.1": {
+            "python-version": ["3.6", "3.7", "3.8", "3.9"],
+            "cuda": ["10.2", "11.1", "11.3"],
+        },
+        "1.10.2": {
+            "python-version": ["3.6", "3.7", "3.8", "3.9"],
+            "cuda": ["10.2", "11.1", "11.3"],
+        },
+        "1.11.0": {
+            "python-version": ["3.7", "3.8", "3.9", "3.10"],
+            "cuda": ["10.2", "11.3", "11.5"],
+        },
+    }
+    if test_only_latest_torch:
+        latest = "1.11.0"
+        matrix = {latest: matrix[latest]}
+
+    ans = []
+    for torch, python_cuda in matrix.items():
+        python_versions = python_cuda["python-version"]
+        cuda_versions = python_cuda["cuda"]
+        if enable_cuda:
+            for p in python_versions:
+                for c in cuda_versions:
+                    ans.append({"torch": torch, "python-version": p, "cuda": c})
+        else:
+            for p in python_versions:
+                ans.append({"torch": torch, "python-version": p})
+
+    print(json.dumps({"include": ans}))
+
+
+def main():
+    args = get_args()
+    generate_build_matrix(
+        enable_cuda=args.enable_cuda,
+        test_only_latest_torch=args.test_only_latest_torch,
+    )
+
+
+if __name__ == "__main__":
+    main()
diff --git a/scripts/github_actions/run-nightly-build.py b/scripts/github_actions/run-nightly-build.py
new file mode 100755
index 000000000..1e002fba3
--- /dev/null
+++ b/scripts/github_actions/run-nightly-build.py
@@ -0,0 +1,35 @@
+#!/usr/bin/env python3
+# Copyright    2022  Xiaomi Corp.        (authors: Fangjun Kuang)
+
+import subprocess
+from datetime import datetime, timedelta
+
+
+def get_last_commit_date() -> datetime:
+    date = (
+        subprocess.check_output(
+            [
+                "git",
+                "log",
+                "-1",
+                "--format=%ad",
+                "--date=unix",
+            ]
+        )
+        .decode("ascii")
+        .strip()
+    )
+    return datetime.utcfromtimestamp(int(date))
+
+
+def main():
+    last_commit_date_utc = get_last_commit_date()
+    now_utc = datetime.utcnow()
+    if last_commit_date_utc + timedelta(days=1) > now_utc:
+        print("true")
+    else:
+        print("false")
+
+
+if __name__ == "__main__":
+    main()