Releases: MooreThreads/torch_musa
torch_musa Release v1.3.0
Highlights
We are excited to release torch_musa v1.3.0 based on PyTorch v2.2.0. In this release, we support FSDP (Fully Sharded Data Parallel) for large model training, as well as improve the stability and efficiency of diferent operators. In general, we add more operators and support more dtypes of Tensors for many operators on our MUSA backend.
With torch_musa v1.3.0, users can utilize most features released in PyTorch v2.2.0 on MUSA GPU, and gain more stable training and inference for many kinds of models in various fields, including the recently popular large language models.
The number of supported operators and models is increasing rapidly. With torch_musa
, users can easily accelerate AI applications on Moore Threads graphics cards.
This release is due to the efforts of engineers in Moore Threads AI Team and other departments. We sincerely hope that everyone can continue to pay attention to our work and participate in it, and witness the fast iteration of torch_musa and Moore Threads graphics cards together.
Enhancements
FSDP
We recommand users to refer offical FSDP doc for more utilization details, and move back to our torch_musa
to get the same experiences as the original one.
Operators support
1.Support operators including torch.conv_transpose_3d
, torch.fmod
, torch.fmax
and torch.fmin
etc.
2.Support more dtypes for torch.sort
, torch.unique
etc.
Documentation
We provide developer documentation for developers, which describes the development environment preparation and some development steps in detail.
Dockers
We provide Release docker image and development docker image.
torch_musa Release v1.2.1
Highlights
We are excited to release torch_musa v1.2.1 based on PyTorch v2.0.0. In this release, we support some basic and important features, including torch_musa profiler
, musa_extension
, musa_convert
, codegen
and compare_tool
. In addition, we now have adapted more than 600 operators. With these basic features and operators, torch_musa
could support a large number of models in various fields, including the recently popular large language models. The number of supported operators and models is increasing rapidly. With torch_musa
, users can easily accelerate AI applications on Moore Threads graphics cards.
This release is due to the efforts of engineers in Moore Threads AI Team and other departments. We sincerely hope that everyone can continue to pay attention to our work and participate in it, and witness the fast iteration of torch_musa and Moore Threads graphics cards together.
New Features
torch_musa profiler
We have adapted pytorch's official performance analysis tool, torch.profiler
. Users can use this adapted tool to analyze the performance details of pytorch model training or inference tasks running on the MUSA platform. It can capture information about operators called at the host level or kernels executed on the GPU device.
musa_extension
We have implemented the MUSAExtension
interface, which is consistent with CUDAExtension
. It can be used to build customized operators based on the MUSA platform, making full use of GPU resources to accelerate calculations. Many pytorch third-party ecological libraries that utilize CUDAExtension
can also be easily ported to the MUSA platform.
musa_converter
We have developed a convert tool named musa_converter that translates pytorch-cuda related strings and APIS in PyTorch scripts into torch_musa compatible code, which improve the efficiency of model migration from CUDA platform to MUSA platform. Users can run musa_converter -h
to see the usage of musa_converter
.
codegen
We introduce the codegen module to handle the automatic binding and registration of customized musa kernels. It extends from torchgen, follows the format patterns of native_functions.yaml file, also supports different custom strategies, which can significantly reduce the workload of developers.
compare_tool
This tool is designed to enhance the debugging and validation process of PyTorch models by offering capabilities for comparing tensor operations across devices, tracking module hierarchies, and detecting the presence of NaN/Inf values. It is aimed at ensuring the correctness and stability of models through various stages of development and testing.
operator_benchmark
We followed PyTorch operator_benchmark suite and adapted it into torch_musa
. Developers can utilize it the same way as in PyTorch. It helps developers to generate fully characterized performance of an operator, and developers can compare result with the one generated from CUDA or other accelerate backends, continuously improve the performances of torch_musa
.
Enhancements
Operators support
1.Support operators including torch.mode
, torch.count_nonzero
, torch.sort(stable=True)
, torch.upsample2d/3d
, torch.logical_or/and
etc.
2.Support more dtypes for torch.scatter
, torch.eq
, torch.histc
, torch.logsumexp
etc.
Operators and modules optimize
1.Optimize and accelerate operators like Indexing kernels, embedding kernels, torch.nonzero
, torch.unique
, torch.clamp
etc.
2.Enable manual seed setting for dropout layer.
3.Support SDP(scale-dot production) with GQA(grouped-query attention) and causal mask.
4.Now the AMP usage is aligned with CUDA as torch.autocast
would automatically enable torch_musa amp.
Documentation
We provide developer documentation for developers, which describes the development environment preparation and some development steps in detail.
Dockers
We provide Release docker image and development docker image.
torch_musa Release v1.1.0
torch_musa Release Notes
- Highlights
- New Features
- AMP mixed precision training
- MUSAExtension
- Pinned memory
- TensorCore computation
- CompareTool [Experimental]
- Supported Operators
- Documentation
- Dockers
Highlights
We are excited to release torch_musa v1.1.0 based on PyTorch v2.0.0. In this release, we support more import features, including AMP mixed precision training, MUSAExtension, TensorCore computation, pinned memory and CompareTool. In addition, we have adapted more than 470 operators, improved DDP module and implemented more quantization operators. With torch_musa, users can easily accelerate AI applications on Moore Threads graphics cards.
This release is due to the efforts of engineers in Moore Threads AI Team and other departments. We sincerely hope that everyone can continue to pay attention to our work and participate in it, and witness the fast iteration of torch_musa and Moore Threads graphics cards together.
New Features
AMP mixed precision training
Now we support mixed precision training of BF16 and FP16. However, it is worth noting that S80 and S3000 only support fp16, while S4000 supports both fp16 and bf16, and the interface is completely consistent with PyTorch. Users can use AMP like the following code:
# low_dtype can be torch.float16 or torch.bfloat16
def train_in_amp(low_dtype=torch.float16):
set_seed()
model = SimpleModel().to(DEVICE)
criterion = nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
# create the scaler object
scaler = torch.musa.amp.GradScaler()
inputs = torch.randn(6, 5).to(DEVICE) # 将数据移至GPU
targets = torch.randn(6, 3).to(DEVICE)
for step in range(20):
optimizer.zero_grad()
# create autocast environment
with torch.musa.amp.autocast(dtype=low_dtype):
outputs = model(inputs)
assert outputs.dtype == low_dtype
loss = criterion(outputs, targets)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
return loss
MUSAExtension
MUSAExtension and CUDAExtension are basically the same, except that MUSAExtension needs to manually add a dynamic library to the dynamic library search path. For detailed usage, please refer to torch_musa/torch_musa/utils/README.md and the developer documentation. This issue will be resolved in the next version.
Pinned memory
Pinned memory now is supported by torch_musa, the following code can utilize it.
cpu_tensor = torch.rand(shape, dtype=torch.float32).pin_memory("musa")
gpu_tensor = cpu_tensor.to("musa", non_blocking=True)
TensorCore computation
The S4000 has tensorcore, therefore it supports TF32 format calculations. Users can utilize TF32 for acceleration using the following code:
with torch.backends.mudnn.flags(allow_tf32=True):
# your train code.
CompareTool [Experimental]
CompareTool is an experimental tool aimed at automatically comparing the computation results between musa and cpu, thereby facilitating the debugging process. For detailed usage, please refer to torch_musa/utils/README.md
Supported Operators
More than 470 operators are supported in torch_musa.
Documentation
We provide developer guide for developers, which describes the development environment preparation and some development steps in detail.
Dockers
Release docker image and development docker image are available now.
[NOTE]: If you want to compile torch_musa without using the provided docker image, please download the rc2.0.0 Intel CPU_Ubuntu underlying software stack in https://developer.mthreads.com/sdk/download/musa?equipment=&os=&driverVersion=&version=
[NOTE]:
- When installing following released whl package, please remove the device name. For example,
- pip install torch-2.0.0-cp310-cp310-linux_x86_64.whl
torch_musa Release v1.0.0
torch_musa Release Notes
- Highlights
- New Features
- CUDA Kernels Porting
- Caching Allocator
- Device Management
- Distributed Data Parallel Training [Experimental]
- FP16 Inference [Experimental]
- Supported Operators
- Supported Models
- Documentation
- Dockers
Highlights
We are excited to release torch_musa v1.0.0 based on PyTorch v2.0.0. In this release, we support some basic and important features, including CUDA kernels porting, device management, memory allocator, distributed data parallel training(experimental) and FP16 inference(experimental). In addition, we have adapted more than 300 operators. With these basic features and operators, torch_musa could support a large number of models in various fields, including the recently popular large language models. The number of supported operators and models is increasing rapidly. With torch_musa, users can easily accelerate AI applications on Moore Threads graphics cards.
This release is due to the efforts of engineers in Moore Threads AI Team and other departments. We sincerely hope that everyone can continue to pay attention to our work and participate in it, and witness the fast iteration of torch_musa and Moore Threads graphics cards together.
New Features
CUDA Kernels Porting
Thanks to CUDA-compatible capabilities of our MUSA software stack, torch_musa can easily support CUDA-compatible modules. It then effectively enables developers to reuse CUDA kernels with a small amount of efforts, which greatly speeds up operators adaptation.
Caching Allocator
The amount of required memory is constantly changing during the program execution. Frqeuent invocations of memory allocation and deallocation (through musaMalloc and musaFree) usually lead to high execution cost. To alleviate this issue, we implemented caching allocator that requests memory blocks from MUSA and strategically splits and reuses these blocks without returning them to MUSA, which results in a significant performance gain.
Device Management
In order to manage devices, three components are implemented in torch_musa, including device streams, device events and device generators. Device streams are used to manage and synchronize launched kernels. Device event is an important component related to streams, which records a specific point in the execution of a stream. Device generators are used to generate random numbers. Devices are initialized lazily, which could improve startup especially for multi-GPU systems.
Distributed Data Parallel Training [Experimental]
As the number of model parameters increases, especially for the large language models, distributed data parallel training becomes increasingly important. torch_musa has already started supporting distributed data parallel training. Some important communication primitives are already supported, including send, recv, broadcast, all_reduce, reduce, all_gather, gather, scatter, reduce_scatter and barrier. The interface torch.nn.parallel.DistributedDataParallel is also supported. This module is under rapid development.
FP16 Inference [Experimental]
To speed up model inference, we currently supported a series of FP16 operators, including linear, matmul, unary ops, binary ops, layernorm and most porting kernels. With this set of operators, we are able to run FP16 inference on a number of models. Please note this feature is still experimental, the model support might be limited.
Supported Operators
More than 300 operators are supported in torch_musa.
Supported Models
Many classic and popular models are already supported, including Stable Diffusion, ChatGLM, Conformer, Bert, YOLOV5, ResNet50, Swin-Transformer, MobileNetv3, EfficientNet, HRNet, TSM, FastSpeech2, UNet, T5, HifiGan, Real-EsrGan, OpenPose, many GPT variants and so on.
Documentation
We provide developer guide for developers, which describes the development environment preparation and some development steps in detail.
Dockers
Release docker image and development docker image are available now.
[NOTE]: If you want to compile torch_musa without using the provided docker image, please contact us to get the necessary dependencies by email [email protected].