v0.13.0

panyx0718 released this 05 Jun 07:34

· 2 commits to release/0.13.0 since this release

9d40eb3

Release Log

Major Features

Asynchronous distributed training support.
Distributed training with ParallelExecutor.
Distributed ring-based training with NCCL2.
Support checkpoint save on trainer and store on trainer and parameter server.
Graceful shutdown of parameter server.
Publish the high-level inference lib API and inference implementation.
Assign roles to each op.
Publish the C++ train API to allow to embed fluid into other C++ systems.
Support uint8_t type data file and data exchange.
C++ reader supports customized data augmentation.
Improved operator and interface support for speech models.
New random_crop op.
New shape op to get the tensor's shape.
New resize_bilinear interface.
New dice_loss layer.
Enhanced reduce_op to support reduce on multiple dimensions.

Performance Improvements

On P40 GPU ResNet-50 model, single GPU speed improves 23.8% (105 images/sec to 130 images/sec). 8 GPUs speedup ratio 6, 32 GPUs speedup ratio reaches 17.4.

Overlap send/recv op with other operators.
Multi-thread server-side request handling.
Weight decay and clipping moved from trainer to parameter server for performance and correctness.
Improved C++ reader.

Major Bug Fixes

Fix accuracy loss when both ParallelExecutor and memory optimizer are used.
Fix ParallelExecutor hang when multiple inputs duplicate.
Fix Program clone cause memory leak.
Fix GRU unit bias ineffective and wrong activation.
Fix ROI Pooling GPU computation issues.
Fix fill_constant_batch_size_like when input is sequence.
Fix reshape op.

Assets 2