Skip to content

v0.13.0

Compare
Choose a tag to compare
@panyx0718 panyx0718 released this 05 Jun 07:34
· 2 commits to release/0.13.0 since this release
9d40eb3

Release Log

Major Features

  • Asynchronous distributed training support.
  • Distributed training with ParallelExecutor.
  • Distributed ring-based training with NCCL2.
  • Support checkpoint save on trainer and store on trainer and parameter server.
  • Graceful shutdown of parameter server.
  • Publish the high-level inference lib API and inference implementation.
  • Assign roles to each op.
  • Publish the C++ train API to allow to embed fluid into other C++ systems.
  • Support uint8_t type data file and data exchange.
  • C++ reader supports customized data augmentation.
  • Improved operator and interface support for speech models.
  • New random_crop op.
  • New shape op to get the tensor's shape.
  • New resize_bilinear interface.
  • New dice_loss layer.
  • Enhanced reduce_op to support reduce on multiple dimensions.

Performance Improvements

On P40 GPU ResNet-50 model, single GPU speed improves 23.8% (105 images/sec to 130 images/sec). 8 GPUs speedup ratio 6, 32 GPUs speedup ratio reaches 17.4.

  • Overlap send/recv op with other operators.
  • Multi-thread server-side request handling.
  • Weight decay and clipping moved from trainer to parameter server for performance and correctness.
  • Improved C++ reader.

Major Bug Fixes

  • Fix accuracy loss when both ParallelExecutor and memory optimizer are used.
  • Fix ParallelExecutor hang when multiple inputs duplicate.
  • Fix Program clone cause memory leak.
  • Fix GRU unit bias ineffective and wrong activation.
  • Fix ROI Pooling GPU computation issues.
  • Fix fill_constant_batch_size_like when input is sequence.
  • Fix reshape op.