Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add design doc for lookup remote table in Fluid #9068

Merged
merged 12 commits into from
Jul 5, 2018
48 changes: 48 additions & 0 deletions doc/fluid/design/dist_train/prefetch_parameter.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
# Design Doc: Prefetching Parameter From Parameter Server

## Abstract

We propose an approach to prefetch parameter from Parameter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pre-fetch

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be:
We propose an approach to pre-fetch the parameters from a Parameter Server while distributed training so that Fluid is able to train a model with a large number of parameters that cannot be stored in one trainer's memory.

Server while distributed training so that Fluid would training
a model including the large parameter which could not be stored in one
trainer's memory.

## Background

For an embedding layer, the trainable parameter may be very large and could
not be stored in one trainer's memory. In Fluid distributed training,
[Distributed Transpiler](./parameter_server.md#distributed-transpiler) would split every parameter into a number of small
parameters and stored in Parameter Server, so we could prefetch the parameter
from the specified Parameter Server according to the input `Ids`.

## Design

This is a feature of Fluid distributed training, maybe you want
to know [Distributed Architecture](./distributed_architecture.md) and
[Parameter Server](./parameter_server.md) before reading the following content.

### Partationed Parameter

<img src="src/split_parameter.png" width="400" />

- **Distributed Transpiler** would split the large parameter
(weight) into some partitioned parameters (weight_0, weight_1, weight_2) as the
figure above.
- We could use `round-robin` to distribute the partitioned parameter.

### Prefetching Parameter

<img src="src/prefetch_parameters.png" width="400" />

- `prefetch_rpc` operator would prefetch the parameter from different Parameter
Server according with the input `Ids`, we use [SelectedRows](../../../design/selected_rows.md)
as the received variable type.
- `merge_selected_rows` operator would merge the received parameters into one
`SelectedRows` variable.

## TODO

- `prefetch_rpc` operator to send rows index and receive SelectedRows variables.
- `lookup_table` need to support `SelectedRows` variable type as input `Weight`.
- Async Update, To avoid slow-node, Async update is important for distributed training,
we need a design doc and implement it in future.
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.