Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

design doc for parallel_do.md #8425

Merged
merged 5 commits into from
Feb 14, 2018

Conversation

tonyyang-svail
Copy link

No description provided.

fc_grad, allreduce(places, scopes, w1_grad),
fc_grad, allreduce(places, scopes, w2_grad)
}
block3 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to indicate each blocks' parents.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

.AsDuplicable();
AddInput(kPlaces, "Devices used for parallel processing");
AddOutput(kOutputs, "Outputs needed to be merged from different devices").AsDuplicable();
AddOutput(kParallelScopes,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kParallelScopes seems to indicate that there are multiple scopes, but the description says Container, which is a single container:

  1. does container mean scope?
  2. is there a single scope or multiple scopes?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. yes
  2. one scope for each device

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe change "container" to "scope" and make "one scope for each device" clear? :)

```
In the forward pass
| Split input onto different devices
| Copy parameter to onto different devices
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that "Copy parameter to onto different devices" is only done in the first time the parallel do OP happens. Maybe we need to make it clear.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current version does this at every iteration

| Merge output from different devices

In the backward pass
| Split output@grad onto different devices
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it split or duplicate?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

split.

| Split output@grad onto different devices
|||| Compute backward pass in parallel
| accumulate param@grad from different devices to the first device
| Merge input@grad from different devices
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it input@grad or param@grad?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another step, Copy param@grad to the place of parallel_do_op, should be added here

# get embedding feature on CPU
feature = some_cpu_only_op(data)

gpu_places = get_place(use_gpu=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the Python API specify 5 parallel CPU thread when there is no GPU?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

with pd.do():
read_input(feature)
prediction = my_net(feature)
write_output(activation)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

write_output(activation) or write_output(prediction)?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo. Thanks for pointing it out.

read_input(feature)
prediction = my_net(feature)
write_output(activation)
prediction = pd()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the Python API support multiple outputs? If so can you provide an example?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Sure.

```python
pd = ParallelDo(gpu_places)
with pd.do():
feature = pre_fetch(gpu_places)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I don't understand how pre_fetch will work here, since the pre_fetch is inside the child block of parallel do, it will not run until parallel do run. Isn't that too late for prefetching?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This op hasn't been implemented yet. But there should be a background thread adding the data to the fetching queue before this OP is called.

write_output(activation)
```

### forward: Copy parameter to onto different devices
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is "Copy parameter to onto different devices" a performance improvement? I agree that this is a more graceful approach, but isn't "Copy parameter to onto different devices" will only run once, so maybe the performance cost is negligible?

Looks that in the body of this section there are other optimizations besides "Copy parameter to onto different devices", maybe need a better title?

Maybe I have this question because I did not fully understand it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the current implementation of backward only supports updating gradient at one place. So we need to copy the updated parameters at every iterations.

Copy link
Contributor

@helinwang helinwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@helinwang helinwang merged commit 118d950 into PaddlePaddle:develop Feb 14, 2018
}
```

## Proformance Imporvement
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a minor typo here. Proformance -> Performance

```
In the forward pass
| Split input onto different devices
| Copy parameter to onto different devices
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can drop the to -> Copy parameter onto different devices

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants