-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
design doc for parallel_do.md #8425
design doc for parallel_do.md #8425
Conversation
fc_grad, allreduce(places, scopes, w1_grad), | ||
fc_grad, allreduce(places, scopes, w2_grad) | ||
} | ||
block3 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's better to indicate each blocks' parents.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
.AsDuplicable(); | ||
AddInput(kPlaces, "Devices used for parallel processing"); | ||
AddOutput(kOutputs, "Outputs needed to be merged from different devices").AsDuplicable(); | ||
AddOutput(kParallelScopes, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kParallelScopes
seems to indicate that there are multiple scopes, but the description says Container
, which is a single container:
- does container mean scope?
- is there a single scope or multiple scopes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- yes
- one scope for each device
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe change "container" to "scope" and make "one scope for each device" clear? :)
``` | ||
In the forward pass | ||
| Split input onto different devices | ||
| Copy parameter to onto different devices |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that "Copy parameter to onto different devices" is only done in the first time the parallel do OP happens. Maybe we need to make it clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current version does this at every iteration
| Merge output from different devices | ||
|
||
In the backward pass | ||
| Split output@grad onto different devices |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it split or duplicate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
split.
| Split output@grad onto different devices | ||
|||| Compute backward pass in parallel | ||
| accumulate param@grad from different devices to the first device | ||
| Merge input@grad from different devices |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it input@grad
or param@grad
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another step, Copy param@grad to the place of parallel_do_op, should be added here
# get embedding feature on CPU | ||
feature = some_cpu_only_op(data) | ||
|
||
gpu_places = get_place(use_gpu=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the Python API specify 5 parallel CPU thread when there is no GPU?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
doc/design/parallel_do.md
Outdated
with pd.do(): | ||
read_input(feature) | ||
prediction = my_net(feature) | ||
write_output(activation) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
write_output(activation)
or write_output(prediction)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo. Thanks for pointing it out.
doc/design/parallel_do.md
Outdated
read_input(feature) | ||
prediction = my_net(feature) | ||
write_output(activation) | ||
prediction = pd() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the Python API support multiple outputs? If so can you provide an example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Sure.
doc/design/parallel_do.md
Outdated
```python | ||
pd = ParallelDo(gpu_places) | ||
with pd.do(): | ||
feature = pre_fetch(gpu_places) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I don't understand how pre_fetch
will work here, since the pre_fetch
is inside the child block of parallel do, it will not run until parallel do run. Isn't that too late for prefetching?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This op hasn't been implemented yet. But there should be a background thread adding the data to the fetching queue before this OP is called.
write_output(activation) | ||
``` | ||
|
||
### forward: Copy parameter to onto different devices |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is "Copy parameter to onto different devices" a performance improvement? I agree that this is a more graceful approach, but isn't "Copy parameter to onto different devices" will only run once, so maybe the performance cost is negligible?
Looks that in the body of this section there are other optimizations besides "Copy parameter to onto different devices", maybe need a better title?
Maybe I have this question because I did not fully understand it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the current implementation of backward only supports updating gradient at one place. So we need to copy the updated parameters at every iterations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
} | ||
``` | ||
|
||
## Proformance Imporvement |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a minor typo here. Proformance -> Performance
``` | ||
In the forward pass | ||
| Split input onto different devices | ||
| Copy parameter to onto different devices |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can drop the to -> Copy parameter onto different devices
No description provided.