Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use all to one op to do DtoD between remote and merge #817

Closed
wants to merge 1 commit into from

Conversation

842974287
Copy link
Contributor

Summary:
Previously we were simply calling Tensor.to to launch DtoD copy. Since PyTorch is doing two-way barrier for DtoD copy, all the DtoD copies are serialized even though they are launched from different devices.

See the blue DtoD copies in the graph below.
{F686842812}

At first I went for merge_pooled_embedding directly but I forgot that MRS models also have sequence embeddings. Covering pooled embeddings are not enough in this case.

This diff introduced a function that takes in a tuple of ivalues and move the underlining tensors to a given target device then outputs a vector of ivalues with underlining tensors in the same device.

For each source device, we synchronize its current stream and launch all the copies for tensors in that device. Then we synchronize the current stream on target device to wait on all the copies.

Now the copies from different devices can run in parallel.
{F686843333}

Differential Revision: D33065710

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D33065710

842974287 added a commit to 842974287/FBGEMM that referenced this pull request Dec 17, 2021
Summary:
Pull Request resolved: pytorch#817

Previously we were simply calling `Tensor.to` to launch DtoD copy. Since PyTorch is doing two-way barrier for DtoD copy, all the DtoD copies are serialized even though they are launched from different devices.

See the blue DtoD copies in the graph below.
{F686842812}

At first I went for merge_pooled_embedding directly but I forgot that MRS models also have sequence embeddings. Covering pooled embeddings are not enough in this case.

This diff introduced a function that takes in a tuple of ivalues and move the underlining tensors to a given target device then outputs a vector of ivalues with underlining tensors in the same device.

For each source device, we synchronize its current stream and launch all the copies for tensors in that device. Then we synchronize the current stream on target device to wait on all the copies.

Now the copies from different devices can run in parallel.
{F686843333}

Differential Revision: D33065710

fbshipit-source-id: d29049a31d2d4ab8ea30810ef1832f9b0afe3bc9
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D33065710

842974287 added a commit to 842974287/FBGEMM that referenced this pull request Dec 17, 2021
Summary:
Pull Request resolved: pytorch#817

Previously we were simply calling `Tensor.to` to launch DtoD copy. Since PyTorch is doing two-way barrier for DtoD copy, all the DtoD copies are serialized even though they are launched from different devices.

See the blue DtoD copies in the graph below.
{F686842812}

At first I went for merge_pooled_embedding directly but I forgot that MRS models also have sequence embeddings. Covering pooled embeddings are not enough in this case.

This diff introduced a function that takes in a tuple of ivalues and move the underlining tensors to a given target device then outputs a vector of ivalues with underlining tensors in the same device.

For each source device, we synchronize its current stream and launch all the copies for tensors in that device. Then we synchronize the current stream on target device to wait on all the copies.

Now the copies from different devices can run in parallel.
{F686843333}

Differential Revision: D33065710

fbshipit-source-id: 40fdf627dd8e7116b3a585f36063c697be35c53a
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D33065710

842974287 added a commit to 842974287/FBGEMM that referenced this pull request Dec 17, 2021
Summary:
Pull Request resolved: pytorch#817

Previously we were simply calling `Tensor.to` to launch DtoD copy. Since PyTorch is doing two-way barrier for DtoD copy, all the DtoD copies are serialized even though they are launched from different devices.

See the blue DtoD copies in the graph below.
{F686842812}

At first I went for merge_pooled_embedding directly but I forgot that MRS models also have sequence embeddings. Covering pooled embeddings are not enough in this case.

This diff introduced a function that takes in a tuple of ivalues and move the underlining tensors to a given target device then outputs a vector of ivalues with underlining tensors in the same device.

For each source device, we synchronize its current stream and launch all the copies for tensors in that device. Then we synchronize the current stream on target device to wait on all the copies.

Now the copies from different devices can run in parallel.
{F686843333}

Differential Revision: D33065710

fbshipit-source-id: 4b94fe52c743edc97701b446b7d2041ca982de95
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D33065710

842974287 added a commit to 842974287/FBGEMM that referenced this pull request Dec 17, 2021
Summary:
Pull Request resolved: pytorch#817

Previously we were simply calling `Tensor.to` to launch DtoD copy. Since PyTorch is doing two-way barrier for DtoD copy, all the DtoD copies are serialized even though they are launched from different devices.

See the blue DtoD copies in the graph below.
{F686842812}

At first I went for merge_pooled_embedding directly but I forgot that MRS models also have sequence embeddings. Covering pooled embeddings are not enough in this case.

This diff introduced a function that takes in a tuple of ivalues and move the underlining tensors to a given target device then outputs a vector of ivalues with underlining tensors in the same device.

For each source device, we synchronize its current stream and launch all the copies for tensors in that device. Then we synchronize the current stream on target device to wait on all the copies.

Now the copies from different devices can run in parallel.
{F686843333}

Reviewed By: yinghai, jianyuh, houseroad

Differential Revision: D33065710

fbshipit-source-id: b7a5c379ecc5d26b105b65b18a3ad9832f015a8f
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D33065710

842974287 added a commit to 842974287/FBGEMM that referenced this pull request Dec 17, 2021
Summary:
Pull Request resolved: pytorch#817

Previously we were simply calling `Tensor.to` to launch DtoD copy. Since PyTorch is doing two-way barrier for DtoD copy, all the DtoD copies are serialized even though they are launched from different devices.

See the blue DtoD copies in the graph below.
{F686842812}

At first I went for merge_pooled_embedding directly but I forgot that MRS models also have sequence embeddings. Covering pooled embeddings are not enough in this case.

This diff introduced a function that takes in a tuple of ivalues and move the underlining tensors to a given target device then outputs a vector of ivalues with underlining tensors in the same device.

For each source device, we synchronize its current stream and launch all the copies for tensors in that device. Then we synchronize the current stream on target device to wait on all the copies.

Now the copies from different devices can run in parallel.
{F686843333}

Reviewed By: yinghai, jianyuh, houseroad

Differential Revision: D33065710

fbshipit-source-id: f333a8ae11bec298e91512c8df575a44cc976350
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D33065710

842974287 added a commit to 842974287/FBGEMM that referenced this pull request Dec 17, 2021
Summary:
Pull Request resolved: pytorch#817

Previously we were simply calling `Tensor.to` to launch DtoD copy. Since PyTorch is doing two-way barrier for DtoD copy, all the DtoD copies are serialized even though they are launched from different devices.

See the blue DtoD copies in the graph below.
{F686842812}

At first I went for merge_pooled_embedding directly but I forgot that MRS models also have sequence embeddings. Covering pooled embeddings are not enough in this case.

This diff introduced a function that takes in a tuple of ivalues and move the underlining tensors to a given target device then outputs a vector of ivalues with underlining tensors in the same device.

For each source device, we synchronize its current stream and launch all the copies for tensors in that device. Then we synchronize the current stream on target device to wait on all the copies.

Now the copies from different devices can run in parallel.
{F686843333}

Reviewed By: yinghai, jianyuh, houseroad

Differential Revision: D33065710

fbshipit-source-id: e73a40525c2c0b4c6abf4958513b0f842f23b26a
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D33065710

842974287 added a commit to 842974287/FBGEMM that referenced this pull request Dec 17, 2021
Summary:
Pull Request resolved: pytorch#817

Previously we were simply calling `Tensor.to` to launch DtoD copy. Since PyTorch is doing two-way barrier for DtoD copy, all the DtoD copies are serialized even though they are launched from different devices.

See the blue DtoD copies in the graph below.
{F686842812}

At first I went for merge_pooled_embedding directly but I forgot that MRS models also have sequence embeddings. Covering pooled embeddings are not enough in this case.

This diff introduced a function that takes in a tuple of ivalues and move the underlining tensors to a given target device then outputs a vector of ivalues with underlining tensors in the same device.

For each source device, we synchronize its current stream and launch all the copies for tensors in that device. Then we synchronize the current stream on target device to wait on all the copies.

Now the copies from different devices can run in parallel.
{F686843333}

Reviewed By: yinghai, jianyuh, houseroad

Differential Revision: D33065710

fbshipit-source-id: 66161563afc1f67d6d12ed904b4c748559c6b526
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D33065710

842974287 added a commit to 842974287/FBGEMM that referenced this pull request Dec 17, 2021
Summary:
Pull Request resolved: pytorch#817

Previously we were simply calling `Tensor.to` to launch DtoD copy. Since PyTorch is doing two-way barrier for DtoD copy, all the DtoD copies are serialized even though they are launched from different devices.

See the blue DtoD copies in the graph below.
{F686842812}

At first I went for merge_pooled_embedding directly but I forgot that MRS models also have sequence embeddings. Covering pooled embeddings are not enough in this case.

This diff introduced a function that takes in a tuple of ivalues and move the underlining tensors to a given target device then outputs a vector of ivalues with underlining tensors in the same device.

For each source device, we synchronize its current stream and launch all the copies for tensors in that device. Then we synchronize the current stream on target device to wait on all the copies.

Now the copies from different devices can run in parallel.
{F686843333}

Reviewed By: yinghai, jianyuh, houseroad

Differential Revision: D33065710

fbshipit-source-id: ff655573bc747052d41d9baf3aece4b780f19136
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D33065710

1 similar comment
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D33065710

842974287 added a commit to 842974287/FBGEMM that referenced this pull request Dec 28, 2021
Summary:
Pull Request resolved: pytorch#817

Previously we were simply calling `Tensor.to` to launch DtoD copy. Since PyTorch is doing two-way barrier for DtoD copy, all the DtoD copies are serialized even though they are launched from different devices.

See the blue DtoD copies in the graph below.
{F686842812}

At first I went for merge_pooled_embedding directly but I forgot that MRS models also have sequence embeddings. Covering pooled embeddings are not enough in this case.

This diff introduced a function that takes in a tuple of ivalues and move the underlining tensors to a given target device then outputs a vector of ivalues with underlining tensors in the same device.

For each source device, we synchronize its current stream and launch all the copies for tensors in that device. Then we synchronize the current stream on target device to wait on all the copies.

Now the copies from different devices can run in parallel.
{F686843333}

Reviewed By: yinghai, jianyuh, houseroad

Differential Revision: D33065710

fbshipit-source-id: 0e54ffd6fa1a702ca8cf28fc97ba741e1144cce4
Summary:
Pull Request resolved: pytorch#817

Previously we were simply calling `Tensor.to` to launch DtoD copy. Since PyTorch is doing two-way barrier for DtoD copy, all the DtoD copies are serialized even though they are launched from different devices.

See the blue DtoD copies in the graph below.
{F686842812}

At first I went for merge_pooled_embedding directly but I forgot that MRS models also have sequence embeddings. Covering pooled embeddings are not enough in this case.

This diff introduced a function that takes in a tuple of ivalues and move the underlining tensors to a given target device then outputs a vector of ivalues with underlining tensors in the same device.

For each source device, we synchronize its current stream and launch all the copies for tensors in that device. Then we synchronize the current stream on target device to wait on all the copies.

Now the copies from different devices can run in parallel.
{F686843333}

Reviewed By: yinghai, jianyuh, houseroad

Differential Revision: D33065710

fbshipit-source-id: f120c959dc631e4a77b818c8dbf0dc4f8853a92a
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D33065710

842974287 added a commit to 842974287/FBGEMM that referenced this pull request Dec 28, 2021
Summary:
Pull Request resolved: pytorch#817

Previously we were simply calling `Tensor.to` to launch DtoD copy. Since PyTorch is doing two-way barrier for DtoD copy, all the DtoD copies are serialized even though they are launched from different devices.

See the blue DtoD copies in the graph below.
{F686842812}

At first I went for merge_pooled_embedding directly but I forgot that MRS models also have sequence embeddings. Covering pooled embeddings are not enough in this case.

This diff introduced a function that takes in a tuple of ivalues and move the underlining tensors to a given target device then outputs a vector of ivalues with underlining tensors in the same device.

For each source device, we synchronize its current stream and launch all the copies for tensors in that device. Then we synchronize the current stream on target device to wait on all the copies.

Now the copies from different devices can run in parallel.
{F686843333}

Reviewed By: yinghai, jianyuh, houseroad

Differential Revision: D33065710

fbshipit-source-id: c02373985e8e92c35fca8520a0292dc84adc3fd2
@842974287 842974287 force-pushed the export-D33065710 branch 2 times, most recently from a56c7cd to 02ccbbf Compare December 28, 2021 20:22
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D33065710

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants