Skip to content

Commit

Permalink
add docs
Browse files Browse the repository at this point in the history
  • Loading branch information
minhthuc2502 committed Mar 1, 2024
1 parent 4b78fb6 commit 28473ae
Show file tree
Hide file tree
Showing 2 changed files with 37 additions and 1 deletion.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@ The project is production-oriented and comes with [backward compatibility guaran
* **Lightweight on disk**<br/>Quantization can make the models 4 times smaller on disk with minimal accuracy loss.
* **Simple integration**<br/>The project has few dependencies and exposes simple APIs in [Python](https://opennmt.net/CTranslate2/python/overview.html) and C++ to cover most integration needs.
* **Configurable and interactive decoding**<br/>[Advanced decoding features](https://opennmt.net/CTranslate2/decoding.html) allow autocompleting a partial sequence and returning alternatives at a specific location in the sequence.
* **Support tensor parallelism for distributed inference.

Some of these features are difficult to achieve with standard deep learning frameworks and are the motivation for this project.

Expand Down
37 changes: 36 additions & 1 deletion docs/parallel.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,8 +42,43 @@ Parallelization with multiple Python threads is possible because all computation
```

## Model and tensor parallelism
Models as the [`Translator`](python/ctranslate2.Translator.rst) and [`Generator`](python/ctranslate2.Generator.rst) can be split into multiple GPUs different.
This is very helpful when the model is too big to be load in only 1 GPU.

These types of parallelism are not yet implemented in CTranslate2.
```python
translator = ctranslate2.Translator(model_path, device="cuda", tensor_parallel=True)
```

Setup environment:
* Install [open-mpi](https://www.open-mpi.org/)
* Configure open-mpi by creating the config file like ``hostfile``:
```bash
[ipaddress or dns] slots=nbGPU1
[other ipaddress or dns] slots=NbGPU2
```
* Run the application in multiprocess to using tensor parallel:
```bash
mpirun -np nbGPUExpected -hostfile hostfile python3 script
```

If you're trying to run the tensor parallelism in multiple machine, there are additional configuration is needed:
* Make sure Master and Slave can connect to each other as a pair with ssh + pubkey
* Export all necessary environment variables from Master to Slave like the example below:
```bash
mpirun -x VIRTUAL_ENV_PROMPT -x PATH -x VIRTUAL_ENV -x _ -x LD_LIBRARY_PATH -np nbGPUExpected -hostfile hostfile python3 script
```

Read more [open-mpi docs](https://www.open-mpi.org/doc/) for more information.

```{note}
Running model in tensor parallel mode in one machine can boost the performance but if running the model shared between multiple
machine could be slower because of the latency in the connectivity.
```

```{note}
In mode tensor parallel, `inter_threads` is always supported to run multiple workers. Otherwise, `device_index` no longer has any effect
because tensor parallel mode will check only available gpus on the system and number of gpu that you want to use.
```

## Asynchronous execution

Expand Down

0 comments on commit 28473ae

Please sign in to comment.