This repository has been archived by the owner on Oct 19, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 360
PipeshardParallel + GPT2 example fails with compile error and segmentation fault #863
Comments
I tried the HEAD commit (20debbe) and now the attribute error & segfault are gone. Just the identical I notice that the result of compilation ( Full error output (tqdm disabled)
|
If you want to use advanced parallelization options. Please refer to this OPT example https://github.com/alpa-projects/alpa/tree/main/examples/opt_finetune and this branch #858 |
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Please describe the bug
I'm trying to use
PipeshardParallel
for the GPT2 example inexamples/gpt2
(20debbe) with Alpa v0.2.2 inside a Docker container. I'm on an RHEL node with four NVIDIA A40 GPUs.check failed: strategies->is_tuple || !strategies->leaf_vector.empty() %pad.38 = f16[8,512,2304]{2,1,0} pad(f16[8,512,768]{2,1,0} %reshape.1367, f16[] %constant.1168), padding=0_0x0_0x1536_0, metadata={op_name="parallelize(stage_0_1)/jit(main)/jit(merged)/jit(stage_0_1_compute2)/transpose(jvp(FlaxGPT2LMHeadModule))/transformer/h/11/attn/pad[padding_config=((0, 0, 0), (0, 0, 0), (1536, 0, 0))]" source_file="/opt/conda/envs/alpa/lib/python3.8/site-packages/transformers/models/gpt2/modeling_flax_gpt2.py" source_line=211} does not have any valid strategies.
AttributeError: module 'jaxlib.xla_extension' has no attribute 'nccl_create_communicators_no_stream
Please describe the expected behavior
System information and environment
To Reproduce
Steps to reproduce the behavior:
docker/coreweave/run_alpa_infiniband.Dockerfile
. All following commands done inside container.git clone --recursive https://github.com/alpa-projects/alpa.git
cd alpa/examples/gpt2
run_clm_flax.py
so that it usesPipeshardParallel
instead ofZero2Parallel
:pip install transformers datasets
(transformers 4.25.1, datasets 2.8.0)export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/
pip install tensorflow
mkdir norwegian-gpt2 && python train_tokenizer.py && python create_config.py
Full error output (tqdm disabled)
As a side note, it would be great if there's a single Dockerfile to compile and run the Alpa HEAD commit.
The text was updated successfully, but these errors were encountered: