Skip to content
This repository has been archived by the owner on Oct 19, 2024. It is now read-only.

Alpa Pipeshard Compile Errors #838

Open
EntilZha opened this issue Dec 28, 2022 · 0 comments
Open

Alpa Pipeshard Compile Errors #838

EntilZha opened this issue Dec 28, 2022 · 0 comments

Comments

@EntilZha
Copy link

Please describe the bug
I'm using the same code from this issue #813, but am now getting some compile errors that suggest there are a few things not implemented causing errors.

I am setting this in my parallelize call:

        import alpa
        import ray

        ray.init()
        alpa.init(cluster="ray")
        p_train_step = alpa.parallelize(
            partial_train_step,
            donate_argnums=(0,),
            method=alpa.PipeshardParallel(
                layer_option=alpa.AutoLayerOption(layer_num=8),
                stage_option='auto',
                num_micro_batches=num_micro_batches,
            ),
        )

Depending on if num_micro_batches=1 or num_micro_batches>1 I get two different errors. If it is set to one, I get this https://gist.github.com/EntilZha/5e5a3c04446404bd8da4673accc24a36. Debugging a bit, it seems like in here https://github.com/alpa-projects/alpa/blob/main/alpa/pipeline_parallel/computation.py#L397, it is expecting that the first expression is a start, but for whatever reason it is not, hitting the first assert.

If micro batch is greater than one, I get https://gist.github.com/EntilZha/8010d86ea767f1730244903f0639c049, the error suggesting that some part of the pipeline isn't implemented yet:

not supported yet.        

I'm unfamiliar with alpa internals, but if there is other information I can provide to help debug, would be happy to.

Please describe the expected behavior
Alpa program compiles and starts running

System information and environment

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04, docker): Ubuntu
  • Python version: 3.9.15
  • CUDA version: 11.4
  • NCCL version:
  • cupy version: 11.4
  • GPU model and memory: V100 32GB
  • Alpa version: Commit 5660516
  • TensorFlow version: 2.11.0
  • JAX version: 0.3.22
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants