Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix GetJobLogs and e2e-neuron binary not exits issue. #465

Merged
merged 3 commits into from
Aug 2, 2024

Conversation

weicongw
Copy link
Contributor

@weicongw weicongw commented Aug 1, 2024

Issue #, if available:

Description of changes:

  • The GetJobLogs was using the same context as the main context. This would cause if time out happens the tests will not print out the logs from the tests. Let the GetJobLogs use context.Background() to let it always prints out the logs.
  • Add e2e-neuron test binary to the kubetest2 dockerfile.
  • Set the default disk storage size to 100 GB
  • Enforces the node group be in one AZ.

Test:

Tested it will always print out the logs even timeout reached

go test -timeout 60m -v . -args -nvidiaTestImage public.ecr.aws/o5d5x8n6/weicongw:nvidia
2024/08/01 07:34:11 No node type specified. Using the node type p3.2xlarge in the node groups.
=== RUN   TestMPIJobPytorchTraining
=== RUN   TestMPIJobPytorchTraining/single-node
=== RUN   TestMPIJobPytorchTraining/single-node/MPIJob_succeeds
    mpi_test.go:60: context deadline exceeded
=== NAME  TestMPIJobPytorchTraining/single-node
    mpi_test.go:71: Test log for pytorch-training-single-node:
    mpi_test.go:72: Cloning into '/pytorch-examples'...
        Note: switching to '0f0c9131ca5c79d1332dce1f4c06fe942fbdc665'.
        
        You are in 'detached HEAD' state. You can look around, make experimental
        changes and commit them, and you can discard any commits you make in this
        state without impacting any branches by switching back to a branch.
        
        If you want to create a new branch to retain commits you create, you may
        do so (now or later) by using -c with the switch command. Example:
        
          git switch -c <new-branch-name>
        
        Or undo this operation with:
        
          git switch -
        
        Turn off this advice by setting config variable advice.detachedHead to false
        
        HEAD is now at 0f0c913 Use regular dropout rather than dropout2d
        Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
        Failed to download (trying next):
        HTTP Error 403: Forbidden
        
        Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz
        Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz to ../data/MNIST/raw/train-images-idx3-ubyte.gz
100%|██████████| 9912422/9912422 [00:00<00:00, 123835640.95it/s]
        Extracting ../data/MNIST/raw/train-images-idx3-ubyte.gz to ../data/MNIST/raw
        
        Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
        Failed to download (trying next):
        HTTP Error 403: Forbidden
        
        Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-labels-idx1-ubyte.gz
        Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-labels-idx1-ubyte.gz to ../data/MNIST/raw/train-labels-idx1-ubyte.gz
100%|██████████| 28881/28881 [00:00<00:00, 27751590.80it/s]
        Extracting ../data/MNIST/raw/train-labels-idx1-ubyte.gz to ../data/MNIST/raw
        
        Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
        Failed to download (trying next):
        HTTP Error 403: Forbidden
        
        Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz
        Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz to ../data/MNIST/raw/t10k-images-idx3-ubyte.gz
100%|██████████| 1648877/1648877 [00:00<00:00, 105775069.92it/s]
        Extracting ../data/MNIST/raw/t10k-images-idx3-ubyte.gz to ../data/MNIST/raw
        
        Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
        Failed to download (trying next):
        HTTP Error 403: Forbidden
        
        Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz
        Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz to ../data/MNIST/raw/t10k-labels-idx1-ubyte.gz
100%|██████████| 4542/4542 [00:00<00:00, 3642548.52it/s]
        Extracting ../data/MNIST/raw/t10k-labels-idx1-ubyte.gz to ../data/MNIST/raw
        
        Train Epoch: 1 [0/60000 (0%)]   Loss: 2.305400
        Train Epoch: 1 [640/60000 (1%)] Loss: 1.359780
        Train Epoch: 1 [1280/60000 (2%)]        Loss: 0.830670
        Train Epoch: 1 [1920/60000 (3%)]        Loss: 0.605961
        Train Epoch: 1 [2560/60000 (4%)]        Loss: 0.345934
        Train Epoch: 1 [3200/60000 (5%)]        Loss: 0.446331
        Train Epoch: 1 [3840/60000 (6%)]        Loss: 0.306768
        Train Epoch: 1 [4480/60000 (7%)]        Loss: 0.279325
        Train Epoch: 1 [5120/60000 (9%)]        Loss: 0.555025
        Train Epoch: 1 [5760/60000 (10%)]       Loss: 0.208878
        Train Epoch: 1 [6400/60000 (11%)]       Loss: 0.279527
        Train Epoch: 1 [7040/60000 (12%)]       Loss: 0.327207
        Train Epoch: 1 [7680/60000 (13%)]       Loss: 0.204888
        Train Epoch: 1 [8320/60000 (14%)]       Loss: 0.220855
        Train Epoch: 1 [8960/60000 (15%)]       Loss: 0.273643
        Train Epoch: 1 [9600/60000 (16%)]       Loss: 0.097318
        Train Epoch: 1 [10240/60000 (17%)]      Loss: 0.248318
        Train Epoch: 1 [10880/60000 (18%)]      Loss: 0.112893
        Train Epoch: 1 [11520/60000 (19%)]      Loss: 0.439383
        Train Epoch: 1 [12160/60000 (20%)]      Loss: 0.244582
        Train Epoch: 1 [12800/60000 (21%)]      Loss: 0.245529
        Train Epoch: 1 [13440/60000 (22%)]      Loss: 0.221483
        Train Epoch: 1 [14080/60000 (23%)]      Loss: 0.157298
        Train Epoch: 1 [14720/60000 (25%)]      Loss: 0.418896
        Train Epoch: 1 [15360/60000 (26%)]      Loss: 0.168725
        Train Epoch: 1 [16000/60000 (27%)]      Loss: 0.110782
        
=== RUN   TestMPIJobPytorchTraining/multi-node
W0801 07:34:58.583447   14852 warnings.go:70] unknown field "spec.mpiReplicaSpecs.Launcher.template.metadata.creationTimestamp"
W0801 07:34:58.583460   14852 warnings.go:70] unknown field "spec.mpiReplicaSpecs.Worker.template.metadata.creationTimestamp"
=== RUN   TestMPIJobPytorchTraining/multi-node/MPIJob_succeeds
    mpi_test.go:123: context deadline exceeded
=== NAME  TestMPIJobPytorchTraining/multi-node
    mpi_test.go:132: no pods found for job multi-node-nccl-test
--- FAIL: TestMPIJobPytorchTraining (47.59s)
    --- FAIL: TestMPIJobPytorchTraining/single-node (47.33s)
        --- FAIL: TestMPIJobPytorchTraining/single-node/MPIJob_succeeds (46.81s)
    --- FAIL: TestMPIJobPytorchTraining/multi-node (0.26s)
        --- FAIL: TestMPIJobPytorchTraining/multi-node/MPIJob_succeeds (0.00s)
=== RUN   TestSingleNodeUnitTest
=== RUN   TestSingleNodeUnitTest/unit-test
=== RUN   TestSingleNodeUnitTest/unit-test/Unit_test_Job_succeeds
    unit_test.go:56: context deadline exceeded
=== NAME  TestSingleNodeUnitTest/unit-test
    unit_test.go:65: container "unit-test-container" in pod "unit-test-job-v8wnm" is waiting to start: ContainerCreating
--- FAIL: TestSingleNodeUnitTest (0.32s)
    --- FAIL: TestSingleNodeUnitTest/unit-test (0.32s)
        --- FAIL: TestSingleNodeUnitTest/unit-test/Unit_test_Job_succeeds (0.00s)
FAIL
FAIL    github.com/aws/aws-k8s-tester/e2e2/test/cases/nvidia    62.397s
FAIL

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@weicongw weicongw marked this pull request as ready for review August 1, 2024 18:02
@weicongw weicongw force-pushed the main branch 2 times, most recently from 6cbf22f to cca1835 Compare August 2, 2024 17:32
@Issacwww Issacwww merged commit 1b67363 into aws:main Aug 2, 2024
3 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants