Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GPU support to compose service #750

Merged
merged 0 commits into from
Mar 18, 2019
Merged

Add GPU support to compose service #750

merged 0 commits into from
Mar 18, 2019

Conversation

efekarakus
Copy link
Contributor

Issue #, if available: 729

Description of changes: Add GPU support to compose service.

Testing: All unit tests pass, see the comment section for manual testing.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@efekarakus
Copy link
Contributor Author

Manual Testing

New Functionality

Configuration

Parameters
# docker-compose.yml
version: '3'
services:
  wordpress:
    image: wordpress
    ports:
      - "80:80"
    links:
      - mysql
    logging:
      driver: awslogs
      options: 
        awslogs-group: tutorial-wordpress
        awslogs-region: us-east-1
        awslogs-stream-prefix: wordpress
  mysql:
    image: mysql:5.7
    environment:
      MYSQL_ROOT_PASSWORD: password
    logging:
      driver: awslogs
      options: 
        awslogs-group: tutorial-mysql
        awslogs-region: us-east-1
        awslogs-stream-prefix: mysql
# ecs-params.yml
version: 1
task_definition:
  services:
    wordpress:
      cpu_shares: 100
      mem_limit: 524288000
      gpu: "1"
    mysql:
      cpu_shares: 100
      mem_limit: 524288000
cluster

The cluster gpu-support consists of 4 p2.xlarge instances, and 1 t2.micro instance.

Tests

Create and run a task

ecs-cli compose up --create-log-groups --cluster-config gpu-support --region us-east-2

INFO[0000] Using ECS task definition                     TaskDefinition="ecscli-gpu-support:1"
WARN[0001] Failed to create log group tutorial-wordpress in us-east-1: The specified log group already exists 
WARN[0001] Failed to create log group tutorial-mysql in us-east-1: The specified log group already exists 
WARN[0001] No log groups to create; no containers use 'awslogs' 
INFO[0001] Starting container...                         container=faacc82d-7f59-4d5c-bec4-da83f51e7070/wordpress
INFO[0001] Starting container...                         container=faacc82d-7f59-4d5c-bec4-da83f51e7070/mysql
INFO[0001] Describe ECS container status                 container=faacc82d-7f59-4d5c-bec4-da83f51e7070/mysql desiredStatus=RUNNING lastStatus=PENDING taskDefinition="ecscli-gpu-support:1"
INFO[0001] Describe ECS container status                 container=faacc82d-7f59-4d5c-bec4-da83f51e7070/wordpress desiredStatus=RUNNING lastStatus=PENDING taskDefinition="ecscli-gpu-support:1"
INFO[0014] Describe ECS container status                 container=faacc82d-7f59-4d5c-bec4-da83f51e7070/mysql desiredStatus=RUNNING lastStatus=PENDING taskDefinition="ecscli-gpu-support:1"
INFO[0014] Describe ECS container status                 container=faacc82d-7f59-4d5c-bec4-da83f51e7070/wordpress desiredStatus=RUNNING lastStatus=PENDING taskDefinition="ecscli-gpu-support:1"
INFO[0026] Started container...                          container=faacc82d-7f59-4d5c-bec4-da83f51e7070/mysql desiredStatus=RUNNING lastStatus=RUNNING taskDefinition="ecscli-gpu-support:1"
INFO[0026] Started container...                          container=faacc82d-7f59-4d5c-bec4-da83f51e7070/wordpress desiredStatus=RUNNING lastStatus=RUNNING taskDefinition="ecscli-gpu-support:1"
Create and run a service

ecs-cli compose --cluster-config gpu-support --region us-east-2 service up

                                                                         ⏎
INFO[0000] Using ECS task definition                     TaskDefinition="ecscli-gpu-support:1"
INFO[0000] Created an ECS service                        service=ecscli-gpu-support taskDefinition="ecscli-gpu-support:1"
INFO[0001] Updated ECS service successfully              desiredCount=1 force-deployment=false service=ecscli-gpu-support
INFO[0016] (service ecscli-gpu-support) has started 1 tasks: (task c40c2067-9806-4145-8861-1ba9a3cd62cf).  timestamp="2019-03-13 20:55:41 +0000 UTC"
INFO[0031] Service status                                desiredCount=1 runningCount=1 serviceName=ecscli-gpu-support
INFO[0031] ECS Service has reached a stable state        desiredCount=1 runningCount=1 serviceName=ecscli-gpu-support
Increase the number of tasks in the service

ecs-cli compose --cluster-config gpu-support --region us-east-2 service scale 3

INFO[0000] Updated ECS service successfully              desiredCount=3 force-deployment=false service=ecscli-gpu-support
INFO[0000] Service status                                desiredCount=3 runningCount=1 serviceName=ecscli-gpu-support
INFO[0015] (service ecscli-gpu-support) has started 2 tasks: (task 22f26c9b-0bbc-4547-9f7e-41cfbe25e1b7) (task c14690c0-aa9f-46d6-990a-d099ac72279d).  timestamp="2019-03-13 20:58:19 +0000 UTC"
INFO[0031] Service status                                desiredCount=3 runningCount=3 serviceName=ecscli-gpu-support
INFO[0031] ECS Service has reached a stable state        desiredCount=3 runningCount=3 serviceName=ecscli-gpu-support
No more instances available to run tasks

ecs-cli compose --cluster-config no-gpu-support --region us-east-2 service scale 4

INFO[0000] Updated ECS service successfully              desiredCount=4 force-deployment=false service=ecscli-gpu-support
INFO[0000] Service status                                desiredCount=4 runningCount=3 serviceName=ecscli-gpu-support
INFO[0015] (service ecscli-gpu-support) was unable to place a task because no container instance met all of its requirements. The closest matching (container-instance abac8633-d532-4539-82eb-d5a06ff51d54) has insufficient GPU resource available. For more information, see the Troubleshooting section of the Amazon ECS Developer Guide.  timestamp="2019-03-13 20:59:09 +0000 UTC"

INFO[0077] (service ecscli-gpu-support) was unable to place a task because no container instance met all of its requirements. The closest matching (container-instance abac8633-d532-4539-82eb-d5a06ff51d54) doesn't have the agent connected. For more information, see the Troubleshooting section of the Amazon ECS Developer Guide.  timestamp="2019-03-13 21:00:11 +0000 UTC"
INFO[0107] (service ecscli-gpu-support) was unable to place a task because no container instance met all of its requirements. The closest matching (container-instance abac8633-d532-4539-82eb-d5a06ff51d54) has insufficient GPU resource available. For more information, see the Troubleshooting section of the Amazon ECS Developer Guide.  timestamp="2019-03-13 21:00:44 +0000 UTC"
FATA[0306] Deployment has not completed: Running count has not changed for 5.00 minutes 

@@ -123,6 +123,15 @@ func reconcileContainerDef(inputCfg *adapter.ContainerConfig, ecsConDef *Contain
if err != nil {
return nil, err
}

if ecsConDef.Gpu != "" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will ecsConDef.Gpu ever be nil?

Copy link
Contributor Author

@efekarakus efekarakus Mar 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my understanding, ecsConDef.Gpu matches the gpu field in ecs-params.yml if the field exists. If the field is not in the container definition then it defaults to its zero value which is the empty string "". (I tested this)

So I don't think there can be a scenario where it's nil.

@@ -66,6 +66,7 @@ type ContainerDef struct {
MemoryReservation libYaml.MemStringorInt `yaml:"mem_reservation"`
HealthCheck *HealthCheck `yaml:"healthcheck"`
Secrets []Secret `yaml:"secrets"`
Gpu string `yaml:"gpu"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize that in the docs this field is specified as a string, but can we verify if there's any good reason why this can't be a number? we can always convert to string in our own code (though of course there is the annoying issue of the default empty value of an int being 0 in go, so would have to make sure that's handled correctly).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's best to make it a string since that's what the API takes, and for the default value reason which you mentioned.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also nit: This should be "GPU" to fit golang initialism practices

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SoManyHs I went with string to keep it consistent with the API as @PettitWesley pointed it out. I'm scared of putting a different type than the backend in case they enable for example values like "two". Then our solution would be more limiting.
@PettitWesley fixed the variable name. I wrote that because I saw Cpu a few lines above it 😜

@@ -66,6 +66,7 @@ type ContainerDef struct {
MemoryReservation libYaml.MemStringorInt `yaml:"mem_reservation"`
HealthCheck *HealthCheck `yaml:"healthcheck"`
Secrets []Secret `yaml:"secrets"`
GPU string `yaml:"gpu"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the container definition, GPU resources are specified within a list of 'resourceRequirement' objects, vs. as a flat value. If new resourceRequirement types are added in the future, how do you propose we add them? Flattened as well? What if multiple requirements of the same type are allowed? Wondering what your ideas are here re: keeping this extensible.

One alternative would be to take in a list of these, like we do for secrets:

secrets:
        - value_from: string
          name: string
...
resource_requirements:
       - type: GPU
         value: 3

Copy link
Contributor

@PettitWesley PettitWesley Mar 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ask @efekarakus to explain this offline, he has an explanation for why he chose this approach. I had the same question.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline 😝

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this was resolved, but Efe was following a design choice I made in a previous design doc.

@efekarakus efekarakus merged commit d696be9 into aws:gpu Mar 18, 2019
@psharkey psharkey mentioned this pull request May 1, 2019
3 tasks
@efekarakus efekarakus mentioned this pull request May 15, 2019
5 tasks
@SoManyHs SoManyHs mentioned this pull request May 17, 2019
4 tasks
@SoManyHs SoManyHs mentioned this pull request Jun 10, 2019
5 tasks
@otterley otterley mentioned this pull request Jun 16, 2020
2 tasks
@bvtujo bvtujo mentioned this pull request Oct 1, 2020
3 tasks
@allisaurus allisaurus mentioned this pull request Dec 18, 2020
3 tasks
@testwill testwill mentioned this pull request Sep 4, 2023
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants