Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Needs test: Context deadline during Pull causes an already exists error from govmomi on a subsequent pull #1409

Open
matthewavery opened this issue Jul 7, 2016 · 18 comments
Labels
component/imagec component/portlayer/storage component/test Tests not covered by a more specific component label kind/debt Problems that increase the cost of other work severity/3-moderate Medium usability or functional impact. Potentially has an inconvenient workaround.

Comments

@matthewavery
Copy link
Contributor

matthewavery commented Jul 7, 2016

During a pull busybox it looks like a context deadline occurred somewhere on the portlayer side during a write image causing an inconsistent state. On repeat of the pull the layer is found to already exists after trying a new right. On a subsequent create the busybox image is still not found and an attempt to pull it is made. The appropriate response should have been that the layer already exists on the second pull.

 % sudo ./vic-machine-linux create --target skinner.eng.vmware.com --user root --image-datastore Datastore --password ca\$hc0w --bridge-network vch-test --name mavery-vch-1 --compute-resource /Datacenter/host/Austin/Resources/mavery --volume-store=test-store:Datastore/testStore --generate-cert=false
INFO[2016-07-07T14:26:22-05:00] ### Installing VCH ####                      
WARN[2016-07-07T14:26:22-05:00] Configuring without TLS - to enable use -generate-cert or -key/-cert parameters 
INFO[2016-07-07T14:26:22-05:00] Validating supplied configuration            
INFO[2016-07-07T14:26:22-05:00] Firewall status: DISABLED on /Datacenter/host/Austin/patty.eng.vmware.com 
INFO[2016-07-07T14:26:22-05:00] Firewall configuration OK on hosts:          
INFO[2016-07-07T14:26:22-05:00]   /Datacenter/host/Austin/patty.eng.vmware.com 
INFO[2016-07-07T14:26:23-05:00] License check OK on hosts:                   
INFO[2016-07-07T14:26:23-05:00]   /Datacenter/host/Austin/patty.eng.vmware.com 
INFO[2016-07-07T14:26:23-05:00] DRS check OK on:                             
INFO[2016-07-07T14:26:23-05:00]   /Datacenter/host/Austin/Resources/mavery   
INFO[2016-07-07T14:26:27-05:00] Creating Resource Pool mavery-vch-1          
INFO[2016-07-07T14:26:27-05:00] Creating directory [Datastore] testStore     
INFO[2016-07-07T14:26:27-05:00] Datastore path is [Datastore] testStore      
INFO[2016-07-07T14:26:27-05:00] Creating appliance on target                 
INFO[2016-07-07T14:26:27-05:00] Network role client is sharing NIC with external 
INFO[2016-07-07T14:26:27-05:00] Network role management is sharing NIC with external 
INFO[2016-07-07T14:26:30-05:00] Uploading images for container               
INFO[2016-07-07T14:26:30-05:00]     bootstrap.iso                               
INFO[2016-07-07T14:26:30-05:00]     appliance.iso                               
INFO[2016-07-07T14:26:42-05:00] Registering VCH as a vSphere extension       
INFO[2016-07-07T14:26:51-05:00] Waiting for IP information                   
INFO[2016-07-07T14:28:51-05:00] Waiting for major appliance components to launch 
INFO[2016-07-07T14:28:51-05:00] Initialization of appliance successful       
INFO[2016-07-07T14:28:51-05:00]                                              
INFO[2016-07-07T14:28:51-05:00] SSH to appliance (default=root:password)     
INFO[2016-07-07T14:28:51-05:00] ssh [email protected]                        
INFO[2016-07-07T14:28:51-05:00]                                              
INFO[2016-07-07T14:28:51-05:00] Log server:                                  
INFO[2016-07-07T14:28:51-05:00] http://10.17.109.94:2378                     
INFO[2016-07-07T14:28:51-05:00]                                              
INFO[2016-07-07T14:28:51-05:00] DOCKER_HOST=10.17.109.94:2375                
INFO[2016-07-07T14:28:51-05:00] DOCKER_OPTS="-H 10.17.109.94:2375"           
INFO[2016-07-07T14:28:51-05:00]                                              
INFO[2016-07-07T14:28:51-05:00] Connect to docker:                           
INFO[2016-07-07T14:28:51-05:00] docker -H 10.17.109.94:2375 info             
INFO[2016-07-07T14:28:51-05:00] Installer completed successfully             
matthewavery@ubuntu ~/go/src/github.com/vmware/vic/bin
 % docker -H 10.17.109.94:2375 pull busybox
Using default tag: latest
latest: Pulling from library/busybox

a3ed95caeb02: Extracting [==================================================>]     32 B/32 B
8ddc19f16526: Pull complete 
Failed to write to image store: context deadline exceeded
matthewavery@ubuntu ~/go/src/github.com/vmware/vic/bin
 % docker -H 10.17.109.94:2375 pull busybox
Using default tag: latest
latest: Pulling from library/busybox

8ddc19f16526: Already exists 
a3ed95caeb02: Extracting [==================================================>]     32 B/32 B
Failed to write to image store: [POST /storage/{store_name}][500] WriteImage default  &{Code:0xc82039f368 Message:Cannot complete the operation because the file or folder [Datastore] VIC/4223eb68-0def-b8f7-2f5c-4d5da6568e9d/images/b05baf071fd542c3146f08e5f20ad63e76fa4b4bd91f274c4838ddc41f3409f8/b05baf071fd542c3146f08e5f20ad63e76fa4b4bd91f274c4838ddc41f3409f8.vmdk already exists}
matthewavery@ubuntu ~/go/src/github.com/vmware/vic/bin
 % 
matthewavery@ubuntu ~/go/src/github.com/vmware/vic/bin
 % docker -H 10.17.109.94:2375 create -v /testVol busybox /bin/sh
Unable to find image 'busybox:latest' locally
latest: Pulling from library/busybox

8ddc19f16526: Already exists 
a3ed95caeb02: Already exists 
Digest: sha256:65ce39ce3eb0997074a460adfb568d0b9f0f6a4392d97b6035630c9d7bf92402
Status: Image is up to date for library/busybox:latest
08ac166ac3736a481bc32627e465a8140ce126617edb6d43185b05b83b4a9c3a
matthewavery@ubuntu ~/go/src/github.com/vmware/vic/bin
 % docker -H 10.17.109.94:2375 create -v /testVol busybox /bin/sh
8baba67575c9c26d4b90eb83e61f465629a498214494092730a34354ab3f7a7d
matthewavery@ubuntu ~/go/src/github.com/vmware/vic/bin
 % docker -H 10.17.109.94:2375 pull busybox                      
Using default tag: latest
latest: Pulling from library/busybox

8ddc19f16526: Already exists 
a3ed95caeb02: Already exists 
Digest: sha256:65ce39ce3eb0997074a460adfb568d0b9f0f6a4392d97b6035630c9d7bf92402
Status: Image is up to date for library/busybox:latest
matthewavery@ubuntu ~/go/src/github.com/vmware/vic/bin

@matthewavery matthewavery added kind/defect Behavior that is inconsistent with what's intended kind/investigation A scoped effort to learn the answers to a set of questions which may include prototyping labels Jul 7, 2016
@mdubya66
Copy link
Contributor

assigning to @hickeng for triage

@hickeng
Copy link
Member

hickeng commented Jul 26, 2016

Assigning @jzt to look into the context timeout - pulls could take a very long time with a slow network connection; this is not an operation that we should timeout (unless docker has a timeout here, in which case it should be on the personality side). Putting in initial estimate for investigation and fix of timeout.

#933 is the issue that will address the already exist/re-pull behaviour.

@hickeng hickeng assigned jzt and unassigned hickeng Jul 26, 2016
@mhagen-vmware mhagen-vmware added the impact/doc/note Requires creation of or changes to an official release note label Aug 8, 2016
@jzt
Copy link
Contributor

jzt commented Sep 14, 2016

There are currently no context timeouts from the persona to the portlayer, and imagec uses a 3600 second timeout for pulls, so I don't think that's it. There is one in the disk package (5 seconds) that may be where this came from:

ctx, _ = context.WithTimeout(ctx, timeout)

This is exercised during the WriteImage path. If you are still seeing these timeouts, this is probably the culprit.

Tagging @fdawg4l in case he has something to add.

@fdawg4l
Copy link
Contributor

fdawg4l commented Sep 14, 2016

Sounds reasonable. We can modify the timeout in the VCH; the time it takes to create the vmdk and attach it to the VCH is completely up to vsphere. I don't know what the upper bound of that operation is on a busy system. @hickeng?

@hickeng
Copy link
Member

hickeng commented Sep 16, 2016

@fdawg4l Almost unbounded I think. We're generally moving to a cancellation based approach in the PL, and the caller (in this case the personality) can furnish timeouts if that's required semantics.

@fdawg4l
Copy link
Contributor

fdawg4l commented Sep 16, 2016

Sounds good. We'll move to cancellation and nuke the timeout in the disk pkg.

As far as this issue goes, we need 2 things
1 - We need to prevent more than 1 image being pulled at the same time
2 - We need to cleanup when a pull fails.

2 is done. We merged this a little over a month ago. We just need 1 now and to remove the timeout. That is not to say we aren't going to see a timeout somewhere else in the stack- does the docker client have a timeout? Do the handlers in the docker daemon have a default timeout per handler?

We can solve the things mentioned, but largely close the issue since subsquent pulls already work. Timeouts / cancellation in general need their own epics.

@fdawg4l
Copy link
Contributor

fdawg4l commented Sep 19, 2016

We can solve the things mentioned, but largely close the issue since subsquent pulls already work. Timeouts / cancellation in general need their own epics.

#2385

@mdubya66
Copy link
Contributor

Workaround: Issue the command again and it should work.

@matthewavery
Copy link
Contributor Author

This may involve a fair amount of code change, and since there is a workaround this may not be a big issue for 0.8.0. It also occurs rarely, so it might be best if we table this one for now. @mdubya66 @hickeng

@mdubya66 mdubya66 modified the milestones: VIC GA Release, v1.0 Nov 2, 2016
@matthewavery matthewavery reopened this Nov 2, 2016
@fdawg4l
Copy link
Contributor

fdawg4l commented Nov 2, 2016

@matthewavery, @jzt's recent imagec changes prevent multiple pulls of the same layer id from happening. It's likely this is now fixed.

@matthewavery matthewavery removed the impact/doc/note Requires creation of or changes to an official release note label Nov 7, 2016
@stuclem
Copy link
Contributor

stuclem commented Nov 7, 2016

Already in the release notes, so removing the kind/note tag.

1 similar comment
@stuclem
Copy link
Contributor

stuclem commented Nov 7, 2016

Already in the release notes, so removing the kind/note tag.

@mhagen-vmware mhagen-vmware added priority/p4 component/test Tests not covered by a more specific component label labels Feb 1, 2017
@mhagen-vmware
Copy link
Contributor

Marking this a low priority est 2 issue now, with the assumption this is fixed already and now making this a test item to make sure we have a test that covers this scenario.

@hickeng
Copy link
Member

hickeng commented Apr 13, 2017

@jzt can you confirm that this is fixed?

@corrieb
Copy link
Contributor

corrieb commented Jul 19, 2017

@jzt ^^^ let's close if this is fixed

@jzt
Copy link
Contributor

jzt commented Jul 19, 2017

@corrieb @mhagen-vmware I'm down to close. I haven't seen/heard of this happening for quite a long time. @mhagen-vmware do you agree?

@hickeng
Copy link
Member

hickeng commented Jul 19, 2017

#5372 is another instance of this - as per #5372 (comment) we should be deleting incomplete layers so they can be re-pulled.
We shouldn't close this without a test that confirms correctness.

@mdubya66 mdubya66 added priority/p2 and removed priority/p4 kind/defect Behavior that is inconsistent with what's intended kind/investigation A scoped effort to learn the answers to a set of questions which may include prototyping labels Oct 31, 2017
@mdubya66
Copy link
Contributor

This looks to be resolved in 5372. Moving from bug to test issue. A test needs to be written to cover this situation.

@zjs zjs added severity/3-moderate Medium usability or functional impact. Potentially has an inconvenient workaround. kind/debt Problems that increase the cost of other work and removed priority/p2 team/container labels Mar 6, 2019
@zjs zjs changed the title Context deadline during Pull causes an already exists error from govmomi on a subsequent pull Needs test: Context deadline during Pull causes an already exists error from govmomi on a subsequent pull Mar 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/imagec component/portlayer/storage component/test Tests not covered by a more specific component label kind/debt Problems that increase the cost of other work severity/3-moderate Medium usability or functional impact. Potentially has an inconvenient workaround.
Projects
None yet
Development

No branches or pull requests

10 participants