-
Notifications
You must be signed in to change notification settings - Fork 879
Registry degrades and reports corruption #540
Comments
A little more from the log:
|
After a few more pulls i stops responding and nothing is written to the container log. Client:
|
This tells me you are having issues reaching your S3 bucket. @shin- what do you think? |
If it's S3 related I will deploy a boto.cfg with region set and debug enabled, initially a reboot of the registry container helped so I did not blamed S3 connectivity but I'm not entirely sure any longer. It's currently not possible to set S3 region in the registry config, correct? |
Right - it's not possible if you are running it inside a docker container (possibly a gevent version issue). |
I'we hard wired boot against eu-west-1 but I am still seeing the same issues. I noticed in the log today that worker threads reported problems inside the container:
If it's S3 related, any ides on how to proceed with debug? I'we seen quite a few 404:s in the logs but they are always for an _inprogess file is that in order ? Another error with current debug: Client:
Server:
|
@Henkis the 404 are irrelevant. What indeed matters is the timeouts reaching your bucket. Unfortunately, I have little help to offer as far as debugging is concerned. I would go starting a python script from scratch using boto S3 (https://github.com/boto/boto) and query my bucket repeatedly until I trigger the issue, then dig into boto. Keep me posted on any progress on this. |
@Henkis can you try bumping your boto timeout values:
inside boto.cfg ... and report here if that helps? Thanks. |
I have tried adding the socker timeout to my boto.cfg, it doesn't help. Same problems maybe some more debug output:
|
Ok, the greenlet is timing out actually. That might be a gunicorn bug you are hitting. By any chance, would you be able to run of master? (I know there are quite some changes...) - or better, force gunicorn to 19.1 instead of 18.0 (inside requirements/main.txt). Thanks! |
Hi, I have tried your solution by forcing gunicorn to 19.1, and I still have the same problem but a different error. First error was just like Henkis : the greenlet was timing out. Now I have that:
or that :
|
At least the greenlet is no longer going in lalaland. So:
This smells to me like a known regression where cancelled requests wouldn't be handled correctly. Assuming you are indeed behind nginx, can you try killing keep-alive? (eg: |
Right now no nginx, we acces the registry directly on local network on port 5000. After these errors, the docker pull command just fail and thats it after multiple times it works... |
@mcadam that's likely this: benoitc/gunicorn#818 |
I've been experiencing exactly the same symptoms and it appears the problem, at least in my case, is in fact downloading from S3. I created a tiny boto script to download a 69M layer:
and found it took 4s to run the first time, then 15s, 23s, 9s and finally after 2m3s:
Presumably this kind of unreliability with S3 isn't normal so I'm investigating further... |
Same issue here:
Restarting does not solve the problem. |
We're having similar problems with 0.9.0. Everything seems to be working fine when just one or two machines are pulling at the same time, but when we let our frontend array (30+ machines) pull at the same time then all hell breaks loose. Our workaround has been putting a haproxy in front and limiting that the haproxy allows only as many connections to the registry as there are worker threads running and to spread out the image pulls over a larger time period. We're also running registry in three different machines, all using the same S3 backend so that the haproxy has more backends to spread the requests out. |
@garo and others
|
Thanks! The storage_redirect seems to fix all our issues. To others: Just add "storage_redirect: true" to the registry.yaml. You can verify that it works by using tcp dump: "tcpdump -l -A -s 1500 src port 5000 |grep Location:" and you should get nice headers like: "Location: https://yourbucket.s3.amazonaws.com/..." |
To add to @garo 's comment (and to save you some digging) you can add |
We've got this issue too on our registries. We're running them inside an AWS VPC. We're running one registry container per host, with 6 instance currently. Those instances are fronted by an elastic load balancer whose timeout it set to 5 mins. We're using S3 as a storage backend. After finding this thread, we set the registry with the -e STORAGE_REDIRECT=true option to delegate image downloads directly to S3. That has helped tremendously with the myriad of errors we were getting such as EOF. However, we're still getting EOF errors on some registry calls that aren't actual image layer retrievals. For example, I saw an EOF error on a /ancestry call. I don't see any errors in the Docker Registry logs when these types of issues happen so I'm sort of at a loss. These errors seem to happen when we have a heavier load on the registries such as 5-10 images pulling from the registries from the same time. However, that doesn't seem like a heavy load especially with storage redirect and 6 instances of the registry behind a load balancer. |
@ALL we recently removed ParallelKey fetch from S3 from the registry - it was triggering timeouts for large objects, and cluttering disk space with orphaned temporary files. |
Hmm that might help, is there any documentation on how to disable that feature on our registries? I'm using the container so I've usually just had environment variables that map to what's in the YAML config file, but I don't see the option you mentioned in that YAML config file schema. |
It's code removal (#961). |
Ok I'm trying the fix out now, I'll have to wait a few days to validate whether our percentage of deploy failures when pulling from the registries goes down. If anyone else is interested in trying this fix quickly, I have a container built on https://registry.hub.docker.com/u/dsw88/registry/. The tag is 0.9.1-parallelkey-fix. I built the container from master, which appears safe at this point since there haven't been any major code changes since the 0.9.1 stable release on Jan. 8: https://github.com/docker/docker-registry/commits/master |
Ok it looks like I'm still seeing the EOF errors after applying the parallel key fix. I don't know yet whether their occurrence has been reduced, since I'll need to wait a few days to have enough real-world data from deploys. But regardless, it appears that parallel key isn't the ultimate fix for this issue. Any other suggestions? Do you think it would be good to set up a Redis LRU cache? I'm wondering if having that cache would decrease the number of errors since it will need to make fewer round-trips to S3. |
I would definitely recommend using the Redis LRU cache. |
Ok so I'm now using the LRU cache and doing S3 storage redirect. We continue to get EOF errors, but we'll see if this at least cuts it down. I'm still concerned about the underlying issue here, however, as adding a cache is at best masking the problem. I'll try to do some debugging in the registry to find out what's causing this error, but I'm not super familiar with the codebase or technology stack so it'll be slow going. One of my problems is I can't see any errors in the registry logs when this problem occurs. @dmp42 do you have any suggestions about how I could go about getting enough information about the error to start debugging? The Docker client isn't much help when pulling images because it just says "unexpected EOF". |
It appears that the small files like /ancestry and others are still failing periodically for us:
When I look in the registry logs, it shows that it got that request and even returned a 200, so presumably it thought it returned the image layer correctly:
So are the threads in the webapp dying or timing out while it's streaming the response or something? |
@dsw88 I now think you are not talking about the same issue that was initially described here (the recommended workarounds were/are here to address registry side EOF/issues communicating with S3). Do you use an http proxy? or a reverse-proxy in front of your registry? (nginx, HAproxy) Also, you are running a quite old version of docker (1.1.2). I would strongly suggest upgrading if you can... |
@dmp42 Sorry about that, I'll open a new issue related to the EOF issues we're seeing even though we've already implemented the S3 storage redirect. I'll post there about the details of my setup. |
I'm seeing errors similar to the ones mentioned in this comment:
I am running docker-registry ( However, the only reason that I'm running the The error I see in
This seems to come from one of these lines in docker's source code:
The reason I think this is an SSL verification issue is because we tested adding To see what was happening with the requests, I manually tried running:
When I manually followed the redirect by hand by running
So it's a catch-22. I need HTTPS security so I can run docker-registry with the |
I've tracked my SSL issue down to this problem with S3 bucket names. This seems to have solved the issue with using To fix my issue I had to:
Now I no longer see the
|
Oops... I spoke too soon 😭Correction.... I still get the
The failing docker pull command was:
Error it gave was:
The docker daemon log
The single line with a timestamp is: The
I believe this is still an issue somewhere in the The setup is (all IPs, layer ids, and containers anonymized to protect the innocent):
|
Just noticed in my previous logs that the docker daemon (the client accessing docker-registry to pull images) has the timestamp of the error to be: I searched for that time in the logs ( I checked the
So both hosts are synchronized to correct time, but are just in different time zones. One is UTC, one is UTC-6 hours. So we just add 6 hours to the docker daemon host time:
Searching
Searching So nginx is seeing the request from docker daemon for And nginx shows HTTP Code TimelineMaking the assumption that Nginx is telling us the truth: the client closed it's connection before it could hear a response. Here is what I think is happening:
So the So from the time that the docker daemon makes the request
Potential Related Gunicorn IssueGiven the symptoms, and proposed chain of events... This could potentially related to benoitc/gunicorn#414 Container Debug InfoThe
And it has these python packages installed:
|
I'm able to reproduce this error easily by running the
Watch the logs by opening 3 terminals:
I then simply access the "Images" page through the Here are are all 3 anonymized logs The other way to reproduce it, although not as easy perhaps, is to do lots of |
Got another instance of this bug, with different request for Here are the logs. There are a couple requests in the logs.. the first couple are from the initial This bug looks a lot like gevent/gevent#445 |
There are other similar bugs on The ones that most notably seem related: |
After enabling the After disabling both of these options, these stability issues went away and errors became much less frequent (although seemed to be present but still sometimes occurred). Didn't have enough time to investigate further, but turning off search and disabling the Since this project is deprecated, hopefully this helps anyone still stuck on Registry V1. |
We are using a privately hosted registry on amazon which seems to degrade after we have pulled larger images (1GB+) a few times from it, it finely stop responding. A restart of the registry container seems to fix the problem for a while:
And again:
We often see these messages in the log but not always:
We currently use the latest registry (registry 0.8.0 cd3581c06bdc) but have hade the same problems earlier.
Registry config:
Dockerfile:
config.yml:
The text was updated successfully, but these errors were encountered: