-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Weird Couchbase Server problems running the Gateway #95
Comments
It's definitely past time to attack this problem head-on. Today I'm going to do a bunch of replications using CBS as a back-end and see if I can reproduce anything like this. Nico, one thing you could do is check whether this is due to something messed-up in your Couchbase configuration. You can do that by stopping the server, then moving aside (not deleting!) |
I actually haven't tried Couchbase Server on OS X, but on Ubuntu I did try to uninstall and purge the package (and setup) and then redownload and reinstall the .deb from the site. It didn't change anything in my specific case. |
Can you describe that in more detail? (I'm not that good with Linux so I don't know what that status line (from top?) means.)
From stack dumps you posted in the earlier reports this looks like CBL isn't sending responses to view queries — the gateway is blocked way down in Go net/http code waiting for response data on the socket. |
The stack dump I referred to above is here -- look at goroutine 37, which is stuck in Here's another stack dump showing a similar hang on launch -- this time it's in response to the server startup code configuring the users in the database. In both cases the gateway is loading a user-info document, which has an out-of-date role list, which triggers querying a view to get the current set of roles for that user, and that query never completes. I wonder if for some reason the view indexing is just taking a very, very, very long time. It shouldn't be, but if it were these are the symptoms you'd get. I'll ask someone with CBS expertise what to do to troubleshoot this. |
(For my own reference: Aaron pointed me to an internal wiki page on debugging view issues, and Damien says Filipe is the in-house expert on CBL views.) |
Yes it was a line from top, showing that Couchbase Server's beam.smp process goes from nearly zero to 125% CPU usage as soon as I start the gateway, and it stays there. |
You can |
Okay, so I reset Couchbase Server entirely and create an empty bucket for sync_gateway. I start up the gateway, it connects to Ccouchbase Server on localhost:8091. I replicate my CouchDB base db (500 docs, 150 MB total) to the gateway. It writes stuff to the Couchbase Server and beam.smp spikes the CPU for a while. A minute after the replication is over (it's a continuous replication, but all docs have been added and by watching the gateway logs, I can tell no more docs are being added), :8092/_active_tasks gives
~5-10 minutes later, CPU is still being consumed by beam.smp at 125%, and Couchbase's _active_tasks gives:
I've let it sit for over an hour before. This is a multicore machine with 4 GB RAM, 3 of which are allocated to the bucket. Couchbase Server setup is the default setup. 500 documents doesn't seem so crazy, but I'll let it sit overnight then report back. Also, this didn't happen a month ago, so I might need to downgrade everything (Couchbase and sync_gateway) commit by commit to find out where the faulty code lies. |
Thanks for the info — I've emailed Filipe, our view-engine expert, asking for advice. Your documents are ~300kbytes in size on average; that's pretty large. Is this due to attachments or is there a whole lot of JSON in each doc? The latter might slow down indexing because of all the JSON parsing. |
The size is due to attachments (png images). There's only a handful of json properties. Sent from my iPhone
|
For reference, eight hours later, these are the _active_tasks - and Couchbase Server is still going at 110-130% CPU.
|
FYI, this is now being tracked as a Couchbase Server issue. |
Is this the sort of thing using CBFS for attachments would help with? |
No, what Nico is running into seems to be a bug in Couchbase Server. |
I tried running CBGB instead, but the sync_gateway gets into an infinite loop and the same things happens: 100% CPU use until the end of time. This time, the logs show more stuff: https://gist.github.com/nl/6027619 |
Hey Nico, the Couchbase devs would like some more info from your server to help diagnose the problem. You can email the logs to me directly. Thanks!
|
I loved all over /opt/couchbase/var/lib/couchbase/logs, but couldn't find a couchbase.log. I'm on Ubuntu, where should be looking for this file? On Aug 1, 2013, at 8:58 PM, Jens Alfke [email protected] wrote:
|
IIRC there is a tool called |
The documentation says it's located at /opt/couchbase/bin/cbcollect_info on Linux, but it's not there. I installed from the pkg. |
There hasn't been any update on the CBS Jira ticket in months. @nl, are you still having this problem, or did it get resolved, or did you just give up? |
It seems to work fine with 2.1.1 now. |
Hooray! Nightmare over. |
Actually, the nightmare is still here. I forgot to test this: when relaunching sync_gateway after it has already run with couchbase server (aka the bucket is not empty), it doesn't start at all. The logs are identical to before: [...] The first run (empty bucket) works fine though... back to Walrus for me! |
Well, it works. So nevermind, kinda. The one thing that has me worried is I'm now seeing a ton of:
in the logs, in between actual CRUD/Access logging. Sometimes 5-10 in a row. There's also a bunch of:
What gives? |
I'm commenting here as my issue seems related to the last comment above. go-couchbase: TAP connection lost; reconnecting to bucket "mydb" in 1s My setup is CBL on iOS 7, sync_gateway and CBS 2.2.0-821-rel on MAC OS X 10 8GBytes memory. Sometimes there are just a few messages, at other times no reconnection is made. I think this is impacting my testing, in the following scenario. Launch App on device 1 replicate to sync_gateway. On sync_gateway the TAP connection messages usually start after the initial push sync from the client. Launch App on device 2, pull files from sync gateway (This always works) On device 1, generate new files and replicate to sync_gateway. If the TAP connection messages are active, device 2 does not pull sync the new files. I guess this makes sense, no TAP connection so no dynamic updates on the channel? I have created a Gist from a typical scenario, which shows the sync_gateway log up to just after device 1 completes it's first push sync. (Note: there are two vBuckets in this example). |
Is Couchbase Server logging anything suspicious at the same time you get those errors? |
The logs appear to just have start/stop status, so I generated a diagnostic dump and extracted the entries that appear to match the sync_gateway test times (this a.m.). The server seems to mirror the client with issues around TAP connections. Gist Here |
Trying to diagnose some recent “TAP connection lost” errors reported in #95.
Could you pull the latest sync_gateway and try again? I added some more logging to go-couchbase that will include the actual error that caused the tap feed to fail. Hopefully with that info things will become clearer. |
Yep, this looks more useful.
|
So this is very, very close to the size of one of the attachments that should be pulled from that DB, in CBS it is showing as 1398100 bytes. |
Yeah, this is an error from // The maximum reasonable body length to expect. |
Can I work around this for now by just increasing the value and rebuilding? I only need to increase the ceiling to that example size. |
I’ve moved the current issue to a new bug report to make things less confusing. |
For about a month, I haven't been able to get sync_gateway to play nice with Couchbase Server.
I'm on a Ubuntu 12.04 LTS 64 bit, 4 GB RAM, 4 cores.
Couchbase 2.0.1 is installed and running perfectly, with a 1000 document bucket for about 150 MB.
8602 couchbas 20 0 1401m 400m 38m S 9 10.1 0:46.12 beam.smp
I start the gateway:
04:59:06.919845 Enabling logging: [CRUD REST REST+]
04:59:06.920239 Enabling logging: [Access]
04:59:06.920590 Opening Couchbase database sync_gateway on http://localhost:8091
04:59:07.013524 Connected to http://localhost:8091, pool default, bucket sync_gateway
At this time, Couchbase Server goes insane and basically takes my server down.
8602 couchbas 20 0 2037m 494m 38m S 125 12.5 1:34.99 beam.smp
Also, the gateway doesn't work at this time, doesn't open 4984 and 4985 ports, and therefore doesn't respond to push/pull requests. The Couchbase Server HTTP UI still works and I can query views on buckets and see documents.
I've let it sit for over an hour, no changes (> 100% CPU consumed). ctlr-\ gives this https://gist.github.com/nl/5864897
This is a big one, since Walrus doesn't seem to be able to handle that 150 MB database either (if the persistent option is enabled), and there are no other options for persistence that I know of.
The text was updated successfully, but these errors were encountered: