Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weird Couchbase Server problems running the Gateway #95

Closed
nl opened this issue Jun 26, 2013 · 33 comments
Closed

Weird Couchbase Server problems running the Gateway #95

nl opened this issue Jun 26, 2013 · 33 comments

Comments

@nl
Copy link

nl commented Jun 26, 2013

For about a month, I haven't been able to get sync_gateway to play nice with Couchbase Server.

I'm on a Ubuntu 12.04 LTS 64 bit, 4 GB RAM, 4 cores.
Couchbase 2.0.1 is installed and running perfectly, with a 1000 document bucket for about 150 MB.
8602 couchbas 20 0 1401m 400m 38m S 9 10.1 0:46.12 beam.smp

I start the gateway:
04:59:06.919845 Enabling logging: [CRUD REST REST+]
04:59:06.920239 Enabling logging: [Access]
04:59:06.920590 Opening Couchbase database sync_gateway on http://localhost:8091
04:59:07.013524 Connected to http://localhost:8091, pool default, bucket sync_gateway

At this time, Couchbase Server goes insane and basically takes my server down.
8602 couchbas 20 0 2037m 494m 38m S 125 12.5 1:34.99 beam.smp

Also, the gateway doesn't work at this time, doesn't open 4984 and 4985 ports, and therefore doesn't respond to push/pull requests. The Couchbase Server HTTP UI still works and I can query views on buckets and see documents.

I've let it sit for over an hour, no changes (> 100% CPU consumed). ctlr-\ gives this https://gist.github.com/nl/5864897

This is a big one, since Walrus doesn't seem to be able to handle that 150 MB database either (if the persistent option is enabled), and there are no other options for persistence that I know of.

@snej
Copy link
Contributor

snej commented Jun 26, 2013

It's definitely past time to attack this problem head-on. Today I'm going to do a bunch of replications using CBS as a back-end and see if I can reproduce anything like this.

Nico, one thing you could do is check whether this is due to something messed-up in your Couchbase configuration. You can do that by stopping the server, then moving aside (not deleting!) ~/Library/Application Support/Couchbase/, then starting the server again. You'll need to recreate any buckets and other settings. Then see if the gateway problem goes away.

@nl
Copy link
Author

nl commented Jun 26, 2013

I actually haven't tried Couchbase Server on OS X, but on Ubuntu I did try to uninstall and purge the package (and setup) and then redownload and reinstall the .deb from the site. It didn't change anything in my specific case.
I might actually try to downgrade to 2.0.0 and/or try the OS X build and see if it changes anything. Also, I'm happy to share my sample database (CouchDB or Couchbase format) if you can't replicate the issue.

@snej
Copy link
Contributor

snej commented Jun 26, 2013

At this time, Couchbase Server goes insane and basically takes my server down.
8602 couchbas 20 0 2037m 494m 38m S 125 12.5 1:34.99 beam.smp

Can you describe that in more detail? (I'm not that good with Linux so I don't know what that status line (from top?) means.)

Also, the gateway doesn't work at this time, doesn't open 4984 and 4985 ports, and therefore doesn't respond to push/pull requests.

From stack dumps you posted in the earlier reports this looks like CBL isn't sending responses to view queries — the gateway is blocked way down in Go net/http code waiting for response data on the socket.

@snej
Copy link
Contributor

snej commented Jun 26, 2013

The stack dump I referred to above is here -- look at goroutine 37, which is stuck in net/http.(*persistConn).roundTrip as a result of DatabaseContext.ComputeRolesForUser making a view query. This isn't happening on launch, though, but in response to an incoming HTTP request.

Here's another stack dump showing a similar hang on launch -- this time it's in response to the server startup code configuring the users in the database.

In both cases the gateway is loading a user-info document, which has an out-of-date role list, which triggers querying a view to get the current set of roles for that user, and that query never completes.

I wonder if for some reason the view indexing is just taking a very, very, very long time. It shouldn't be, but if it were these are the symptoms you'd get. I'll ask someone with CBS expertise what to do to troubleshoot this.

@snej
Copy link
Contributor

snej commented Jun 26, 2013

(For my own reference: Aaron pointed me to an internal wiki page on debugging view issues, and Damien says Filipe is the in-house expert on CBL views.)

@nl
Copy link
Author

nl commented Jun 26, 2013

Yes it was a line from top, showing that Couchbase Server's beam.smp process goes from nearly zero to 125% CPU usage as soon as I start the gateway, and it stays there.

@snej
Copy link
Contributor

snej commented Jun 26, 2013

You can GET :8092/_active_tasks to see what the view engine is doing. The output is basically the same as CouchDB's. The "type" property in each entry in the array is the task type, like indexer or blocked_indexer (though apparently they should only block if you have more than 4 indexers trying to run at once.)

@nl
Copy link
Author

nl commented Jun 26, 2013

Okay, so I reset Couchbase Server entirely and create an empty bucket for sync_gateway. I start up the gateway, it connects to Ccouchbase Server on localhost:8091. I replicate my CouchDB base db (500 docs, 150 MB total) to the gateway. It writes stuff to the Couchbase Server and beam.smp spikes the CPU for a while.

A minute after the replication is over (it's a continuous replication, but all docs have been added and by watching the gateway logs, I can tell no more docs are being added), :8092/_active_tasks gives

[{"pid":"<0.12533.0>","limit":4,"running":1,"started_on":1372276223,"type":"couch_main_index_barrier","updated_on":1372276580,"waiting":0},{"pid":"<0.12534.0>","limit":2,"running":0,"started_on":1372276223,"type":"couch_replica_index_barrier","updated_on":1372276223,"waiting":0},{"pid":"<0.24455.0>","changes_done":0,"design_documents":["_design/sync_gateway"],"indexer_type":"main","initial_build":false,"progress":0,"set":"sync_gateway","signature":"a2f825658fd047b24dd9a72ccebe4d3f","started_on":1372276580,"total_changes":4,"type":"indexer","updated_on":1372276580}]

~5-10 minutes later, CPU is still being consumed by beam.smp at 125%, and Couchbase's _active_tasks gives:

[{"pid":"<0.12533.0>","limit":4,"running":1,"started_on":1372276223,"type":"couch_main_index_barrier","updated_on":1372276885,"waiting":0},{"pid":"<0.12534.0>","limit":2,"running":0,"started_on":1372276223,"type":"couch_replica_index_barrier","updated_on":1372276223,"waiting":0},{"pid":"<0.32477.0>","changes_done":0,"design_documents":["_design/sync_gateway"],"indexer_type":"main","initial_build":false,"progress":0,"set":"sync_gateway","signature":"a2f825658fd047b24dd9a72ccebe4d3f","started_on":1372276885,"total_changes":4,"type":"indexer","updated_on":1372276885}]

I've let it sit for over an hour before. This is a multicore machine with 4 GB RAM, 3 of which are allocated to the bucket. Couchbase Server setup is the default setup. 500 documents doesn't seem so crazy, but I'll let it sit overnight then report back.

Also, this didn't happen a month ago, so I might need to downgrade everything (Couchbase and sync_gateway) commit by commit to find out where the faulty code lies.

@snej
Copy link
Contributor

snej commented Jun 26, 2013

Thanks for the info — I've emailed Filipe, our view-engine expert, asking for advice.

Your documents are ~300kbytes in size on average; that's pretty large. Is this due to attachments or is there a whole lot of JSON in each doc? The latter might slow down indexing because of all the JSON parsing.

@nl
Copy link
Author

nl commented Jun 27, 2013

The size is due to attachments (png images). There's only a handful of json properties.

Sent from my iPhone

On Jun 26, 2013, at 10:31 PM, Jens Alfke [email protected] wrote:

Thanks for the info — I've emailed Filipe, our view-engine expert, asking for advice.

Your documents are ~300kbytes in size on average; that's pretty large. Is this due to attachments or is there a whole lot of JSON in each doc? The latter might slow down indexing because of all the JSON parsing.


Reply to this email directly or view it on GitHub.

@nl
Copy link
Author

nl commented Jun 27, 2013

For reference, eight hours later, these are the _active_tasks - and Couchbase Server is still going at 110-130% CPU.

[{"pid":"<0.12533.0>","limit":4,"running":1,"started_on":1372276223,"type":"couch_main_index_barrier","updated_on":1372306292,"waiting":0},{"pid":"<0.12534.0>","limit":2,"running":0,"started_on":1372276223,"type":"couch_replica_index_barrier","updated_on":1372276223,"waiting":0},{"pid":"<0.14982.1>","design_documents":["_design/sync_gateway"],"indexer_type":"main","set":"sync_gateway","signature":"a2f825658fd047b24dd9a72ccebe4d3f","started_on":1372277578,"type":"blocked_indexer","updated_on":1372277578},{"pid":"<0.10366.2>","design_documents":["_design/sync_gateway"],"indexer_type":"main","set":"sync_gateway","signature":"a2f825658fd047b24dd9a72ccebe4d3f","started_on":1372278868,"type":"blocked_indexer","updated_on":1372278868},{"pid":"<0.27240.2>","design_documents":["_design/sync_gateway"],"indexer_type":"main","set":"sync_gateway","signature":"a2f825658fd047b24dd9a72ccebe4d3f","started_on":1372279618,"type":"blocked_indexer","updated_on":1372279618},{"pid":"<0.15395.3>","design_documents":["_design/sync_gateway"],"indexer_type":"main","set":"sync_gateway","signature":"a2f825658fd047b24dd9a72ccebe4d3f","started_on":1372280548,"type":"blocked_indexer","updated_on":1372280548},{"pid":"<0.8831.4>","design_documents":["_design/sync_gateway"],"indexer_type":"main","set":"sync_gateway","signature":"a2f825658fd047b24dd9a72ccebe4d3f","started_on":1372281718,"type":"blocked_indexer","updated_on":1372281718},{"pid":"<0.4997.5>","design_documents":["_design/sync_gateway"],"indexer_type":"main","set":"sync_gateway","signature":"a2f825658fd047b24dd9a72ccebe4d3f","started_on":1372283008,"type":"blocked_indexer","updated_on":1372283008},{"pid":"<0.7031.6>","design_documents":["_design/sync_gateway"],"indexer_type":"main","set":"sync_gateway","signature":"a2f825658fd047b24dd9a72ccebe4d3f","started_on":1372284538,"type":"blocked_indexer","updated_on":1372284538},{"pid":"<0.11523.21>","design_documents":["_design/sync_gateway"],"indexer_type":"main","set":"sync_gateway","signature":"a2f825658fd047b24dd9a72ccebe4d3f","started_on":1372297322,"type":"blocked_indexer","updated_on":1372297322},{"pid":"<0.21987.24>","design_documents":["_design/sync_gateway"],"indexer_type":"main","set":"sync_gateway","signature":"a2f825658fd047b24dd9a72ccebe4d3f","started_on":1372302183,"type":"blocked_indexer","updated_on":1372302183},{"pid":"<0.61.25>","design_documents":["_design/sync_gateway"],"indexer_type":"main","set":"sync_gateway","signature":"a2f825658fd047b24dd9a72ccebe4d3f","started_on":1372302663,"type":"blocked_indexer","updated_on":1372302663},{"pid":"<0.16107.25>","design_documents":["_design/sync_gateway"],"indexer_type":"main","set":"sync_gateway","signature":"a2f825658fd047b24dd9a72ccebe4d3f","started_on":1372303383,"type":"blocked_indexer","updated_on":1372303383},{"pid":"<0.26133.33>","changes_done":0,"design_documents":["_design/sync_gateway"],"indexer_type":"main","initial_build":false,"progress":0,"set":"sync_gateway","signature":"a2f825658fd047b24dd9a72ccebe4d3f","started_on":1372306292,"total_changes":4,"type":"indexer","updated_on":1372306292}]

@snej
Copy link
Contributor

snej commented Jun 27, 2013

FYI, this is now being tracked as a Couchbase Server issue.

@jchris
Copy link
Contributor

jchris commented Jun 29, 2013

Is this the sort of thing using CBFS for attachments would help with?

@snej
Copy link
Contributor

snej commented Jul 2, 2013

No, what Nico is running into seems to be a bug in Couchbase Server.

@nl
Copy link
Author

nl commented Jul 18, 2013

I tried running CBGB instead, but the sync_gateway gets into an infinite loop and the same things happens: 100% CPU use until the end of time. This time, the logs show more stuff: https://gist.github.com/nl/6027619

@snej
Copy link
Contributor

snej commented Aug 1, 2013

Hey Nico, the Couchbase devs would like some more info from your server to help diagnose the problem. You can email the logs to me directly. Thanks!

Aliaksey Artamonau added a comment - 31/Jul/13 4:32 PM
ns_server never kills any of the processes using SIGKILL or actually any other signal. This might be an indication of OOM killer. Or someone actually killed memcached manually. To verify whether it's OOM killer or not we need dmesg output. Otherwise, I don't think there's much to investigate.

Jens Alfke added a comment - 31/Jul/13 10:00 PM
Thanks, Aliaksey. What is dmesg and how should we ask the developer to collect its output?

Abhinav Dangeti added a comment - 01/Aug/13 10:58 AM
couchbase.log from cbcollect_info should contain the (kernel log buffer's) dmesg output.

@nl
Copy link
Author

nl commented Aug 1, 2013

I loved all over /opt/couchbase/var/lib/couchbase/logs, but couldn't find a couchbase.log. I'm on Ubuntu, where should be looking for this file?

On Aug 1, 2013, at 8:58 PM, Jens Alfke [email protected] wrote:

Hey Nico, the Couchbase devs would like some more info from your server to help diagnose the problem. You can email the logs to me directly. Thanks!

Aliaksey Artamonau added a comment - 31/Jul/13 4:32 PM
ns_server never kills any of the processes using SIGKILL or actually any other signal. This might be an indication of OOM killer. Or someone actually killed memcached manually. To verify whether it's OOM killer or not we need dmesg output. Otherwise, I don't think there's much to investigate.

Jens Alfke added a comment - 31/Jul/13 10:00 PM
Thanks, Aliaksey. What is dmesg and how should we ask the developer to collect its output?

Abhinav Dangeti added a comment - 01/Aug/13 10:58 AM
couchbase.log from cbcollect_info should contain the (kernel log buffer's) dmesg output.


Reply to this email directly or view it on GitHub.

@snej
Copy link
Contributor

snej commented Aug 1, 2013

IIRC there is a tool called cbcollect_info that you run, and one of its options is to dump various logging data.

@nl
Copy link
Author

nl commented Aug 19, 2013

The documentation says it's located at /opt/couchbase/bin/cbcollect_info on Linux, but it's not there. I installed from the pkg.

@snej
Copy link
Contributor

snej commented Oct 9, 2013

There hasn't been any update on the CBS Jira ticket in months. @nl, are you still having this problem, or did it get resolved, or did you just give up?
If necessary I can go talk to the guy the Jira ticket is assigned to, if you need him to take action on something. (Last I saw, he was asking you for more logs.)

@nl
Copy link
Author

nl commented Oct 18, 2013

It seems to work fine with 2.1.1 now.

@nl nl closed this as completed Oct 18, 2013
@snej
Copy link
Contributor

snej commented Oct 18, 2013

Hooray! Nightmare over.

@nl
Copy link
Author

nl commented Oct 20, 2013

Actually, the nightmare is still here. I forgot to test this: when relaunching sync_gateway after it has already run with couchbase server (aka the bucket is not empty), it doesn't start at all. The logs are identical to before:

[...]
13:31:49.855071 Access: Computed roles for "": []
13:31:49.855903 Reset guest user to config
13:31:49.855993 Starting admin server on :4985
13:31:49.863742 Starting server on :4984 ...

The first run (empty bucket) works fine though... back to Walrus for me!
About the Jira guys, I can't find the cbcollect tool anywhere, sorry.

@nl nl reopened this Oct 20, 2013
@nl
Copy link
Author

nl commented Oct 20, 2013

Well, it works. So nevermind, kinda. The one thing that has me worried is I'm now seeing a ton of:

2013/10/20 13:46:21 go-couchbase: TAP connection lost; reconnecting to bucket "sync_gateway" in 1s

in the logs, in between actual CRUD/Access logging. Sometimes 5-10 in a row. There's also a bunch of:

13:47:01.855394 WARNING: TruncateEncodedChangeLog: Couldn't find a safe place to truncate -- channels.TruncateEncodedChangeLog() at change_log_encoder.go:126

What gives?

@ghost
Copy link

ghost commented Nov 6, 2013

I'm commenting here as my issue seems related to the last comment above.

go-couchbase: TAP connection lost; reconnecting to bucket "mydb" in 1s

My setup is CBL on iOS 7, sync_gateway and CBS 2.2.0-821-rel on MAC OS X 10 8GBytes memory.

Sometimes there are just a few messages, at other times no reconnection is made.

I think this is impacting my testing, in the following scenario.

Launch App on device 1 replicate to sync_gateway.

On sync_gateway the TAP connection messages usually start after the initial push sync from the client.

Launch App on device 2, pull files from sync gateway (This always works)

On device 1, generate new files and replicate to sync_gateway.

If the TAP connection messages are active, device 2 does not pull sync the new files.

I guess this makes sense, no TAP connection so no dynamic updates on the channel?

I have created a Gist from a typical scenario, which shows the sync_gateway log up to just after device 1 completes it's first push sync. (Note: there are two vBuckets in this example).

@snej
Copy link
Contributor

snej commented Nov 6, 2013

Is Couchbase Server logging anything suspicious at the same time you get those errors?

@ghost
Copy link

ghost commented Nov 6, 2013

The logs appear to just have start/stop status, so I generated a diagnostic dump and extracted the entries that appear to match the sync_gateway test times (this a.m.). The server seems to mirror the client with issues around TAP connections. Gist Here

snej added a commit that referenced this issue Nov 6, 2013
Trying to diagnose some recent “TAP connection lost” errors reported in
#95.
@snej
Copy link
Contributor

snej commented Nov 6, 2013

Could you pull the latest sync_gateway and try again? I added some more logging to go-couchbase that will include the actual error that caused the tap feed to fail. Hopefully with that info things will become clearer.

@ghost
Copy link

ghost commented Nov 6, 2013

Yep, this looks more useful.

2013/11/06 21:40:37 go-couchbase: Tap feed from 127.0.0.1:11210 failed: 1398109 is too big (max %!s(int=1000000))
2013/11/06 21:40:37 go-couchbase: TAP connection lost; reconnecting to bucket "mydb1" in 1s
21:40:37.446641 Changes: MultiChangesFeed: channels expand to [email protected]:1 ...
2013/11/06 21:40:38 go-couchbase: Tap feed from 127.0.0.1:11210 failed: 1398109 is too big (max %!s(int=1000000))
2013/11/06 21:40:38 go-couchbase: TAP connection lost; reconnecting to bucket "mydb1" in 1s
2013/11/06 21:40:39 go-couchbase: Tap feed from 127.0.0.1:11210 failed: 1398109 is too big (max %!s(int=1000000))
2013/11/06 21:40:39 go-couchbase: TAP connection lost; reconnecting to bucket "mydb1" in 1s
2013/11/06 21:40:40 go-couchbase: Tap feed from 127.0.0.1:11210 failed: 1398109 is too big (max %!s(int=1000000))
2013/11/06 21:40:40 go-couchbase: TAP connection lost; reconnecting to bucket "mydb1" in 1s
21:40:40.831682 HTTP:  #027: PUT /mydb2/_local/bd19343214b4c39a906acd7fe9c65850cf24233e  (as [email protected])
21:40:40.873809 HTTP:  #028: PUT /mydb1/_local/919e429509611235676a2bf9147ed8bceda0e000  (as [email protected])
2013/11/06 21:40:41 go-couchbase: Tap feed from 127.0.0.1:11210 failed: 1398109 is too big (max %!s(int=1000000))
2013/11/06 21:40:41 go-couchbase: TAP connection lost; reconnecting to bucket "mydb1" in 1s
21:40:42.059519 HTTP:  #029: PUT /mydb2/_local/8bcf5682e2a8cfe39c40dc2e24b35743e36b8eee  (as [email protected])
2013/11/06 21:40:42 go-couchbase: Tap feed from 127.0.0.1:11210 failed: 1398109 is too big (max %!s(int=1000000))
2013/11/06 21:40:42 go-couchbase: TAP connection lost; reconnecting to bucket "mydb1" in 1s
2013/11/06 21:40:43 go-couchbase: Tap feed from 127.0.0.1:11210 failed: 1398109 is too big (max %!s(int=1000000))
2013/11/06 21:40:43 go-couchbase: TAP connection lost; reconnecting to bucket "mydb1" in 1s
2013/11/06 21:40:44 go-couchbase: Tap feed from 127.0.0.1:11210 failed: 1398109 is too big (max %!s(int=1000000))
2013/11/06 21:40:44 go-couchbase: TAP connection lost; reconnecting to bucket "mydb1" in 1s
2013/11/06 21:40:45 go-couchbase: Tap feed from 127.0.0.1:11210 failed: 1398109 is too big (max %!s(int=1000000))
2013/11/06 21:40:45 go-couchbase: TAP connection lost; reconnecting to bucket "mydb1" in 1s
2013/11/06 21:40:46 go-couchbase: Tap feed from 127.0.0.1:11210 failed: 1398109 is too big (max %!s(int=1000000))
2013/11/06 21:40:46 go-couchbase: TAP connection lost; reconnecting to bucket "mydb1" in 1s
2013/11/06 21:40:47 go-couchbase: Tap feed from 127.0.0.1:11210 failed: 1398109 is too big (max %!s(int=1000000))
2013/11/06 21:40:47 go-couchbase: TAP connection lost; reconnecting to bucket "mydb1" in 1s
2013/11/06 21:40:48 go-couchbase: Tap feed from 127.0.0.1:11210 failed: 1398109 is too big (max %!s(int=1000000))
2013/11/06 21:40:48 go-couchbase: TAP connection lost; reconnecting to bucket "mydb1" in 1s

@ghost
Copy link

ghost commented Nov 6, 2013

So this is very, very close to the size of one of the attachments that should be pulled from that DB, in CBS it is showing as 1398100 bytes.

@snej
Copy link
Contributor

snej commented Nov 6, 2013

Yeah, this is an error from mc_req.go in the gomemcached package. The incoming message from the server is longer than it’s hardcoded to accept:

// The maximum reasonable body length to expect.
// Anything larger than this will result in an error.
var MaxBodyLen = int(1e6)

@ghost
Copy link

ghost commented Nov 6, 2013

Can I work around this for now by just increasing the value and rebuilding? I only need to increase the ceiling to that example size.

@snej
Copy link
Contributor

snej commented Nov 6, 2013

I’ve moved the current issue to a new bug report to make things less confusing.

@snej snej closed this as completed Nov 6, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants