-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problem with many "failed to invoke InternalPushTxn" errors when using Serializable isolation type #877
Comments
Those are warning, not errors (we probably will suppress them because On Fri, May 1, 2015 at 12:04 AM, YugeshN [email protected] wrote:
|
The cockroachdb server seems to spam them continuously, and my app eventually times out. I could increase the retry backoff timer and see how that goes. |
FYI, The way the app is structured right now, after N retries and failures, it will do a rollback. At the moment, the app is pretty much "failing" from a client / usability perspective. I had the same app running previously on FoundationDB, and it would sail through this section without any hiccup. I didn't even realize I had a write contention issue here until now. I will tweak the retry / failure algorithm and try again with Serializable. |
Could you share the app / test harness you're using? Or describe the basic We've so far only been concerned with correctness and haven't gotten to the On Fri, May 1, 2015 at 12:16 AM, YugeshN [email protected] wrote:
|
I'm not sure how foundation DB was structured, but if they used locking, On Fri, May 1, 2015 at 12:23 AM, Spencer Kimball [email protected]
|
My app is a service that I am building which was originally running on Riak, then I switched it over to FoundationDB, and since they dropped off the radar after being purchased by Apple, I've started switching it over to Cockroachdb. I wrote my own Java client for CockroachDB (I plan on putting the code up on Github once I am done) and have started testing the Java Client by integrating it into my service i.e. essentially swapping out foundationdb for cockroachdb. The key space I am updating isn't very large, and I don't believe there are a significant number of updates to them.. they can get fairly "busy" over a short period of time (mostly burst, not constant). The exponential backoff and retry mechanism is one I built into the Java client. I can tweak that. My app essentially does a rollback if the Put() fails (and it seems to always fail after N retries in the Java Client). It could be that the retries are too often.. I can tweak the backoff algorithm to see if that improves things. The retries will only occur if the error from Cockroach says it's retryable and it's not ABORT. With regards FoundationDB, here are some details on how they handled transactions: |
Thanks for the details. Will look into it. You're already providing a huge With those log statements, we can reconstruct the traffic and work on Thanks again! On Fri, May 1, 2015 at 12:45 AM, YugeshN [email protected] wrote:
|
FYI, I double checked and it seems my client is only setup to use 6 threads, so there should only be 6 concurrent writers to the same set of keys. This doesn't seem like a lot, but there could be something else going on. It could be that my retry mechanism is exacerbating the issue with the number of retries and the specific intervals I am using. I was planning on adding logging to the client, so this is probably a good time to do that. FYI, I also my service running on Couchbase as well, but didn't stick with it for very long since the "Views" feature of couchbase is very slow and pretty inconsistent (and there is no other way to perform range searches on Couchbase). Having run the exact same code on a number of different DBs does help set certain expectations and understand the boundaries of the various DBs. CockroachDB seems very promising thus far, so I am eager to see it get to production status, and am happy to help debug these types of problems :-) |
Well we really appreciate you pushing on it. We're literally rounding the On Fri, May 1, 2015 at 1:01 AM, YugeshN [email protected] wrote:
|
So I have been digging a little into the problem. The exact transaction that is causing the problem is one that does the following: Starts a txn Initially I had the last 2 calls i.e. the Delete and Put in a Batch, but I think the Batch exacerbated the problem, so I disabled the batch for now and just had them be standard calls. Most of the errors / problems seem to be related to only these sets of calls. I am confused about one thing.... Can someone explain what the difference is between Retryable and transaction restart? The Proto for error says the following: // BACKOFF is for errors that can retried by restarting the transaction // If transaction_restart is not ABORT, the error condition may be handled by // If retryable is true, the error condition may be transient and the failed So from the proto, it sounds like retryable is referencing the operation within the transaction i.e. whether the operation can be retried or not. Does transaction restart references the entire transaction? i.e. in this case all the operations within the transaction? For example, in some of my calls for the Initial Put, I am getting a response which says "retryable=false" and "transaction_restart=BACKOFF". I am not sure how to interpret that. Here is a full response I am getting back:
|
If you get |
So if I understand you correctly... I need to basically issue the Put, Delete, Put and EndTransaction again, but this time with the same Transaction from the previous header (that said transaction_restart)? |
Yes. In go, the interface looks like this:
The function that is passed in gets re-run from the beginning when there is a restart error (and the RunTransaction method is responsible for passing the Transaction from one request to the next); you'll need to have a similar interface in your java client. |
Ok, thanks Ben. Let me tweak the code around this and see how that goes. |
So I have a test scenario which can trigger this problem. Actually, what happens is that once triggered, the CockroachDB instance just constantly spits out the "failed to invoke InternalPushTxn" errors and then eventually exits and dies with the following error (this happens a little bit after my client has completely exited and shutdown): 30778789.357931852,1 orig=1430778789.351674080,0 max=1430778789.601674318,0} Now it's possible that this problem / issue is somehow related to a bug in my Java client, but I suspect that even if that were true, the DB shouldn't just die. It would be interesting to see if you can trigger the same problem with a Go or Python client My test scenario is as follows... imagine you have a user record in the DB i.e. userId = {user info}. You also create an index for email address. Now in this test scenario, there are multiple clients attempting to change the user's email address. This requires updating the user record, as well as updating the "index" i.e. deleting the old index and creating a new index. So to create a test scenario, create a single record e.g. 1234 = blah In your thread class, do the following in a transaction (passing in key i.e. 1234):
Now create and run 6 or more threads and pass it the same key (so it will invoke the same set of transactions concurrently on the same record). An interesting note... when I ran this test with 2 threads, it worked fine. >3 threads triggers the problem. 6 threads definitely causes the DB to go into a loop and then die. FYI, I am using SERIALIZABLE isolation level. Let me know if you trigger the same error. If not, then the problem is unique to my Java client and I will have to do more digging. |
Could you send more of the log? I'll need to look earlier to see why a txn The log indicates the server was shutdown, not that it died. Did you send On Mon, May 4, 2015 at 6:53 PM, YugeshN [email protected] wrote:
|
Are you sure you're not hitting CTRL-C or something like that around a Best, On Mon, May 4, 2015 at 6:53 PM, YugeshN [email protected] wrote:
|
Nope... I did not hit Ctrl-C or send any signal to the server. In fact, I just ran the test again and I completely exited my test client, and the server is still just running constantly and dumping the "failed to push" messages. It's been going like that for about about 3-5 minutes now with no client connected. Seems like it's stuck in some kind of loop. It's not exiting / dying this time, but it isn't stopping either. Let me see if I can get more details on what's going on. |
Something is signaling your process--it's only way you'd see that log It's not so surprising to see that retry loop if what's going on is what I What I really need is to set up your same client locally so I can debug Thanks! On Mon, May 4, 2015 at 7:18 PM, YugeshN [email protected] wrote:
|
@spencerkimball I think this is actually reproduced nicely by Here's roughly what it looks like:
server log floods: https://gist.github.com/tschottdorf/b5dbebe1233c7f782581 The test details are in |
The test cases which are commented in that file are also still pathological (write/write). Haven't checked them all, but it fills the log quick and intents seem to be involved. The other cases go through fine. |
Yeah, it looks like a case of resolving intents gone wild. We've got tests that try this sort of behaviour, but in our tests we limit retries - which means the resolves fail after, say, 2 attempts. With our (as of yet untuned) standard settings, there is no limit on retries and so it goes wild at the level of the distributed sender. We'll figure something out. |
Great to hear that you guys have potentially narrowed down the problem! I thought it was related to something specific in my code and / or my misuse of the DB so I basically began working around the issue by removing some of the write contention from my app / service. I still see occasional flurries of this error in the DB log, but have managed to reduce the number of occurrences considerably with some of the changes I have made. |
This can be reproduced with examples/bank/ with as little as two concurrent writers: Fairly quickly, all nodes start spinning on:
When that happens, the client freezes for a while. Even once the client is stopped, it can take minutes for all nodes to quiet down again. |
I think TestRetryNonTxn in Edit: We should rewrite TestRetryNonTxn in a way that allows porting to other platforms. It currently relies on setting system-internal RetryOptions in the test. See cockroachdb/sqlalchemy-cockroachdb#3 |
^- edit: that's not an issue, I muddled up resolve and push. The resolve will simply be a no-op in that case and never errors out retriably. |
basically there's a transaction which is PENDING but its last heartbeat was five minutes ago. Maybe other checks return from the function and fail the push (though it succeed if a transaction record goes stale without committing). |
maybe related to #1082: if the client doesn't retry on this error, it will leave the transaction open. So either
|
store.go goes into a busy retry loop trying to resolve the intents, but keeping the timestamp in the request header unchanged. since that is what's used to validate whether the pushee txn is expired, we compare the same two numbers over and over again and control never goes back to the client. i've hacked the code locally to use the wall time in InternalPushTxn. That correctly causes the Push to go through after it's been retried often enough to let 10s pass. I'll look next at how to avoid that busy loop and whether the txn coordinator has a mechanism to stop heartbeating when the client disappears (edit: it does, currently set to 10s as well). With the coordinator heartbeating for 10s, and the txn timing out 10s after that, we're looking at 20s during which anything that has to do with the key remains locked. |
the real problem though is that with two concurrent writers on two keys, they can block each other:
Here's the dist sender side of this:
It looks like the pusher (triggered by a transactional Get call) isn't transactional here (not shown in log). That's a bug, I'm going to track it down. |
PR coming up. It's going to be embarrassingly small. |
previously args.Timestamp was used, but there are situations in which that never changes, meaning that infinite retry loops on an abandoned Txn descriptor would ensue in storage/store, for instance in cockroachdb#877 and the bank example.
Are you sure it's only two transactions blocking each other, or one transaction which is blocking two others? I don't see how two transactions could block each other. the one with the higher priority would succeed in pushing. |
previously args.Timestamp was used, but there are situations in which that never changes, meaning that infinite retry loops on an abandoned Txn descriptor would ensue in storage/store, for instance in cockroachdb#877 and the bank example.
In the bugs that I've been seeing the errors are infinite retries on the store level, where nothing really changes between attempts. The bug was that the Push would actually forget the Txn. I'm going to play with this more. |
I'm closing this for now; we have the basic bugs nailed down and will have plenty of testing for the txn pipeline under contention in #628. |
I am testing out an app which updates the same set of keys many times. The problem is, it throws a ton of the following errors:
I0501 03:24:08.647541 21931 retry.go:101] store: Get failed; retrying in 0
W0501 03:24:08.647817 21931 dist_sender.go:536] failed to invoke InternalPushTxn: failed to push "8e3eca7d-9d5b-4960-8fb1-900385256b5c" {id=5a9c2442-a739-456f-9d61-e87ce7d15791 pri=1873872797, iso=SERIALIZABLE, stat=PENDING, epo=0, ts=1430450628.695363455,1 orig=1430450628.691514813,0 max=1430450628.941514813,0}
I0501 03:24:08.648106 21931 kv.go:119] failed InternalPushTxn: failed to push "8e3eca7d-9d5b-4960-8fb1-900385256b5c" {id=5a9c2442-a739-456f-9d61-e87ce7d15791 pri=1873872797, iso=SERIALIZABLE, stat=PENDING, epo=0, ts=1430450628.695363455,1 orig=1430450628.691514813,0 max=1430450628.941514813,0}
I changed the isolation type to Snapshot, and the app seems to work better, except for the occasional stalls.
With Serializable Isolation type, it fails miserably.
Is there some kind of queue internal to cockroachdb that might be filling up with these serializable transactions, and it isn't processing them fast enough?
The text was updated successfully, but these errors were encountered: