Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set an upper limit of how large we will let txn-queues grow. #463

Open
wants to merge 5 commits into
base: v2
Choose a base branch
from

Conversation

jameinel
Copy link
Contributor

@jameinel jameinel commented Jul 4, 2017

When we have broken transaction data in the database (such as from
mongo getting OOM killed), it can cause cascade failure, where that
document ends up getting too many transactions queued up against it.

This can also happen if you have nothing but assert-only transactions against a
single document.

If we have lots of transactions, it becomes harder and harder to add
new entries and clearing out a large queue is O(N^2) which means capping it
is worthwhile. (It also makes the document grow until it hits max-doc-size.)

The upper bound is still quite large, so it should not be triggered if
everything is operating normally.

When we have broken transaction data in the database (such as from
mongo getting OOM killed), it can cause cascade failure, where that
document ends up getting too many transactions queued up against it.

This can also happen if you have nothing but assert-only transactions against a
single document.

If we have lots of transactions, it becomes harder and harder to add
new entries and clearing out a large queue is O(N^2) which means capping it
is worthwhile. (It also makes the document grow until it hits max-doc-size.)

The upper bound is still quite large, so it should not be triggered if
everything is operating normally.
txn/flusher.go Outdated
@@ -244,6 +246,16 @@ NextDoc:
change.Upsert = false
chaos("")
if _, err := cquery.Apply(change, &info); err == nil {
if len(info.Queue) > maxTxnQueueLength {
// abort with TXN Queue too long, but remove the entry we just added
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/TXN Queue/txn-queue/ ?

txn/flusher.go Outdated
@@ -212,6 +212,8 @@ var txnFields = bson.D{{"txn-queue", 1}, {"txn-revno", 1}, {"txn-remove", 1}, {"

var errPreReqs = fmt.Errorf("transaction has pre-requisites and force is false")

const maxTxnQueueLength = 1000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a sensible default but it might be good to add support for overriding it on a per session basis in case it's required for some application.

Still defaults to 1000 without any other configuration, but
allows callers to know that they can be stricter/less strict.
@fmpwizard
Copy link
Contributor

If you close and reopen the PR, travis will rerun this build and thanks to #462 , your PR should pass all tests on all mongodb versions

@jameinel jameinel closed this Jul 21, 2017
@jameinel jameinel reopened this Jul 21, 2017
@rogpeppe
Copy link
Contributor

I don't understand this fully, but my immediate suspicion is that the code looks racy. I'd at least expect a doc comment explaining how/why it's not.

This still aborts the transaction, but then returns a custom error
instead of ErrAborted. That allows us to still log the details, which
lets us know what document is broken, but also avoids having a huge
queue of Pending transactions wanting to be applied.
jujubot added a commit to juju/juju that referenced this pull request Nov 15, 2017
Add another patch for mgo.

## Description of change

This one changes it so that if we find a document whose txn-queue is
growing too large, abort the transaction, rather than letting it grow
unbounded. This allows us to recover from a bad transaction in a much
more reasonable manner.

## QA steps

Bootstrap a controller from the source created by "releasetests/make-release-tarball.sh"
You can use enable-ha if you so chose.
Go to machine-0 and inject and invalid transaction in a document (I like to use lease documents, because we know we try to touch them every 30s.)
Watch the transaction queue grow, but end up capped around 1000.
Remove the bad transaction from the queue.
See that everything actually recovers gracefully.


```
$ rm -rf tmp.*
$ ./releasetests/make-release-tarball.bash df86098 .
$ cd tmp.*/RELEASE
$ export GOPATH=$PWD
# I had to hack github.com/juju/juju/version/version.go so that bootstrap would pick this jujud
# and not download it from streams
$ time go install -v github.com/juju/juju/...
$ ./bin/juju bootstrap lxd --debug
$ ./bin/juju enable-ha
$ juju deploy -B cs:~jameinel/ubuntu-lite --to 0 -m controller
$ juju switch controller
# wait for "juju status" to stabilize
$ juju ssh -m controller 0
$$ dialmgo() {
    agent=$(cd /var/lib/juju/agents; echo machine-*)
    pw=$(sudo grep statepassword /var/lib/juju/agents/${agent}/agent.conf | cut '-d ' -sf2)
    /usr/lib/juju/mongo3.2/bin/mongo --ssl -u ${agent} -p $pw --authenticationDatabase admin --sslAllowInvalidHostnames --sslAllowInvalidCertificates localhost:37017/juju
}
$$ dialmgo
> db.leases.find().pretty()
...
{
        "_id" : "9b89c6e0-a321-43c2-894f-7c8eb87a92b1:application-leadership#ubuntu-lite#",
        "namespace" : "application-leadership",
        "name" : "ubuntu-lite",
        "holder" : "ubuntu-lite/0",
        "start" : NumberLong("254150808548"),
        "duration" : NumberLong("60000000000"),
        "writer" : "machine-0",
        "model-uuid" : "9b89c6e0-a321-43c2-894f-7c8eb87a92b1",
        "txn-revno" : NumberLong(8),
        "txn-queue" : [
                "5a0aa29741639d0e1a4e0f44_61a9f62b"
        ]
}
> db.leases.update({name: "ubuntu-lite"}, {$push: {"txn-queue": "5a0aa29741639d0edeadbeef_decafbad"}})
# Watch juju debug-log in another window, see the leadership manager start dying
# and resumer fail to resume transactions
> db.leases.aggregate([{$match: {name: "ubuntu-lite"}}, {$project: {_id: 1, num: {$size: "$txn-queue"}}}])
{ "_id" : "9b89c6e0-a321-43c2-894f-7c8eb87a92b1:application-leadership#ubuntu-lite#", "num" : 490 }
# Watch how many transactions are created
> db.txns.aggregate([{$group: {_id: "$s", c: {$sum: 1}}}, {$sort: {c: -1}}])
{ "_id" : 5, "c" : 2774 }
{ "_id" : 6, "c" : 1816 }
{ "_id" : 2, "c" : 698 }

# Wait for it to hit 1000, and notice that it creates more aborted transactions, and the queue doesn't grow very much:
> db.txns.aggregate([{$group: {_id: "$s", c: {$sum: 1}}}, {$sort: {c: -1}}])
{ "_id" : 5, "c" : 3922 }
{ "_id" : 6, "c" : 1840 }
{ "_id" : 2, "c" : 990 }
> db.leases.aggregate([{$match: {name: "ubuntu-lite"}}, {$project: {_id: 1, num: {$size: "$txn-queue"}}}])
{ "_id" : "9b89c6e0-a321-43c2-894f-7c8eb87a92b1:application-leadership#ubuntu-lite#", "num" : 1000 }

# Now remove the transaction, and see the count start to fall
> db.leases.update({name: "ubuntu-lite"}, {$pull: {"txn-queue": "5a0aa29741639d0edeadbeef_decafbad"}})
```

## Documentation changes

None.

## Bug reference

go-mgo/mgo#463
This is more about when we have problems, don't let them get as out of hand, than fixing the underlying problem.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants