Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add InternalSplit command & tests #64

Merged
merged 1 commit into from
Sep 24, 2014
Merged

add InternalSplit command & tests #64

merged 1 commit into from
Sep 24, 2014

Conversation

tbg
Copy link
Member

@tbg tbg commented Sep 16, 2014

  • added a method CopyInto to the response cache, req'd for initializing a new range during a split.
  • added an InternalSplit command. Making the split happen without a transitory period in which things can go wrong is subtle so this needs a good review and likely some changes.
  • added some basic tests, but the transition should be bulletproofed using more tests eventually.

@@ -846,3 +850,65 @@ func (r *Range) InternalSnapshotCopy(args *proto.InternalSnapshotCopyRequest, re
reply.SnapshotId = args.SnapshotId
reply.SetGoError(err)
}

// InternalSplit splits the range at key args.SplitKey, creating and
// initializing a new range containing the remainder in the process.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Be a little more explicit here, saying the original range is truncated to have StartKey=desc.StartKey and EndKey=args.SplitKey, and the newly created range has StartKey=args.SplitKey and EndKey=desc.EndKey.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@spencerkimball
Copy link
Member

LGTM. Coming along!

}

// Lock the range for the duration of the split.
r.Lock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will block the entire range to accept any new cmd which is not we wish to. Especially we need to wait respCache.CopyInfo to finish which take long.

We need to optimize this part, e.g. will not create the actual range until the copy is done.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, makes sense to optimize the downtime. I changed two things:

  • enforcing args.Key = splitKey and args.EndKey = range.(...).EndKey.
    That will make sure that anything operating on the left side of the split (=the old range) keeps running until we actually lock.
  • changed store.CreateRange to not automatically Start() the range.
    This is important since now we can create the range, copy the respCache and only then lock the old range.

In effect the part of the keys that goes into the "new" range will be locked until the split is completed; the rest will be available until immediately before the new range comes online.
The most time-consuming part in that latter process is likely the transactional update of the Meta1/2 values.
To get that done without any downtime, my idea would be to

  • do everything that sets up the new range. Fire it up (it is technically overlapping with the first range, but doesn't do anything, so no conflict [needs careful checking]).
    Once its Raft is on (fail on timeout):
  • transactional update of Meta1/Meta2. (extra subtle if we're splitting one of the meta ranges itself, which we are locking one half of...)
  • lock and resize the original range, unlock.
    The old range will then proceed, returning errors (wrong range) for all the reads/writes which were waiting to access the key range that has now split off.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Split is actually an exceedingly subtle distributed algorithm. We have to be very careful to get it right. What needs to be balanced here is the correct separation between a distributed transaction conducted by the range leader and the identical Raft commands which must be replayed against the state machines of each range replica.

Let's not worry about this being slow yet. Let's worry about getting it absolutely correct. The two are going to be very difficult to solve simultaneously, although I'm not super worried about performance. This is why we're keeping our ranges in the 32-64M range!

Here's my pseudo code for the actual split algorithm:

On range leader, determine periodically whether a split is
necessary by consulting the estimated size of the range. If
in excess of zone.RangeMaxBytes, initiate split.

Split algorithm from perspective of range leader (this is not
a Raft command, but is done outside of Raft):

// First, find the split key using a snapshot.
snap := GetSnapshot()
splitKey := FindSplitKey(snap)

// Next, start a distributed transaction to effect split
// and also update addressing records.
tx := BeginTransaction()

// Determine new range ID.
newRangeID := allocateRangeID()

// Call split range such that it's submitted as a Raft command;
// each replica of the range executes it identically.
reply := db.InternalSplitRange(tx, rangeID, newRangeID, splitKey)

// Update addressing records for old range and new range.
updateRangeAddressingRecords(tx, rangeID, reply.rangeDesc)
updateRangeAddressingRecords(tx, newRangeID, reply.newRangeDesc)

EndTransaction(tx, true /* Commit */)

Raft command for InternalSplitRange; this is done identically on
each range replica:

copyResponseCache()
createNewRange()
newRangeDesc.StartKey = splitKey
newRangeDesc.EndKey = rangeDesc.EndKey
rangeDesc.EndKey = splitKey
mvcc.Put(tx, KeyLocalRangeMetadataPrefix + rangeID, rangeDesc)
mvcc.Put(tx, KeyLocalRangeMetadataPrefix + newRangeID, newRangeDesc)

Keep in mind this is just pseudo-code! In any case, the idea here is
that this piece runs as part of the distributed transaction and it
runs on every replica, setting the stage for the split, but not quite
pulling the trigger. The split must be part of the same two-phase
commit as everything else, or we will have problems. Notice that we're
moving the local RangeMetadata value to use MVCC! It currently writes
directly to the engine.

So the question is: how do the KeyLocalRangeMetadataPrefix
"intents" get committed when the the transaction commits? They
don't get automatically resolved by the transaction coordinator,
because their keys are local. However, every Raft command
directed at a range "reads" the key at
KeyLocalRangeMetadataPrefix. If it's an intent, we push the txn
as per usual, which either succeeds or fails. Often times, it
will succeed with already-committed, in which case we resolve the
intent. If we find out the txn is aborted, then the range will clean
up the provisional new range by calling RangeManager.DropRange.

@spencerkimball
Copy link
Member

I know the full split algorithm is hard to grok -- the interaction between Raft and distributed transactions is tricky. Don't let that part hold you up for this PR. You're still making plenty of progress on the pieces which we need regardless. I would say next thing would be the code in range leader to decide when to split. I'm getting really close to having the code in place for doing the distributed transaction...! End of this week hopefully.

@tbg
Copy link
Member Author

tbg commented Sep 22, 2014

sorry for the delay. Rebasing this right now as the merge is fairly nasty. @spencerkimball I'll add TODOs in the appropriate places (to get all the transactional stuff right), merge it and then work on the split trigger. Sound good?

@spencerkimball
Copy link
Member

Yep sounds like a plan!

On Mon, Sep 22, 2014 at 3:20 AM, Tobias Schottdorf <[email protected]

wrote:

sorry for the delay. Rebasing this right now as the merge is fairly nasty.
@spencerkimball https://github.com/spencerkimball I'll add TODOs in the
appropriate places (to get all the transactional stuff right), merge it and
then work on the split trigger. Sound good?


Reply to this email directly or view it on GitHub
#64 (comment).

@tbg
Copy link
Member Author

tbg commented Sep 22, 2014

rebased & added comments. Merging tmrw after work.

@spencerkimball
Copy link
Member

Should I re-review?

On Mon, Sep 22, 2014 at 7:36 PM, Tobias Schottdorf <[email protected]

wrote:

rebased & added comments. Merging tmrw after work.


Reply to this email directly or view it on GitHub
#64 (comment).

@tbg
Copy link
Member Author

tbg commented Sep 22, 2014

No intentional changes. Maybe just a quick glance to make sure it doesn't show that it's already past bedtime here.

tbg added a commit that referenced this pull request Sep 24, 2014
add range splitting groundwork
@tbg tbg merged commit ec81081 into cockroachdb:master Sep 24, 2014
@cockroach-team
Copy link

Nice.

On Wed, Sep 24, 2014 at 2:45 AM, Tobias Schottdorf <[email protected]

wrote:

Merged #64 #64.


Reply to this email directly or view it on GitHub
#64 (comment).

You received this message because you are subscribed to the Google Groups
"Cockroach DB" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to [email protected].
For more options, visit https://groups.google.com/d/optout.

@tbg tbg deleted the split branch February 22, 2015 17:14
tbg referenced this pull request in tbg/cockroach Aug 31, 2016
We have been seeing long startup times which disappear spontaneously. During a
restart of the beta cluster, the following goroutine was observed, which suggests
that we were spending a lot of time GCing replicas on startup.

    engine              ??:0                     _Cfunc_DBIterNext(cockroachdb#324, cockroachdb#323, 0, 0, 0, 0, 0, 0, 0)
    engine              rocksdb.go:1135          (*rocksDBIterator).Next(cockroachdb#235)
    storage             replica_data_iter.go:104 (*ReplicaDataIterator).Next(cockroachdb#316)
    storage             store.go:1748            (*Store).destroyReplicaData(#109, cockroachdb#317, 0, 0)
    storage             store.go:841             (*Store).Start.func2(0x101b, cockroachdb#300, 0x36, 0x40, cockroachdb#301, 0x36, 0x40, cockroachdb#315, 0x3, 0x4, ...)
    storage             store.go:734             IterateRangeDescriptors.func1(cockroachdb#306, 0x40, 0x41, cockroachdb#307, 0x92, 0x92, cockroachdb#341, 0, 0x186c, 0x4000, ...)
    engine              mvcc.go:1593             MVCCIterate(cockroachdb#329, #68, #47, #81, cockroachdb#232, 0x9, 0x10, cockroachdb#233, 0xb, 0x10, ...)
    storage             store.go:738             IterateRangeDescriptors(cockroachdb#330, cockroachdb#196, #47, #81, cockroachdb#195, #179, #110)
    storage             store.go:867             (*Store).Start(#109, cockroachdb#330, cockroachdb#196, #179, cockroachdb#185, 0x1)
    server              node.go:405              (*Node).initStores(#78, cockroachdb#330, cockroachdb#196, #98, 0x1, 0x1, #179, 0, #55)
    server              node.go:330              (*Node).start(#78, cockroachdb#330, cockroachdb#196, #42, #129, #98, 0x1, 0x1, 0, 0, ...)
    server              server.go:431            (*Server).Start(#5, cockroachdb#330, cockroachdb#196, #95, 0x1)
    cli                 start.go:368             runStart(#34, #178, 0, 0x9, 0, 0)
    cobra               command.go:599           (*Command).execute(#34, #177, 0x9, 0x9, #34, #177)
    cobra               command.go:689           (*Command).ExecuteC(#33, #70, cockroachdb#343, #72)
    cobra               command.go:648           (*Command).Execute(#33, #71, cockroachdb#343)
    cli                 cli.go:96                Run(#64, 0xa, 0xa, 0, 0)
    main                main.go:37               main()
tbg referenced this pull request in tbg/cockroach Aug 31, 2016
We have been seeing long startup times which disappear spontaneously. During a
restart of the beta cluster, the following goroutine was observed, which suggests
that we were spending a lot of time GCing replicas on startup.

    engine              ??:0                     _Cfunc_DBIterNext(cockroachdb#324, cockroachdb#323, 0, 0, 0, 0, 0, 0, 0)
    engine              rocksdb.go:1135          (*rocksDBIterator).Next(cockroachdb#235)
    storage             replica_data_iter.go:104 (*ReplicaDataIterator).Next(cockroachdb#316)
    storage             store.go:1748            (*Store).destroyReplicaData(#109, cockroachdb#317, 0, 0)
    storage             store.go:841             (*Store).Start.func2(0x101b, cockroachdb#300, 0x36, 0x40, cockroachdb#301, 0x36, 0x40, cockroachdb#315, 0x3, 0x4, ...)
    storage             store.go:734             IterateRangeDescriptors.func1(cockroachdb#306, 0x40, 0x41, cockroachdb#307, 0x92, 0x92, cockroachdb#341, 0, 0x186c, 0x4000, ...)
    engine              mvcc.go:1593             MVCCIterate(cockroachdb#329, #68, #47, #81, cockroachdb#232, 0x9, 0x10, cockroachdb#233, 0xb, 0x10, ...)
    storage             store.go:738             IterateRangeDescriptors(cockroachdb#330, cockroachdb#196, #47, #81, cockroachdb#195, #179, #110)
    storage             store.go:867             (*Store).Start(#109, cockroachdb#330, cockroachdb#196, #179, cockroachdb#185, 0x1)
    server              node.go:405              (*Node).initStores(#78, cockroachdb#330, cockroachdb#196, #98, 0x1, 0x1, #179, 0, #55)
    server              node.go:330              (*Node).start(#78, cockroachdb#330, cockroachdb#196, #42, #129, #98, 0x1, 0x1, 0, 0, ...)
    server              server.go:431            (*Server).Start(#5, cockroachdb#330, cockroachdb#196, #95, 0x1)
    cli                 start.go:368             runStart(#34, #178, 0, 0x9, 0, 0)
    cobra               command.go:599           (*Command).execute(#34, #177, 0x9, 0x9, #34, #177)
    cobra               command.go:689           (*Command).ExecuteC(#33, #70, cockroachdb#343, #72)
    cobra               command.go:648           (*Command).Execute(#33, #71, cockroachdb#343)
    cli                 cli.go:96                Run(#64, 0xa, 0xa, 0, 0)
    main                main.go:37               main()
soniabhishek pushed a commit to soniabhishek/cockroach that referenced this pull request Feb 15, 2017
…s_changes

changing access of methods in plog
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants