Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcd-tester: pause before compaction, fix races #5914

Closed
wants to merge 1 commit into from

Conversation

gyuho
Copy link
Contributor

@gyuho gyuho commented Jul 11, 2016

  • increase default qps
  • fix race condition between stresser cancel, start
  • pause before doing compaction
  • remove duplicate cleanup calls
  • add additional atomic canceled check

For example,

2016-07-13 18:23:53.576937 I | etcd-tester: [round#2 case#38] starting the stressers...
2016-07-13 18:23:53.576954 I | etcd-tester: [round#2 case#38] started stressers
2016-07-13 18:23:53.576965 I | etcd-tester: [round#2 case#38] succeed!
2016-07-13 18:23:53.576980 I | etcd-tester: [round#2] compacting 642280 entries (timeout 22s)
2016-07-13 18:23:53.576991 I | etcd-tester: [round#2] canceling the stressers...
2016-07-13 18:23:53.577003 I | etcd-tester: "10.240.0.19:2379" canceled
2016-07-13 18:23:53.577011 I | etcd-tester: "10.240.0.20:2379" canceled
2016-07-13 18:23:53.577021 I | etcd-tester: "10.240.0.22:2379" canceled
2016-07-13 18:23:53.577031 I | etcd-tester: [round#2] canceled stressers
2016-07-13 18:23:53.577041 I | etcd-tester: [round#2] compacting storage (current revision 1939051, compact revision 1929051)
2016-07-13 18:23:53.577078 I | etcd-tester: [compact kv #0] starting (endpoint http://10.240.0.19:2379)
2016-07-13 18:23:53.577117 I | etcd-tester: stresser to "10.240.0.20:2379" started
2016-07-13 18:23:53.577134 I | etcd-tester: stresser to "10.240.0.19:2379" started
2016-07-13 18:23:53.577228 I | etcd-tester: stresser to "10.240.0.22:2379" started

@gyuho gyuho changed the title etcd-tester: pause stresser for compaction [WIP] etcd-tester: pause stresser for compaction Jul 11, 2016
@gyuho gyuho changed the title [WIP] etcd-tester: pause stresser for compaction [DO NOT MERGE] etcd-tester: pause stresser for compaction Jul 11, 2016
@gyuho gyuho force-pushed the pause-before-compaction branch 8 times, most recently from 2b18785 to 16fa9a9 Compare July 11, 2016 22:11
@gyuho gyuho changed the title [DO NOT MERGE] etcd-tester: pause stresser for compaction etcd-tester: pause stresser for compaction Jul 12, 2016
@gyuho gyuho force-pushed the pause-before-compaction branch 2 times, most recently from 16fb7ca to 4652155 Compare July 12, 2016 20:18
@gyuho gyuho changed the title etcd-tester: pause stresser for compaction etcd-tester: pause stresser for compaction, fix race condition in cancel Jul 12, 2016
@gyuho gyuho force-pushed the pause-before-compaction branch 7 times, most recently from ea569cf to 5daa121 Compare July 12, 2016 22:15
@gyuho gyuho changed the title etcd-tester: pause stresser for compaction, fix race condition in cancel etcd-tester: pause stresser for compaction Jul 12, 2016
@gyuho gyuho force-pushed the pause-before-compaction branch 2 times, most recently from 566619b to 25abbb5 Compare July 13, 2016 00:03
@gyuho gyuho added the WIP label Jul 13, 2016
@gyuho gyuho force-pushed the pause-before-compaction branch 5 times, most recently from 0aed9eb to 3314318 Compare July 13, 2016 17:31
@gyuho gyuho changed the title etcd-tester: pause stresser for compaction etcd-tester: pause before compaction, fix races Jul 13, 2016
@gyuho gyuho force-pushed the pause-before-compaction branch 2 times, most recently from 6ebae67 to 02fc2e4 Compare July 13, 2016 18:55
@gyuho gyuho removed the WIP label Jul 13, 2016
@gyuho gyuho force-pushed the pause-before-compaction branch 7 times, most recently from 60b8dd0 to 12e4001 Compare July 13, 2016 23:34
@gyuho
Copy link
Contributor Author

gyuho commented Jul 13, 2016

@xiang90 @heyitsanthony Can you check this change? I am trying to find where I am leaking stresser goroutines... This change keeps giving me these errors.

tester.txt

2016-07-13 23:37:11.776900 I | etcd-tester: [round#1 case#17] injected failure
2016-07-13 23:37:11.776944 I | etcd-tester: [round#1 case#17] recovering failure "failpoint github.com/coreos/etcd/etcdserver/raftAfterSave panic majority"
2016-07-13 23:37:11.780387 I | etcd-tester: [round#1 case#17] recovered failure
2016-07-13 23:37:11.780448 I | etcd-tester: [round#1 case#17] canceling the stressers...
2016-07-13 23:37:11.780932 I | etcd-tester: stresser "10.240.0.19:2379" is canceled
2016-07-13 23:37:11.781476 I | etcd-tester: stresser "10.240.0.20:2379" is canceled
2016-07-13 23:37:11.782052 I | etcd-tester: stresser "10.240.0.22:2379" is canceled
2016-07-13 23:37:11.782064 I | etcd-tester: [round#1 case#17] canceled stressers
2016-07-13 23:37:11.782071 I | etcd-tester: [round#1 case#17] updating current revisions...
2016-07-13 23:37:12.592128 I | etcd-tester: stresser "10.240.0.19:2379" is already canceled
2016-07-13 23:37:12.592518 I | etcd-tester: stresser "10.240.0.20:2379" is already canceled
2016-07-13 23:37:12.592536 I | etcd-tester: stresser "10.240.0.22:2379" is already canceled
2016-07-13 23:37:13.378665 I | etcd-tester: stresser "10.240.0.19:2379" is already canceled
2016-07-13 23:37:13.592102 I | etcd-tester: stresser "10.240.0.19:2379" is already canceled
2016-07-13 23:37:13.592491 I | etcd-tester: stresser "10.240.0.22:2379" is already canceled
2016-07-13 23:37:13.592536 I | etcd-tester: stresser "10.240.0.20:2379" is already canceled
2016-07-13 23:37:13.835134 I | etcd-tester: stresser "10.240.0.20:2379" is already canceled
2016-07-13 23:37:13.835205 I | etcd-tester: stresser "10.240.0.20:2379" is already canceled
2016-07-13 23:37:14.102734 I | etcd-tester: stresser "10.240.0.22:2379" is already canceled
2016-07-13 23:37:14.102956 I | etcd-tester: stresser "10.240.0.22:2379" is already canceled
2016-07-13 23:37:14.102971 I | etcd-tester: stresser "10.240.0.22:2379" is already canceled
2016-07-13 23:37:14.103217 I | etcd-tester: stresser "10.240.0.22:2379" is already canceled
2016-07-13 23:37:14.378713 I | etcd-tester: stresser "10.240.0.19:2379" is already canceled
2016-07-13 23:37:14.433072 I | etcd-tester: stresser "10.240.0.19:2379" is already canceled
2016-07-13 23:37:14.433133 I | etcd-tester: stresser "10.240.0.19:2379" is already canceled
2016-07-13 23:37:14.433143 I | etcd-tester: stresser "10.240.0.19:2379" is already canceled
2016-07-13 23:37:14.592185 I | etcd-tester: stresser "10.240.0.19:2379" is already canceled
2016-07-13 23:37:14.592534 I | etcd-tester: stresser "10.240.0.22:2379" is already canceled
2016-07-13 23:37:14.592585 I | etcd-tester: stresser "10.240.0.20:2379" is already canceled
2016-07-13 23:37:14.765747 I | etcd-tester: [round#1 case#17] updated current revisions with 965651
2016-07-13 23:37:14.765811 I | etcd-tester: [round#1 case#17] all members are consistent with current revisions [revisions: map[http://10.240.0.19:2379:965651 http://10.240.0.20:2379:965651 http://10.240.0.22:2379:965651]]
2016-07-13 23:37:14.765820 I | etcd-tester: [round#1 case#17] checking current storage hashes...
2016-07-13 23:37:14.765834 I | etcd-tester: [round#1 case#17] checking current storage hashes failed [hashes: map[http://10.240.0.19:2379:4077193248 http://10.240.0.20:2379:506525265 http://10.240.0.22:2379:506525265]]
2016-07-13 23:37:14.765842 I | etcd-tester: [round#1 case#17] starting the stressers...
2016-07-13 23:37:14.765852 I | etcd-tester: [round#1 case#17] started stressers
2016-07-13 23:37:14.765872 I | etcd-tester: [round#1] cleaning up...

So the cancel waits for all goroutines to exit counting the waitgroup, but the log shows there are still ongoing stresser.run goroutines, which are causing inconsistent revisions.

@heyitsanthony
Copy link
Contributor

@gyuho in stresser.Cancel() I think it should be wg := &s.wg

@gyuho
Copy link
Contributor Author

gyuho commented Jul 14, 2016

@heyitsanthony We already define in pointer

type stresser struct {
    Endpoint string

    KeySize        int
    KeySuffixRange int

    qps int
    N   int

    mu          sync.Mutex
    wg          *sync.WaitGroup
    conn        *grpc.ClientConn
    rateLimiter *rate.Limiter
    cancel      func()
    canceled    bool
    success     int
}

@gyuho gyuho force-pushed the pause-before-compaction branch from 12e4001 to 7d2b39c Compare July 14, 2016 17:02
@xiang90
Copy link
Contributor

xiang90 commented Jul 14, 2016

After this changes, everything works fine?

@gyuho
Copy link
Contributor Author

gyuho commented Jul 14, 2016

@xiang90 I will try again today. I was running this patch yesterday but kept getting

2016-07-13 23:37:14.765834 I | etcd-tester: [round#1 case#17] checking current storage hashes failed [hashes: map[http://10.240.0.19:2379:4077193248 http://10.240.0.20:2379:506525265 http://10.240.0.22:2379:506525265]]

I think I am still doing cancellation wrong.

Thanks!

@gyuho gyuho added the WIP label Jul 14, 2016
- increase default qps
- fix race condition between stresser cancel, start
- pause before doing compaction
- remove duplicate cleanup calls
- add additional atomic canceled check
@gyuho gyuho force-pushed the pause-before-compaction branch from 7d2b39c to 9453d6b Compare July 14, 2016 17:43
@heyitsanthony
Copy link
Contributor

@gyuho yes, you're right about the wg pointer. What about:

func (tt *tester) startStressers() {
    plog.Printf("%s starting the stressers...", tt.logPrefix())
    for _, s := range tt.cluster.Stressers {
        go s.Stress()
    }

Should this be synchronous instead of launched in a new goroutine? Otherwise the stress cancel could happen before the stress start completes.

@heyitsanthony
Copy link
Contributor

Looking at the code closer, Stress() would have to be reworked a little bit to make it only synchronously start the stressers (e.g., remove the <-ctx.Done()) instead of waiting for it to complete, but I think it'll be easier to reason about overall. The mutex interleaving between Cancel() and Stress() looks tricky and I don't think it buys much in terms of what the code is actually doing.

@gyuho
Copy link
Contributor Author

gyuho commented Jul 14, 2016

@heyitsanthony I will try to make it more synchronous.

I tried to hold the lock to block concurrent cancellation with additional canceled flag

but I still get this error...

2016-07-14 18:10:18.469707 I | etcd-tester: stresser "10.240.0.19:2379" is canceled
2016-07-14 18:10:18.470212 I | etcd-tester: stresser "10.240.0.20:2379" is canceled
2016-07-14 18:10:18.470639 I | etcd-tester: stresser "10.240.0.22:2379" is canceled
2016-07-14 18:10:18.470656 I | etcd-tester: [round#6 case#17] canceled stressers
2016-07-14 18:10:18.470666 I | etcd-tester: [round#6 case#17] updating current revisions...
2016-07-14 18:10:18.982076 I | etcd-tester: stresser "10.240.0.22:2379" is already canceled
2016-07-14 18:10:19.289690 I | etcd-tester: stresser "10.240.0.22:2379" is already canceled
2016-07-14 18:10:21.190023 I | etcd-tester: stresser "10.240.0.22:2379" is already canceled
2016-07-14 18:10:21.190199 I | etcd-tester: stresser "10.240.0.22:2379" is already canceled
2016-07-14 18:10:21.190215 I | etcd-tester: stresser "10.240.0.22:2379" is already canceled
2016-07-14 18:10:21.527262 I | etcd-tester: [round#6 case#17] updated current revisions with 4261260
2016-07-14 18:10:21.527303 I | etcd-tester: [round#6 case#17] all members are consistent with current revisions [revisions: map[http://10.240.0.19:2379:4261260 http://10.240.0.20:2379:4261260 http://10.240.0.22:2379:4261260]]
2016-07-14 18:10:21.527311 I | etcd-tester: [round#6 case#17] checking current storage hashes...
2016-07-14 18:10:21.527322 I | etcd-tester: [round#6 case#17] checking current storage hashes failed [hashes: map[http://10.240.0.19:2379:3309157168 http://10.240.0.20:2379:176671884 http://10.240.0.22:2379:176671884]]

Thanks!

@heyitsanthony
Copy link
Contributor

@gyuho I don't really trust that canceled flag since the cancelation status is already built into the ctx; the way it is now, if it's racing on context cancelation it'll race on the canceled flag too.

@gyuho
Copy link
Contributor Author

gyuho commented Jul 14, 2016

@heyitsanthony Yeah it got too messy. I will try to make startStresser synchronous removing <-ctx.Done inside, as you suggest.

Thanks.

@gyuho gyuho removed the WIP label Jul 14, 2016
@gyuho gyuho closed this Jul 14, 2016
@gyuho gyuho deleted the pause-before-compaction branch July 14, 2016 20:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants