Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Discuss] Don't panic if server fails to create scheduler #1443

Closed
nanne007 opened this issue Feb 26, 2019 · 8 comments
Closed

[Discuss] Don't panic if server fails to create scheduler #1443

nanne007 opened this issue Feb 26, 2019 · 8 comments
Assignees
Labels
status/discussion-wanted The issue needs to be discussed.

Comments

@nanne007
Copy link
Contributor

nanne007 commented Feb 26, 2019

Please answer these questions before submitting your issue. Thanks!

  1. What did you do?
    If possible, provide a recipe for reproducing the error.

From this PR, If server cannot create scheduler, it just panic. However, the reasons of failing to create scheduler is not just come from wrong configurations.
In real case, maybe schedulers itself raise some error, or maybe pd-ctl send an invalid config due to some kind of bugs. The online server cannot work normally again in these cases, because there is no way to delete the wrong config in the etcd kv before PD crashs.

Also, Should we check the disabled scheduler, if it's configured incorrectly, and delete it too?

	for _, schedulerCfg := range scheduleCfg.Schedulers {
		if schedulerCfg.Disable {
			scheduleCfg.Schedulers[k] = schedulerCfg
			k++
			log.Info("skip create scheduler", zap.String("scheduler-type", schedulerCfg.Type))
			continue
		}
		s, err := schedule.CreateScheduler(schedulerCfg.Type, c.opController, schedulerCfg.Args...)
		if err != nil {
			log.Fatal("can not create scheduler", zap.String("scheduler-type", schedulerCfg.Type), zap.Error(err))
		}
		log.Info("create scheduler", zap.String("scheduler-name", s.GetName()))
		if err = c.addScheduler(s, schedulerCfg.Args...); err != nil {
			log.Error("can not add scheduler", zap.String("scheduler-name", s.GetName()), zap.Error(err))
		}

		// Only records the valid scheduler config.
		if err == nil {
			scheduleCfg.Schedulers[k] = schedulerCfg
			k++
		}
	}
  1. What did you expect to see?

  2. What did you see instead?

  3. What version of PD are you using (pd-server -V)?

@rleungx
Copy link
Member

rleungx commented Feb 26, 2019

@Connor1996 PTAL.

@Connor1996
Copy link
Member

seems that even though pd-ctl sends an invalid config, PD won't persist the config.

// AddScheduler adds a scheduler.
func (h *Handler) AddScheduler(name string, args ...string) error {
	c, err := h.getCoordinator()
	if err != nil {
		return err
	}
	s, err := schedule.CreateScheduler(name, c.limiter, args...)
	if err != nil {
		return err
	}
	log.Infof("create scheduler %s", s.GetName())
	if err = c.addScheduler(s, args...); err != nil {
		log.Errorf("can not add scheduler %v: %v", s.GetName(), err)
	} else if err = h.opt.persist(c.cluster.kv); err != nil {
		log.Errorf("can not persist scheduler config: %v", err)
	}
	return err
}

But for the disabled schedulers, seems they should be checked. /cc @nolouch

@nanne007
Copy link
Contributor Author

nanne007 commented Mar 1, 2019

In fact, I have a running cluster whose schedule config is a total mess.
It is the result of many times of pd restart and leader switch due to bugs or upgrading.
Because I have done some patches personally, It's hard to distinguish which part make the damage.

What I'm trying to point out is that we need a way to recover from/discard invalid config, instead of panic assuming the underlining etcd data always work perfectly.

@Connor1996
Copy link
Member

The panic is to notify users that there may be some typo of the config. If we just discard the invalid config, the PD's behavior may be not as same as what users expect.
I prefer not discarding config here. Do you have any better idea?

@nolouch
Copy link
Contributor

nolouch commented Mar 1, 2019

@lerencao We really need a recover way. @Connor1996 How about adding an option let PD in recovery mode and ignore some panics that similar this situation?

@nanne007
Copy link
Contributor Author

nanne007 commented Mar 1, 2019

Errorred spell from config file may be checked before ahead. And in real case, I believe most people just use default schedulers, and use pd-ctl to change schedulers dynamically.

@rleungx rleungx added the status/discussion-wanted The issue needs to be discussed. label Mar 1, 2019
@Connor1996
Copy link
Member

@lerencao check before ahead seems reasonable and neat

@Connor1996
Copy link
Member

Do you want to have a try? @lerencao

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status/discussion-wanted The issue needs to be discussed.
Projects
None yet
Development

No branches or pull requests

4 participants