Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Oathkeeper (v0.14.2_oryOS.10) returning empty reply on slow/long distance database calls #178

Closed
dietb opened this issue Apr 25, 2019 · 8 comments

Comments

@dietb
Copy link

dietb commented Apr 25, 2019

Describe the bug
I have Oathkeeper running in an East US kubernetes cluster, and the DB is currently living in West US. There’s about a 70ms lag between the two sites.

When running the following curl in the same k8s cluster as oathkeeper, I’m seeing the following

curl:
root@shellpod-bd-55c5b5cc74-nlmrh:/# curl “http://oathkeeper-api.hydra.svc.cluster.local:4456/rules?limit=50000&offset=0”
curl: (52) Empty reply from server

[oathkeeper-api-7f4cc5cb9f-dbmr6] time=“2019-04-24T18:50:58Z” level=info msg=“started handling request” method=GET remote=“10.39.0.47:60820” request="/rules?limit=50000&offset=0"
[oathkeeper-api-7f4cc5cb9f-dbmr6] time=“2019-04-24T18:51:24Z” level=info msg=“completed handling request” measure#oathkeeper-api.latency=26073727174 method=GET remote=“10.39.0.47:60820” request="/rules?limit=50000&offset=0" status=200 text_status=OK took=26.073727174s

Running what I believe to be the same query from the postgres logs and the code (https://github.com/ory/oathkeeper/blob/master/rule/manager_sql.go#L79) from the same pod issuing the curl requests returns within a few ms.

There are currently around 50 roles configured. When I poll with numbers less than 20 (specifically, 19) it works. Once I try 20, it times out just over 10 seconds.

To Reproduce
Steps to reproduce the behavior:

  1. Have an oathkeeper instance with a 70ms lag between the instance and the database
  2. Either manually poll oathkeeper rules or set up oathkeeper-proxy

Expected behavior
For things to not time out :-)

Version:

  • Environment: Kubernetes 1.12.7,
  • Version oryd/oathkeeper:v0.14.2_oryOS.10

Additional context
I can reproduce this pretty easily if needed.

@aeneasr
Copy link
Member

aeneasr commented Apr 27, 2019

Could you show the (truncated) db connection string?

@aeneasr
Copy link
Member

aeneasr commented Apr 27, 2019

Also, we might not fix this because there are some changes incoming with the next release: #177

@dietb
Copy link
Author

dietb commented Apr 27, 2019

Nothing fancy.

postgres://:@haproxy.default.svc.cluster.local:5432/auth?sslmode=verify-ca

I tried to up the max_conns but didn't see any positive results. I dug through the code and it doesn't look like that'd help a ton anyway.

Once work starts on the issue referenced, I may be able to dedicate some time to helping out with it.

@aeneasr
Copy link
Member

aeneasr commented Apr 27, 2019

Ok, damn - that was what I thought could work. Maybe there is a timeout based on the transaction size? I don't really know...

What do you think of the proposal laid out in #177 ? Would it suit your use case?

@dietb
Copy link
Author

dietb commented Apr 27, 2019

I can keep poking around and add some logging when I get a chance. I assume that part of it is that its making multiple queries, and 70ms+N adds up. Writing those to be nested with one query would probably be faster overall, but.....

#177 is definitely a big step in the right direction. Localized reads stored in memory will be a lot faster than constant db reads. Things would get a little trickier with hot reloads because you'd have to read repeatedly, but the odds of a KV store being faster than DB lookups are pretty strong.

I have some other thoughts about #177 that probably aren't appropriate for here. If I get time, I'll hop on discord to rant.

@aeneasr
Copy link
Member

aeneasr commented Apr 27, 2019

Please do! Also feel free to post them in #177 !

@ridhamtarpara
Copy link

I was having the same problem with 15 rules due to the same cross region issue with europe-east and asia-south and trace the reason. In my case, it was happening due to sequential queries in ListRules logic

func (s *SQLManager) ListRules(limit, offset int) ([]Rule, error) {
	var ids []string
	if err := s.db.Select(&ids, s.db.Rebind("SELECT surrogate_id FROM oathkeeper_rule ORDER BY id LIMIT ? OFFSET ?"), limit, offset); err == sql.ErrNoRows {
		return []Rule{}, nil
	} else if err != nil {
		return nil, errors.WithStack(err)
	}

	rules := make([]Rule, len(ids))
	for k, id := range ids {
		d, err := s.GetRule(id)
		if err != nil {
			return nil, errors.WithStack(err)
		}
		rules[k] = *d
	}

	return rules, nil
}

func (s *SQLManager) GetRule(id string) (*Rule, error) {
	d := sqlRule{
		Authenticators:     []sqlRuleHandler{},
		Authorizers:        []sqlRuleHandler{},
		CredentialsIssuers: []sqlRuleHandler{},
	}
	if err := s.db.Get(&d, s.db.Rebind("SELECT * FROM oathkeeper_rule WHERE surrogate_id=?"), id); err == sql.ErrNoRows {
		return nil, errors.WithStack(helper.ErrResourceNotFound)
	} else if err != nil {
		return nil, errors.WithStack(err)
	}

	if err := s.getHandler("oathkeeper_rule_authorizer", id, &d.Authorizers); err != nil {
		return nil, errors.WithStack(err)
	}
	if err := s.getHandler("oathkeeper_rule_authenticator", id, &d.Authenticators); err != nil {
		return nil, errors.WithStack(err)
	}
	if err := s.getHandler("oathkeeper_rule_credentials_issuer", id, &d.CredentialsIssuers); err != nil {
		return nil, errors.WithStack(err)
	}

	return d.toRule()
}

func (s *SQLManager) getHandler(table, id string, rh *sqlRuleHandlers) (err error) {
	if err := s.db.Select(rh, s.db.Rebind(fmt.Sprintf("SELECT * FROM %s WHERE rule_id=?", table)), id); err != nil {
		return errors.WithStack(err)
	}
	return
}

So I changed the query and used SQL joins that worked for me and it was necessary as we were in production with oathkeeper so it may be a workaround for you as after #177 it won't be a case

@dietb
Copy link
Author

dietb commented Apr 30, 2019

@ridhamtarpara Would you mind posting the query? I'd like to take a look at it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants