-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
error fetching the log events from Auth0 #110
Comments
It's odd, do you mind listing the grants you have set? Is Were there any additional --flags or values passed to the exporter? Did you manually make a request to the exporter? Or was it Prometheus? You deployed the exporter and that it errored on the first request? Is that correct? |
Grants: Flags:
I have tried manually making a request but only |
So I looked a bit closer at the config and it looks like setting Now I am running into a different error that appears to be an actual bug:
|
Would you be able to change the permission to add: |
That did not seem to have any effect. Kicked the pod to be sure. |
The error seems to come from the auth0 go-client https://github.com/auth0/go-auth0/blob/main/management/log.go#L160. |
I'll see what I can do, but so far looking through logs, nothing is sticking out other than some logs not containing a value for |
Thank you! In the Auth0 log dashboard for your tenant you should be able to try this lucene query
unfortunately not, the call to the auth0 sdk seem to fail before returning any logs. I could add a log statement and maybe get the previous logs before the problem log. |
By any chance are you using a custom Auth0 connection for your users? https://auth0.com/docs/manage-users/user-accounts/identify-users |
The queries returned no results but we are using a custom database connection for users, yes. |
I've tried replicate the issues by adding a custom connector to my auth0 tenant, and try set the user_id to an integer but the system automatically adds "auth0|1" as a prefix. It doesn't seem possible to for a user_id to be an int. I'm going to open an issue in the auth0 go client |
Under
I don't know off the top of my head if it's common for |
@tfadeyi Is it possible to make modifications to log out what value was encountered or is this all wrapped up in |
Hey @hobbsh thank you for reporting that instance, I think as you said it is still a string so it should still be able to be unmarshalled correctly. I
Unfortunately, It is all abstracted by the I think an alternative would be to generate a client for auth0 separate from go-auth0. It would require quite a bit of effort. |
@tfadeyi I don't know why it took me this long to do it but I build/ran it locally with the same parameters as in kubernetes and it works no problem. What could the discrepancy be? |
So when you ran it locally you didn't encounter the unmarshalling error? I did make some changes to the request timeouts, that are available in the v0.2.2 release, but I’m not sure if those solve the issue. |
The problem seemed to be not including the |
On a side note, I am still seeing
It also looks like Prometheus is hitting its scrape timeout as well. |
I see, so you still encountered the unmarshalling issue when
Yeah, it simplified the helm chart, it leaves the decision to the exporter code to decided which credentials to use. |
I thought the changes from 0.2.2 might help in this scenario but it seems like it takes too much time to fetch the tenant users. What are your current scraping period and timeout? |
I set
Correct and it's also set in our Kubernetes deployment as well. |
So rolling this out to our upper environments, I am seeing timeouts again. I wonder if it would make sense to either make the exporter not look back for logs at all and start from
|
I should have a couple hours to look into these. I'll update the logs with the number of logs fetched before erroring. What value did you assign to auth0.from? I think adding the client filter would be a good idea, I'll look at how to implement it. I'm still trying to check why it worked when you set the token and client credentials. If you don't mind could I see your redacted exporter configuration? |
Last night I set it to 11/17 which it had just turned over to in UTC so the look back should have been minimal, however after running ok for a couple hours, the timeouts came back. As for the config:
And setting |
would you be able to give this https://github.com/tfadeyi/auth0-simple-exporter/pkgs/container/auth0-simple-exporter/149141846?tag=v0.2.4 a try when get chance |
Still the same issue unfortunately. So if the exporter relies on |
Do you mind posting the error log again? Currently if the --auth0.from is not set it defaults to time.Now() and the checkpoint is fetched by getting the last log of the previous day, if no logs are present in the previous day, it keeps going back one day until it finds one. For a max attempt of 30days back, which I think is the regular retention period for logs in Auth0. |
Ah, so I didn't remove |
We did get an abnormal spike in activity/logs overnight, which I'm sure contributed to it as collection started having issues around that time. It doesn't appear like we crossed 12k logs in the past 24 hours until a few hours ago though. |
@hobbsh Whenever you get a chance would you mind trying this prerelease https://github.com/tfadeyi/auth0-simple-exporter/pkgs/container/auth0-simple-exporter/149708045?tag=v0.2.5-alpha.1 |
It ran for ~2 hours then the
If I restart the pod, it starts running again. |
Did the error only occurred once? When it occurred did the exporter completely stopped working? Is it able to answer queries from the Prometheus? I'm asking because the exporter should be able to recover from errors in the requests |
The error repeats at every collection interval for Auth0 and could be a red herring. The Prometheus scraping fails at this point too - the target (ServiceMonitor) is considered down as far as Prometheus is concerned because it is blowing past the
The exporter does not seem to be resource constrained but I am bumping requests to see if it helps. |
Looking at the metrics after curl'ing locally, it looks like that one request 5x'd the memory usage on the pod. |
@tfadeyi Unfortunately after 2.5 hours (almost exactly), metrics stop flowing. With
|
I'll make a new release with a fix, hopefully this fixes the issue. |
Thanks! The panic stopped using
|
I'll be adding a new flag to disable the metric https://github.com/tfadeyi/auth0-simple-exporter/pull/132/files#diff-5ba0a0c121dd1ea9d455c9163c3d6ca5d477d225c3ed873668aa7138e51e5e0a, and I'll make a release. Is the exporter still not exporting log metrics? |
There should be a new release https://github.com/tfadeyi/auth0-simple-exporter/pkgs/container/auth0-simple-exporter |
So far so good, been running for the longest continuous stretch it ever has! 🤞 |
@hobbsh are things still stable with the exporter? |
@tfadeyi Looking great, thanks for working through this with me! |
Perfect! Then I'll make a full stable release soon! |
@tfadeyi Unfortunately the exporter is still hanging (although after much longer periods of operation). Still timing out fetching logs (seems to correlate with spikes in activity but not always). I'm wondering a couple things:
|
@hobbsh i think we can make the client timeout a flag/value. currently the timeout is disabled, So I think the Prometheus timeouts must have ended the request. I'll add more logs about the fetch times. I'll see if I can also add a whitelist flag for the Auth0 clients to fetch metrics from. By any chance do you have a general idea of how much time it passes for the issue to occur? |
Better logic around the exporter health. Related #110 Signed-off-by: oluwole fadeyi <[email protected]>
Better logic around the exporter health. Related #110 Signed-off-by: oluwole fadeyi <[email protected]>
Better logic around the exporter health. Related #110 Signed-off-by: oluwole fadeyi <[email protected]>
Better logic around the exporter health. Related #110 Signed-off-by: oluwole fadeyi <[email protected]>
@hobbsh sorry for the lack of updates, if you are running the exporter, I've made a new release to include additional health checks. |
@tfadeyi I've been running it for several weeks and it looks much better, thanks so much for the follow up and dedication to this issue! |
What happened?
Seeing this in the logs upon setup, unsure what the issue could be as it's not terribly descriptive.
What should have happened?
Request would have ideally succeeded
Reproduction
Running with internal helm chart because existing one has a bug, but that should not be getting in the way here.
The text was updated successfully, but these errors were encountered: