-
Notifications
You must be signed in to change notification settings - Fork 254
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
loki.source.windowsevent - not able to handle more than 5000 events per minute "in time" and starts to lag behind #2539
Comments
Discussing issues with loki.source.windowsevent |
@wildum |
this page should help you: https://grafana.com/docs/alloy/latest/troubleshoot/profile/#profile-grafana-alloy-resource-consumption |
Thanks for sharing, looking at the CPU usage:
Looking at the code, I see that the position is saved after every event. This seems very overkill to me. A few options would greatly improve it:
In GetFromSnapProcess, it iterate over all processes running on the Windows machine till it finds the one that matches the pid. I can think of a few options to improve this:
|
Are profiles from our environment still required? Because of the note about sensitive information in the documentation I don't want to upload dumps from our productive environment here. When using a smaller eventlog size, the time of the last ingested log is more or less equal to the oldest log in the eventlog. We although seem to have a smaller number of logs ingested in loki. So probably alloy is not able to keep up and at some point ingests the oldest logs available while the rest gets dropped due to filesize limit. My solution is a real big eventlog (40 gb at last). At work hours the offset is up to 3 hours but alloy keeps up when people go home and less logs are generated. |
Hey, thanks for the additional data, I don't think that we need an additional profile, I believe that the one shared by @Nachtfalkeaw is enough to show the bottlenecks. I could start working on the options above. I can test it locally on my personal Windows machine but I don't have a big setup. If one of you is ok to try a custom build once I have a fix available that would be nice. (I could provide a docker image / windows installer / windows exe or you could just build it from the branch where the fix will be, whatever works for you) |
Hey! I am OK with testing the fix but to see if it improves performance I have to run it in our productive environment from where I cannot share dumps with a clear conscience. Our testing environment does not have enough servers and users to generate the required amount of logs. Windows installer would be best for me. |
Hello, Maybe as a Factor of scrape time. If poll Intervall/scrape of logs is 3s then maybe write it to disk every poll_interval x 3 but max every 1s and min every 10s? Not Sure Just an Idea. Another Thing i noticed is that our local Anti Virus hast increased high CPU load which seems to have a dependency to alloy load. So maybe writing Files costs Anti Virus CPU. Process names and process IDs are important for me i think. So would Not Miss that Info. Optional for other WHO do Not need it, ok. I could Test it. The Server ist a Test Environment. Mai Installer would be fine. However - do i have to use another Stage for that or can I keep my config? |
hey, you can keep your config. You should not see anything but an important CPU usage reduction. I sent you the custom installer on slack. @l-freund I can also send it to you via slack if you want. For now, I only worked on the position file (the part taking more than 60% of the CPU). Once this improvement is successful, I can dig into the second one (the GetFromSnapProcess) |
Hello, I have good and bad news. First thing the bookmark / fileposition thing drastically reduced the load on our local AntiVirus. It went down from 80% to 2%. You can see this in the CPU graph the red line. Bad news or at least not the expected behaviour: I have another tool running which counts the "ForwardedEvents" Logs and it counts around 17.000 - 20.000 per Minute at this time of the day. From the Loki Log Volume perspective this may be close than before. Before with old alloy I had around 5-7k logs per minute and now 15k-18k per minute. From this perspective it looks like alloys more CPU usage is because of getting more logs ingested. So here the log volume change after changing the alloy version here the CPU usage from prometheus.exporter.windows of alloy and antivirus before and after. PS: |
Hello, I collected the Windows Eventlogs "Application" of alloy and sent it @wildum via slack. |
Hello, I just signed up for slack and am willing to test the build. How do we get in touch? |
Hey @Nachtfalkeaw, sorry I thought I replied to the comment already but I guess I somehow forgot to press enter. Thanks for the testing, I agree that the evidence that you shared indicates a good CPU performance improvement. But the panic is worrying. I checked the logs that you sent me. They are very helpful but the root cause of the panic is still unclear to me, it happens deep in Windows system calls. I have a few questions to further debug it:
Regarding the profiling not working:
|
@l-freund you can message me ("William Dumont") on the slack via a private message and I will send you the build |
@wildum
For the debug. I will give it another try tomorrow or wednesday. Will try with a bigger log file, keep it running for longer time and try to collect the debugs, maybe at the beginning and another time later after some time running. PS: |
Hello, I installed 1.7.0-devel with the bookmarks-file optimization yesterday. On the screenshot, you can see much more logs ingestet at arround 13:50. This was the point where the new version was installed. At around 15:00 yesterday, alloy has kept up and I have live logs in loki since. Alloy is running stable with a eventlogsize of 1 GB. I want to see today if it can keep up a whole working day with all the login/logoff events. Tomorrow I'm planing to increase the size of my forwarded events log and see if it still runs stable Update: I just updated our second Eventlog Collector which collects client logs. This was untouched, so Alloy 1.5.0 was running with a 40 GB forwareded events logfile. 1.7.0-devel starts and runs with the exact same setup and seems to keep up. |
Another Update: The second eventlog collector ingests logs live by now and is still running stable with 40 GB eventlog size. Looks pretty good to me so far. |
Thanks for all the feedback. @Nachtfalkeaw I will wait for your feedback to see whether the crash can be reproduced or not before moving forward with the improvement.
This is only the first improvement. The profiles that @l-freund sent me from the new version show that 80% of CPU is now used to retrieve the processName. That's the next area where I'd like to dig and see if we can cut it down |
Hello @wildum Here are the steps I aleady sent you for documentation: hr
To be honest, I do not understand why it can not start anymore because data, eventlogs, all was deleted and there should not be any old information anywhere. After some time, waiting and after the reboot it looks like alloy ingests logs and metrics, however it is still crashing here and then. @l-freund check the "application" Log for "Source=Alloy" |
Hello, I still can not say what exaclty causes the issues, but I do not think it is the size of the eventlog itself.
What could not fix the issue after it occured (crashing alloy)
Another strange thing I noticed: expected malformed: Test #1: - working
Test #2: - working
Test #3: - working
Test #4: - failed
I am not sure if this message itself causes the issues or if the fact that some messages do not seem to have a "Description" causes issues? If I remember in other github issues and comments of you @wildum you mention that you changed the parsing of "Description" field or you want to. maybe this could be a hint why alloy crashes? |
Hey! Alloy kept up with both our log collectors the whole day yesterday. This is a big improvment! Grafana alarms are finally usable :) One server had 40GB before. I stopped alloy, deleted the log, reduced its size and started alloy again. It is running fine since. I only collect logs with alloy, no metrics since we already had another tool set up to do this. |
As a hint for forwarded events collection: We have 8 or more active directory servers forwarding logs to windows forwarded_events log where alloy is running.
If I am not wrong loki should accept these logs if they are out of order and the current chunk is still open (1-2h slot). But what with the alloy bookmark file. will it recognize the older logs? Another question: or is this scenario covered by the bookmarks file? |
@wildum and @l-freund I was not able to fix this except revert the VM to a snapshot. |
@Nachtfalkeaw did the crashes occure with an older version of alloy, too? At the beginning I forwarded nearly all logs (application, system, security) from all servers to the eventlog collectors. I started to narrow down the forwarded events because alloy was not able to keep up with eventlog. The only crashes I experienced occured with errors in alloy's configuration. I have set up several subscriptions with event id filters for account, security, etc. on the collectors (source initiated). And a group policy to configure servers and clients to push their logs to the collectors. |
Hello, update. changing the subscription does not seem to be the root cause. I changed the subscription back, with event filters. waited 20 minutes, cleared the windows event forwarded logs file. and then restarted alloy. alloy started now, can ingest metrics and logs. further test in the meantimeit is pretty sure related to tze messages beeing forwarded. so the problem is with some of the logs, probably formatting error. "Task category" in my logs has now something like this (134712) and not a string. it has brackets and integer. |
Nice, if you can pinpoint the exact events that are causing the crash that would be a huge step forward. As @l-freund hinted, I would be curious to know if the official version of Alloy is also crashing with the problematic events. I have been going through my changes many times but I can't find what could cause such a thing |
both versions 1.5.1 and 1.7 are affected. |
Ok that's good to know, at least we can treat the improvement and this bug separately. |
I am not too familar with these subscriptions and windows server. At least I would agree:
So you may proceed with your improvements. I sent you some pprof files yesterday and you probably can identify what is using most of the CPU time. I will try to find the events which may cause the issues. However this is not that easy because I have to stop subscription, change, restart it and the need to wait until I receive new logs. maybe someone knows how to exclude logs in the filter then I can add my event IDs I know which work from the other logs which do not work. -- edit -- |
This reminds me of the problems I had with loki agent at the beginning. It ingested a few logs and then crashed. It did that only with forwarded events, not with local system, security or other log. I tried out Alloy shortly after it's release and it solved this issue for me. |
The issue is related at least to Windows Event ID 4634 in parallel I started collection all logs except 4634 but it crashed again. I assume multiple logs have the same parsing error. |
In our setup we had nearly 50.000 4634 events logged without any crashes in the last hour alone. Logon/Logoff events are generated on the client and on the domain controller. We only collect those from the DCs except for local accounts on the clients. From the start we collected both, client and server generated events but reduced it to DC only due to alloy not keeping up with the logs. $ at the end of a computer object is also not a problem here. I already gave you my config but want to link it here, so @wildum can have a look. |
Hello, 4634 in my local Security log works. However 4634 events from forwarded events do not. maybe it is with a further additional processing step to format the logs. maybe I will try to check this. |
I created a new ticket to further improve the performance of this component: #2615. I will try to work on it soon and contact you if I find some ways to improve it |
Hello, I can now ingest 4634 logs from forwarded_events. the problem seems to be in my loki.process pipeline. --- edit ---
Reference to new issue: |
Updated my last comment. It is my processing stage no matter if local channel security or forwardedevents and EventID 4364. |
Should we move the conversation there and close this issue? |
@wildum |
What's wrong?
Wer are running a windows server 2022 and on this server alloy 1.5.1.
This windows Server receives Windows Eventlos via Eventlog Forwarding. These logs are stored in "ForwardedEvents" Event Log.
We receive between 4.000 - 8.000 events per minute. Grafana alloy 1.5.1 loki.source.windowsevent is not able to collect these logs in time. If the amount of logs is lower than I can see the logs in loki only a few seconds later. If the amount of logs per minute increases then alloy falls behind. could be 5 minutes and could be 20 minutes.
I checked the Windows Server ressources but they are all fine. I checked the alloy process CPU usage vi prometheus.exporter.windows and the CPU usage is always around 75%. so it is not saturation 1 CPU completely and not using more CPUs which would be available.
I have another tool on this windows server which is reading these ForwardedEvents Logs and this tool is using around 20% and has all the logs within seconds.
So there is maybe something in the loki.source.windowsevent code which should be optimized. maybe newer versions of windows allow other/faster APIs than what is used in alloy.
I also tried to reduce the "ForwardedEvents" file size from 8GB to 1GB and 256MB and 512MB. It looks like the bigger the file the harder it is for alloy to keep up.
In addition if I scrape these forwardedEvents the alloy /-/reload endpoint is not working anymore. The browser can not reload the config. I have to do a "restart" via Windows Services.
Steps to reproduce
System information
Windows Server 2022
Software version
Grafana Alloy 1.5.1
Configuration
Logs
The text was updated successfully, but these errors were encountered: