Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

loki.source.windowsevent - not able to handle more than 5000 events per minute "in time" and starts to lag behind #2539

Closed
Nachtfalkeaw opened this issue Jan 25, 2025 · 39 comments
Labels
bug Something isn't working

Comments

@Nachtfalkeaw
Copy link

What's wrong?

Wer are running a windows server 2022 and on this server alloy 1.5.1.
This windows Server receives Windows Eventlos via Eventlog Forwarding. These logs are stored in "ForwardedEvents" Event Log.

We receive between 4.000 - 8.000 events per minute. Grafana alloy 1.5.1 loki.source.windowsevent is not able to collect these logs in time. If the amount of logs is lower than I can see the logs in loki only a few seconds later. If the amount of logs per minute increases then alloy falls behind. could be 5 minutes and could be 20 minutes.

I checked the Windows Server ressources but they are all fine. I checked the alloy process CPU usage vi prometheus.exporter.windows and the CPU usage is always around 75%. so it is not saturation 1 CPU completely and not using more CPUs which would be available.

I have another tool on this windows server which is reading these ForwardedEvents Logs and this tool is using around 20% and has all the logs within seconds.

So there is maybe something in the loki.source.windowsevent code which should be optimized. maybe newer versions of windows allow other/faster APIs than what is used in alloy.

I also tried to reduce the "ForwardedEvents" file size from 8GB to 1GB and 256MB and 512MB. It looks like the bigger the file the harder it is for alloy to keep up.

In addition if I scrape these forwardedEvents the alloy /-/reload endpoint is not working anymore. The browser can not reload the config. I have to do a "restart" via Windows Services.

Image
Image
Image

Steps to reproduce

  • Try to collect ForwardedEvents with more than 5k events per minute or better more.
  • Try to reload config via /-/reload endpoint
  • try to get logs in loki in "real time"
  • check alloy CPU usage

System information

Windows Server 2022

Software version

Grafana Alloy 1.5.1

Configuration

Alloy config:

loki.source.windowsevent "forwardedevents"  {
    eventlog_name = "ForwardedEvents"
    use_incoming_timestamp = true
    locale = 1033
    poll_interval = "250ms"

    labels = {
      "service_name"  = "windows_forwardedevents",
    }

    forward_to = [loki.relabel.windows_event_level.receiver]
}

Logs

N/A
@Nachtfalkeaw Nachtfalkeaw added the bug Something isn't working label Jan 25, 2025
@Nachtfalkeaw
Copy link
Author

Discussing issues with loki.source.windowsevent

#2310
grafana/loki#12492

@Nachtfalkeaw
Copy link
Author

@wildum
How to enabled the necessary Go Profile/debug you need?

@wildum
Copy link
Contributor

wildum commented Jan 27, 2025

@Nachtfalkeaw
Copy link
Author

Nachtfalkeaw commented Jan 27, 2025

@wildum
at the moment the system is 3-4 minutes behind with the logs.

Debugs deleted for safety.

@l-freund
maybe you can add your pprofe files for further analysis.

@wildum
Copy link
Contributor

wildum commented Jan 28, 2025

Thanks for sharing, looking at the CPU usage:

  • 63.7% is spent in atomic.WriteFile to save the position in the event log
  • 25.6% is spent in GetFromSnapProcess which is only used to retrieve the process name associated with the event

Looking at the code, I see that the position is saved after every event. This seems very overkill to me. A few options would greatly improve it:

  • save the position in a goroutine every x seconds (same approach as loki.source.file)
  • save the position after every batch of events (+ eventually saving every x events if the batches are really big)
    On exit the position can be saved one last time to be sync when restarting. On hard stops when the position cannot be saved then there might be some duplication but that should not be a major concern.

In GetFromSnapProcess, it iterate over all processes running on the Windows machine till it finds the one that matches the pid. I can think of a few options to improve this:

  • if the process name is not a very important info, we could make it optional
  • not a Windows expert but we only need the process name, maybe there is a better way to retrieve it using the pid then iterating over all processes
  • else we could maybe we could do some caching

@l-freund
Copy link

l-freund commented Jan 28, 2025

Are profiles from our environment still required? Because of the note about sensitive information in the documentation I don't want to upload dumps from our productive environment here.

When using a smaller eventlog size, the time of the last ingested log is more or less equal to the oldest log in the eventlog. We although seem to have a smaller number of logs ingested in loki. So probably alloy is not able to keep up and at some point ingests the oldest logs available while the rest gets dropped due to filesize limit. My solution is a real big eventlog (40 gb at last). At work hours the offset is up to 3 hours but alloy keeps up when people go home and less logs are generated.
Since we want to use loki for alarming this not a good solution at the moment and we are thinking to install alloy directly on our DCs, as they generate the majority of logs, and let only the rest of the servers forward to our eventlog server. At the moment all servers forward to the same eventlog server. Alloy with better performance would be very much appreciated.

@wildum
Copy link
Contributor

wildum commented Jan 28, 2025

Hey, thanks for the additional data, I don't think that we need an additional profile, I believe that the one shared by @Nachtfalkeaw is enough to show the bottlenecks. I could start working on the options above. I can test it locally on my personal Windows machine but I don't have a big setup. If one of you is ok to try a custom build once I have a fix available that would be nice. (I could provide a docker image / windows installer / windows exe or you could just build it from the branch where the fix will be, whatever works for you)

@l-freund
Copy link

l-freund commented Jan 28, 2025

Hey! I am OK with testing the fix but to see if it improves performance I have to run it in our productive environment from where I cannot share dumps with a clear conscience. Our testing environment does not have enough servers and users to generate the required amount of logs. Windows installer would be best for me.

@Nachtfalkeaw
Copy link
Author

Hello,
I can confirm that the bookmark.xml is written more or less in milliseconds. If i have the filder.open the File is created and deleted, really fast, hard to Open it. So saving These information in memory would Help.

Maybe as a Factor of scrape time. If poll Intervall/scrape of logs is 3s then maybe write it to disk every poll_interval x 3 but max every 1s and min every 10s? Not Sure Just an Idea.
Not Sure If the batch size is a good indicator because of you have many logs the batch is probably often full.

Another Thing i noticed is that our local Anti Virus hast increased high CPU load which seems to have a dependency to alloy load. So maybe writing Files costs Anti Virus CPU.

Process names and process IDs are important for me i think. So would Not Miss that Info. Optional for other WHO do Not need it, ok.

I could Test it. The Server ist a Test Environment. Mai Installer would be fine.

However - do i have to use another Stage for that or can I keep my config?

@wildum
Copy link
Contributor

wildum commented Jan 29, 2025

hey, you can keep your config. You should not see anything but an important CPU usage reduction. I sent you the custom installer on slack. @l-freund I can also send it to you via slack if you want.

For now, I only worked on the position file (the part taking more than 60% of the CPU).

Once this improvement is successful, I can dig into the second one (the GetFromSnapProcess)

@Nachtfalkeaw
Copy link
Author

Hello,

I have good and bad news. First thing the bookmark / fileposition thing drastically reduced the load on our local AntiVirus. It went down from 80% to 2%. You can see this in the CPU graph the red line.

Bad news or at least not the expected behaviour:
alloy is now using more CPU than before. But on the other hand it looks like it can process 2-3x the amount of the logs than before. So it looks like I lost logs before.

I have another tool running which counts the "ForwardedEvents" Logs and it counts around 17.000 - 20.000 per Minute at this time of the day. From the Loki Log Volume perspective this may be close than before.

Before with old alloy I had around 5-7k logs per minute and now 15k-18k per minute. From this perspective it looks like alloys more CPU usage is because of getting more logs ingested.

So here the log volume change after changing the alloy version
Image

here the CPU usage from prometheus.exporter.windows of alloy and antivirus before and after.
Image

PS:
I can not create the debug profiles. It tells me "unknown profile" if I run it in the browser
https://grafana.com/docs/alloy/latest/troubleshoot/profile/#memory-consumption

@Nachtfalkeaw
Copy link
Author

Nachtfalkeaw commented Jan 29, 2025

Hello,
in an additional test I increased the "ForwardedEvents" Log Size from 512MB to 8GB. I stopped alloy. waited 5-10 minutes until ForwardedEvent Log reached ~6GB of 8GB and then tried to start alloy again. However it was not able to start or crashed.

I collected the Windows Eventlogs "Application" of alloy and sent it @wildum via slack.

@l-freund
Copy link

l-freund commented Feb 3, 2025

Hello,

I just signed up for slack and am willing to test the build. How do we get in touch?

@wildum
Copy link
Contributor

wildum commented Feb 3, 2025

Hey @Nachtfalkeaw, sorry I thought I replied to the comment already but I guess I somehow forgot to press enter. Thanks for the testing, I agree that the evidence that you shared indicates a good CPU performance improvement.

But the panic is worrying. I checked the logs that you sent me. They are very helpful but the root cause of the panic is still unclear to me, it happens deep in Windows system calls.

I have a few questions to further debug it:

  • is it always crashing when you use the big log file?
  • is it always working when you use the small log file?
  • if you run the official Alloy version, does it also panic with the big file?

Regarding the profiling not working:

@wildum
Copy link
Contributor

wildum commented Feb 3, 2025

@l-freund you can message me ("William Dumont") on the slack via a private message and I will send you the build

@Nachtfalkeaw
Copy link
Author

@wildum
no worries.

  • if you run the official Alloy version, does it also panic with the big file?
    No. 256MB or 8GB work. no crash.

  • is it always working when you use the small log file?
    It worked for the time I was testing. maybe ~1hr.

  • is it always crashing when you use the big log file?
    it crashed with the big log file. I think I stopped alloy service. increased the file to 8Gb and then let windows collect logs. then I started alloy if I remember correctly. It could be that I played with the "poll_interval" however at some point I had the crashes.
    I am not sure what I did. maybe I deleted the data folder of forwarded_events in alloy. maybe it was somehow related to an existing bookmark file which contained a bookmark which was not in the event logs anymore.
    However I think it is not always crashing with the big logfile but maybe under some specific circumstances I do not know right now.

For the debug.
For whatever reaso I could not get it to work on this server. I tried at home and it was. Unfortunately I reverted the VM already so did not test again.

I will give it another try tomorrow or wednesday. Will try with a bigger log file, keep it running for longer time and try to collect the debugs, maybe at the beginning and another time later after some time running.

PS:
It still looks like it does not scale that good. maybe @l-freund can share some experience - maybe the Log volume screenshot which should not provide sensitive information.

@l-freund
Copy link

l-freund commented Feb 4, 2025

Hello,

I installed 1.7.0-devel with the bookmarks-file optimization yesterday. On the screenshot, you can see much more logs ingestet at arround 13:50. This was the point where the new version was installed. At around 15:00 yesterday, alloy has kept up and I have live logs in loki since. Alloy is running stable with a eventlogsize of 1 GB. I want to see today if it can keep up a whole working day with all the login/logoff events. Tomorrow I'm planing to increase the size of my forwarded events log and see if it still runs stable

Image

Update: I just updated our second Eventlog Collector which collects client logs. This was untouched, so Alloy 1.5.0 was running with a 40 GB forwareded events logfile. 1.7.0-devel starts and runs with the exact same setup and seems to keep up.

@l-freund
Copy link

l-freund commented Feb 4, 2025

Another Update: The second eventlog collector ingests logs live by now and is still running stable with 40 GB eventlog size. Looks pretty good to me so far.

@wildum
Copy link
Contributor

wildum commented Feb 4, 2025

Thanks for all the feedback. @Nachtfalkeaw I will wait for your feedback to see whether the crash can be reproduced or not before moving forward with the improvement.

It still looks like it does not scale that good.

This is only the first improvement. The profiles that @l-freund sent me from the new version show that 80% of CPU is now used to retrieve the processName. That's the next area where I'd like to dig and see if we can cut it down

@Nachtfalkeaw
Copy link
Author

Nachtfalkeaw commented Feb 4, 2025

Hello @wildum
I sent you the information and the debugs via slack. Unfortunately I still have crash loops. They started after I stopped alloy a few times, waited a few minutes and then restarted alloy.

Here are the steps I aleady sent you for documentation:

hr
I think I started at around 21:00 and it was running ok. The Windows Event Log was maximum 8 GB but was only ~1GB filled.
22:05 Uhr
everything worked.
22:05 Uhr
I stopped the alloy service a few minutes later and startet it again. was still working.
22:06 Uhr
however before 21:35 I stopped it again and waited some more time. eventlog filesize on disk was around 6GB now.
Alloy way only crashing. starting, crashing. printing errors in event log.

  • I deleted the "data" folder or at least the loki contents.
  • I stopped the service, I tried again, still crash.
  • I removed other collectors like windows_exporter with same result
  • I removed all logging config, only metrics, same result
  • I enabled all .alloy components, cleared the windows log file, deleted the data folder of alloy, started the service, still crashing
  • I rebooted the server, still crashing.

To be honest, I do not understand why it can not start anymore because data, eventlogs, all was deleted and there should not be any old information anywhere.

After some time, waiting and after the reboot it looks like alloy ingests logs and metrics, however it is still crashing here and then.

@l-freund
can you check what happens if you stop the service in services, wait 5 minutes and then restart alloy. if it works for a few minutes then maybe do the same. stop the service. wait a few minutes, start it again.

check the "application" Log for "Source=Alloy"

@Nachtfalkeaw
Copy link
Author

Nachtfalkeaw commented Feb 5, 2025

Hello,

I still can not say what exaclty causes the issues, but I do not think it is the size of the eventlog itself.
From some further other tests it looks like if any change in these parameters could causes the issue:

  • change the size of forwarded events e.g. from 512MB to 1GB
  • change of the windows subscription or its type of events which then are forwarded into forwarded events log

What could not fix the issue after it occured (crashing alloy)

  • reverting eventlog size to the size than before (clear eventlog, resize from 1GB to 512MB)
  • restarting Windows Event Collector, Restarting Windows Event Forwarder Services, Restarting alloy (stop alloy at first, restart rest, delete "data" folder, restart alloy)

Another strange thing I noticed:
in these race conditions my prometheus metrics of windows exporter are malformed. e.g. the correct instance label is this:

expected
instance = hostname

malformed:
instance = stname

Test #1: - working
At the moment I ran the very simple test:

  • I just moved the new alloy_170 to the sever, ran the silent install option, thats it. no service stop, nothing. just install.
  • This is working since several minutes now.
  • I will keep it running some more time

Test #2: - working

  • based on the previous test and environment I will increase windows forwarded events log size from 512000 KB to 1024000 KB.
  • No restart of anything. just a change of this parameter in windows.
  • logs arrived, metrics arrived. no interruption.

Test #3: - working

  • based on Test 2 I just restarted the Windows 2022 server. I did not stop services or anything else.
  • after the reboot of the Windows server Alloy started correctly and ingested metrics.
  • the windows event collector/forwarder service probably needs some time to tell the "sources" to send events (or vice versa).
    • server reboot 10:13
    • server up, alloy up 10:17
    • first new events arrived in "Forwarded events" 10 or 15 minutes later. The sources send events which where not sned while the server rebooted, so no events lost.
  • alloy is still ingesting metrics and logs.

Test #4: - failed

  • change something in the windows events subscription to retrieve other logs in forwarded events
  • changing something in the subscription means that the source which forward events should forward other events. before I was forwarding only "Security" and "System" with specific Event IDs. Then I removed the event filer based on IDs and added "Application" and "Setup" and "Forwarded Events".
    • I changed the subscription at 10:58. this stopped that any forwarded events were received
    • alloy was working and ingesting metrics (and other logs which are not forwarded events) what is correct as there were not recent/new forwarded events logs.
    • forwarded events started at 11:05 to list new logs
    • alloy crashed 11:09:02
    • the eventlog message was: kerberos_service_ticket_operatios. and maybe thats the relevant part:
      "The description of event id xxxx from source yyyy cannot be found. Either the component that raises the event is not installed on your local computer or the installation is corrupted. you can install or repair the component on your computer"

I am not sure if this message itself causes the issues or if the fact that some messages do not seem to have a "Description" causes issues? If I remember in other github issues and comments of you @wildum you mention that you changed the parsing of "Description" field or you want to. maybe this could be a hint why alloy crashes?

@l-freund
Copy link

l-freund commented Feb 5, 2025

Hey!

Alloy kept up with both our log collectors the whole day yesterday. This is a big improvment! Grafana alarms are finally usable :)
I now changed the eventlogsize of both servers to 10 GB (10485760kb).

One server had 40GB before. I stopped alloy, deleted the log, reduced its size and started alloy again. It is running fine since.
The other server had 10GB before. I just increased the log size without a restart of the service.

I only collect logs with alloy, no metrics since we already had another tool set up to do this.

@Nachtfalkeaw
Copy link
Author

As a hint for forwarded events collection:

We have 8 or more active directory servers forwarding logs to windows forwarded_events log where alloy is running.
after a reboot of the alloy server it may take some time until all AD servers start to send their logs again. The result may be that AD

  • server A starts forwarding logs immediately from 10:00 - 10:20
  • server B starts forwarding logs with 5 min delay at 10:05 and forwards its logs with timestamp 10:00 - 10:20

If I am not wrong loki should accept these logs if they are out of order and the current chunk is still open (1-2h slot).

But what with the alloy bookmark file. will it recognize the older logs?

Another question:
imagine there is a large log file which contains logs for several days. as and loki/alloy admin I know that I allow ingestion of logs only for the last 24hrs. would it make sense to offer a parameter in the alloy config which sets the xquery to only check the last x hours? if not set query all. if set only ask the windows API for the logs of the last 24hrs. maybe could reduce load on the API?

or is this scenario covered by the bookmarks file?

@Nachtfalkeaw
Copy link
Author

Nachtfalkeaw commented Feb 5, 2025

@wildum and @l-freund
I found that Test 4 (#2539 (comment)) failed. changing an existing subscriptions whch logs into forwarded events.
I theory it could be possible that a specific event type which is now forwarded leads to the crash.

I was not able to fix this except revert the VM to a snapshot.

@l-freund
Copy link

l-freund commented Feb 5, 2025

@Nachtfalkeaw did the crashes occure with an older version of alloy, too?

At the beginning I forwarded nearly all logs (application, system, security) from all servers to the eventlog collectors. I started to narrow down the forwarded events because alloy was not able to keep up with eventlog. The only crashes I experienced occured with errors in alloy's configuration.

I have set up several subscriptions with event id filters for account, security, etc. on the collectors (source initiated). And a group policy to configure servers and clients to push their logs to the collectors.

@Nachtfalkeaw
Copy link
Author

Hello,

update. changing the subscription does not seem to be the root cause. I changed the subscription back, with event filters. waited 20 minutes, cleared the windows event forwarded logs file.

and then restarted alloy. alloy started now, can ingest metrics and logs.

further test in the meantime

it is pretty sure related to tze messages beeing forwarded.
ich changed the subscription back to send all logs from "security" and "system" without event filter.
then I cleared all forwarded logs and restarted alloy. alloy started.
after some minutes the first logs arrived in forwarded logs and alloy crashed.
i stopped alloy, the subscription and collected the "forwarded events logs" and exported these.
then cleared forwarded events, changed the subscription back and restarted alloy. alloy started fine.
some minutes later logs arrived and alloy could ingest these.

so the problem is with some of the logs, probably formatting error. "Task category" in my logs has now something like this (134712) and not a string. it has brackets and integer.

@wildum
Copy link
Contributor

wildum commented Feb 5, 2025

Nice, if you can pinpoint the exact events that are causing the crash that would be a huge step forward. As @l-freund hinted, I would be curious to know if the official version of Alloy is also crashing with the problematic events. I have been going through my changes many times but I can't find what could cause such a thing

@Nachtfalkeaw
Copy link
Author

Nachtfalkeaw commented Feb 5, 2025

both versions 1.5.1 and 1.7 are affected.

@wildum
Copy link
Contributor

wildum commented Feb 5, 2025

Ok that's good to know, at least we can treat the improvement and this bug separately.
Next step is to find the event that causes the crash so that I can try to reproduce it on my machine. What if you have only one event with the task category set to a number and you try to ingest it, does it crash?

@Nachtfalkeaw
Copy link
Author

Nachtfalkeaw commented Feb 5, 2025

I am not too familar with these subscriptions and windows server.

At least I would agree:

  1. the speed improvements with the bookmarks seem to work. more logs can get ingested, lower CPU use per log.
  2. it is stable with reboots, subscription changes, eventlog size changes, eventlog cleared changes.
  3. The idea of sending all events to this machine to test how many logs allow can ingest and keep up.

So you may proceed with your improvements. I sent you some pprof files yesterday and you probably can identify what is using most of the CPU time.

I will try to find the events which may cause the issues. However this is not that easy because I have to stop subscription, change, restart it and the need to wait until I receive new logs.

maybe someone knows how to exclude logs in the filter then I can add my event IDs I know which work from the other logs which do not work.

-- edit --
found it:
add a "-" in front of the evnt id. will inckude events I know they work.

@l-freund
Copy link

l-freund commented Feb 5, 2025

it is pretty sure related to tze messages beeing forwarded. ich changed the subscription back to send all logs from "security" and "system" without event filter. then I cleared all forwarded logs and restarted alloy. alloy started. after some minutes the first logs arrived in forwarded logs and alloy crashed. i stopped alloy, the subscription and collected the "forwarded events logs" and exported these. then cleared forwarded events, changed the subscription back and restarted alloy. alloy started fine. some minutes later logs arrived and alloy could ingest these.

so the problem is with some of the logs, probably formatting error. "Task category" in my logs has now something like this (134712) and not a string. it has brackets and integer.

This reminds me of the problems I had with loki agent at the beginning. It ingested a few logs and then crashed. It did that only with forwarded events, not with local system, security or other log. I tried out Alloy shortly after it's release and it solved this issue for me.

@Nachtfalkeaw
Copy link
Author

The issue is related at least to Windows Event ID 4634
I attached one log example and anonymized it.

in parallel I started collection all logs except 4634 but it crashed again. I assume multiple logs have the same parsing error.
hopefully you can find something which just needs to be escaped better. e.g. $ at the end of a hostname.

4634.txt

@l-freund
Copy link

l-freund commented Feb 5, 2025

In our setup we had nearly 50.000 4634 events logged without any crashes in the last hour alone. Logon/Logoff events are generated on the client and on the domain controller. We only collect those from the DCs except for local accounts on the clients. From the start we collected both, client and server generated events but reduced it to DC only due to alloy not keeping up with the logs.

$ at the end of a computer object is also not a problem here. I already gave you my config but want to link it here, so @wildum can have a look.
This is the whole config, as said before we use alloy only to forward logs.
grafana/loki#12492 (comment)

@Nachtfalkeaw
Copy link
Author

Hello,

4634 in my local Security log works. However 4634 events from forwarded events do not.
I don't think it is an issue with this type of event itself because I had the crash with other events, too (did not find the ID) but I think there is something in the message, which is maybe specific for our environment and this causes the crash.

maybe it is with a further additional processing step to format the logs. maybe I will try to check this.

@wildum
Copy link
Contributor

wildum commented Feb 5, 2025

I created a new ticket to further improve the performance of this component: #2615. I will try to work on it soon and contact you if I find some ways to improve it

@Nachtfalkeaw
Copy link
Author

Nachtfalkeaw commented Feb 5, 2025

Hello,

I can now ingest 4634 logs from forwarded_events. the problem seems to be in my loki.process pipeline.
forwarded_events and the local windows events are processed by this loki.process stage and events with specific EventID 4634 logs cause regular crashes. I now removed this stage and I get the logs.

--- edit ---
it is the same for local Security Logs channel and forwarded events. However because I worked with the machine I probably never noticed these 4364 (Logoff) and only noticed it in ForwardedEvents ...

//
//=======================================================================================
//


            // Wir parsen das Log im json Format um alle Felder zu bekommen.
            // Feldname im json wir als Label oder structured_metadata übernommen. Ausnahme "execution" und "security", dort übernehmen wir alles. Siehe nächste stage.json.
loki.process "windows_eventlog" {

  stage.json {
      expressions = {
        source            = "",                              // wenn rechts vom "=" leer ist, dann sucht er nach dem gleichnamigen Feld was links steht
        channel           = "",                              // ansonsten sucht er nach dem Feld was rechts vom "=" steht und schreibt es nach links.
        computer          = "",                              // ändert sich also keine Name, kann man es leer lassen und erkennt, dass es unverändert ist
        event_id          = "",
        levelText         = "",
        level             = "",
        opCodeText        = "",
        keywords          = "",
        timeCreated       = "",
        eventRecordID     = "",
        event_data        = "",
        user_data         = "",
        message           = "",
        task              = "",
        taskText          = "",
        version           = "",
        opCode            = "",
        execution         = "",
        processId         = "execution.\"processId\"",         // JMESHpath json subquerry von execution. verschachtelter json wird extrahiert
        threadId          = "execution.\"threadId\"",          // JMESHpath json subquerry von execution. verschachtelter json wird extrahiert
        processName       = "execution.\"processName\"",       // JMESHpath json subquerry von execution. verschachtelter json wird extrahiert
        security          = "",
        userId            = "security.\"userId\"",             // JMESHpath json subquerry von security. verschachtelter json wird extrahiert
        userName          = "security.\"userName\"",           // JMESHpath json subquerry von security. verschachtelter json wird extrahiert

      }
  }



            // Der Inhalt des labels "level wird angepasst, sollte es nur ein Zahlenwert sein oder der Platzhalter "tmp_levelt"
            // [MS Level Code, MS Level Text, Loki Mapping]
            // [0, LogAlways, debug], [1, Critical, critical], [2, Error, error], [3, Warning, warn], [4, Informational, info], [5, Verbose, trace]
            // {{- und -}} entfernt führend und nachfolgende Leerzeichen und "nil" bedeutet "ohne Inhalt", also "leer" also  ""
            // "ToLower" macht es in Kleinbuchstaben, "TrimSpace" entfernt ebenfalls Leerzeichen
  stage.template {
      source   = "level"
      template = `{{- $level := .Value -}}
                  {{- if eq $level "0" -}}debug
                  {{- else if eq $level "1" -}}critical
                  {{- else if eq $level "2" -}}error
                  {{- else if eq $level "3" -}}warn
                  {{- else if eq $level "4" -}}info
                  {{- else if eq $level "5" -}}trace
                  {{- else if eq $level "tmp_level" -}}{{- .levelText -}}
                  {{- else if eq .levelText "Information" -}}info
                  {{- else if eq .levelText "Informationen" -}}info
                  {{- else if eq .levelText "Warning" -}}warn
                  {{- else if eq .levelText "Warnung" -}}warn
                  {{- else if eq .levelText "Fehler" -}}error
                  {{- else if eq .levelText "Kritisch" -}}critical
                  {{- else if eq .levelText nil -}}unknown
                  {{- else -}}{{- .levelText -}}{{- end -}}`
  }

            // die Werte die wir als label wollen schreiben wir hier.
  stage.labels {
    values = {
      level       = "",
      channel     = "",
    }
  }



            // Alles was kein label ist wird structured_metadata
  stage.structured_metadata {
    values = {
        source            = "",                 //
        //channel         = "",                 // wollen wir als label, geht besser filtern
        computer          = "",                 //
        event_id          = "",                 //
        //level           = "",                 // wollen wir als label, geht besser filtern
        levelText         = "",                 //
        opCodeText        = "",                 //
        keywords          = "",                 //
        timeCreated       = "",                 //
        eventRecordID     = "",                 //
        event_data        = "",                 // zusätzliche xml formatierten Informationen
        user_data         = "",                 // zusätzliche xml formatierten Informationen
        // message        = "",                 // Haben wir in der Logzeile, brauchen wir nicht als metadata nochmal
        task              = "",                 //
        taskText          = "",                 //
        // execution      = "",                 // Nicht nötig, da Inhalt extrahiert nach "processId" + "threadId" + "processName"
        // security       = "",                 // Nicht nötig, da Inhalt extrahiert nach "userId" + "userName"
        processId         = "",                 //
        threadId          = "",                 //
        processName       = "",                 //
        userId            = "",                 //
        userName          = "",                 //
        version           = "",                 // 
        opCode            = "",                 // 
    }
  }



            // Wir droppen Alloy logs aus Windows event, weil sie sich schlecht parsen lassen und wir zudem eine eigene logging .alloy file haben und diese anders parsen.
  stage.drop {
      source = "source"                               // extrahierte Daten aus dem Windows Event Log mit dem Feld "source" und dem Wert "Alloy"
      value  = "Alloy"
      drop_counter_reason = "windows_eventlog_alloy"  // Metrik: loki_process_dropped_lines_total   eine drop Reason für Labels, damit man sieht, wieviele Logs deswegen gelöscht werden.
  }


            // Loki bietet einen vordefinierten parser für den "messages" Bereich von Windows Eventlogs. Wir wählen daszuvor per json extrahierte "message" Feld als "source" aus
            // Und lassen es umwandeln. Sollten im Message Feld Informationen stehen, die es zuvor schon gaben, werden diese "overwritten" und nicht ein separates Label erstellt.
  stage.eventlogmessage {
      source = "message"
      overwrite_existing = true
  }


            // Als Logzeile exportieren wir nur den "message" Teil. Alle anderen Infos als structured_metadata in den Log Details.
  stage.output {
      source = "message"
  }


            // Wichtig, ohne timestamp stage wird die "ingested" time genommen, die vom tatsächlichen Log abweichen kann.
            // Feste Anzahl Stellen (7) und "0" wird sowohl vorne als auch hinten angezeigt.    2024-11-30T22:04:42.5148570Z  und   2024-11-30T22:04:00.0038331Z
  stage.timestamp {
      source      = "timeCreated"
      format      = "2006-01-02T15:04:05.0000000Z"
//      fallback_formats = [
//        "2006-01-02T15:04:05.000000000Z",           // Nur Beispiel.
//      ]
//      location    = "Europe/Berlin"                 // ACHTUNG: Nur setzen, wenn der timestamp keine Zeitzone mit liefert. Liefert er Zeitzone Z (UTC) und man setzt Europe/Berlin, gibts ggf Probleme und gar keine Logs mehr.
  }


forward_to = [loki.relabel.hostname.receiver]

}

Reference to new issue:
#2616

@Nachtfalkeaw
Copy link
Author

Updated my last comment. It is my processing stage no matter if local channel security or forwardedevents and EventID 4364.
Not sure if my stage is wrong or if the messages from the stage are not correctly handled by e.g. stage.template or others.

@wildum
Copy link
Contributor

wildum commented Feb 5, 2025

Should we move the conversation there and close this issue?

@Nachtfalkeaw
Copy link
Author

@wildum
we can close this here.

@wildum wildum closed this as completed Feb 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants