Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

loki.process of logs from loki.source.windowsevent with event ID 4364 lead to crash #2616

Open
Nachtfalkeaw opened this issue Feb 5, 2025 · 7 comments
Labels
bug Something isn't working

Comments

@Nachtfalkeaw
Copy link

Nachtfalkeaw commented Feb 5, 2025

What's wrong?

Hello,

using alloy 1.5.1 and 1.7.0 preview
both versions lead to crashes in my environment onwindows server 2022.

I receive forwrdedEvents on this server and collect these with alloys loki.source.windowsevent.
I collect the local logs, too.

There ist at least one processing problem with events of ID 4364.
These events leads to a crash of alloy - no matter if in local "Security" or in "ForwardedEvenets".

Here is my loki.process

//
//=======================================================================================
//


            // Wir parsen das Log im json Format um alle Felder zu bekommen.
            // Feldname im json wir als Label oder structured_metadata übernommen. Ausnahme "execution" und "security", dort übernehmen wir alles. Siehe nächste stage.json.
loki.process "windows_eventlog" {

  stage.json {
      expressions = {
        source            = "",                              // wenn rechts vom "=" leer ist, dann sucht er nach dem gleichnamigen Feld was links steht
        channel           = "",                              // ansonsten sucht er nach dem Feld was rechts vom "=" steht und schreibt es nach links.
        computer          = "",                              // ändert sich also keine Name, kann man es leer lassen und erkennt, dass es unverändert ist
        event_id          = "",
        levelText         = "",
        level             = "",
        opCodeText        = "",
        keywords          = "",
        timeCreated       = "",
        eventRecordID     = "",
        event_data        = "",
        user_data         = "",
        message           = "",
        task              = "",
        taskText          = "",
        version           = "",
        opCode            = "",
        execution         = "",
        processId         = "execution.\"processId\"",         // JMESHpath json subquerry von execution. verschachtelter json wird extrahiert
        threadId          = "execution.\"threadId\"",          // JMESHpath json subquerry von execution. verschachtelter json wird extrahiert
        processName       = "execution.\"processName\"",       // JMESHpath json subquerry von execution. verschachtelter json wird extrahiert
        security          = "",
        userId            = "security.\"userId\"",             // JMESHpath json subquerry von security. verschachtelter json wird extrahiert
        userName          = "security.\"userName\"",           // JMESHpath json subquerry von security. verschachtelter json wird extrahiert

      }
  }



            // Der Inhalt des labels "level wird angepasst, sollte es nur ein Zahlenwert sein oder der Platzhalter "tmp_levelt"
            // [MS Level Code, MS Level Text, Loki Mapping]
            // [0, LogAlways, debug], [1, Critical, critical], [2, Error, error], [3, Warning, warn], [4, Informational, info], [5, Verbose, trace]
            // {{- und -}} entfernt führend und nachfolgende Leerzeichen und "nil" bedeutet "ohne Inhalt", also "leer" also  ""
            // "ToLower" macht es in Kleinbuchstaben, "TrimSpace" entfernt ebenfalls Leerzeichen
  stage.template {
      source   = "level"
      template = `{{- $level := .Value -}}
                  {{- if eq $level "0" -}}debug
                  {{- else if eq $level "1" -}}critical
                  {{- else if eq $level "2" -}}error
                  {{- else if eq $level "3" -}}warn
                  {{- else if eq $level "4" -}}info
                  {{- else if eq $level "5" -}}trace
                  {{- else if eq $level "tmp_level" -}}{{- .levelText -}}
                  {{- else if eq .levelText "Information" -}}info
                  {{- else if eq .levelText "Informationen" -}}info
                  {{- else if eq .levelText "Warning" -}}warn
                  {{- else if eq .levelText "Warnung" -}}warn
                  {{- else if eq .levelText "Fehler" -}}error
                  {{- else if eq .levelText "Kritisch" -}}critical
                  {{- else if eq .levelText nil -}}unknown
                  {{- else -}}{{- .levelText -}}{{- end -}}`
  }

            // die Werte die wir als label wollen schreiben wir hier.
  stage.labels {
    values = {
      level       = "",
      channel     = "",
    }
  }



            // Alles was kein label ist wird structured_metadata
  stage.structured_metadata {
    values = {
        source            = "",                 //
        //channel         = "",                 // wollen wir als label, geht besser filtern
        computer          = "",                 //
        event_id          = "",                 //
        //level           = "",                 // wollen wir als label, geht besser filtern
        levelText         = "",                 //
        opCodeText        = "",                 //
        keywords          = "",                 //
        timeCreated       = "",                 //
        eventRecordID     = "",                 //
        event_data        = "",                 // zusätzliche xml formatierten Informationen
        user_data         = "",                 // zusätzliche xml formatierten Informationen
        // message        = "",                 // Haben wir in der Logzeile, brauchen wir nicht als metadata nochmal
        task              = "",                 //
        taskText          = "",                 //
        // execution      = "",                 // Nicht nötig, da Inhalt extrahiert nach "processId" + "threadId" + "processName"
        // security       = "",                 // Nicht nötig, da Inhalt extrahiert nach "userId" + "userName"
        processId         = "",                 //
        threadId          = "",                 //
        processName       = "",                 //
        userId            = "",                 //
        userName          = "",                 //
        version           = "",                 // 
        opCode            = "",                 // 
    }
  }



            // Wir droppen Alloy logs aus Windows event, weil sie sich schlecht parsen lassen und wir zudem eine eigene logging .alloy file haben und diese anders parsen.
  stage.drop {
      source = "source"                               // extrahierte Daten aus dem Windows Event Log mit dem Feld "source" und dem Wert "Alloy"
      value  = "Alloy"
      drop_counter_reason = "windows_eventlog_alloy"  // Metrik: loki_process_dropped_lines_total   eine drop Reason für Labels, damit man sieht, wieviele Logs deswegen gelöscht werden.
  }


            // Loki bietet einen vordefinierten parser für den "messages" Bereich von Windows Eventlogs. Wir wählen daszuvor per json extrahierte "message" Feld als "source" aus
            // Und lassen es umwandeln. Sollten im Message Feld Informationen stehen, die es zuvor schon gaben, werden diese "overwritten" und nicht ein separates Label erstellt.
  stage.eventlogmessage {
      source = "message"
      overwrite_existing = true
  }


            // Als Logzeile exportieren wir nur den "message" Teil. Alle anderen Infos als structured_metadata in den Log Details.
  stage.output {
      source = "message"
  }


            // Wichtig, ohne timestamp stage wird die "ingested" time genommen, die vom tatsächlichen Log abweichen kann.
            // Feste Anzahl Stellen (7) und "0" wird sowohl vorne als auch hinten angezeigt.    2024-11-30T22:04:42.5148570Z  und   2024-11-30T22:04:00.0038331Z
  stage.timestamp {
      source      = "timeCreated"
      format      = "2006-01-02T15:04:05.0000000Z"
//      fallback_formats = [
//        "2006-01-02T15:04:05.000000000Z",           // Nur Beispiel.
//      ]
//      location    = "Europe/Berlin"                 // ACHTUNG: Nur setzen, wenn der timestamp keine Zeitzone mit liefert. Liefert er Zeitzone Z (UTC) und man setzt Europe/Berlin, gibts ggf Probleme und gar keine Logs mehr.
  }


forward_to = [loki.relabel.hostname.receiver]

}

and as an attachment an anonymized event log:

4634.txt

Steps to reproduce

try to process these typ of messages using my loki.process stage with FORWARDED EVENTS Logs

System information

Windows Server 2022

Software version

alloy 1.5.1 and alloy 1.7.0-preview

Configuration

above

Logs

above
@Nachtfalkeaw
Copy link
Author

based on this investigation:
#2539

@Nachtfalkeaw Nachtfalkeaw changed the title loki.process of logs from loki.source.windowsevent with event ID 4364 loki.process of logs from loki.source.windowsevent with event ID 4364 lead to crash if received by ForwardedEvents Feb 5, 2025
@Nachtfalkeaw Nachtfalkeaw changed the title loki.process of logs from loki.source.windowsevent with event ID 4364 lead to crash if received by ForwardedEvents loki.process of logs from loki.source.windowsevent with event ID 4364 lead to crash Feb 5, 2025
@wildum
Copy link
Contributor

wildum commented Feb 5, 2025

do you confirm that without the loki.process component (with the loki.process.windowsevent directly connected to the loki.relabel component) there is no crash?

@Nachtfalkeaw
Copy link
Author

Nachtfalkeaw commented Feb 5, 2025

Hello @wildum ,

I now removed all metrics collection because always when alloy crashed this caused malformed labels and metrics.
In addition I removed all of my processing steps for logs.

This is what I use now:

To collect windows event log:

//=======================================================================================

loki.source.windowsevent "application"  {
    eventlog_name = "Application"
    use_incoming_timestamp = true

    labels = {
      "service_name"  = "windows_eventlog",
      "channel"       = "Application",
    }

    forward_to = [loki.write.loki.receiver]
}


//
//=======================================================================================
//

loki.source.windowsevent "security"  {
    eventlog_name = "Security"
    use_incoming_timestamp = true

    labels = {
      "service_name"  = "windows_eventlog",
      "channel"       = "Security",
    }

    forward_to = [loki.write.loki.receiver]
}


//
//=======================================================================================
//


loki.source.windowsevent "setup"  {
    eventlog_name = "Setup"
    use_incoming_timestamp = true

    labels = {
      "service_name"  = "windows_eventlog",
      "channel"       = "Setup",
    }

    forward_to = [loki.write.loki.receiver]
}

//
//=======================================================================================
//


loki.source.windowsevent "system"  {
    eventlog_name = "System"
    use_incoming_timestamp = true

    labels = {
      "service_name"  = "windows_eventlog",
      "channel"       = "System",
    }

    forward_to = [loki.write.loki.receiver]
}


//
//=======================================================================================
//

And this here for windows_event forwarded logs:

//=======================================================================================

loki.source.windowsevent "forwardedevents"  {
    eventlog_name = "ForwardedEvents"
    use_incoming_timestamp = true
    locale = 1033
    poll_interval = "250ms"

    labels = {
      "service_name"  = "windows_forwardedevents",
    }

    forward_to = [loki.write.loki.receiver]
}

//
//=======================================================================================
//

The last step is the loki.write.loki

//=======================================================================================

loki.write "loki" {
  endpoint {
    url = "https://loki.sub.domain.com:3100/loki/api/v1/push"
    max_backoff_retries     = 25
    retry_on_http_429       = true
    max_backoff_period      = "10m"

    queue_config {
      capacity           = "50MiB"
      drain_timeout      = "15s"
    }

    tls_config {
      insecure_skip_verify = true
    }

    tenant_id = "tenant_02"

    headers = {
      "Authorization" = "Basic AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=",
    }
  }
}
//=======================================================================================

I STILL have these crashes. We have several computers (AD Servers) which forward their events to my server. it could be that some logs are out of order. Thats why I asked you in another post how alloy handles this. Loki itself should ingest it if out of order and ingest old logs is enabled.

Image

As a test I restored my default config, with all metrics, logs and relabel BUT I removed all other servers from my subscription except one. If there is only one the logs should arrive in the correct order and this is to test if the unordered logs could have caused these problems.

@l-freund
Do you have one or more sources in your event forwarded logs as source? You mentioned you have "multiple" servers and multilple alloys but do you receive logs from multiple servers with multiple timestamps on the same server where alloy ingests logs?

---- edit 2025-02-05 22:56 -----

  • my default config with all relabeling, alle events enabled and only one AD server (source) which forwards logs and the same problem.
  • I now disabled all metrics and logs, I disabled local windows event logs and only kept forwarded logs and I kept only one AD server as source. Further I narrowed down to 4364 as only event type.

--- edit 2025-02-05 23:47 ---------

  • error still exists with only forwarded event logs and event ID 4364 and only one AD server forwarding logs. noth else in alloy enabled, no metrics,, no local windows logs.

@l-freund
Copy link

l-freund commented Feb 6, 2025

Hello!

My Environment looks as follows:

  • two Windows eventlog collectors, each has 16 source initiated subscriptions. The subscriptions cover different event IDs, we used parts of this repository: https://github.com/palantir/windows-event-forwarding/tree/master/wef-subscriptions
  • These subscriptions are all configured for the same source machines. There is only one exception: Authentication logs are only collected from our two domain controllers because these entries are generated on clients and DCs which is a lot of overhead. We defined one subscription, that covers local logins on the clients, though.
  • We do not use custom eventlog channels as suggested in palantir's repository but collect all logs in the default forwarded events log. Alloy ingests only forwarded events, no local logs.
  • One eventlog collector is published to our Windows servers by group policy. About 100 servers are pushing their logs.
  • The other collector is published to our Windows clients. About 900 clients are pushing their logs.
  • The push interval in the policy is set to 30 seconds

Especially during setup, timestamps where mixed up because the machines didn't apply the policy all at the same time and started one by one to push logs. Also, windows clients are offline sometimes (e.g. notebooks out of office) and push their logs as soon, as they have connectivity to our network again. In my experience, alloy does not get confused by this but ingests the logs when they arrive at the collector. When use_incoming_timestamp is set, the timestamp is taken from the log entry. This leads to loki providing the logs in chronological order, not in the order they came in.

@wildum
Copy link
Contributor

wildum commented Feb 6, 2025

Hi, I found something interesting: influxdata/telegraf#12328

That's a telegraph issue but in Alloy we use code from Promtail which forked the win_eventlog code for Telegraph.

The crash described in the issue looks a lot like what you encountered and this seems to be solved via these changes: influxdata/telegraf#12375

"The root cause is that Windows' EvtFormatMessage syscall is expecting a handle to the publisher (i.e. the machine that sent the event) which is becoming invalid if that publisher is down. As a consequence Windows throws an exception (read Golang panic) instead of returning a simple error." correlates with what I saw in the crash logs that you sent me

The code between Telegraph and Alloy is now quite different but I will try to incorporate the same changes and build a version that you can test. (I don't have the setup to replicate it though)

@Nachtfalkeaw
Copy link
Author

Hello,
running the new Version with this fix in my Environment since 2025-02-07 23:00.

12hrs later No Crash of alloy. I am forwarding all Log Channels and additional Channels to my ForwardedEvents logs. The amount of logs is probably > 20 Millionen per 12h.

Will Monitor some more time and collect pprof profiles from time to time and share.

PS:
There is a new WARN Message of alloy, Something Like.

eventmessage=

could be from my loki.process step.

Will let you know later.

@Nachtfalkeaw
Copy link
Author

Hello,

Version was working for more than 24hrs. No Crash.
Provided pprof and Screenshots via slack.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants