Initialize filtering functionality #10985

kvch · 2019-02-28T11:13:38Z

This PR initializes support for the similar filters provided by the community Journalbeat. It does not yet have the complete functionality as wildcards for unit names are missing.
But I am opening this PR to discuss the interface of include_matches.

Previously include_matches expected a list of filter expressions. Now more advanced filtering is added, but it required a different way of configuration.
Also, the way matches are applied to a systemd journal is a bit counterintuitive. If matcher expressions are applied to the same field, they are connected with OR. In case of differen field names, the expressions are connected with AND. Thus, it does not let you define full locigal expressions as our processor filtering does.

The following configuration matches for entries where process.name is either chromium or dockerd:

include_matches.equals:
  - _COMM=chromium
  - _COMM=dockerd

If we would like to get entries of chromium, dockerd and audit messages the following configuration is required.

include_matches.or:
  - [
     _COMM=chromium,
     _COMM=dockerd
  ]
  - [
    _TRANSPORT=audit
  ]

What I would like to discuss if the configuration above is ok. Another alternative which came to my mind is this:

include_matches.or:
  - conditions:
    - _COMM=chromium
    - _COMM=dockers
  - conditions:
    - _TRANSPORT=audit

But in my opinion the one I have implemented is simpler. WDYT?

Note that most of the filtering use cases are covered by unit, kernel and identifiers.

TODO

add changelog entry

Depends on #10982

houndci-bot · 2019-02-28T11:13:47Z

journalbeat/reader/matcher.go

+	"github.com/coreos/go-systemd/sdjournal"
+)
+
+type MatcherConfig struct {


exported type MatcherConfig should have comment or be unexported

urso · 2019-04-15T22:34:34Z

Is there a notable performance benefit in building journald filters instead of matching and filtering out the events/logs ourselves?

Anyways, OR-combining same fields makes sense. One can always apply some post filtering via processors. One could also put it in config like this:

matchers:
- _COMM: [chromium, docker]
- _TRANSPORT: [audit]

I think this syntax makes somewhat clear thatchromium, docker are combined into an OR, while _TRANSPORT and _COMM are combined into and AND. We just do not allow a field being referenced multiple times (or make this optional).

I'm not sure I like the extra settings like kernel,unit,identifier. These can be expressed using the matchers as well (are even compiled into matchers), yet I wonder if its fully clear how these settings affect the actual configured matchers.

urso · 2019-04-15T22:35:42Z

Can we split the backoff fix into another PR? So the fix doesn't get lost.

kvch · 2019-04-17T14:36:41Z

Separate PR for backoff: #11861

opsnull · 2019-07-03T08:55:47Z

any update?

andrewkroh · 2020-05-11T20:57:51Z

Is there a notable performance benefit in building journald filters instead of matching and filtering out the events/logs ourselves?

There's a big difference from what I observed so I think the feature is necessary. I was working on deploying Journalbeat to collected a few select sources (two systemd units and kernel logs). The initial deployment was taking a very long time and consuming plenty of CPU since it was going through the backlog of all logs in the entire journal.

Another option I considered was to create one input for each source so that I could use include_matches. But that's not possible because each of my inputs uses the default path: LOCAL_SYSTEM_JOURNAL so if I restart it loses its position. The registry only stores one cursor instead of three unique cursors. Perhaps we can add a unique name/id for each input so that we can store the cursors independently. @urso Will the new Filebeat registry allow this?

On Filtering

It looks like we can construct filters with AND's and OR's with a limitation on the number of levels. From: https://www.freedesktop.org/software/systemd/man/sd_journal_add_conjunction.html

It is hence possible to build an expression of AND terms, consisting of OR terms, consisting of AND terms, consisting of OR terms of matches (the latter OR expression is implicitly created for matches with the same field name, see above).

There is an example in the test cases at https://github.com/systemd/systemd/blob/e7d5fe17db9d046b364f933c4dcdd145f47b024a/src/journal/test-journal-match.c#L59. I think that example equates to something like this in our typical YAML conditional format:

- and:
    - or:
        - and:
            - or:
                - QUUX: mmmm
                - QUUX: xxxxx
                - QUUX: yyyyy
            - or:
                - HALLO: ""
                - HALLO: WALDO
            - PIFF: paff

        - and:
            - or:
                - ONE: one
                - ONE: two
            - TWO: two

    - or:
        - and:
            - or:
                - L4_1: "yes"
                - L4_1: ok
            - or:
                - L4_2: "yes"
                - L4_2: ok

        - and:
            - or:
                - L3: "yes"
                - L3: ok

But I'm not immediately sure how to convert that back into a series of sd_journal_add_match, sd_journal_add_disjunction, sd_journal_add_conjunction calls.

urso · 2020-05-12T14:12:09Z

Is there a notable performance benefit in building journald filters instead of matching and filtering out the events/logs ourselves?

There's a big difference from what I observed so I think the feature is necessary. I was working on deploying Journalbeat to collected a few select sources (two systemd units and kernel logs). The initial deployment was taking a very long time and consuming plenty of CPU since it was going through the backlog of all logs in the entire journal.

Did you use include_matches, or did filter via Beats conditionals? Filtering via Beats conditionals after the event is created is indeed quite 'expensive'. I was wondering if we can filter on the journald entry directly before we create the MapStr. If this would be "cheap" (no/minimal amount of allocations) this might gives us much more flexibility.

It looks like we can construct filters with AND's and OR's with a limitation on the number of levels.
...

Yep, this was my concern/question. The API is not easy to use. And it looks like the "language" only covers a subset of the logical expressions. The man page even states:

The combination of sd_journal_add_match(), sd_journal_add_disjunction() and sd_journal_add_conjunction() may be used to build complex search terms, even though full logical expressions are not available

Also the kind of filters one can apply are limited. E.g. one can not use regexes.

Another option I considered was to create one input for each source so that I could use include_matches. But that's not possible because each of my inputs uses the default path: LOCAL_SYSTEM_JOURNAL so if I restart it loses its position. The registry only stores one cursor instead of three unique cursors. Perhaps we can add a unique name/id for each input so that we can store the cursors independently. @urso Will the new Filebeat registry allow this?

Yes, the POC already supports this for journald and windows event logs by configuring id. Each input will track it's state under the key <input_type>::<source name>::<id> if id has been configured. E.g. configuring journald input for service X with id: X will use the registry key journald::LOCAL_SYSTEM_JOURNAL::X.

andrewkroh · 2020-05-12T18:33:55Z

Did you use include_matches, or did filter via Beats conditionals?

I started with a Beats drop_event processor with a conditional. This wasn't acceptable from a performance perspective (high latency and lots of CPU usage) It probably would be more performant if we could move our filtering earlier, but using the built-in journal match is the optimal way IMO. After rolling Journalbeat out to several machines for several services at different times I think the following features combo would be ideal:

Allow multiple independent inputs to operate on the same journal. (By adding an id option to the input in journalbeat.)
Enhance the "include_match" configuration option to allow the first level of or statements. This almost becomes unnecessary as soon as you allow for multiple inputs on the same journal since each input can have its include_match, but there are probably some cases where it is useful to be able use an or (e.g. CONTAINER_TAG=mysql or _COMM=mysql) to process logs through a single pipeline. (low priority)
Add support for a reloadable 'inputs.d' directory. (Nice to have, but we can skip this since Filebeat has the feature.)

Is OK if I open an issues for (1) and (2) to further describe the feature and use cases? (edit: I opened a PR for (1) because I personally could use this now.)

mergify · 2021-04-07T16:32:49Z

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b feature-journalbeat-advanced-filtering upstream/feature-journalbeat-advanced-filtering
git merge upstream/master
git push upstream feature-journalbeat-advanced-filtering

ruflin · 2021-08-17T12:58:02Z

@kvch What should we do with this PR? It was stale for quite some time.

kvch · 2021-08-17T16:18:46Z

This PR should be adapted in the journalD input. This filtering mode is a must-have, mostly mimics the community Journalbeat and also adds more options for users to define their own journal filters. Systemd can do some prefiltering for us, so Filebeat/Journalbeat has to process less events.

I suggest to add this to the Filebeat input before we enable it. The main problem with this PR was that it was a breaking change in Journalbeat.

VannTen · 2021-08-25T16:22:25Z

It looks like we can construct filters with AND's and OR's with a limitation on the number of levels. From: https://www.freedesktop.org/software/systemd/man/sd_journal_add_conjunction.html

Are you sure the number of level is limited ? I don't get that from the man page

But I'm not immediately sure how to convert that back into a series of sd_journal_add_match, sd_journal_add_disjunction, sd_journal_add_conjunction calls.

I think the translation would be to intercalate a sd_journal_add_disjonction between each member of and or list, and add_conjunction for an or list :

- and: # Process each item of the list recursively, call sd_journal_add_conjuntion between each 
    - or: # Process each item of the list recursively, call sd_add_conjuntion between each
        - and:
            - or:
                - QUUX: mmmm # sd_journal_add_match, (same for each leaf match)
                     # sd_journal_add_disjunction
                - QUUX: xxxxx  
                    # sd_journal_add_disjunction
                - QUUX: yyyyy
             # sd_journal_add_conjunction
            - or:
                - HALLO: ""
                 # sd_journal_add_disjunction
                - HALLO: WALDO
            # sd_journal_add_conjunction
            - PIFF: paff
     # sd_journal_add_disjunction
        - and:
            - or:
                - ONE: one 
                 # sd_journal_add_disjunction
                - ONE: two
           # sd_journal_add_conjunction
            - TWO: two
 ...

And I think matches in a leaf or don't need a disjunction if they have the same key (the latter OR expression is implicitly created for matches with the same field name).

That syntax could remain backward compatible, I think. If we considers an implicit and at the top level (which is the current behaviour, following the sd_journal_add_match api) :

include_matches:
 - unit=a # sd_journal_add_match
 - unit=b # sd_journal_add_match
 - unit=c # sd_journal_add_match
 - severity=1 # sd_journal_add_match
 - severity=2 # sd_journal_add_match
 - or : # sd_journal_add_conjunction
   - transport=stuff # sd_journal_add_match
     # sd_journal_add_disjunction
   - something=the_thing # sd_journal_add_match

The problem is when there is match with the same key inside an AND. I guess for those there should be some pre-processing to "merge" matches on the same field.

VannTen · 2021-08-25T16:24:18Z

journalbeat/reader/config.go

+	Matches MatcherConfig
+	// Units stores the units to monitor
+	Units []string
+	// Kernel stores whether kernel messages should be monitored
+	Kernel bool
+	// Identifiers stores the syslog identifiers to watch
+	Identifiers []string


Should the mentioned syntax be used, the go code will not need to know about units, kernels, or identifiers.

mergify · 2021-09-22T13:06:29Z

This pull request does not have a backport label. Could you fix it @kvch? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-v./d./d./d is the label to automatically backport to the 7./d branch. /d is the digit

NOTE: backport-skip has been added to this pull request.

kvch · 2021-12-06T15:42:33Z

Closing in favour of #29294

kvch added discuss Issue needs further discussion. review Journalbeat breaking change labels Feb 28, 2019

kvch requested a review from a team as a code owner February 28, 2019 11:13

houndci-bot reviewed Feb 28, 2019

View reviewed changes

kvch force-pushed the feature-journalbeat-advanced-filtering branch 2 times, most recently from c8df57b to 2977a57 Compare March 5, 2019 13:20

kvch mentioned this pull request Mar 14, 2019

Journalbeat #8323

Closed

23 tasks

kvch changed the title ~~Initialize the filtering functionality && backoff fix~~ Initialize the filtering functionality Apr 17, 2019

kvch changed the title ~~Initialize the filtering functionality~~ Initialize filtering functionality Apr 17, 2019

initialize advanced filtering

6a65514

kvch force-pushed the feature-journalbeat-advanced-filtering branch from 2977a57 to 6a65514 Compare April 17, 2019 16:33

urso added the Team:Beats label Nov 14, 2019

urso added the [zube]: In Review label Feb 4, 2020

andresrc added Team:Integrations Label for the Integrations team and removed Team:Beats labels Mar 6, 2020

This was referenced May 12, 2020

[Journalbeat] Add ID config option input #18467

Merged

Cherry-pick #18467 to 7.x: [Journalbeat] Add ID config option input #18590

Merged

ruflin closed this Aug 17, 2021

zube bot removed the [zube]: In Review label Aug 17, 2021

zube bot added the [zube]: Done label Aug 17, 2021

kvch reopened this Aug 17, 2021

zube bot added [zube]: In Progress and removed [zube]: Done labels Aug 17, 2021

VannTen reviewed Aug 25, 2021

View reviewed changes

mergify bot added the backport-skip Skip notification from the automated backport with mergify label Sep 22, 2021

jlind23 added the 8.1-candidate label Nov 24, 2021

jlind23 added v8.1.0 and removed 8.1-candidate labels Dec 3, 2021

kvch mentioned this pull request Dec 6, 2021

Add more filtering options to journald input #29294

Merged

6 tasks

kvch closed this Dec 6, 2021

zube bot added [zube]: Done and removed [zube]: In Progress labels Dec 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initialize filtering functionality #10985

Initialize filtering functionality #10985

kvch commented Feb 28, 2019 •

edited by urso

Loading

houndci-bot Feb 28, 2019

urso commented Apr 15, 2019

urso commented Apr 15, 2019

kvch commented Apr 17, 2019 •

edited

Loading

opsnull commented Jul 3, 2019

andrewkroh commented May 11, 2020

urso commented May 12, 2020

andrewkroh commented May 12, 2020 •

edited

Loading

mergify bot commented Apr 7, 2021

ruflin commented Aug 17, 2021

kvch commented Aug 17, 2021 •

edited

Loading

VannTen commented Aug 25, 2021

VannTen Aug 25, 2021

mergify bot commented Sep 22, 2021

kvch commented Dec 6, 2021

Initialize filtering functionality #10985

Initialize filtering functionality #10985

Conversation

kvch commented Feb 28, 2019 • edited by urso Loading

houndci-bot Feb 28, 2019

Choose a reason for hiding this comment

urso commented Apr 15, 2019

urso commented Apr 15, 2019

kvch commented Apr 17, 2019 • edited Loading

opsnull commented Jul 3, 2019

andrewkroh commented May 11, 2020

On Filtering

urso commented May 12, 2020

andrewkroh commented May 12, 2020 • edited Loading

mergify bot commented Apr 7, 2021

ruflin commented Aug 17, 2021

kvch commented Aug 17, 2021 • edited Loading

VannTen commented Aug 25, 2021

VannTen Aug 25, 2021

Choose a reason for hiding this comment

mergify bot commented Sep 22, 2021

kvch commented Dec 6, 2021

kvch commented Feb 28, 2019 •

edited by urso

Loading

kvch commented Apr 17, 2019 •

edited

Loading

andrewkroh commented May 12, 2020 •

edited

Loading

kvch commented Aug 17, 2021 •

edited

Loading