Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initialize filtering functionality #10985

Closed

Conversation

kvch
Copy link
Contributor

@kvch kvch commented Feb 28, 2019

This PR initializes support for the similar filters provided by the community Journalbeat. It does not yet have the complete functionality as wildcards for unit names are missing.
But I am opening this PR to discuss the interface of include_matches.

Previously include_matches expected a list of filter expressions. Now more advanced filtering is added, but it required a different way of configuration.
Also, the way matches are applied to a systemd journal is a bit counterintuitive. If matcher expressions are applied to the same field, they are connected with OR. In case of differen field names, the expressions are connected with AND. Thus, it does not let you define full locigal expressions as our processor filtering does.

The following configuration matches for entries where process.name is either chromium or dockerd:

include_matches.equals:
  - _COMM=chromium
  - _COMM=dockerd

If we would like to get entries of chromium, dockerd and audit messages the following configuration is required.

include_matches.or:
  - [
     _COMM=chromium,
     _COMM=dockerd
  ]
  - [
    _TRANSPORT=audit
  ]

What I would like to discuss if the configuration above is ok. Another alternative which came to my mind is this:

include_matches.or:
  - conditions:
    - _COMM=chromium
    - _COMM=dockers
  - conditions:
    - _TRANSPORT=audit

But in my opinion the one I have implemented is simpler. WDYT?

Note that most of the filtering use cases are covered by unit, kernel and identifiers.

TODO

  • add changelog entry

Depends on #10982

@kvch kvch added discuss Issue needs further discussion. review Journalbeat breaking change labels Feb 28, 2019
@kvch kvch requested a review from a team as a code owner February 28, 2019 11:13
"github.com/coreos/go-systemd/sdjournal"
)

type MatcherConfig struct {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exported type MatcherConfig should have comment or be unexported

@kvch kvch force-pushed the feature-journalbeat-advanced-filtering branch 2 times, most recently from c8df57b to 2977a57 Compare March 5, 2019 13:20
@kvch kvch mentioned this pull request Mar 14, 2019
23 tasks
@urso
Copy link

urso commented Apr 15, 2019

Is there a notable performance benefit in building journald filters instead of matching and filtering out the events/logs ourselves?

Anyways, OR-combining same fields makes sense. One can always apply some post filtering via processors. One could also put it in config like this:

matchers:
- _COMM: [chromium, docker]
- _TRANSPORT: [audit]

I think this syntax makes somewhat clear thatchromium, docker are combined into an OR, while _TRANSPORT and _COMM are combined into and AND. We just do not allow a field being referenced multiple times (or make this optional).

I'm not sure I like the extra settings like kernel,unit,identifier. These can be expressed using the matchers as well (are even compiled into matchers), yet I wonder if its fully clear how these settings affect the actual configured matchers.

@urso
Copy link

urso commented Apr 15, 2019

Can we split the backoff fix into another PR? So the fix doesn't get lost.

@kvch
Copy link
Contributor Author

kvch commented Apr 17, 2019

Separate PR for backoff: #11861

@kvch kvch changed the title Initialize the filtering functionality && backoff fix Initialize the filtering functionality Apr 17, 2019
@kvch kvch changed the title Initialize the filtering functionality Initialize filtering functionality Apr 17, 2019
@kvch kvch force-pushed the feature-journalbeat-advanced-filtering branch from 2977a57 to 6a65514 Compare April 17, 2019 16:33
@opsnull
Copy link

opsnull commented Jul 3, 2019

any update?

@andresrc andresrc added Team:Integrations Label for the Integrations team and removed Team:Beats labels Mar 6, 2020
@andrewkroh
Copy link
Member

Is there a notable performance benefit in building journald filters instead of matching and filtering out the events/logs ourselves?

There's a big difference from what I observed so I think the feature is necessary. I was working on deploying Journalbeat to collected a few select sources (two systemd units and kernel logs). The initial deployment was taking a very long time and consuming plenty of CPU since it was going through the backlog of all logs in the entire journal.

Another option I considered was to create one input for each source so that I could use include_matches. But that's not possible because each of my inputs uses the default path: LOCAL_SYSTEM_JOURNAL so if I restart it loses its position. The registry only stores one cursor instead of three unique cursors. Perhaps we can add a unique name/id for each input so that we can store the cursors independently. @urso Will the new Filebeat registry allow this?

On Filtering

It looks like we can construct filters with AND's and OR's with a limitation on the number of levels. From: https://www.freedesktop.org/software/systemd/man/sd_journal_add_conjunction.html

It is hence possible to build an expression of AND terms, consisting of OR terms, consisting of AND terms, consisting of OR terms of matches (the latter OR expression is implicitly created for matches with the same field name, see above).

There is an example in the test cases at https://github.com/systemd/systemd/blob/e7d5fe17db9d046b364f933c4dcdd145f47b024a/src/journal/test-journal-match.c#L59. I think that example equates to something like this in our typical YAML conditional format:

- and:
    - or:
        - and:
            - or:
                - QUUX: mmmm
                - QUUX: xxxxx
                - QUUX: yyyyy
            - or:
                - HALLO: ""
                - HALLO: WALDO
            - PIFF: paff

        - and:
            - or:
                - ONE: one
                - ONE: two
            - TWO: two

    - or:
        - and:
            - or:
                - L4_1: "yes"
                - L4_1: ok
            - or:
                - L4_2: "yes"
                - L4_2: ok

        - and:
            - or:
                - L3: "yes"
                - L3: ok

But I'm not immediately sure how to convert that back into a series of sd_journal_add_match, sd_journal_add_disjunction, sd_journal_add_conjunction calls.

@urso
Copy link

urso commented May 12, 2020

Is there a notable performance benefit in building journald filters instead of matching and filtering out the events/logs ourselves?

There's a big difference from what I observed so I think the feature is necessary. I was working on deploying Journalbeat to collected a few select sources (two systemd units and kernel logs). The initial deployment was taking a very long time and consuming plenty of CPU since it was going through the backlog of all logs in the entire journal.

Did you use include_matches, or did filter via Beats conditionals? Filtering via Beats conditionals after the event is created is indeed quite 'expensive'. I was wondering if we can filter on the journald entry directly before we create the MapStr. If this would be "cheap" (no/minimal amount of allocations) this might gives us much more flexibility.

It looks like we can construct filters with AND's and OR's with a limitation on the number of levels.
...

Yep, this was my concern/question. The API is not easy to use. And it looks like the "language" only covers a subset of the logical expressions. The man page even states:

The combination of sd_journal_add_match(), sd_journal_add_disjunction() and sd_journal_add_conjunction() may be used to build complex search terms, even though full logical expressions are not available

Also the kind of filters one can apply are limited. E.g. one can not use regexes.

Another option I considered was to create one input for each source so that I could use include_matches. But that's not possible because each of my inputs uses the default path: LOCAL_SYSTEM_JOURNAL so if I restart it loses its position. The registry only stores one cursor instead of three unique cursors. Perhaps we can add a unique name/id for each input so that we can store the cursors independently. @urso Will the new Filebeat registry allow this?

Yes, the POC already supports this for journald and windows event logs by configuring id. Each input will track it's state under the key <input_type>::<source name>::<id> if id has been configured. E.g. configuring journald input for service X with id: X will use the registry key journald::LOCAL_SYSTEM_JOURNAL::X.

@andrewkroh
Copy link
Member

andrewkroh commented May 12, 2020

Did you use include_matches, or did filter via Beats conditionals?

I started with a Beats drop_event processor with a conditional. This wasn't acceptable from a performance perspective (high latency and lots of CPU usage) It probably would be more performant if we could move our filtering earlier, but using the built-in journal match is the optimal way IMO. After rolling Journalbeat out to several machines for several services at different times I think the following features combo would be ideal:

  1. Allow multiple independent inputs to operate on the same journal. (By adding an id option to the input in journalbeat.)
  2. Enhance the "include_match" configuration option to allow the first level of or statements. This almost becomes unnecessary as soon as you allow for multiple inputs on the same journal since each input can have its include_match, but there are probably some cases where it is useful to be able use an or (e.g. CONTAINER_TAG=mysql or _COMM=mysql) to process logs through a single pipeline. (low priority)
  3. Add support for a reloadable 'inputs.d' directory. (Nice to have, but we can skip this since Filebeat has the feature.)

Is OK if I open an issues for (1) and (2) to further describe the feature and use cases? (edit: I opened a PR for (1) because I personally could use this now.)

@mergify
Copy link
Contributor

mergify bot commented Apr 7, 2021

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b feature-journalbeat-advanced-filtering upstream/feature-journalbeat-advanced-filtering
git merge upstream/master
git push upstream feature-journalbeat-advanced-filtering

@ruflin
Copy link
Collaborator

ruflin commented Aug 17, 2021

@kvch What should we do with this PR? It was stale for quite some time.

@ruflin ruflin closed this Aug 17, 2021
@zube zube bot removed the [zube]: In Review label Aug 17, 2021
@zube zube bot added the [zube]: Done label Aug 17, 2021
@kvch
Copy link
Contributor Author

kvch commented Aug 17, 2021

This PR should be adapted in the journalD input. This filtering mode is a must-have, mostly mimics the community Journalbeat and also adds more options for users to define their own journal filters. Systemd can do some prefiltering for us, so Filebeat/Journalbeat has to process less events.

I suggest to add this to the Filebeat input before we enable it. The main problem with this PR was that it was a breaking change in Journalbeat.

@kvch kvch reopened this Aug 17, 2021
@VannTen
Copy link

VannTen commented Aug 25, 2021

It looks like we can construct filters with AND's and OR's with a limitation on the number of levels. From: https://www.freedesktop.org/software/systemd/man/sd_journal_add_conjunction.html

Are you sure the number of level is limited ? I don't get that from the man page

But I'm not immediately sure how to convert that back into a series of sd_journal_add_match, sd_journal_add_disjunction, sd_journal_add_conjunction calls.

I think the translation would be to intercalate a sd_journal_add_disjonction between each member of and or list, and add_conjunction for an or list :

- and: # Process each item of the list recursively, call sd_journal_add_conjuntion between each 
    - or: # Process each item of the list recursively, call sd_add_conjuntion between each
        - and:
            - or:
                - QUUX: mmmm # sd_journal_add_match, (same for each leaf match)
                     # sd_journal_add_disjunction
                - QUUX: xxxxx  
                    # sd_journal_add_disjunction
                - QUUX: yyyyy
             # sd_journal_add_conjunction
            - or:
                - HALLO: ""
                 # sd_journal_add_disjunction
                - HALLO: WALDO
            # sd_journal_add_conjunction
            - PIFF: paff
     # sd_journal_add_disjunction
        - and:
            - or:
                - ONE: one 
                 # sd_journal_add_disjunction
                - ONE: two
           # sd_journal_add_conjunction
            - TWO: two
 ...

And I think matches in a leaf or don't need a disjunction if they have the same key (the latter OR expression is implicitly created for matches with the same field name).

That syntax could remain backward compatible, I think. If we considers an implicit and at the top level (which is the current behaviour, following the sd_journal_add_match api) :

include_matches:
 - unit=a # sd_journal_add_match
 - unit=b # sd_journal_add_match
 - unit=c # sd_journal_add_match
 - severity=1 # sd_journal_add_match
 - severity=2 # sd_journal_add_match
 - or : # sd_journal_add_conjunction
   - transport=stuff # sd_journal_add_match
     # sd_journal_add_disjunction
   - something=the_thing # sd_journal_add_match

The problem is when there is match with the same key inside an AND. I guess for those there should be some pre-processing to "merge" matches on the same field.

Comment on lines +42 to +48
Matches MatcherConfig
// Units stores the units to monitor
Units []string
// Kernel stores whether kernel messages should be monitored
Kernel bool
// Identifiers stores the syslog identifiers to watch
Identifiers []string
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the mentioned syntax be used, the go code will not need to know about units, kernels, or identifiers.

@mergify
Copy link
Contributor

mergify bot commented Sep 22, 2021

This pull request does not have a backport label. Could you fix it @kvch? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v./d./d./d is the label to automatically backport to the 7./d branch. /d is the digit

NOTE: backport-skip has been added to this pull request.

@mergify mergify bot added the backport-skip Skip notification from the automated backport with mergify label Sep 22, 2021
@kvch
Copy link
Contributor Author

kvch commented Dec 6, 2021

Closing in favour of #29294

@kvch kvch closed this Dec 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-skip Skip notification from the automated backport with mergify breaking change discuss Issue needs further discussion. Journalbeat review Team:Integrations Label for the Integrations team v8.1.0 [zube]: Done
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants