Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File input label regex #376

Merged
merged 12 commits into from
Aug 21, 2021
Merged

File input label regex #376

merged 12 commits into from
Aug 21, 2021

Conversation

jsirianni
Copy link
Member

@jsirianni jsirianni commented Aug 11, 2021

Description of Changes

Context

File input should have the ability to parse a log file's headers and attach them as labels to each entry's $label map. This needs to work under the following situations:

  • Start at end
  • Start at beginning
  • Start at end + starting with an offset other than 0 (half way though a file)

Goals

This solution should not decrease performance of file input operator.

Changes

  • Added LabelRegex, an optional parameter for providing a regex that will be used for parsing headers
    • This regex should contain two capture groups: key and value
    • Example: label_regex: '^#(?P<key>.*?): (?P<value>.*)'
  • Added Labels map[string]string to the Fingerprint type. This map still store the labels derived from the headers
    • When NewFingerprint() is called, the map is initialized with fp.Labels = make(map[string]string)
  • Added ReadHeaders() method, which is called by ReadToEnd before a file's entry's are read
    • This method will read the beginning of a file until the regex stops matching, and return.
  • Updated ReadToEnd() to attach the Labels on the fingerprint to each entry

Use Cases

The initial use-case for this change is to allow for a W3C plugin to be built, where we will need to detect the field names before parsing the entry. A future PR will enable CSV parser to take an optional FieldLabel parameter, meaning the user will not need to define the fields in their stanza config, for example:

file input will read the file starting with this:

#Version: 1.0
#Fields: s-dns	date	time	x-duration	c-ip	c-port	c-vx-zone	c-vx-gloc	x-tls-version	x-ciphersuite	x-tls-session	cs-method	cs-uri	cs-version	cs(User-Agent)	cs(Referer)	cs(Cookie)	cs(Range)	sc-status	s-cachestatus	sc-bytes	sc-stream-bytes	sc-dscp	x-connection-age	x-connection-reqs	s-ip	s-port	s-vx-rate	s-vx-rate-status	x-vx-serial	rs-stream-bytes	rs-bytes	cs-vx-token	sc-vx-download-rate	x-protohash	sc(Content-Range)	s-request-id	s-flags	rs-media-bitrate	rs-media-format	x-tus-proc	x-tus-switch	x-tus-vxpl	x-tus-auth	x-tus-wait	x-tus-flookup	x-tus-vxicp	x-tus-usel	x-tus-uconn	x-tus-ureq	x-tus-slookup	x-tus-send	x-tus-dthrot	x-tus-cread	x-tus-uread	x-tus-reqwait	x-failure-reason
#Software: redacted
#Start-Date: 2021-07-21 14:35:00

and then attach those headers as labels, to each entry

{
  "timestamp": "2021-08-11T14:59:18.36875-04:00",
  "severity": 0,
  "labels": {
    "Fields": "rs-stream-bytes\trs-bytes\tcs-vx-token\tsc-vx-download-rate....",
    "Software": "redacted",
    "Start-Date": "2021-07-21 14:35:00",
    "Version": "1.0",
    "file_name": "w3c.log.orig"
  },
  "record": "some raw unparsed w3c/csv record"
}

Once CSV is updated, we can do this: (removed some fields from this output to keep it small)

{
  "timestamp": "2021-08-11T15:09:39.090829-04:00",
  "severity": 0,
  "labels": {
    "Fields": "rs-stream-bytes\trs-bytes\tcs-vx-token\tsc-vx-download-rate....",
    "Software": "redacted",
    "Start-Date": "2021-07-21 14:35:00",
    "Version": "1.0",
    "file_name": "w3c.log.orig"
  },
  "record": {
    "c-ip": "redacted",
    "c-port": "45674",
    "c-vx-gloc": "g.ca.bc",
    ...
    "x-tus-vxpl": "-",
    "x-tus-wait": "-",
    "x-vx-serial": "138367"
  }
}

The pipeline config to handle this:

pipeline:
- type: file_input
  include:
  - ./w3c.log
  start_at: beginning
  label_regex: '^#(?P<key>.*?): (?P<value>.*)'

# ignore header lines that may exists in the file periodically
- type: filter
  expr: '$record matches "^#"'

# parse tab delimited records using dynamic header
- type: csv_parser
  delimiter: "\t"
  header_delimiter: "\t"
  header_label: Fields

- type: stdout

Please check that the PR fulfills these requirements

  • Tests for the changes have been added (for bug fixes / features)
  • Docs have been added / updated (for bug fixes / features)
  • Add a changelog entry (for non-trivial bug fixes / features)
  • CI passes

@djaglowski
Copy link
Member

Log Files Logs / Second CPU Avg (%) CPU Avg Δ (%) Memory Avg (MB) Memory Avg Δ (MB)
1 1000 1.4655279 -0.08626604 129.21713 -0.13644409
1 5000 5.103597 -0.017116547 136.4177 -1.4542084
1 10000 10.103542 +0 146.5567 +0
1 50000 55.829834 +0 176.13254 +0
1 100000 97.82954 -9.550255 227.09967 -16.58702
10 100 1.9656245 +0.08629382 133.51886 -0.6135559
10 500 6.224127 -0.15523577 139.21013 -2.9342651
10 1000 11.9141655 +0 145.78004 +0
10 5000 56.914448 +3.5710373 178.59174 -1.2013702
10 10000 112.31294 +0 226.0765 +0

@codecov
Copy link

codecov bot commented Aug 11, 2021

Codecov Report

Merging #376 (13a6470) into master (4a234f9) will decrease coverage by 0.14%.
The diff coverage is 40.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #376      +/-   ##
==========================================
- Coverage   73.21%   73.07%   -0.14%     
==========================================
  Files         124      124              
  Lines        8011     8055      +44     
==========================================
+ Hits         5865     5886      +21     
- Misses       1647     1666      +19     
- Partials      499      503       +4     
Impacted Files Coverage Δ
operator/builtin/input/file/fingerprint.go 90.48% <ø> (ø)
operator/builtin/input/file/reader.go 60.87% <11.54%> (-10.18%) ⬇️
operator/builtin/input/file/file.go 78.57% <44.44%> (-0.49%) ⬇️
operator/builtin/input/file/config.go 81.05% <86.67%> (+1.05%) ⬆️
operator/builtin/output/newrelic/newrelic.go 73.55% <0.00%> (-0.83%) ⬇️
operator/builtin/output/otlp/otlp.go 68.13% <0.00%> (+3.30%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4a234f9...13a6470. Read the comment docs.

Copy link
Member

@djaglowski djaglowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple issues. See comments.

operator/builtin/input/file/config.go Outdated Show resolved Hide resolved
operator/builtin/input/file/config.go Outdated Show resolved Hide resolved
operator/builtin/input/file/reader.go Outdated Show resolved Hide resolved
operator/builtin/input/file/reader.go Outdated Show resolved Hide resolved
@djaglowski
Copy link
Member

Log Files Logs / Second CPU Avg (%) CPU Avg Δ (%) Memory Avg (MB) Memory Avg Δ (MB)
1 1000 1.5 -0.051793933 128.7174 -0.63616943
1 5000 5.258783 +0.13806915 140.79903 +2.927124
1 10000 10.534573 +0 144.88577 +0
1 50000 50.828434 +0 176.49757 +0
1 100000 100.030624 -7.349167 232.47871 -11.207977
10 100 2.0518632 +0.17253256 134.5043 +0.3718872
10 500 6.1380486 -0.24131393 137.89319 -4.2512054
10 1000 11.827702 +0 147.5815 +0
10 5000 57.347363 +4.003952 182.11166 +2.3185577
10 10000 108.77695 +0 225.02277 +0

@djaglowski
Copy link
Member

Log Files Logs / Second CPU Avg (%) CPU Avg Δ (%) Memory Avg (MB) Memory Avg Δ (MB)
1 1000 1.5517483 +0.017143726 130.5745 -3.4251099
1 1000 1.4138529 -0.12075162 127.20798 -6.791626
1 5000 6.103733 +1.1037092 137.4468 -0.8407898
1 5000 5.2069902 +0.2069664 138.53569 +0.24810791
1 10000 10.224202 -0.2759943 143.02492 -2.3033447
1 10000 10.172573 -0.32762337 143.11638 -2.2118835
1 50000 50.328503 +0.10372543 172.67148 -6.1954346
1 50000 55.51832 +5.293541 175.5505 -3.3164062
1 100000 97.932106 +3.6193619 226.57866 -4.29216
1 100000 99.17467 +4.861923 223.61853 -7.252289
10 100 2.2759836 +0.34492123 132.31721 -0.20231628
10 100 1.8448126 -0.08624971 133.82288 +1.3033447
10 500 6.22416 +0.24128151 140.00835 -1.9190521
10 500 5.7930274 -0.18985128 139.89749 -2.0299072
10 1000 11.4484415 +0.62081814 145.50215 -1.0521393
10 1000 13.31055 +2.4829264 144.74811 -1.8061829
10 5000 55.104153 +0 177.53152 +0
10 5000 55.208244 +0 180.25499 +0
10 10000 110.98502 +1.0256729 230.75876 -0.92388916
10 10000 106.18756 -3.7717896 228.08069 -3.6019592

@jsirianni jsirianni requested a review from djaglowski August 20, 2021 19:59
Comment on lines +317 to +319
/*if err := newReader.readHeaders(ctx); err != nil {
f.Errorf("error while reading file headers: %s", err)
}*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should remove this

djaglowski
djaglowski previously approved these changes Aug 20, 2021
@jsirianni jsirianni marked this pull request as ready for review August 20, 2021 20:06
@djaglowski
Copy link
Member

Log Files Logs / Second CPU Avg (%) CPU Avg Δ (%) Memory Avg (MB) Memory Avg Δ (MB)
1 1000 1.3965669 -0.13803768 130.43939 -3.5602112
1 5000 5.3276806 +0.32765675 138.18831 -0.09927368
1 10000 10.551935 +0.05173874 143.96889 -1.359375
1 50000 48.96703 -1.2577477 170.05873 -8.808182
1 100000 101.017494 +6.70475 241.28691 +10.416092
10 100 1.948277 +0.017214656 133.91931 +1.3997803
10 500 6.3967204 +0.41384172 142.56049 +0.63308716
10 1000 11.91315 +1.0855265 147.8152 +1.26091
10 5000 56.382088 +3.5026054 183.31183 +3.443161
10 10000 107.50693 -2.452423 244.50742 +12.824768

@djaglowski
Copy link
Member

Log Files Logs / Second CPU Avg (%) CPU Avg Δ (%) Memory Avg (MB) Memory Avg Δ (MB)
1 1000 1.4655358 +0.034465313 125.27842 -2.4197235
1 5000 5.172553 +0.17239428 137.86961 +2.9621582
1 10000 10.206806 -0.6211052 146.43939 +2.704071
1 50000 51.413994 +2.27536 177.8253 +4.458252
1 100000 96.24193 -1.0513458 223.6149 -12.847931
10 100 1.9311762 +0.01736331 132.67538 -2.186142
10 500 6.120885 -0.5000076 140.72629 -1.1846771
10 1000 12.120882 +0.896719 147.15517 -1.6155701
10 5000 56.25758 -0.5536461 181.3661 -0.3713684
10 10000 109.802055 +4.633072 225.25095 +12.287186

@djaglowski
Copy link
Member

Log Files Logs / Second CPU Avg (%) CPU Avg Δ (%) Memory Avg (MB) Memory Avg Δ (MB)
1 1000 1.4828056 +0.051735163 129.05617 +1.3580246
1 5000 5.172513 +0.17235422 135.98653 +1.079071
1 10000 10.379475 -0.44843674 144.3521 +0.6167755
1 50000 47.47703 -1.661602 176.70352 +3.3364716
1 100000 96.268074 -1.0251999 232.20609 -4.2567444
10 100 2.2587035 +0.3448906 134.34712 -0.5144043
10 500 6.103602 -0.5172906 139.92606 -1.984909
10 1000 11.948519 +0.7243557 144.52048 -4.2502594
10 5000 62.47068 +5.6594543 186.81546 +5.0779877
10 10000 107.49849 +2.329506 210.05563 -2.9081268

@jsirianni jsirianni merged commit 8484510 into master Aug 21, 2021
@jsirianni jsirianni deleted the file-input-label-regex branch August 21, 2021 15:09
This was referenced Aug 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants