Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filebeat S3 input plugin cannot parse jsonl file with content-type set as application/json #19902

Closed
lag13 opened this issue Jul 14, 2020 · 6 comments · Fixed by #19962
Closed
Assignees
Labels
Team:Platforms Label for the Integrations - Platforms team

Comments

@lag13
Copy link

lag13 commented Jul 14, 2020

Hey! I had created this topic two weeks ago https://discuss.elastic.co/t/filebeat-s3-cannot-parse-jsonl-file-whos-content-type-is-set-to-application-json/239374 but no one has responded and I am convinced that this is a bug or something that should be fixed so I thought I'd try making an issue. I appreciate any feedback and sorry if this is just extra noise for y'all.

The problem is that the filebeat S3 input plugin cannot process an s3 object who's content-type is application/json AND the object content is a separate json object per line (i.e. jsonl). Processing such an object used to be possible until v7.7.0 when the S3 input plugin started enforcing json parsing if it saw a content-type of application/json: https://github.com/elastic/beats/blob/v7.7.0/x-pack/filebeat/input/s3/input.go#L432
Before that, json processing was ONLY controlled by the expand_event_list_from_field configuration like you can see here: https://github.com/elastic/beats/blob/v7.6.2/x-pack/filebeat/input/s3/input.go#L431

Now, it's probably the case that content-type on the s3 object should NOT be application/json in the first place but sadly I do not have control over that 😦. I'm essentially dealing with the same problem as #18696 but cloudflare is the entity pushing the logs to s3 (where for him it is AWS GuardDuty) and I don't have control over how cloudflare sets content-type.

I had reproduced this issue by compiling filebeat off of master two weeks ago (here was the filbeat version output filebeat version 8.0.0 (amd64), libbeat 8.0.0 [3341c1bca5626d1ee90af692617f10f58695ed1c built 2020-06-30 20:15:28 +0000 UTC]) and, looking at the source code today (2297636), it seems this problem is still there. Here were the steps I followed when I reproduced this issue:

$ cat s3filebeat.log 
{"id": "0001", "hey": "there", "how": {"are": "you"}}
{"id": "0002", "hope": "you", "are": {"doing": "well"}}
{"id": "0003", "I": "am", "doing": {"O": "K"}}

$ gzip s3filebeat.log 

$ aws --profile PROFILE s3api put-object --body ./s3filebeat.log.gz --bucket lucas-test-filebeat-s3 --content-encoding gzip --content-type application/json --key s3filebeat.log.gz
{
    "ETag": "\"955ed9f01b6ee38dbba167daab9ebbbb\""
}

$ cat filebeat.yml 
filebeat.inputs:
- type: s3
  queue_url: "https://sqs.us-east-1.amazonaws.com/ACCTNUM/cloudflare_logs_dev"
  role_arn: "arn:aws:iam::ACCTNUM:role/cloudflare_logs_filebeat_access_s3_sqs_dev"

output.console:
  pretty: true

$ ./filebeat -e
2020-06-30T15:39:04.180-0500	INFO	instance/beat.go:628	Home path: [/home/lgroenendaal/src/github.com/elastic/beats/x-pack/filebeat] Config path: [/home/lgroenendaal/src/github.com/elastic/beats/x-pack/filebeat] Data path: [/home/lgroenendaal/src/github.com/elastic/beats/x-pack/filebeat/data] Logs path: [/home/lgroenendaal/src/github.com/elastic/beats/x-pack/filebeat/logs]
2020-06-30T15:39:04.180-0500	INFO	instance/beat.go:636	Beat ID: 2ef98d6a-ef7c-4885-a258-b4341a63b43b
2020-06-30T15:39:04.181-0500	INFO	[seccomp]	seccomp/seccomp.go:124	Syscall filter successfully installed
2020-06-30T15:39:04.181-0500	INFO	[beat]	instance/beat.go:964	Beat info	{"system_info": {"beat": {"path": {"config": "/home/lgroenendaal/src/github.com/elastic/beats/x-pack/filebeat", "data": "/home/lgroenendaal/src/github.com/elastic/beats/x-pack/filebeat/data", "home": "/home/lgroenendaal/src/github.com/elastic/beats/x-pack/filebeat", "logs": "/home/lgroenendaal/src/github.com/elastic/beats/x-pack/filebeat/logs"}, "type": "filebeat", "uuid": "2ef98d6a-ef7c-4885-a258-b4341a63b43b"}}}
2020-06-30T15:39:04.181-0500	INFO	[beat]	instance/beat.go:973	Build info	{"system_info": {"build": {"commit": "3341c1bca5626d1ee90af692617f10f58695ed1c", "libbeat": "8.0.0", "time": "2020-06-30T20:15:28.000Z", "version": "8.0.0"}}}
2020-06-30T15:39:04.181-0500	INFO	[beat]	instance/beat.go:976	Go runtime info	{"system_info": {"go": {"os":"linux","arch":"amd64","max_procs":12,"version":"go1.13.3"}}}
2020-06-30T15:39:04.182-0500	INFO	[beat]	instance/beat.go:980	Host info	{"system_info": {"host": {"architecture":"x86_64","boot_time":"2020-06-22T10:53:41-05:00","containerized":false,"name":"lgroenendaal-XPS-15-9570","ip":["127.0.0.1/8","::1/128","192.168.86.24/24","fe80::c38c:a654:ca3e:d0dd/64","10.20.211.109/32","fe80::d2ee:dd77:2943:aa45/64","172.17.0.1/16"],"kernel_version":"5.3.0-59-generic","mac":["9c:b6:d0:c6:01:39","02:42:e2:23:9b:84"],"os":{"family":"debian","platform":"ubuntu","name":"Ubuntu","version":"18.04.2 LTS (Bionic Beaver)","major":18,"minor":4,"patch":2,"codename":"bionic"},"timezone":"CDT","timezone_offset_sec":-18000,"id":"5a8843fa712d481595ebd41926cda45f"}}}
2020-06-30T15:39:04.183-0500	INFO	[beat]	instance/beat.go:1009	Process info	{"system_info": {"process": {"capabilities": {"inheritable":null,"permitted":null,"effective":null,"bounding":["chown","dac_override","dac_read_search","fowner","fsetid","kill","setgid","setuid","setpcap","linux_immutable","net_bind_service","net_broadcast","net_admin","net_raw","ipc_lock","ipc_owner","sys_module","sys_rawio","sys_chroot","sys_ptrace","sys_pacct","sys_admin","sys_boot","sys_nice","sys_resource","sys_time","sys_tty_config","mknod","lease","audit_write","audit_control","setfcap","mac_override","mac_admin","syslog","wake_alarm","block_suspend","audit_read"],"ambient":null}, "cwd": "/home/lgroenendaal/src/github.com/elastic/beats/x-pack/filebeat", "exe": "/home/lgroenendaal/src/github.com/elastic/beats/x-pack/filebeat/filebeat", "name": "filebeat", "pid": 5578, "ppid": 4156, "seccomp": {"mode":"filter","no_new_privs":true}, "start_time": "2020-06-30T15:39:03.640-0500"}}}
2020-06-30T15:39:04.183-0500	INFO	instance/beat.go:298	Setup Beat: filebeat; Version: 8.0.0
2020-06-30T15:39:04.183-0500	INFO	[publisher]	pipeline/module.go:113	Beat name: lgroenendaal-XPS-15-9570
2020-06-30T15:39:04.184-0500	WARN	beater/filebeat.go:151	Filebeat is unable to load the Ingest Node pipelines for the configured modules because the Elasticsearch output is not configured/enabled. If you have already loaded the Ingest Node pipelines or are using Logstash pipelines, you can ignore this warning.
2020-06-30T15:39:04.184-0500	INFO	[monitoring]	log/log.go:118	Starting metrics logging every 30s
2020-06-30T15:39:04.184-0500	INFO	instance/beat.go:449	filebeat start running.
2020-06-30T15:39:04.184-0500	WARN	beater/filebeat.go:251	Filebeat is unable to load the Ingest Node pipelines for the configured modules because the Elasticsearch output is not configured/enabled. If you have already loaded the Ingest Node pipelines or are using Logstash pipelines, you can ignore this warning.
2020-06-30T15:39:04.184-0500	INFO	registrar/registrar.go:145	Loading registrar data from /home/lgroenendaal/src/github.com/elastic/beats/x-pack/filebeat/data/registry/filebeat/data.json
2020-06-30T15:39:04.184-0500	INFO	registrar/registrar.go:152	States Loaded from registrar: 0
2020-06-30T15:39:04.184-0500	INFO	[crawler]	beater/crawler.go:71	Loading Inputs: 1
2020-06-30T15:39:04.185-0500	WARN	[cfgwarn]	s3/input.go:131	BETA: s3 input type is used
2020-06-30T15:39:04.185-0500	INFO	[crawler]	beater/crawler.go:141	Starting input (ID: 18222034013403473169)
2020-06-30T15:39:04.185-0500	INFO	[crawler]	beater/crawler.go:108	Loading and starting Inputs completed. Enabled inputs: 1
2020-06-30T15:39:04.186-0500	INFO	[s3]	s3/input.go:173	visibility timeout is set to 300 seconds
2020-06-30T15:39:04.186-0500	INFO	[s3]	s3/input.go:174	aws api timeout is set to 2m0s
2020-06-30T15:39:04.186-0500	INFO	[s3]	s3/input.go:196	s3 input worker has started. with queueURL: https://sqs.us-east-1.amazonaws.com/374144443638/cloudflare_logs_dev
2020-06-30T15:39:14.633-0500	ERROR	[s3]	s3/input.go:458	expand_event_list_from_field parameter is missing in config for application/json content-type file
2020-06-30T15:39:14.633-0500	ERROR	[s3]	s3/input.go:393	createEventsFromS3Info failed processing file from s3 bucket "lucas-test-filebeat-s3" with name "s3filebeat.log.gz": expand_event_list_from_field parameter is missing in config for application/json content-type file

The imporant error is the last line which tells me that my object cannot be processed even though I think it should be processable: s3/input.go:393 createEventsFromS3Info failed processing file from s3 bucket "lucas-test-filebeat-s3" with name "s3filebeat.log.gz": expand_event_list_from_field parameter is missing in config for application/json content-type file.

If I try, just for fun, to include the expand_event_list_from_field config it will, understandably, fail to parse and we'll get the WARN log: s3/input.go:542 decode json failed for 's3filebeat.log.gz' from S3 bucket 'lucas-test-filebeat-s3', skipping this file: json: cannot unmarshal string into Go value of type []interface {}.

Any advice on how to proceed here? Personally I think the appropriate change would be to ignore content-type: application/json and have expand_event_list_from_field be the ONLY thing controlling whether or not to parse the object content as JSON (which is how the logic used to be) but I defer to you maintainers because you have the vision for how you want the code to behave.

I really appreciate any who took the time to read this and I really apprecate the creation of this feature, it's a nifty one. Cheers.

@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Jul 14, 2020
@andresrc andresrc added the Team:Platforms Label for the Integrations - Platforms team label Jul 15, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/integrations-platforms (Team:Platforms)

@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Jul 15, 2020
@kaiyan-sheng kaiyan-sheng self-assigned this Jul 15, 2020
@kaiyan-sheng
Copy link
Contributor

Hi @lag13 sorry we missed the discuss issue! This does look like a bug to me so thank you for creating this issue. One question: What does AWS Cloudflare set for the content-type?

@lag13
Copy link
Author

lag13 commented Jul 15, 2020

That's fine! I'm glad you saw this one :). Cloudflare (and it's just cloudflare, not "AWS cloudflare" https://www.cloudflare.com/) sets the content-type to application/json for the S3 objects it creates. Not sure if should be setting it to that (since the file is a separate json object on each line) but I have no control over it unfortunately and I feel like, regardless, the s3 input plugin should be at least capable of processing such an entity.

I'm curious as to why the code was changed in 7.7.0 to force json parsing if the content-type was application/json (https://github.com/elastic/beats/blob/v7.7.0/x-pack/filebeat/input/s3/input.go#L432) where before it was only controlled by the presence of the expand_event_list_from_field config: https://github.com/elastic/beats/blob/v7.6.2/x-pack/filebeat/input/s3/input.go#L431. Seems like that change is strictly worse since all it does is limit behavior but I defer to y'all on what you think is best!

Currently I'm using version 7.5.2 of filebeat to process things and that's working just fine. Thanks for getting back to me on this. Let me know if there's anything I can do to help. I'd be happy to attempt a contribution if you are swamped with other things.

@kaiyan-sheng
Copy link
Contributor

@lag13 I just created an initial PR for fixing this issue. Do you mind sending me a real log file from Cloudflare for testing? Please feel free to review/test it. Thanks!

@lag13
Copy link
Author

lag13 commented Jul 16, 2020

@kaiyan-sheng Thanks so much!!! I think what you have in your PR works just fine. The important bit is those separate json objects per line. That being said, the cloudflare logs looks like this:

{"CacheCacheStatus":"unknown","CacheResponseBytes":1963,"CacheResponseStatus":200,"CacheTieredFill":false,"ClientASN":14618,"ClientCountry":"us","ClientDeviceType":"desktop","ClientIP":"1.2.3.4","ClientIPClass":"noRecord","ClientRequestBytes":2400,"ClientRequestHost":"test.com","ClientRequestMethod":"POST","ClientRequestPath":"/hey/there","ClientRequestProtocol":"HTTP/1.1","ClientRequestReferer":"","ClientRequestURI":"/hey/there","ClientRequestUserAgent":"Apache-HttpClient/4.5.8 (Java/1.8.0_201)","ClientSSLCipher":"something-cool","ClientSSLProtocol":"TLSv1.2","ClientSrcPort":47463,"ClientXRequestedWith":"","EdgeColoCode":"IAD","EdgeColoID":16,"EdgeEndTimestamp":"2020-06-25T01:25:12Z","EdgePathingOp":"wl","EdgePathingSrc":"macro","EdgePathingStatus":"nr","EdgeRateLimitAction":"","EdgeRateLimitID":0,"EdgeRequestHost":"test.com","EdgeResponseBytes":972,"EdgeResponseCompressionRatio":0,"EdgeResponseContentType":"application/json","EdgeResponseStatus":200,"EdgeServerIP":"5.6.7.8","EdgeStartTimestamp":"2020-06-25T01:25:12Z","FirewallMatchesActions":[],"FirewallMatchesRuleIDs":[],"FirewallMatchesSources":[],"OriginIP":"1.2.3.4","OriginResponseBytes":0,"OriginResponseHTTPExpires":"","OriginResponseHTTPLastModified":"","OriginResponseStatus":200,"OriginResponseTime":18000000,"OriginSSLProtocol":"TLSv1.2","ParentRayID":"00","RayID":"5a8ae4af0e520d86","SecurityLevel":"high","WAFAction":"unknown","WAFFlags":"0","WAFMatchedVar":"","WAFProfile":"unknown","WAFRuleID":"","WAFRuleMessage":"","WorkerCPUTime":0,"WorkerStatus":"unknown","WorkerSubrequest":false,"WorkerSubrequestCount":0,"ZoneID":143880799}
{"CacheCacheStatus":"unknown","CacheResponseBytes":1963,"CacheResponseStatus":200,"CacheTieredFill":false,"ClientASN":14618,"ClientCountry":"us","ClientDeviceType":"desktop","ClientIP":"1.2.3.4","ClientIPClass":"noRecord","ClientRequestBytes":2400,"ClientRequestHost":"test.com","ClientRequestMethod":"POST","ClientRequestPath":"/hey/buddy","ClientRequestProtocol":"HTTP/1.1","ClientRequestReferer":"","ClientRequestURI":"/hey/buddy","ClientRequestUserAgent":"Apache-HttpClient/4.5.8 (Java/1.8.0_201)","ClientSSLCipher":"something-cool","ClientSSLProtocol":"TLSv1.2","ClientSrcPort":47463,"ClientXRequestedWith":"","EdgeColoCode":"IAD","EdgeColoID":16,"EdgeEndTimestamp":"2020-06-25T01:25:12Z","EdgePathingOp":"wl","EdgePathingSrc":"macro","EdgePathingStatus":"nr","EdgeRateLimitAction":"","EdgeRateLimitID":0,"EdgeRequestHost":"test.com","EdgeResponseBytes":972,"EdgeResponseCompressionRatio":0,"EdgeResponseContentType":"application/json","EdgeResponseStatus":200,"EdgeServerIP":"5.6.7.8","EdgeStartTimestamp":"2020-06-25T01:25:12Z","FirewallMatchesActions":[],"FirewallMatchesRuleIDs":[],"FirewallMatchesSources":[],"OriginIP":"1.2.3.4","OriginResponseBytes":0,"OriginResponseHTTPExpires":"","OriginResponseHTTPLastModified":"","OriginResponseStatus":200,"OriginResponseTime":18000000,"OriginSSLProtocol":"TLSv1.2","ParentRayID":"00","RayID":"5a8ae4af0e520d86","SecurityLevel":"high","WAFAction":"unknown","WAFFlags":"0","WAFMatchedVar":"","WAFProfile":"unknown","WAFRuleID":"","WAFRuleMessage":"","WorkerCPUTime":0,"WorkerStatus":"unknown","WorkerSubrequest":false,"WorkerSubrequestCount":0,"ZoneID":143880799}

.... etc....

@lag13
Copy link
Author

lag13 commented Jul 16, 2020

I took a quick look at your PR. I'll try to test it out myself before tomorrow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Platforms Label for the Integrations - Platforms team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants