Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIGSEGV with input forward plugin #7240

Closed
tugtugtug opened this issue Apr 21, 2023 · 9 comments · Fixed by aws/aws-for-fluent-bit#642
Closed

SIGSEGV with input forward plugin #7240

tugtugtug opened this issue Apr 21, 2023 · 9 comments · Fixed by aws/aws-for-fluent-bit#642

Comments

@tugtugtug
Copy link

tugtugtug commented Apr 21, 2023

Bug Report

Describe the bug
When paired the fluent-bit 2.1.0 out-forward to fluent-bit 1.9.10 in-forward. A constant crash happens on the 1.9.10 side.

To Reproduce

  • Have the output side running fluent-bit 2.1.0 with forward plugin
  • Have the input side running fluent-bit 1.9.10 with forward plugin
    stacktrace of the input side:
#0  0x00007fe556dec676 in __strlen_sse2 () from target:/lib64/libc.so.6
#1  0x00007fe556da6e47 in vfprintf () from target:/lib64/libc.so.6
#2  0x00007fe556dccfd0 in vsnprintf () from target:/lib64/libc.so.6
#3  0x00000000004d497c in flb_log_print (type=2, file=0x0, line=0, fmt=0xa32e38 "unknown time format %s") at /tmp/fluent-bit-1.9.10/src/flb_log.c:411
#4  0x00000000004fdf41 in flb_time_pop_from_mpack (time=0x7fe4cbbf34c0, reader=0x7fe4cbbf3020) at /tmp/fluent-bit-1.9.10/src/flb_time.c:349
        tag = {type = mpack_type_array, exttype = 0 '\000', v = {u = 2, i = 2, d = 9.8813129168249309e-324, f = 2.80259693e-45, b = 2, l = 2, n = 2}}
        d = 6.4030907701025552e-321
        f = 4.58785117e-41
        i = 140620280571024
        tmp = 32740
        extbuf = "\344\177\000\000\300$ߵ"
        ext_len = 140620280565792
#5  0x000000000060e810 in cb_lua_filter_mpack (data=0x7fe4ae4e9040, bytes=1223, tag=0x7fe4b5a45d70 "payload.json", tag_len=12, out_buf=0x7fe4cbbf35a8, out_bytes=0x7fe4cbbf3598, f_ins=0x7fe55401fd80, i_ins=0x7fe55400a780, 
    filter_context=0x7fe4cac35760, config=0x7fe554019f80) at /tmp/fluent-bit-1.9.10/plugins/filter_lua/lua.c:183
#6  0x00000000004e0630 in flb_filter_do (ic=0x7fe4b591c200, data=0x7fe4b5aa8000, bytes=1218, tag=0x7fe4c0b61210 "payload.json", tag_len=12, config=0x7fe554019f80) at /tmp/fluent-bit-1.9.10/src/flb_filter.c:124
#7  0x000000000051d1e1 in input_chunk_append_raw (in=0x7fe55400a780, n_records=1, tag=0x7fe4c0b61210 "payload.json", tag_len=12, buf=0x7fe4b5aa8000, buf_size=1218) at /tmp/fluent-bit-1.9.10/src/flb_input_chunk.c:1585
#8  0x000000000051d4d3 in flb_input_chunk_append_raw (in=0x7fe55400a780, tag=0x7fe4c0b61210 "payload.json", tag_len=12, buf=0x7fe4b5aa8000, buf_size=1218) at /tmp/fluent-bit-1.9.10/src/flb_input_chunk.c:1699
#9  0x00000000005720e9 in fw_process_array (in=0x7fe55400a780, conn=0x7fe4b59af100, tag=0x7fe4c0b61210 "payload.json", tag_len=12, root=0x7fe4cbbf38e0, arr=0x7fe4cbbf3920, chunk_id=-1)
    at /tmp/fluent-bit-1.9.10/plugins/in_forward/fw_prot.c:198
#10 0x0000000000572980 in fw_prot_process (conn=0x7fe4b59af100) at /tmp/fluent-bit-1.9.10/plugins/in_forward/fw_prot.c:409
ret = 0
stag_len = 12
c = 0
chunk_id = 18446744073709551615
stag = 0x7fe4b4ec4006 "payload.json\221", <incomplete sequence \335>
out_tag = 0x7fe4aaf5d110 "payload.json"
bytes = 1318
buf_off = 1318
recv_len = 6
gz_size = 140620280568208
gz_data = 0x23aafec028
tag = {type = MSGPACK_OBJECT_STR, via = {boolean = 12, u64 = 140617229271052, i64 = 140617229271052, f64 = 6.9474142196208944e-310, array = {size = 12, ptr = 0x7fe4b4ec4006}, map = {size = 12, ptr = 0x7fe4b4ec4006}, 
	str = {size = 12, ptr = 0x7fe4b4ec4006 "payload.json\221", <incomplete sequence \335>}, bin = {size = 12, ptr = 0x7fe4b4ec4006 "payload.json\221", <incomplete sequence \335>}, ext = {type = 12 '\f', 
---Type <return> to continue, or q <return> to quit---
	  size = 32740, ptr = 0x7fe4b4ec4006 "payload.json\221", <incomplete sequence \335>}}}
entry = {type = MSGPACK_OBJECT_ARRAY, via = {boolean = true, u64 = 1, i64 = 1, f64 = 4.9406564584124654e-324, array = {size = 1, ptr = 0x7fe4b4eea038}, map = {size = 1, ptr = 0x7fe4b4eea038}, str = {size = 1, 
	  ptr = 0x7fe4b4eea038 "\006"}, bin = {size = 1, ptr = 0x7fe4b4eea038 "\006"}, ext = {type = 1 '\001', size = 0, ptr = 0x7fe4b4eea038 "\006"}}}
map = {type = 3051297120, via = {boolean = 16, u64 = 140620264125200, i64 = 140620264125200, f64 = 6.947564161338361e-310, array = {size = 3034854160, ptr = 0x7fe53ca19300}, map = {size = 3034854160, 
	  ptr = 0x7fe53ca19300}, str = {size = 3034854160, ptr = 0x7fe53ca19300 "\030"}, bin = {size = 3034854160, ptr = 0x7fe53ca19300 "\030"}, ext = {type = 16 '\020', size = 32740, ptr = 0x7fe53ca19300 "\030"}}}
root = {type = MSGPACK_OBJECT_ARRAY, via = {boolean = 2, u64 = 140617229271042, i64 = 140617229271042, f64 = 6.9474142196204004e-310, array = {size = 2, ptr = 0x7fe4b4eea008}, map = {size = 2, ptr = 0x7fe4b4eea008}, 
	str = {size = 2, ptr = 0x7fe4b4eea008 "\005"}, bin = {size = 2, ptr = 0x7fe4b4eea008 "\005"}, ext = {type = 2 '\002', size = 32740, ptr = 0x7fe4b4eea008 "\005"}}}
chunk = {type = 3034854160, via = {boolean = 37, u64 = 450982268709, i64 = 450982268709, f64 = 2.2281484585266268e-312, array = {size = 10702629, ptr = 0x600007ff99a0002}, map = {size = 10702629, 
	  ptr = 0x600007ff99a0002}, str = {size = 10702629, ptr = 0x600007ff99a0002 <error: Cannot access memory at address 0x600007ff99a0002>}, bin = {size = 10702629, 
	  ptr = 0x600007ff99a0002 <error: Cannot access memory at address 0x600007ff99a0002>}, ext = {type = 37 '%', size = 105, ptr = 0x600007ff99a0002 <error: Cannot access memory at address 0x600007ff99a0002>}}}
result = {zone = 0x7fe49fafaf68, data = {type = MSGPACK_OBJECT_ARRAY, via = {boolean = 2, u64 = 140617229271042, i64 = 140617229271042, f64 = 6.9474142196204004e-310, array = {size = 2, ptr = 0x7fe4b4eea008}, map = {
		size = 2, ptr = 0x7fe4b4eea008}, str = {size = 2, ptr = 0x7fe4b4eea008 "\005"}, bin = {size = 2, ptr = 0x7fe4b4eea008 "\005"}, ext = {type = 2 '\002', size = 32740, ptr = 0x7fe4b4eea008 "\005"}}}}
unp = 0x7fe49fbc0940
all_used = 1318
mp_sbuf = {size = 140620264125280, data = 0x7fe49faa60f8 "`3\344\264\344\177", alloc = 140620280567984}
mp_pck = {data = 0x7fe4b5df18b0, callback = 0x7fe4b4e43360}
#11 0x00000000005710f3 in fw_conn_event (data=0x7fe4b59af100) at /tmp/fluent-bit-1.9.10/plugins/in_forward/fw_conn.c:81
#12 0x00000000004f4ba5 in flb_engine_start (config=0x7fe554019f80) at /tmp/fluent-bit-1.9.10/src/flb_engine.c:854
#13 0x00000000004d3783 in flb_lib_worker (data=0x7fe554018000) at /tmp/fluent-bit-1.9.10/src/flb_lib.c:626
#14 0x00007fe558a4e44b in start_thread () from target:/lib64/libpthread.so.0
#15 0x00007fe556e4952f in clone () from target:/lib64/libc.so.6
  • Steps to reproduce the problem:

Expected behavior
Fluent-bit should not crash and should remain backward compatible unless otherwise stated.

Your Environment

  • Version used: 1.9.10 and 2.1.0
  • Configuration:
    Input:
[INPUT]
    Name    forward
    Alias   input.forward.ingress
    Listen  0.0.0.0
    Port    24225
    Buffer_Chunk_Size 1MB
    Buffer_Max_Size   64MB
    # Tag must be set by the source of the forward
    Tag     payload.json

[FILTER]
    Name nest
    Match *.json
    Operation nest
    Wildcard *
    Nest_under log

[FILTER]
    Name   lua
    Alias  filter.lua.create_event_tag
    Match  payload.*
    script event-filter.lua
    call   create_event_tag_field

Output:

[OUTPUT]
    Name   forward
    Match  *
    Host   log-collector
    Port   24225
    Storage.total_limit_size  256M
    Tag    payload.json
    Workers 2
    net.keepalive_max_recycle 30

  • Environment name and version (e.g. Kubernetes? What version?): EKS 1.23
  • Server type and version: AWS t3a.xlarge
  • Operating System and version: 5.4.238-148.346.amzn2.x86_64
  • Filters and plugins: input side of service, uses nest, lua as filters and forward as input plugin

Additional context
The input side of the service cannot be changed or upgraded as the version of 1.9.10 is latest one released by AWS.
Any workaround for configuring the output side or input side would be appreciated.
Issue was not observed with the 1.9.x releases running as the output service.

@nokute78
Copy link
Collaborator

@leonardo-albertovich (Cc: @edsiper )
I think #7133 caused this issue.
Old fluent-bit can't receive new format since old fluent-bit expects that first element is event_time.

I think fluentd also is affected. (I haven't test)

Format of v2.0.7

[event_time,{"message":"dummy"}]
First element is event_time.

{"format":"fixarray", "header":"0x92", "length":2, "raw":"0x92d700644348df229523df81a76d657373616765a564756d6d79", "value":
    [
        {"format":"event time", "header":"0xd7", "type":0, "raw":"0xd700644348df229523df", "value":"2023-04-22 11:39:27.580199391 +0900 JST"},
        {"format":"fixmap", "header":"0x81", "length":1, "raw":"0x81a76d657373616765a564756d6d79", "value":
            [
                {"key":
                    {"format":"fixstr", "header":"0xa7", "raw":"0xa76d657373616765", "value":"message"},
                 "value":
                    {"format":"fixstr", "header":"0xa5", "raw":"0xa564756d6d79", "value":"dummy"}
                }
            ]
        }
    ]
}

Format of Current master

[[event_time, {}], {"message","dummy"}]
First element is array [event_time, {}].

{"format":"array 32", "header":"0xdd", "length":2, "raw":"0xdd00000002dd00000002d70064434944229413c88081a76d657373616765a564756d6d79", "value":
    [
        {"format":"array 32", "header":"0xdd", "length":2, "raw":"0xdd00000002d70064434944229413c880", "value":
            [
                {"format":"event time", "header":"0xd7", "type":0, "raw":"0xd70064434944229413c8", "value":"2023-04-22 11:41:08.580129736 +0900 JST"},
                {"format":"fixmap", "header":"0x80", "length":0, "raw":"0x80", "value":
                    [

                    ]
                }
            ]
        }
,
        {"format":"fixmap", "header":"0x81", "length":1, "raw":"0x81a76d657373616765a564756d6d79", "value":
            [
                {"key":
                    {"format":"fixstr", "header":"0xa7", "raw":"0xa76d657373616765", "value":"message"},
                 "value":
                    {"format":"fixstr", "header":"0xa5", "raw":"0xa564756d6d79", "value":"dummy"}
                }
            ]
        }
    ]
}

@nokute78
Copy link
Collaborator

If we sent a new format which contains metadata using forward protocol,
we need to

@leonardo-albertovich
Copy link
Collaborator

It seems that I missed that in out_forward when operating in forward mode. When operating in message mode the metadata is sent in the optional options map and when operating in compat mode it's omitted but the new format is used for the array entries.

I think we'll be able to ship a fix very soon.

@leonardo-albertovich
Copy link
Collaborator

Turns out that I did include the flag and part of the necessary code but I made a typo in the conditional and also missed another part of the code that needed to be patched.

Once PR 7249 is merged the default behavior when operating in forward mode will be backwards compatible (metadata will be dropped).

I will write an update once the PR is verified and merged.

@leonardo-albertovich
Copy link
Collaborator

@tugtugtug I just tried to reproduce the crash using the master branch and the head of the 1.9 branch but couldn't, fluent-bit 1.9 just prints :

[2023/04/24 17:21:55] [ warn] unknown time format 6
[2023/04/24 17:21:55] [ warn] unknown time format 6
[2023/04/24 17:21:55] [ warn] unknown time format 6
[2023/04/24 17:21:55] [ warn] unknown time format 6
[2023/04/24 17:21:55] [ warn] unknown time format 6
[2023/04/24 17:21:55] [ warn] unknown time format 6
[2023/04/24 17:21:55] [ warn] unknown time format 6
[2023/04/24 17:21:55] [ warn] unknown time format 6
[2023/04/24 17:21:55] [ warn] unknown time format 6

Which makes sense in a way. Would you be able to give me a hand with the reproduction? I think we should be able to capture the traffic into a pcap I'd be able te replay locally and then I'm sure we'd be able to fix it really quick.

The patch that fixes the default behavior is on its way but I'd really appreciate it if you could help me reproduce this so we can patch 1.9 if needed.

@tugtugtug
Copy link
Author

@leonardo-albertovich thank you for looking into this issue so quickly.
Unfortunately we no longer have the environment to reproduce it anymore, as we had to revert our changes. We may come back to this again, and be able to provide extra details.
However, when I debugged the issue, the flb_warn would likely crash as it attempts to format an enum as a string. I guess this may depend on how tolerant the libc implementation is. We use the aws fluent-bit image, which may have the hardened libc.

gdb shows the tag as:

tag = {type = mpack_type_array, exttype = 0 '\000', v = {u = 2, i = 2, d = 9.8813129168249309e-324, f = 2.80259693e-45, b = 2, l = 2, n = 2}}

this print that formats the tag.type as string

flb_warn("unknown time format %s", tag.type);

@leonardo-albertovich
Copy link
Collaborator

Ok, that's much simpler than I expected. If I'm correct, what happened was that not that in_forward caused the crash (which is what I expected) but rather that it let the malformed record pass through and then the lua filter tried to unpack it and caused this access violation.

The good news is that this bug poses no security risk because it's treating the value of tag.type as a pointer when it's actually an integer with a maximum value of 9 (further analysis is required to determine if there could be a reliable way to use this as an info leak or not but I don't think that will be the case).

Additionally, this particular bug is in the (now legacy) mpack timestamp decoder function which was used by the lua filter which means it's not reachable in fluent-bit 2.1 because we deprecated that code.

Regardless, we will patch it in the currently maintained versions and let the folks at AWS know of this so they can act appropriately.

Thank you very much for reporting this issue, for providing these very important bits of insight and for the patience.

@leonardo-albertovich
Copy link
Collaborator

We have already fixed the issue with PRs #7261 and #7262 and we have notified the folks from AWS about it.
Just in case I will tag @PettitWesley here to ensure there is nothing missing.

Please let me know if there is anything that needs to be done from our end.

@tugtugtug
Copy link
Author

Thank you @leonardo-albertovich, again really appreciate the quick turnaround. I think this resolves my issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants