Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

camerad: BPS config failure #34663

Open
adeebshihadeh opened this issue Feb 22, 2025 · 4 comments · May be fixed by #34665
Open

camerad: BPS config failure #34663

adeebshihadeh opened this issue Feb 22, 2025 · 4 comments · May be fixed by #34665
Labels
Milestone

Comments

@adeebshihadeh
Copy link
Contributor

adeebshihadeh commented Feb 22, 2025

Just happened on master

https://commaai.sentry.io/issues/6012903112/?query=user.id%3A21d8936b8bd24158&referrer=issue-stream&statsPeriod=14d&stream_index=0

--- stack trace ---
#0  __pthread_kill_implementation (threadid=547762122784, signo=signo@entry=6, no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:44
        tid = 63600
        ret = 0
        pd = 0x7f892ab020
        old_mask = {__val = {366960958643}}
        ret = <optimized out>
#1  0x0000007f88c47690 in __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
No locals.
#2  0x0000007f88bfcb3c in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
        ret = <optimized out>
#3  0x0000007f88be7e00 in __GI_abort () at ./stdlib/abort.c:79
        save_stage = 1
        act = {__sigaction_handler = {sa_handler = 0x20, sa_sigaction = 0x20}, sa_mask = {__val = {17055145657313161216, 549103836288, 547755088920, 4207595095826719075, 3417499243072025376, 3414961561409511779, 3419183686060171619, 549103836160, 547755487144, 367848122128, 547762124640, 547756664560, 96, 12, 13, 547756643392}}, sa_flags = -1999179240, sa_restorer = 0x7fd9239c80}
#4  0x0000007f88bf5cbc in __assert_fail_base (fmt=0x7f88d1c6d0 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=assertion@entry=0x557093d4b4 "ret == 0", file=file@entry=0x557093d835 "system/camerad/cameras/spectra.cc", line=line@entry=702, function=function@entry=0x557093dcb3 "void SpectraCamera::config_bps(int, int)") at ./assert/assert.c:94
        str = 0x55a574e720 "NWZ\005"
        total = 4096
#5  0x0000007f88bf5d2c in __assert_fail (assertion=0x557093d4b4 "ret == 0", file=0x557093d835 "system/camerad/cameras/spectra.cc", line=702, function=0x557093dcb3 "void SpectraCamera::config_bps(int, int)") at ./assert/assert.c:103
No locals.
#6  0x0000005570919b00 in SpectraCamera::config_bps (this=this@entry=0x55a5746fb0, idx=idx@entry=6, request_id=request_id@entry=25) at system/camerad/cameras/spectra.cc:702
        cam_packet_handle = 7864423
        pkt = std::unique_ptr<cam_packet> = {get() = 0x7f89275000}
        size = 768
        buf_desc = 0x7f89275038
        patches = std::vector of length 0, capacity 0
        io_cfg = <optimized out>
        ret = <optimized out>
#7  0x0000005570918ba4 in SpectraCamera::enqueue_buffer (this=this@entry=0x55a5746fb0, i=6, request_id=request_id@entry=25) at system/camerad/cameras/spectra.cc:979
        frame_ready = <optimized out>
        req_mgr_sched_request = {session_hdl = 8258058, link_hdl = 5243151, bubble_enable = 0, sync_mode = 0, req_id = 25}
        sync_create = {name = {78 'N', 111 'o', 100 'd', 101 'e', 79 'O', 117 'u', 116 't', 112 'p', 117 'u', 116 't', 80 'P', 111 'o', 114 'r', 116 't', 70 'F', 101 'e', 110 'n', 99 'c', 101 'e', 0 '\000', <optimized out> <repeats 44 times>}, sync_obj = <optimized out>}
        ret = <optimized out>
#8  0x000000557091a6c0 in SpectraCamera::enqueue_req_multi (this=0x55a5746fb0, start=<optimized out>, n=<optimized out>) at system/camerad/cameras/spectra.cc:312
        request_id = <optimized out>
        request_id = <optimized out>
        idx = <optimized out>
#9  SpectraCamera::handle_camera_event (this=0x55a5746fb0, event_data=0x7fd9239f28) at system/camerad/cameras/spectra.cc:1390
        buf_idx = <optimized out>
        timestamp = <optimized out>
        request_id = 19
        frame_id_raw = 9
#10 0x0000005570912c98 in camerad_thread () at system/camerad/cameras/camera_qcom2.cc:309
        cam = std::unique_ptr<CameraState> = {get() = 0x55a5746fb0}
        __range4 = std::vector of length 3, capacity 4 = {std::unique_ptr<CameraState> = {get() = 0x55a5734580}, std::unique_ptr<CameraState> = {get() = 0x55a573fbe0}, std::unique_ptr<CameraState> = {get() = 0x55a5746fb0}}
        __begin4 = std::unique_ptr<CameraState> = {get() = 0x55a5746fb0}
        __end4 = <optimized out>
        event_data = <optimized out>
        fds = {{fd = 13, events = 2, revents = 2}}
        ev = {type = 134217728, u = {vsync = {field = 10 '\n'}, ctrl = {changes = 8258058, type = 4294967274, {value = 19, value64 = 19}, flags = 9, minimum = 0, maximum = -245408007, step = 695, default_value = 5243151}, frame_sync = {frame_sequence = 8258058}, src_change = {changes = 8258058}, motion_det = {flags = 8258058, frame_sequence = 4294967274, region_mask = 19}, data = "\n\002~\000\352\377\377\377\023\000\000\000\000\000\000\000\t\000\000\000\000\000\000\000\371^_\361\267\002\000\000\017\001P\000\000\000\000\000\020\240Cc\230\377\377\377p\234\252c\230\377\377\3770\\\314=\352\377\377\377"}, pending = 0, sequence = 36, timestamp = {tv_sec = 2989, tv_nsec = 51845530}, id = 2, reserved = {0, 0, 0, 0, 0, 0, 0, 0}}
        ret = <optimized out>
        device_id = 0x7f8825a508 <glbl_a6x_ddl_table>
        props = {16585, 16586, 0}
        v = {device_id = 0x7f8825a508 <glbl_a6x_ddl_table>, ctx = 0x55a5712430, server_id = 7119150156901953228, should_exit = std::atomic<bool> = { false }, name = "camerad", listener_thread = {_M_id = {_M_thread = 547213414144}}, cur_idx = std::map with 3 elements = {[VISION_STREAM_ROAD] = std::atomic<unsigned long> = { 17 }, [VISION_STREAM_DRIVER] = std::atomic<unsigned long> = { 17 }, [VISION_STREAM_WIDE_ROAD] = std::atomic<unsigned long> = { 17 }}, buffers = std::map with 3 elements = {[VISION_STREAM_ROAD] = std::vector of length 18, capacity 32 = {0x55a573cf40, 0x55a573e6d0, 0x55a573e940, 0x55a573e9d0, 0x55a573f120, 0x55a573f3c0, 0x55a5742890, 0x55a5742b00, 0x55a5742da0, 0x55a57430d0, 0x55a5743370, 0x55a5743610, 0x55a57438b0, 0x55a5743b50, 0x55a5743df0, 0x55a5744090, 0x55a5744330, 0x55a5743040}, [VISION_STREAM_DRIVER] = std::vector of length 18, capacity 32 = {0x55a5744860, 0x55a574b180, 0x55a574b420, 0x55a574b6c0, 0x55a574b960, 0x55a574bc00, 0x55a574bea0, 0x55a574c140, 0x55a574c3e0, 0x55a574c710, 0x55a574c9b0, 0x55a574cc50, 0x55a574cef0, 0x55a574d190, 0x55a574d430, 0x55a574d6d0, 0x55a574d970, 0x55a574c680}, [VISION_STREAM_WIDE_ROAD] = std::vector of length 18, capacity 32 = {0x55a5733120, 0x55a573a200, 0x55a573a470, 0x55a573a6e0, 0x55a573a980, 0x55a573ac20, 0x55a573aec0, 0x55a573b160, 0x55a573b400, 0x55a573b730, 0x55a573b9d0, 0x55a573bc70, 0x55a573bf10, 0x55a573c1b0, 0x55a573c450, 0x55a573c6f0, 0x55a573c990, 0x55a573b6a0}}, msg_ctx = 0x55a57290d0, sockets = std::map with 3 elements = {[VISION_STREAM_ROAD] = 0x55a5732e00, [VISION_STREAM_DRIVER] = 0x55a5738bb0, [VISION_STREAM_WIDE_ROAD] = 0x55a5729070}}
        m = {video0_fd = {fd_ = 13}, cam_sync_fd = {fd_ = 14}, isp_fd = {fd_ = 15}, icp_fd = {fd_ = 16}, device_iommu = 4905, cdm_iommu = 248091, icp_device_iommu = 181244, mem_mgr = {handle_lookup = std::map with 12 elements = {[0x7f89273000] = 7995497, [0x7f89274000] = 7929960, [0x7f89275000] = 7864423, [0x7f892a6000] = 3276832, [0x7f892a7000] = 3211295, [0x7f892a8000] = 1835019, [0x7f89351000] = 1769482, [0x7f89384000] = 1441797, [0x7f89386000] = 1376260, [0x7f89387000] = 1310723, [0x7f89388000] = 1245186, [0x7f89389000] = 1179649}, size_lookup = std::map with 12 elements = {[0x7f89273000] = 16, [0x7f89274000] = 36, [0x7f89275000] = 768, [0x7f892a6000] = 528, [0x7f892a7000] = 64, [0x7f892a8000] = 324, [0x7f89351000] = 272, [0x7f89384000] = 7520, [0x7f89386000] = 88, [0x7f89387000] = 196, [0x7f89388000] = 24, [0x7f89389000] = 112}, cached_allocations = std::map with 12 elements = {[16] = std::queue wrapping: std::deque with 1 element = {0x7f89273000}, [24] = std::queue wrapping: std::deque with 1 element = {0x7f89388000}, [36] = std::queue wrapping: std::deque with 1 element = {0x7f89274000}, [64] = std::queue wrapping: std::deque with 1 element = {0x7f892a7000}, [88] = std::queue wrapping: std::deque with 1 element = {0x7f89386000}, [112] = std::queue wrapping: std::deque with 1 element = {0x7f89389000}, [196] = std::queue wrapping: std::deque with 1 element = {0x7f89387000}, [272] = std::queue wrapping: std::deque with 1 element = {0x7f89351000}, [324] = std::queue wrapping: std::deque with 1 element = {0x7f892a8000}, [528] = std::queue wrapping: std::deque with 1 element = {0x7f892a6000}, [768] = std::queue wrapping: std::deque with 0 elements, [7520] = std::queue wrapping: std::deque with 1 element = {0x7f89384000}}, video0_fd = 13}}
        cams = std::vector of length 3, capacity 4 = {std::unique_ptr<CameraState> = {get() = 0x55a5734580}, std::unique_ptr<Came

camera cloudlogs here: cam_log.txt

@adeebshihadeh adeebshihadeh added this to the 0.9.8 milestone Feb 22, 2025
@adeebshihadeh
Copy link
Contributor Author

adeebshihadeh commented Feb 22, 2025

@deanlee want to check this out? This is on latest master after all our camerad work.

@deanlee
Copy link
Contributor

deanlee commented Feb 22, 2025

It seems the crash occurred after the camera driver realigned and dropped requests:

[1.953526] system/camerad/cameras/spectra.cc:1379 - handle_camera_event: camera 2 realign
[1.953625] system/camerad/cameras/spectra.cc:1379 - handle_camera_event: camera 2 realign
[1.953763] system/camerad/cameras/spectra.cc:257 - clear_req_queue: flushed all req: 100
[1.954976] system/camerad/cameras/spectra.cc:1389 - handle_camera_event: camera 2 dropped requests 19 2
[1.955117] system/camerad/cameras/spectra.cc:1389 - handle_camera_event: camera 2 dropped requests 19 2

We should avoid requeuing requests if the camera has been realigned. In this case, the request_id is 19, and request_id_last is 2, creating a significant gap.

@deanlee
Copy link
Contributor

deanlee commented Feb 22, 2025

The sequence of events leading to the crash is as follows:

  1. A timeout in CAM_SYNC_WAIT triggers enqueue_buffer to call clear_req_queue().
  2. This call to clear_req_queue() causes handle_camera_event to detect a camera realign and invoke clear_req_queue() again.
  3. The check for dropped requests returns true and calls enqueue_req_multi(), which overwrites the buf_idx of the current request_id, creating an invalid sync object.
  4. In subsequent enqueue_buffer calls, the function waits on this invalid sync object, causing another timeout and triggering another call to clear_req_queue().
  5. The system then logs the error: "cam_context_prepare_dev_to_hw: 307 [cam-icp][1] No more request obj free."
  6. This leads to an assertion failure and a crash.

This issue occurs because re-enqueuing requests to handle dropped requests after a camera re-alignment overwrites buf_idx, request_id, and sync objects. This can lead to invalid states, such as waiting on obsolete sync objects. PR #34665 resolves this by properly managing re-enqueued requests.

Additionally, we should review the error handling for CAM_SYNC_WAIT timeouts. Calling clear_req_queue() immediately after a timeout is overly aggressive, potentially disrupting valid requests. A more cautious and reliable approach, such as retrying or selectively flushing failed requests, should be explored.

@deanlee
Copy link
Contributor

deanlee commented Feb 22, 2025

[2.286699] MAIN 0 kernel - CAM_ERR: CAM-CTXT: cam_context_prepare_dev_to_hw: 307 [cam-icp][1] No more request obj free
[2.286758] MAIN 0 kernel - CAM_ERR: CAM-ICP: __cam_icp_config_dev_in_ready: 129 Failed to prepare device
[2.286815] MAIN 0 kernel - CAM_ERR: CAM-CORE: __cam_node_handle_config_dev: 243 Config failure for node cam-icp
[2.286878] MAIN 0 kernel - CAM_ERR: CAM-CORE: cam_node_handle_ioctl: 613 config device failed(rc = -12)

config device failed(rc = -12) rc = -12 corresponds to ENOMEM (Out of memory)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants