Reference counting changes #1951

gdamore · 2024-11-29T18:12:10Z

This converts the main part of NNG to use reference counting atomics efficiently, instead of some other hacky approaches using locks. It should be safer, and faster both!

codecov · 2024-11-29T18:14:41Z

Codecov Report

Attention: Patch coverage is 83.24873% with 66 lines in your changes missing coverage. Please review.

Project coverage is 68.20%. Comparing base (9c0b9d6) to head (e5f6632).
Report is 119 commits behind head on main.

Files with missing lines	Patch %	Lines
src/core/pipe.c	74.24%	13 Missing and 4 partials ⚠️
src/nng.c	65.62%	6 Missing and 5 partials ⚠️
src/core/aio.c	68.75%	5 Missing and 5 partials ⚠️
src/sp/transport/socket/sockfd.c	71.87%	2 Missing and 7 partials ⚠️
src/core/socket.c	90.27%	2 Missing and 5 partials ⚠️
src/core/dialer.c	89.65%	0 Missing and 3 partials ⚠️
src/supplemental/websocket/websocket.c	93.18%	1 Missing and 2 partials ⚠️
src/core/listener.c	92.85%	0 Missing and 2 partials ⚠️
src/platform/posix/posix_tcpdial.c	75.00%	0 Missing and 1 partial ⚠️
src/sp/protocol/pubsub0/sub.c	0.00%	0 Missing and 1 partial ⚠️
... and 2 more

❗ There is a different number of reports uploaded between BASE (9c0b9d6) and HEAD (e5f6632). Click for more details.

HEAD has 3 uploads less than BASE

Flag BASE (9c0b9d6) HEAD (e5f6632)

4 1

Additional details and impacted files

@@             Coverage Diff             @@
##             main    #1951       +/-   ##
===========================================
- Coverage   81.92%   68.20%   -13.72%     
===========================================
  Files          95       93        -2     
  Lines       24066    20454     -3612     
  Branches     3206     3047      -159     
===========================================
- Hits        19715    13950     -5765     
+ Misses       4280     3965      -315     
- Partials       71     2539     +2468

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Operations that might be performed during teardown, such as reaping, waiting, closing, freeing, should only be done if the aio has properly been initialized. This is important for certain simple cases where inline aio objects are used, and initialization of an outer object can fail before the enclosed aio is initialized.

Once a context has started the process of close, further attempts to close it will return NNG_ECLOSED. What was I thinking to ever do anything else?

This uses simple reference counters for now that should be simpler, and hopefully more reliable.

This is a major change, but it should eliminate some of the problems we have seen with use-after-free bugs in shutdown. It should also be faster as we don't need to use locks as much.

This updates the pipe to use contiguous data for the transport data as well as the pipe protocol data. It updates sockfd to use this, and eliminates the need for the sockfd transport to do its own asynchronous reaping, thereby hopefully closing a shutdown race. The other transports will shortly get the same treatment. Also fixed valgrind complaint about uninitialized data in the socket test.

This avoids certain kinds of challenging deadlocks during finalization, but it does require users of the optimized nni_aio_init function to explicitly call nni_aio_stop before doing nni_aio_fini. As a minor benefit, this should reduce the number of mutex entry/exit blocks for very short lived objects (such as rapidly recycling contexts).

If an error occurs, the application gets to know about it. There cannot be external factors that cause us to spin for memory, since this is not accessible via the network.

We should probably come back and make this more explicit with a separate endpoint stop() function, which can be blocking and call nni_aio_stop. For now this gets us over the hump.

The attempt to use nni_task_abort() was completely misguided. In fact this function isn't needed, and is a relic of a design that predates the nni_aio_begin / nni_aio_schedule split. Additionally, nni_aio_abort needed a fix to prevent a hang if it was called between the calls to nni_aio_prep and nni_aio_schedule. (Essentially a canceled operation should fail in scheduling.)

Also, includes a few fixes for the sockfd transport.

gdamore · 2024-12-27T02:37:07Z

Needs to be redone. Later.

gdamore force-pushed the refcount branch 10 times, most recently from a48f777 to 1afe044 Compare December 7, 2024 16:15

gdamore force-pushed the refcount branch 2 times, most recently from e3a3059 to 58c9028 Compare December 7, 2024 21:38

gdamore added 17 commits December 7, 2024 14:00

tests: ipc test valgrind fix for uninitialized stack data

c109115

ctx: Simplify handling for closed contexts.

9d1f671

Once a context has started the process of close, further attempts to close it will return NNG_ECLOSED. What was I thinking to ever do anything else?

Dialer and listener reference count refactor.

f3a76d6

This uses simple reference counters for now that should be simpler, and hopefully more reliable.

performance: reference counters can use relaxed order when incrementing

78de5eb

Fix failure reference count leak for dialer/listener

f46f16a

Eliminate s_cv unused condvar.

f46f49b

Context reference counts

b5d9329

socket: convert to using reference counts for shutdown

3ccd057

This is a major change, but it should eliminate some of the problems we have seen with use-after-free bugs in shutdown. It should also be faster as we don't need to use locks as much.

nng_fini: Simpler clean up of sockets, add a test.

f9910a8

listener: fix leaking listener after nng_listener_close

27fdb7f

websocket transport: inline the aios

019f20b

websocket: more aio inlining (generic websocket layer)

d3ed1bb

socket transport: No need for a cool down for this transport.

48371f3

If an error occurs, the application gets to know about it. There cannot be external factors that cause us to spin for memory, since this is not accessible via the network.

sockfd: constify ops vectors

66e207e

gdamore added 6 commits December 7, 2024 14:00

pair1: stop the pipe sooner

d584373

pipe: simplify because we always have tran and proto data

c050e2c

dialer/listener/socket: ensure teardown order prevents use-after-free

c1cb69e

We should probably come back and make this more explicit with a separate endpoint stop() function, which can be blocking and call nni_aio_stop. For now this gets us over the hump.

pipes: make separate dialer/listener pipe allocators.

0e81279

Also, includes a few fixes for the sockfd transport.

aio: wrong test for nni_aio_abort

e5f6632

gdamore force-pushed the refcount branch from 58c9028 to e5f6632 Compare December 7, 2024 22:01

gdamore closed this Dec 27, 2024

gdamore deleted the refcount branch December 27, 2024 02:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reference counting changes #1951

Reference counting changes #1951

gdamore commented Nov 29, 2024

codecov bot commented Nov 29, 2024 •

edited

Loading

gdamore commented Dec 27, 2024

Reference counting changes #1951

Reference counting changes #1951

Conversation

gdamore commented Nov 29, 2024

codecov bot commented Nov 29, 2024 • edited Loading

Codecov Report

gdamore commented Dec 27, 2024

codecov bot commented Nov 29, 2024 •

edited

Loading