Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FIXED] microservice cleanup (flapping MicroServiceStops... tests) #816

Merged
merged 14 commits into from
Nov 5, 2024

Conversation

levb
Copy link
Collaborator

@levb levb commented Nov 1, 2024

  • Removed the global hash of services, it was the most immediate source of the deadlock.
  • Added the list of microservices to natsConnection. NOTE increased memory allocation for all connections, need a better way.
  • Fixed the flow of replacing an existing endpoint in a service.
  • Got rid of ep->name since it was synonimous with ep->cfg->Name
  • Adjusted the tests, some minor fixes.

- Removed the global hash of services, it was the most immediate source of the deadlock.
- Added the list of microservices to `natsConnection`. **NOTE** increased memory allocation for all connections, need a better way.
- Fixed the flow of replacing an existing endpoint in a service.
- Got rid of ep->name since it was synonimous with ep->cfg->Name
- Adjusted the tests, some minor fixes.
Copy link

codecov bot commented Nov 1, 2024

Codecov Report

Attention: Patch coverage is 53.33333% with 98 lines in your changes missing coverage. Please review.

Project coverage is 70.50%. Comparing base (1553d4a) to head (a15d5a8).
Report is 24 commits behind head on main.

Files with missing lines Patch % Lines
src/micro.c 54.54% 41 Missing and 49 partials ⚠️
src/micro_endpoint.c 14.28% 3 Missing and 3 partials ⚠️
src/conn.c 75.00% 0 Missing and 1 partial ⚠️
src/micro_monitoring.c 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #816      +/-   ##
==========================================
+ Coverage   68.71%   70.50%   +1.79%     
==========================================
  Files          39       47       +8     
  Lines       15207    15366     +159     
  Branches     3143     3177      +34     
==========================================
+ Hits        10449    10834     +385     
+ Misses       1700     1460     -240     
- Partials     3058     3072      +14     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

src/natsp.h Show resolved Hide resolved
test/test.c Show resolved Hide resolved
@levb levb changed the title [FIXED] microservice cleanup (flapping MicroServiceStops... tests) WIP: [FIXED] microservice cleanup (flapping MicroServiceStops... tests) Nov 1, 2024
@levb levb changed the title WIP: [FIXED] microservice cleanup (flapping MicroServiceStops... tests) [FIXED] microservice cleanup (flapping MicroServiceStops... tests) Nov 4, 2024
@levb levb requested a review from kozlovic November 4, 2024 00:46
@levb levb marked this pull request as ready for review November 4, 2024 00:46
src/micro_endpoint.c Outdated Show resolved Hide resolved
src/micro.c Show resolved Hide resolved
src/micro.c Outdated Show resolved Hide resolved
src/micro.c Outdated
// Wrap the connection callbacks before we subscribe to anything.
MICRO_CALL(err, _wrap_connection_event_callbacks(m));
MICRO_CALL(err, micro_ErrorFromStatus(
natsOptions_setMicroCallbacks(m->nc->opts, _on_connection_closed, _on_error)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So for instance even if the nc connection is valid at the time we enter micro_AddService, but happen to be closed during execution of this function, it will leave things in a bad state. If service is part of a connection, I think it should be handled by the connection, even if micro_X calls are just a proxy to internal natsConn_X calls. Have you thought that way?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did, in fact I was VERY inclined to make them "1st class citizens" in natsConnection, but not for this PR... This all really came out from trying to fix a single race condition on that global services table I had before, wrong locking order in 1 place.

OT: In light of the orbit movement I've been contemplating adding simple "internal use" extensibility to both Connection and Subscription; for data (like a hashmap) and callbacks (say, with a bool lock parameter). My initial microservices implementation was naive, it is still marked experimental, and I would really love to externalize and re-factor it to orbit before it is final.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved all "attach to connection" functionality into the func, and added a natsConn_Lock around + connection validation. Thanks for pointing it out. After the service is added to connection, it should unwind "normally" if it's terminated.

src/micro.c Outdated Show resolved Hide resolved
src/micro.c Outdated Show resolved Hide resolved
src/micro.c Outdated Show resolved Hide resolved
src/micro.c Show resolved Hide resolved
numEndpoints = m->numEndpoints;
_unlock_service(m);

if ((refs == 0) && (numEndpoints == 0))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: would it be symptomatic of a bug if we have refs == 0 and numEndpoints != 0?

Copy link
Collaborator Author

@levb levb Nov 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, because of the order of execution. As I understand it (and can prove, I think) On(Connection)Closed may be invoked in the midst of Sub's threads calling the On(Sub)Complete from their own threads. greyhair ++

Similar for error callbacks, maybe it was just one of the 2, not 100% sure now.

src/micro.c Outdated Show resolved Hide resolved
@levb levb requested a review from kozlovic November 5, 2024 15:45
src/micro_endpoint.c Outdated Show resolved Hide resolved
src/micro.c Outdated
}
natsMutex_Unlock(nc->servicesMu);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lock was acquired only under if (s == NATS_OK) after setting the callbacks, but here you unlock unconditionally. You should the if statement above (where you bump service's ref count) under the if statement where you acquire the service lock. This unlock would fall under that previous if statement too.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(hit the wrong button before) thanks for catching this!

Copy link
Member

@kozlovic kozlovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@levb levb merged commit 6c7b2cc into main Nov 5, 2024
30 checks passed
@levb levb deleted the lev-micro-stops-when-server-stops branch November 5, 2024 18:02
github-actions bot pushed a commit that referenced this pull request Nov 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants