Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release all waiting processes when destroying a socket (fixes #15975) #16073

Merged

Conversation

daniels220
Copy link
Contributor

Send #signalAll to all three semaphores as part of #destroy so any waiting processes are immediately unblocked. Nil the semaphores first to minimize the possibility of a race condition.

Then, improve reporting of this condition by making sure everyone has an #ifClosed: branch--specifically #waitFor{Connection|Accept}For:, analogous to #waitForData/#waitForSendDone families. Leave #waitForDisconnectionFor: alone as returning a boolean, since the socket being destroyed indeed means it is disconnected, so the only failure condition is a timeout. Plus some cleanup of the loops to reduce duplicated code.

Last, cleanly report attempts to retrieve the last socket error code when the socket is already destroyed

Copy link
Member

@guillep guillep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the changes.
Maybe @tesonep wants to give a quick look?

Otherwise it seems good to me.

@daniels220 could you give us for the record some details about how "hard" have you tested this? Or some scenario you foresee that we need to test a bit more?

Send #signalAll to all three semaphores as part of #destroy so any waiting processes are immediately unblocked. Nil the semaphores first to minimize the possibility of a race condition.

Then, improve reporting of this condition by making sure everyone has an #ifClosed: branch--specifically  #waitFor{Connection|Accept}For:, analogous to #waitForData/#waitForSendDone families. Leave #waitForDisconnectionFor: alone as returning a boolean, since the socket being destroyed indeed means it is disconnected, so the only failure condition is a timeout. Plus some cleanup of the loops to reduce duplicated code.
@daniels220 daniels220 force-pushed the socket-destroy-release-processes branch from 2189ca6 to 1d7b1e9 Compare February 3, 2024 16:50
@guillep
Copy link
Member

guillep commented Feb 3, 2024

there is a seemingly related issue in the ci, ideas ?

Instance of Semaphore did not understand #signalAll

Semaphore(Object)>>doesNotUnderstand: #signalAll

[ :each | each signalAll ] in Socket>>destroy in Block: [ :each | each signalAll ]

Array(SequenceableCollection)>>do:

Socket>>destroy

Socket>>closeAndDestroy:

Socket>>closeAndDestroy

ZdcSecureSocketStream(ZdcAbstractSocketStream)>>socketClose

[

		socket ifNotNil: [

			self socketClose.

			socket := nil ].

		readBuffer close.

		writeBuffer close ] in ZdcSecureSocketStream(ZdcAbstractSocketStream)>>close in Block: [...

FullBlockClosure(BlockClosure)>>ensure:

ZdcSecureSocketStream(ZdcAbstractSocketStream)>>close

[ super close ] in ZdcSecureSocketStream>>close in Block: [ super close ]

FullBlockClosure(BlockClosure)>>ensure:

ZdcSecureSocketStream>>close

ZnClient>>close

ZnClient>>validateConnectionForHost:port:

ZnClient>>url:

ZnClient>>prepareRedirect

ZnClient>>executeWithRedirectsRemaining:

[ self executeWithRedirectsRemaining: self maxNumberOfRedirects ] in ZnClient>>executeWithRetriesRemaining: in Block: [ self executeWithRedirectsRemaining: self max[..]

FullBlockClosure(BlockClosure)>>on:do:

ZnClient>>executeWithRetriesRemaining:

[ self executeWithRetriesRemaining: self numberOfRetries ] in [

		[ self executeWithRetriesRemaining: self numberOfRetries ]

			on: Error

			do: self ifFailBlock ] in ZnClient>>executeWithTimeout in Block: [ self executeWithRetriesRemaining: self numbe[..]

FullBlockClosure(BlockClosure)>>on:do:

[

		[ self executeWithRetriesRemaining: self numberOfRetries ]

			on: Error

			do: self ifFailBlock ] in ZnClient>>executeWithTimeout in Block: [...

[ ^ block value ] in ZnClient>>withTimeoutDo: in Block: [ ^ block value ]

[ activeProcess

			psValueAt: index

			put: anObject.

		aBlock value ] in ZnConnectionTimeout(DynamicVariable)>>value:during: in Block: [ activeProcess...

FullBlockClosure(BlockClosure)>>ensure:

ZnConnectionTimeout(DynamicVariable)>>value:during:

ZnConnectionTimeout class(DynamicVariable class)>>value:during:

ZnClient>>withTimeoutDo:

@daniels220
Copy link
Contributor Author

could you give us for the record some details about how "hard" have you tested this? Or some scenario you foresee that we need to test a bit more?

To be honest, not as hard as I/we probably should, but at the same time I can't put a finger on why. There are just so many different ways to use sockets, and this was driven by finding that a pattern in the code I work with just doesn't work, but I'm always wary when touching code this central to something.

(A little background: I work on a large legacy application being ported to Pharo, and it uses a long-lived socket connection, which has processes waiting on it for data to arrive. In other dialects we found that the most reliable way to close the connection and release those processes was to close the socket "out from under them", rather than explicitly terminate them—trying that would sometimes confuse the FFI layer and cause an image crash, where closing the socket would raise an "error" on the C library side which would cleanly propagate and unwind the FFI call. Jump on over to Pharo, and I find that tests are failing because the connection just never tears down properly as far as the test is concerned (it's checking both the socket and the processes, since it's important that they do wind down so they can be started back up again with a new socket later).)

I find it a bit strange TBH that closing the socket doesn't cause the semaphores to be signaled by the SocketPlugin code—surely closing a socket causes an active select() to trigger (or whatever the proper terminology for that is)? But I don't know C well enough to look at that, so I went for the solution I could implement. But that does make me a little uncomfortable, which I think is part of why I feel like yeah, more testing could be good.

Re: the CI failure—I notice that signalAll is an extension from ThreadedFFI, not part of the kernel. I'd have assumed the tests are all run with the image fully built/all packages loaded, but if not, that would do it—and it may be bad form to introduce a dependency from Network-Kernel -> ThreadedFFI if it doesn't already exist? Now that, again, you point it out more explicitly, the use of signalAll is another point of discomfort actually. In any safe usage AFAICT, there should be at most one process waiting on each semaphore, because otherwise how are they supposed to avoid stepping on each others' toes? So it should be sufficient to merely signal each semaphore once. Perhaps on balance that would be better—it opens a theoretical hole, but one that is only likely to matter in already-buggy code. And eventually all processes will be released when the Semaphore is garbage-collected—assuming anyway that nobody else refers to those processes themselves, which would in turn prevent the Semaphore from being GC'd, and that is a possibility.

Lemme sleep on this a bit and see what I think about those two points, and/or if I think of ways to better test it. I notice SocketTest is very anemic, most of the real functional testing is happening on SocketStreamTest, but this is (unsurprisingly) very coupled to the way SocketStream uses Socket so it can't test the full API. I can't say I feel like writing out a comprehensive SocketTest from scratch, but maybe I can lay out a sketch of what could be done and fill in a couple bits?

@guillep
Copy link
Member

guillep commented Feb 5, 2024

Haha, actually my question was not to block the issue here but to document it. I know that it's not easy to test/measure the impact of this fix. But I'm glad you have it in mind.

Just FTR, if you think of some tests in the future, I'd like to see them :)

@guillep guillep merged commit 41d450c into pharo-project:Pharo12 Feb 5, 2024
1 of 2 checks passed
@guillep
Copy link
Member

guillep commented Feb 5, 2024

Argh, I missed that this was breaking the CI because of a missing #signalAll method. I'll revert it, we are going to re-issue the PR, sorry...

@guillep
Copy link
Member

guillep commented Feb 5, 2024

I see, the signalAll method is a threadedFFI extension.

imagen

We should move the method to the semaphore class directly.

@daniels220
Copy link
Contributor Author

Yes, I saw that signalAll is in ThreadedFFI, but I had so much else to say it looks like I didn't emphasize that enough in my reply :). If moving it to Semaphore directly makes sense to you, I can do that as part of re-submitting. However it's worth noting that there currently are no other senders of signalAll outside of ThreadedFFI, so it may be premature to move it. Does my thinking re: not actually needing the signalAll make sense to you—that it could be regular #signal instead?

@guillep
Copy link
Member

guillep commented Feb 6, 2024

Yes, I saw that signalAll is in ThreadedFFI, but I had so much else to say it looks like I didn't emphasize that enough in my reply :). If moving it to Semaphore directly makes sense to you, I can do that as part of re-submitting.

Yes, it makes sense to me.
My point here is that such a global extension produces silly issues as this one. The extension is globally available, but when the system is building (and loading packages one by one), it may not be available.

However it's worth noting that there currently are no other senders of signalAll outside of ThreadedFFI, so it may be premature to move it.

To me here the extension is adding a missing Semaphore functionality...

  • Either it's a ThreadedFFI functionality then it should be moved close to the user to avoid mistakes like this one.
  • Or its available for all Semaphore users then we move it to Semaphore class, with no extension.

Does my thinking re: not actually needing the signalAll make sense to you—

This is a third solution (which would be similar to pushing the signalAll method as a ThreadedFFI method).
We could re-implement as follows:

Socket >> signalAll: semaphore
  ...

that it could be regular #signal instead?

And yes, if actually we are using semaphores as mutexes, maybe a single signal suffices? I agree here :)

Do you want that I revert the revert to continue working on this?

@guillep
Copy link
Member

guillep commented Feb 14, 2024

Hi @daniels220 any news on this front? How do you want that we proceed?

@daniels220
Copy link
Contributor Author

I've just been busy the past week, thanks for the reminder.

Hmm...I had some concerns about introducing signalAll into the core Semaphore API, but thinking about them some more I think it's really usage-dependent. As long as whatever mechanism calls signalAll is written with the understanding that more processes might start waiting again immediately, and this is the intended behavior, it's fine.

In this particular case, the "ideal" method would be a destroy or something, that not only releases all waiting processes but "breaks" the Semaphore so further calls to wait have no effect—or you could think of it as setting excessSignals to infinity. But I also think that in practice, not only is this not a concern—the chances of it are laughably miniscule, if it's even possible—but a single signal should be sufficient unless the calling code is already buggy. So I went with that approach, but independently I have no problem with signalAll being moved to Semaphore.

So, I've resubmitted #16172, which includes the reapply as the first commit before tweaking things. I hope that's convenient for you?

@daniels220 daniels220 deleted the socket-destroy-release-processes branch February 15, 2024 05:20
@guillep
Copy link
Member

guillep commented Feb 15, 2024

I've just been busy the past week, thanks for the reminder.

Oh, sorry, no intention to bother you, just I needed info to move on.
Thanks for replying fast!

Hmm...I had some concerns about introducing signalAll into the core Semaphore API

I undertand

, but thinking about them some more I think it's really usage-dependent. As long as whatever mechanism calls signalAll is written with the understanding that more processes might start waiting again immediately, and this is the intended behavior, it's fine.

I agree with both your feeling and your final analysis ^^

In this particular case, the "ideal" method would be a destroy or something, that not only releases all waiting processes but "breaks" the Semaphore so further calls to wait have no effect—or you could think of it as setting excessSignals to infinity.

I like this idea! I'd make it part of another issue though.
One possibility is to break it by setting excessSignals := nil, and maybe rework a bit the Semaphore code (both Pharo+vm) to be robust about this case (don't remember if they are from the top of my head).

But I also think that in practice, not only is this not a concern—the chances of it are laughably miniscule, if it's even possible—but a single signal should be sufficient unless the calling code is already buggy. So I went with that approach, but independently I have no problem with signalAll being moved to Semaphore.

Yes, this is good as it allows us to move on!
Nothing prevents that we introduce later a proper signalAll if we see the use case, it's rather a easy change :)

So, I've resubmitted #16172, which includes the reapply as the first commit before tweaking things. I hope that's convenient for you?

It's great! thanks! I'll pass over it between today and tomorrow!

@astares
Copy link
Member

astares commented Feb 15, 2024

@daniels220

I work on a large legacy application being ported to Pharo

Can you tell us more (if possible)? Maybe a new story you could describe for https://pharo.org/success ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants