Release all waiting processes when destroying a socket (fixes #15975) #16073

daniels220 · 2024-01-29T23:54:08Z

Send #signalAll to all three semaphores as part of #destroy so any waiting processes are immediately unblocked. Nil the semaphores first to minimize the possibility of a race condition.

Then, improve reporting of this condition by making sure everyone has an #ifClosed: branch--specifically #waitFor{Connection|Accept}For:, analogous to #waitForData/#waitForSendDone families. Leave #waitForDisconnectionFor: alone as returning a boolean, since the socket being destroyed indeed means it is disconnected, so the only failure condition is a timeout. Plus some cleanup of the loops to reduce duplicated code.

Last, cleanly report attempts to retrieve the last socket error code when the socket is already destroyed

guillep

I like the changes.
Maybe @tesonep wants to give a quick look?

Otherwise it seems good to me.

@daniels220 could you give us for the record some details about how "hard" have you tested this? Or some scenario you foresee that we need to test a bit more?

Send #signalAll to all three semaphores as part of #destroy so any waiting processes are immediately unblocked. Nil the semaphores first to minimize the possibility of a race condition. Then, improve reporting of this condition by making sure everyone has an #ifClosed: branch--specifically #waitFor{Connection|Accept}For:, analogous to #waitForData/#waitForSendDone families. Leave #waitForDisconnectionFor: alone as returning a boolean, since the socket being destroyed indeed means it is disconnected, so the only failure condition is a timeout. Plus some cleanup of the loops to reduce duplicated code.

…he socket is already destroyed

guillep · 2024-02-03T17:43:55Z

there is a seemingly related issue in the ci, ideas ?

Instance of Semaphore did not understand #signalAll

Semaphore(Object)>>doesNotUnderstand: #signalAll

[ :each | each signalAll ] in Socket>>destroy in Block: [ :each | each signalAll ]

Array(SequenceableCollection)>>do:

Socket>>destroy

Socket>>closeAndDestroy:

Socket>>closeAndDestroy

ZdcSecureSocketStream(ZdcAbstractSocketStream)>>socketClose

[

		socket ifNotNil: [

			self socketClose.

			socket := nil ].

		readBuffer close.

		writeBuffer close ] in ZdcSecureSocketStream(ZdcAbstractSocketStream)>>close in Block: [...

FullBlockClosure(BlockClosure)>>ensure:

ZdcSecureSocketStream(ZdcAbstractSocketStream)>>close

[ super close ] in ZdcSecureSocketStream>>close in Block: [ super close ]

FullBlockClosure(BlockClosure)>>ensure:

ZdcSecureSocketStream>>close

ZnClient>>close

ZnClient>>validateConnectionForHost:port:

ZnClient>>url:

ZnClient>>prepareRedirect

ZnClient>>executeWithRedirectsRemaining:

[ self executeWithRedirectsRemaining: self maxNumberOfRedirects ] in ZnClient>>executeWithRetriesRemaining: in Block: [ self executeWithRedirectsRemaining: self max[..]

FullBlockClosure(BlockClosure)>>on:do:

ZnClient>>executeWithRetriesRemaining:

[ self executeWithRetriesRemaining: self numberOfRetries ] in [

		[ self executeWithRetriesRemaining: self numberOfRetries ]

			on: Error

			do: self ifFailBlock ] in ZnClient>>executeWithTimeout in Block: [ self executeWithRetriesRemaining: self numbe[..]

FullBlockClosure(BlockClosure)>>on:do:

[

		[ self executeWithRetriesRemaining: self numberOfRetries ]

			on: Error

			do: self ifFailBlock ] in ZnClient>>executeWithTimeout in Block: [...

[ ^ block value ] in ZnClient>>withTimeoutDo: in Block: [ ^ block value ]

[ activeProcess

			psValueAt: index

			put: anObject.

		aBlock value ] in ZnConnectionTimeout(DynamicVariable)>>value:during: in Block: [ activeProcess...

FullBlockClosure(BlockClosure)>>ensure:

ZnConnectionTimeout(DynamicVariable)>>value:during:

ZnConnectionTimeout class(DynamicVariable class)>>value:during:

ZnClient>>withTimeoutDo:

daniels220 · 2024-02-04T05:48:01Z

could you give us for the record some details about how "hard" have you tested this? Or some scenario you foresee that we need to test a bit more?

To be honest, not as hard as I/we probably should, but at the same time I can't put a finger on why. There are just so many different ways to use sockets, and this was driven by finding that a pattern in the code I work with just doesn't work, but I'm always wary when touching code this central to something.

(A little background: I work on a large legacy application being ported to Pharo, and it uses a long-lived socket connection, which has processes waiting on it for data to arrive. In other dialects we found that the most reliable way to close the connection and release those processes was to close the socket "out from under them", rather than explicitly terminate them—trying that would sometimes confuse the FFI layer and cause an image crash, where closing the socket would raise an "error" on the C library side which would cleanly propagate and unwind the FFI call. Jump on over to Pharo, and I find that tests are failing because the connection just never tears down properly as far as the test is concerned (it's checking both the socket and the processes, since it's important that they do wind down so they can be started back up again with a new socket later).)

I find it a bit strange TBH that closing the socket doesn't cause the semaphores to be signaled by the SocketPlugin code—surely closing a socket causes an active select() to trigger (or whatever the proper terminology for that is)? But I don't know C well enough to look at that, so I went for the solution I could implement. But that does make me a little uncomfortable, which I think is part of why I feel like yeah, more testing could be good.

Re: the CI failure—I notice that signalAll is an extension from ThreadedFFI, not part of the kernel. I'd have assumed the tests are all run with the image fully built/all packages loaded, but if not, that would do it—and it may be bad form to introduce a dependency from Network-Kernel -> ThreadedFFI if it doesn't already exist? Now that, again, you point it out more explicitly, the use of signalAll is another point of discomfort actually. In any safe usage AFAICT, there should be at most one process waiting on each semaphore, because otherwise how are they supposed to avoid stepping on each others' toes? So it should be sufficient to merely signal each semaphore once. Perhaps on balance that would be better—it opens a theoretical hole, but one that is only likely to matter in already-buggy code. And eventually all processes will be released when the Semaphore is garbage-collected—assuming anyway that nobody else refers to those processes themselves, which would in turn prevent the Semaphore from being GC'd, and that is a possibility.

Lemme sleep on this a bit and see what I think about those two points, and/or if I think of ways to better test it. I notice SocketTest is very anemic, most of the real functional testing is happening on SocketStreamTest, but this is (unsurprisingly) very coupled to the way SocketStream uses Socket so it can't test the full API. I can't say I feel like writing out a comprehensive SocketTest from scratch, but maybe I can lay out a sketch of what could be done and fill in a couple bits?

guillep · 2024-02-05T08:16:40Z

Haha, actually my question was not to block the issue here but to document it. I know that it's not easy to test/measure the impact of this fix. But I'm glad you have it in mind.

Just FTR, if you think of some tests in the future, I'd like to see them :)

guillep · 2024-02-05T10:49:42Z

Argh, I missed that this was breaking the CI because of a missing #signalAll method. I'll revert it, we are going to re-issue the PR, sorry...

guillep · 2024-02-05T10:51:50Z

I see, the signalAll method is a threadedFFI extension.

We should move the method to the semaphore class directly.

daniels220 · 2024-02-05T21:00:31Z

Yes, I saw that signalAll is in ThreadedFFI, but I had so much else to say it looks like I didn't emphasize that enough in my reply :). If moving it to Semaphore directly makes sense to you, I can do that as part of re-submitting. However it's worth noting that there currently are no other senders of signalAll outside of ThreadedFFI, so it may be premature to move it. Does my thinking re: not actually needing the signalAll make sense to you—that it could be regular #signal instead?

guillep · 2024-02-06T08:39:02Z

Yes, I saw that signalAll is in ThreadedFFI, but I had so much else to say it looks like I didn't emphasize that enough in my reply :). If moving it to Semaphore directly makes sense to you, I can do that as part of re-submitting.

Yes, it makes sense to me.
My point here is that such a global extension produces silly issues as this one. The extension is globally available, but when the system is building (and loading packages one by one), it may not be available.

However it's worth noting that there currently are no other senders of signalAll outside of ThreadedFFI, so it may be premature to move it.

To me here the extension is adding a missing Semaphore functionality...

Either it's a ThreadedFFI functionality then it should be moved close to the user to avoid mistakes like this one.
Or its available for all Semaphore users then we move it to Semaphore class, with no extension.

Does my thinking re: not actually needing the signalAll make sense to you—

This is a third solution (which would be similar to pushing the signalAll method as a ThreadedFFI method).
We could re-implement as follows:

Socket >> signalAll: semaphore
  ...

that it could be regular #signal instead?

And yes, if actually we are using semaphores as mutexes, maybe a single signal suffices? I agree here :)

Do you want that I revert the revert to continue working on this?

guillep · 2024-02-14T15:18:47Z

Hi @daniels220 any news on this front? How do you want that we proceed?

daniels220 · 2024-02-15T05:20:51Z

I've just been busy the past week, thanks for the reminder.

Hmm...I had some concerns about introducing signalAll into the core Semaphore API, but thinking about them some more I think it's really usage-dependent. As long as whatever mechanism calls signalAll is written with the understanding that more processes might start waiting again immediately, and this is the intended behavior, it's fine.

In this particular case, the "ideal" method would be a destroy or something, that not only releases all waiting processes but "breaks" the Semaphore so further calls to wait have no effect—or you could think of it as setting excessSignals to infinity. But I also think that in practice, not only is this not a concern—the chances of it are laughably miniscule, if it's even possible—but a single signal should be sufficient unless the calling code is already buggy. So I went with that approach, but independently I have no problem with signalAll being moved to Semaphore.

So, I've resubmitted #16172, which includes the reapply as the first commit before tweaking things. I hope that's convenient for you?

guillep · 2024-02-15T07:59:17Z

I've just been busy the past week, thanks for the reminder.

Oh, sorry, no intention to bother you, just I needed info to move on.
Thanks for replying fast!

Hmm...I had some concerns about introducing signalAll into the core Semaphore API

I undertand

, but thinking about them some more I think it's really usage-dependent. As long as whatever mechanism calls signalAll is written with the understanding that more processes might start waiting again immediately, and this is the intended behavior, it's fine.

I agree with both your feeling and your final analysis ^^

In this particular case, the "ideal" method would be a destroy or something, that not only releases all waiting processes but "breaks" the Semaphore so further calls to wait have no effect—or you could think of it as setting excessSignals to infinity.

I like this idea! I'd make it part of another issue though.
One possibility is to break it by setting excessSignals := nil, and maybe rework a bit the Semaphore code (both Pharo+vm) to be robust about this case (don't remember if they are from the top of my head).

But I also think that in practice, not only is this not a concern—the chances of it are laughably miniscule, if it's even possible—but a single signal should be sufficient unless the calling code is already buggy. So I went with that approach, but independently I have no problem with signalAll being moved to Semaphore.

Yes, this is good as it allows us to move on!
Nothing prevents that we introduce later a proper signalAll if we see the use case, it's rather a easy change :)

So, I've resubmitted #16172, which includes the reapply as the first commit before tweaking things. I hope that's convenient for you?

It's great! thanks! I'll pass over it between today and tomorrow!

astares · 2024-02-15T22:42:45Z

@daniels220

I work on a large legacy application being ported to Pharo

Can you tell us more (if possible)? Maybe a new story you could describe for https://pharo.org/success ?

Pharo - Pharo Success Stories

guillep approved these changes Feb 1, 2024

View reviewed changes

daniels220 mentioned this pull request Feb 2, 2024

Improved error handling around sending data in Socket/SocketStream (fixes #15614) #15984

Merged

daniels220 added 2 commits February 3, 2024 11:49

Cleanly report attempts to retrieve the last socket error code when t…

1d7b1e9

…he socket is already destroyed

daniels220 force-pushed the socket-destroy-release-processes branch from 2189ca6 to 1d7b1e9 Compare February 3, 2024 16:50

guillep merged commit 41d450c into pharo-project:Pharo12 Feb 5, 2024
1 of 2 checks passed

guillep mentioned this pull request Feb 5, 2024

Revert "Release all waiting processes when destroying a socket (fixes #15975)" #16093

Merged

guillep mentioned this pull request Feb 5, 2024

Move Semaphore>>signalAll extension to Semaphore class #16094

Closed

daniels220 deleted the socket-destroy-release-processes branch February 15, 2024 05:20

daniels220 mentioned this pull request Feb 17, 2024

Socket>>sendDone is misnamed—does not actually mean that sending is *done* #16196

Open

This was referenced Apr 2, 2024

Revert comment change from commit f5df5966bd73d05bc04806d2f8735baa4b1783cb #16376

Merged

Update ZdcSimpleSocketStream for recent changes to Socket svenvc/zodiac#15

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release all waiting processes when destroying a socket (fixes #15975) #16073

Release all waiting processes when destroying a socket (fixes #15975) #16073

daniels220 commented Jan 29, 2024

guillep left a comment

guillep commented Feb 3, 2024

daniels220 commented Feb 4, 2024

guillep commented Feb 5, 2024

guillep commented Feb 5, 2024

guillep commented Feb 5, 2024

daniels220 commented Feb 5, 2024

guillep commented Feb 6, 2024

guillep commented Feb 14, 2024

daniels220 commented Feb 15, 2024

guillep commented Feb 15, 2024

astares commented Feb 15, 2024 •

edited by unfurl-links bot

Loading

Release all waiting processes when destroying a socket (fixes #15975) #16073

Release all waiting processes when destroying a socket (fixes #15975) #16073

Conversation

daniels220 commented Jan 29, 2024

guillep left a comment

Choose a reason for hiding this comment

guillep commented Feb 3, 2024

daniels220 commented Feb 4, 2024

guillep commented Feb 5, 2024

guillep commented Feb 5, 2024

guillep commented Feb 5, 2024

daniels220 commented Feb 5, 2024

guillep commented Feb 6, 2024

guillep commented Feb 14, 2024

daniels220 commented Feb 15, 2024

guillep commented Feb 15, 2024

astares commented Feb 15, 2024 • edited by unfurl-links bot Loading

astares commented Feb 15, 2024 •

edited by unfurl-links bot

Loading