Release waiting processes when destroying a socket (without ThreadedFFI dependency) #16172

daniels220 · 2024-02-15T05:19:09Z

Plus a couple related refactoring opportunities I spotted at the last moment.

guillep

Looks good to me :)

guillep · 2024-02-15T16:58:44Z

There is a seemingly related test failure. Ideas?

Error
Assertion failed
Salida
TestFailure
Assertion failed
SocketStreamTest(TestAsserter)>>assert:description:resumable:
SocketStreamTest(TestAsserter)>>assert:description:
SocketStreamTest(Object)>>assert:
SocketStreamTest(TestAsserter)>>should:raise:
SocketStreamTest>>testFlushOtherEndClosed
SocketStreamTest(TestCase)>>performTest

daniels220 · 2024-02-17T04:19:39Z

Hmm...when I ran the test in my image, it passed, but if I run it many times in a row (by script, not clicking the button—100 timesRepeat: [(SocketStreamTest selector: #testFlushOtherEndClosed) debug]), it usually fails the second time—sometimes it takes more runs, but it does always fail eventually. So it seems that something about...maybe the previous run's sockets having just recently closed?...makes it so that it hasn't "realized" the connection is closed by the time the second flush is executed. (I imagine in the CI build, having run a bunch of other sockets-related tests right before it had the same effect as running the one test many times.)

Since writing this, another issue led me to realize that waitForSendDone... is a misnomer—it's not waiting for sending to complete, just for the send buffer to have room to send more. As such, it's possible to send some data and have sendDone answer true immediately, so long as what you send isn't enough to fill the send buffer. So in fact, when running this test, waitForSendDone... never actually waits, and it's just a matter of whether the internal data structures of the socket are updated in time for the socket to be considered closed by the time the second send is initiated.

As far as making the test reliable, off the top of my head the only thing I can think of is to simply insert a wait in between the two flush calls. Even 15ms is enough for the test to pass 1000x in a row on my machine, but I see other tests using 100ms, and that might be safer on slower machines, other OSes (I'm testing this on macOS), etc. Went with that and pushed a commit.

I also created #16196—I think we should rename sendDone to something that better reflects its actual meaning—but that's a separate job.

jecisc · 2024-03-25T20:37:11Z

I restarted the CI since it has been a while and this can fix some failing tests

jecisc · 2024-03-25T21:36:55Z

There is still a failing socket test

osx-64 / Tests-osx-64 / testNextPutAllFlushOtherEndClosed – MacOSX64.Network.Tests.SocketStreamTest6s
Error
Assertion failed
Stacktrace
TestFailure
Assertion failed
SocketStreamTest(TestAsserter)>>assert:description:resumable:
SocketStreamTest(TestAsserter)>>assert:description:
SocketStreamTest(Object)>>assert:
SocketStreamTest(TestAsserter)>>should:raise:
SocketStreamTest>>testNextPutAllFlushOtherEndClosed
SocketStreamTest(TestCase)>>performTest

daniels220 · 2024-03-25T22:42:11Z

Aha—it's the same issue as with #testFlushOtherEndClosed. I added the wait here as well and it should be good now.

jecisc · 2024-03-26T10:15:19Z

Now I see this one:

unix-64 / Tests-unix-64 / testFlushLargeMessageOtherEndClosed – Unix64.Network.Tests.SocketStreamTest10s
Error
Assertion failed
Stacktrace
TestFailure
Assertion failed
SocketStreamTest(TestAsserter)>>assert:description:resumable:
SocketStreamTest(TestAsserter)>>assert:description:
SocketStreamTest(Object)>>assert:
SocketStreamTest(TestAsserter)>>should:raise:
SocketStreamTest>>testFlushLargeMessageOtherEndClosed
SocketStreamTest(TestCase)>>performTest

daniels220 · 2024-03-26T20:27:36Z

Interesting...I believe this is because of differing defaults for SO_SNDBUF on different platforms¹, and in particular Linux (at least the Debian 11 VM I tested on) has such a high default that my 256KiB test message actually fits in the buffer, invalidating the test. Seems appropriate to explicitly set SO_SNDBUF...at which point I discovered that the same Debian refuses to set any value below 416KiB, so I'm also reading the value back and making sure the message is larger. This now passes reliably on my machine.

I checked with socket getOption: 'SO_SNDBUF' (and 'SO_RCVBUF' for good measure):
- Windows: 65KiB send, 65KiB receive
- macOS: ~144KiB (146988) send, ~399KiB (408300) receive (interestingly these are both multiples of 1361...sounds like there's a story there)
- Debian 11: ~2.5MiB (2565KiB exactly) send, 128KiB receive
↩

Rinzwind · 2024-04-02T20:48:56Z

@daniels220 As GitHub does not notify a pull request’s author about this as far as I know, in case you’re not aware: GitHub now indicates ‘this branch has conflicts that must be resolved’. That seems to be due to the merge of pull request #16376.

daniels220 · 2024-04-02T23:04:20Z

@Rinzwind Thanks for the heads-up—easy fix.

Rinzwind · 2024-04-03T07:04:25Z

Thanks! There’s one test failure in build 6 you may still need to look at though.

guillep · 2024-04-04T13:07:22Z

Thanks! There’s one test failure in build 6 you may still need to look at though.

Strangely this only breaks on windows with a timeout. Maybe the issue is somewhere else than the code?

TestTookTooMuchTime
Process>>pvtSignal:list:
SocketStream>>checkFlush
SocketStream>>nextPutAll:
[
			clientStream
				nextPutAll: (String new: bufferSize * 2 withAll: $a);
				flush ] in SocketStreamTest>>testFlushLargeMessageOtherEndClosed
[
	  aBlock value.
	  false ] in SocketStreamTest(TestAsserter)>>executeShould:inScopeOf:
FullBlockClosure(BlockClosure)>>on:do:
SocketStreamTest(TestAsserter)>>executeShould:inScopeOf:
SocketStreamTest(TestAsserter)>>should:raise:
SocketStreamTest>>testFlushLargeMessageOtherEndClosed

jecisc · 2024-04-05T11:16:31Z

I restarted a build to see if this is random or not.

I don't know it we should merge this one or not because of this test. It already fixes a lot of our current failing tests.

jecisc · 2024-04-07T13:03:07Z

The same failure happened again with the same stack

daniels220 · 2024-04-08T04:08:50Z

I'm hoping to look at this tomorrow (Monday). I have a sneaking suspicion what the problem might be, but it's not something I was ever able to reproduce/verify—it would be great if this were a reproducible test case, but also, I could be completely off base.

daniels220 · 2024-04-08T20:35:59Z

Okay wow, no, it's very much not what I suspected. So on Windows, when you try to set SO_SNDBUF to a reasonable value, it instead gets set to a distinctly unreasonable value—something absolutely massive. The test isn't really "taking too long", it's trying to allocate a multi-gigabyte array, and it's a tossup whether that times out or fails with OOM.

So I can work around that by only trying to set the buffer on not-Windows (or just giving up entirely and accepting that I'll need to send an 8MB message to make the test work on Linux)...at which point I find out that Windows will happily say, "yup, I sent all that" when you make a send() call that massively exceeds the send buffer, rather than saying "I could send 65KiB of that, try again later for the rest". So there's no opportunity to then detect, "wait, the other end is closed" the next time around, because it only tries once. The only way to make the test work is to add a separate explicit send/flush...but that's precisely the point of the test, is that if you send enough data it absolutely should fail without that!

So there's an arguable VM/plugin bug in play—though I think it's more like some nasty Windows errata that the VM might want to work around. Certainly looking at the plugin code myself, I don't see anything suspicious, it must be that send() is just agreeing to something that it really shouldn't. And yet it also seems like, if the other end doesn't close, the data does all get sent with just that one send() call...what in the world is Windows doing?

So. Effectively this test is expected-failure on Windows, in that I don't know how to make the behavior it's testing work on Windows. It's not a regression from earlier Pharo versions, the test itself is new (my doing). Thoughts on how to proceed?

Rinzwind · 2024-04-09T23:20:06Z

There’s a bug when setting SO_SNDBUF to a value with more than 4 digits, try:

(9999 to: 10000) collect: [ :value |
	(Socket new
		setOption: 'SO_SNDBUF' value: value;
		getOption: 'SO_SNDBUF') second ]

I get #(9999 131072) on macOS, and #(9999 808464433) on Windows. This seems to be related to the condition on line 1539 of ‘SocketPluginImpl.c’ as of Pharo VM commit 32b2be55c1.

daniels220 · 2024-04-10T03:43:15Z

@Rinzwind Ouch, yes, I actually looked at the plugin code and was suspicious of those lines, but couldn't understand them well enough to specifically point to a bug. Looking at it now, yeah, wow, that line is completely confusing "length of a string" and "byte size of a type". Most 32-bit integers are longer than four decimal digits! (Of course strtol returns long, which might be 64 bits, but the same applies to 8 decimal digits not covering the 64-bit int range—in fact 8 decimal digits doesn't even cover 32!). As a stopgap measure, it might make sense to just remove that first condition, and simply check if the string is exclusively digits at any length. We would also technically need to check over/underflow as 32 characters can exceed even a 64-bit int.

Of course the right solution, as implied in the comments in the plugin code, is for the image to pass the values as their actual types, the primitive to convert them to appropriate encoding in a void* pointer, and pass that to the sqSocketSetOptions implementation which would use it pretty much as-is. Honestly this seems not especially hard? But I don't have the ability to compile the VM (I tried on a brand-new Debian 11 VM and the result segfaulted precisely when using sockets, go figure) and don't know quite enough C to go it alone.

But then this starts sounding an awful lot like what the FFI marshaling layer is already doing, which really brings me back around to a pet peeve of mine, that being, why is so much of the sockets implementation in primitives that are just thin wrappers over syscalls? I know we need a tiny bit of abstraction to handle different platforms, but that could be handled by switching to different helper classes or something in the image, with just a very very few primitives to handle the parts that absolutely cannot be done any other way (I'm thinking some of the stuff with aio in the name may qualify here, but I'd still want to keep as much as possible in the image).

In any case, we still have the weird behavior in Windows that we can have SO_SNDBUF at its default of 65536, send a multi-megabyte message, and Windows will happily accept the entire thing in one chunk (send(2) will return the full length you give it) and actually succeed in sending it, assuming the socket is not closed in the meantime. Where the hell is it putting all that data? Isn't this what the send buffer size is supposed to mean?

Rinzwind · 2024-04-10T19:29:30Z

Another approach to solving the bug could be to change the condition on line 1539 to opt->isint with isint being a new field that indicates whether the option expects an int or not. The table ‘Socket-Level Options’ in section ‘2.10.16 Use of Options’ in the System Interfaces volume of POSIX.1-2017 has a column ‘Parameter Type’.

As for the Windows behavior, that’s not a newly introduced problem that might require reverting recent changes as far as I understand, so you may need to make #testFlushLargeMessageOtherEndClosed signal TestSkipped for now when running on Windows or change it to test the ‘current’ rather than ‘desired’ behavior with respect to the signaling of ConnectionClosed.

daniels220 · 2024-04-10T23:54:32Z

Yes, explicitly specifying the type is clearly the correct behavior—that's more in line with what I was thinking of in terms of refactoring so the image passes the parameters as their correct type rather than strings. The declaration on the C side would still be needed for getsockopt(), in order to correctly interpret the return value, and we could use it for error-checking—if the image passes the incorrect type, fail the primitive rather than feeding garbage to setsockopt().

Makes sense re: skipping the test on Windows. Just did that.

Rinzwind · 2024-04-12T19:24:37Z

The last build failed due to issue #16443. That seems to have been fixed so another build could be started but its Windows test run will probably fail now due to issue #16450.

jecisc · 2024-04-15T22:30:47Z

Tests passed! Except a known random failure.

Should we integrate?

estebanlm · 2024-04-16T10:41:06Z

and now we have a conflict :P

…pharo-project#15975)" This reverts commit f5df596.

This should be sufficient, as multiple processes trying to read or write at once will result in undefined behavior anyway as their data gets all mixed up together.

…through accessors This was already sometimes the case, and makes sense--these are private implementation details that really shouldn't be exposed to the outside world at all. Also refactor to make it clearer that #waitForData is strictly a degenerate case of #waitForDataFor:...--the loop structure is identical, just without the timeout-related bits.

…ransient failure

…therEndClosed Actually, extract a helper as the first part of each test is identical.

…OtherEndClosed Linux default is enormous (~2.5MiB), while Windows is only 65KiB. Attempt to standardize on this small buffer, but make sure we exceed the *actual* buffer size, as Linux (again) refuses to set a buffer size below 416KiB.

Fails due to Windows accepting arbitrarily large messages--much larger than the TCP send buffer--in a single send(2) call, preventing #waitForSendDone... from being called during the send process. Not sure why but this is not a regression, just strange behavior revealed by the new test.

Also, no need to actually store a `sendDone` temp, if we reach the end of the method we know it is true.

daniels220 · 2024-04-16T23:02:22Z

Right, I remember seeing the issue whose fix created that conflict. As it happens I have some further refinement, but in any case, conflict resolved.

MarcusDenker · 2024-04-25T14:48:29Z

It's green!

Rinzwind · 2024-04-25T18:55:33Z

There are no test results for Windows in the most recent build’s results though. The output of the ‘Allocate node’ step for the Windows test run in that build (build 14) shows ‘pharo-ci-jenkins2-win10’ was used (which doesn’t actually run the tests, see issue #15105):

Running on pharo-ci-jenkins2-win10 in C:\builds\workspace\est_and_branch_Pipeline_PR-16172

daniels220 mentioned this pull request Feb 15, 2024

Release all waiting processes when destroying a socket (fixes #15975) #16073

Merged

guillep approved these changes Feb 15, 2024

View reviewed changes

daniels220 mentioned this pull request Mar 25, 2024

[CI] SocketStreamTest failing on CI #16119

Closed

jecisc closed this Mar 25, 2024

jecisc reopened this Mar 25, 2024

daniels220 force-pushed the socket-release-all-2 branch from 0b737fb to e82d63d Compare March 25, 2024 22:41

daniels220 force-pushed the socket-release-all-2 branch from 175171e to a60667b Compare April 2, 2024 23:03

daniels220 mentioned this pull request Apr 2, 2024

Update ZdcSimpleSocketStream for recent changes to Socket svenvc/zodiac#15

Merged

daniels220 force-pushed the socket-release-all-2 branch from a60667b to 91485b2 Compare April 10, 2024 23:53

jecisc closed this Apr 12, 2024

jecisc reopened this Apr 15, 2024

jecisc added the To fix before release Issure to fix before the upcoming Pharo release label Apr 15, 2024

MarcusDenker added the Status: Tests passed please review! label Apr 16, 2024

daniels220 force-pushed the socket-release-all-2 branch from 91485b2 to 1ff5eb1 Compare April 16, 2024 22:38

daniels220 added 8 commits April 16, 2024 18:44

Reapply "Release all waiting processes when destroying a socket (fixes …

36dd6e1

…pharo-project#15975)" This reverts commit f5df596.

Signal each Semaphore only once to avoid dependency on #signalAll

c6a90bb

This should be sufficient, as multiple processes trying to read or write at once will result in undefined behavior anyway as their data gets all mixed up together.

Add delay and update comment in #testFlushOtherEndClosed to prevent t…

05bc3b0

…ransient failure

Add wait in #testNextPutAllFlushOtherEndClosed similar to #testFlushO…

838b221

…therEndClosed Actually, extract a helper as the first part of each test is identical.

Clarify the purpose of the return value from #waitForSendDoneFor:...

82d4d7b

Also, no need to actually store a `sendDone` temp, if we reach the end of the method we know it is true.

daniels220 force-pushed the socket-release-all-2 branch from 1ff5eb1 to 82d4d7b Compare April 16, 2024 22:51

jecisc closed this Apr 19, 2024

jecisc reopened this Apr 19, 2024

jecisc closed this Apr 20, 2024

jecisc reopened this Apr 20, 2024

estebanlm closed this Apr 25, 2024

estebanlm reopened this Apr 25, 2024

MarcusDenker closed this Apr 25, 2024

MarcusDenker reopened this Apr 25, 2024

MarcusDenker merged commit c754edc into pharo-project:Pharo12 Apr 25, 2024
2 checks passed

Rinzwind mentioned this pull request Apr 26, 2024

Add Toplo world menu entry (part 2) #16478

Merged

Rinzwind mentioned this pull request May 13, 2024

[CI] Windows CI run sometimes is green but did not run #15105

Closed

Rinzwind mentioned this pull request Nov 23, 2024

Update versions in #beautifulComments, #documentBrowser and #welcomeBrowser for BaselineOfPharo for Pharo 11 to ‘v1.0.1’, ‘v1.0.3’ and ‘v1.0.2’ #17427

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release waiting processes when destroying a socket (without ThreadedFFI dependency) #16172

Release waiting processes when destroying a socket (without ThreadedFFI dependency) #16172

daniels220 commented Feb 15, 2024

guillep left a comment

guillep commented Feb 15, 2024

daniels220 commented Feb 17, 2024

jecisc commented Mar 25, 2024

jecisc commented Mar 25, 2024

daniels220 commented Mar 25, 2024

jecisc commented Mar 26, 2024

daniels220 commented Mar 26, 2024

Rinzwind commented Apr 2, 2024

daniels220 commented Apr 2, 2024

Rinzwind commented Apr 3, 2024

guillep commented Apr 4, 2024

jecisc commented Apr 5, 2024

jecisc commented Apr 7, 2024

daniels220 commented Apr 8, 2024

daniels220 commented Apr 8, 2024

Rinzwind commented Apr 9, 2024

daniels220 commented Apr 10, 2024

Rinzwind commented Apr 10, 2024

daniels220 commented Apr 10, 2024

Rinzwind commented Apr 12, 2024

jecisc commented Apr 15, 2024

estebanlm commented Apr 16, 2024

daniels220 commented Apr 16, 2024

MarcusDenker commented Apr 25, 2024

Rinzwind commented Apr 25, 2024

Release waiting processes when destroying a socket (without ThreadedFFI dependency) #16172

Release waiting processes when destroying a socket (without ThreadedFFI dependency) #16172

Conversation

daniels220 commented Feb 15, 2024

guillep left a comment

Choose a reason for hiding this comment

guillep commented Feb 15, 2024

daniels220 commented Feb 17, 2024

jecisc commented Mar 25, 2024

jecisc commented Mar 25, 2024

daniels220 commented Mar 25, 2024

jecisc commented Mar 26, 2024

daniels220 commented Mar 26, 2024

Footnotes

Rinzwind commented Apr 2, 2024

daniels220 commented Apr 2, 2024

Rinzwind commented Apr 3, 2024

guillep commented Apr 4, 2024

jecisc commented Apr 5, 2024

jecisc commented Apr 7, 2024

daniels220 commented Apr 8, 2024

daniels220 commented Apr 8, 2024

Rinzwind commented Apr 9, 2024

daniels220 commented Apr 10, 2024

Rinzwind commented Apr 10, 2024

daniels220 commented Apr 10, 2024

Rinzwind commented Apr 12, 2024

jecisc commented Apr 15, 2024

estebanlm commented Apr 16, 2024

daniels220 commented Apr 16, 2024

MarcusDenker commented Apr 25, 2024

Rinzwind commented Apr 25, 2024