Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release waiting processes when destroying a socket (without ThreadedFFI dependency) #16172

Merged
merged 8 commits into from
Apr 25, 2024

Conversation

daniels220
Copy link
Contributor

Plus a couple related refactoring opportunities I spotted at the last moment.

Copy link
Member

@guillep guillep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me :)

@guillep
Copy link
Member

guillep commented Feb 15, 2024

There is a seemingly related test failure. Ideas?

Error
Assertion failed
Salida
TestFailure
Assertion failed
SocketStreamTest(TestAsserter)>>assert:description:resumable:
SocketStreamTest(TestAsserter)>>assert:description:
SocketStreamTest(Object)>>assert:
SocketStreamTest(TestAsserter)>>should:raise:
SocketStreamTest>>testFlushOtherEndClosed
SocketStreamTest(TestCase)>>performTest

@daniels220
Copy link
Contributor Author

Hmm...when I ran the test in my image, it passed, but if I run it many times in a row (by script, not clicking the button—100 timesRepeat: [(SocketStreamTest selector: #testFlushOtherEndClosed) debug]), it usually fails the second time—sometimes it takes more runs, but it does always fail eventually. So it seems that something about...maybe the previous run's sockets having just recently closed?...makes it so that it hasn't "realized" the connection is closed by the time the second flush is executed. (I imagine in the CI build, having run a bunch of other sockets-related tests right before it had the same effect as running the one test many times.)

Since writing this, another issue led me to realize that waitForSendDone... is a misnomer—it's not waiting for sending to complete, just for the send buffer to have room to send more. As such, it's possible to send some data and have sendDone answer true immediately, so long as what you send isn't enough to fill the send buffer. So in fact, when running this test, waitForSendDone... never actually waits, and it's just a matter of whether the internal data structures of the socket are updated in time for the socket to be considered closed by the time the second send is initiated.

As far as making the test reliable, off the top of my head the only thing I can think of is to simply insert a wait in between the two flush calls. Even 15ms is enough for the test to pass 1000x in a row on my machine, but I see other tests using 100ms, and that might be safer on slower machines, other OSes (I'm testing this on macOS), etc. Went with that and pushed a commit.

I also created #16196—I think we should rename sendDone to something that better reflects its actual meaning—but that's a separate job.

@jecisc
Copy link
Member

jecisc commented Mar 25, 2024

I restarted the CI since it has been a while and this can fix some failing tests

@jecisc
Copy link
Member

jecisc commented Mar 25, 2024

There is still a failing socket test

osx-64 / Tests-osx-64 / testNextPutAllFlushOtherEndClosed – MacOSX64.Network.Tests.SocketStreamTest6s
Error
Assertion failed
Stacktrace
TestFailure
Assertion failed
SocketStreamTest(TestAsserter)>>assert:description:resumable:
SocketStreamTest(TestAsserter)>>assert:description:
SocketStreamTest(Object)>>assert:
SocketStreamTest(TestAsserter)>>should:raise:
SocketStreamTest>>testNextPutAllFlushOtherEndClosed
SocketStreamTest(TestCase)>>performTest

@daniels220 daniels220 force-pushed the socket-release-all-2 branch from 0b737fb to e82d63d Compare March 25, 2024 22:41
@daniels220
Copy link
Contributor Author

Aha—it's the same issue as with #testFlushOtherEndClosed. I added the wait here as well and it should be good now.

@jecisc
Copy link
Member

jecisc commented Mar 26, 2024

Now I see this one:

unix-64 / Tests-unix-64 / testFlushLargeMessageOtherEndClosed – Unix64.Network.Tests.SocketStreamTest10s
Error
Assertion failed
Stacktrace
TestFailure
Assertion failed
SocketStreamTest(TestAsserter)>>assert:description:resumable:
SocketStreamTest(TestAsserter)>>assert:description:
SocketStreamTest(Object)>>assert:
SocketStreamTest(TestAsserter)>>should:raise:
SocketStreamTest>>testFlushLargeMessageOtherEndClosed
SocketStreamTest(TestCase)>>performTest

@daniels220
Copy link
Contributor Author

Interesting...I believe this is because of differing defaults for SO_SNDBUF on different platforms1, and in particular Linux (at least the Debian 11 VM I tested on) has such a high default that my 256KiB test message actually fits in the buffer, invalidating the test. Seems appropriate to explicitly set SO_SNDBUF...at which point I discovered that the same Debian refuses to set any value below 416KiB, so I'm also reading the value back and making sure the message is larger. This now passes reliably on my machine.

Footnotes

  1. I checked with socket getOption: 'SO_SNDBUF' (and 'SO_RCVBUF' for good measure):

    • Windows: 65KiB send, 65KiB receive
    • macOS: ~144KiB (146988) send, ~399KiB (408300) receive (interestingly these are both multiples of 1361...sounds like there's a story there)
    • Debian 11: ~2.5MiB (2565KiB exactly) send, 128KiB receive

@Rinzwind
Copy link
Contributor

Rinzwind commented Apr 2, 2024

@daniels220 As GitHub does not notify a pull request’s author about this as far as I know, in case you’re not aware: GitHub now indicates ‘this branch has conflicts that must be resolved’. That seems to be due to the merge of pull request #16376.

@daniels220 daniels220 force-pushed the socket-release-all-2 branch from 175171e to a60667b Compare April 2, 2024 23:03
@daniels220
Copy link
Contributor Author

@Rinzwind Thanks for the heads-up—easy fix.

@Rinzwind
Copy link
Contributor

Rinzwind commented Apr 3, 2024

Thanks! There’s one test failure in build 6 you may still need to look at though.

@guillep
Copy link
Member

guillep commented Apr 4, 2024

Thanks! There’s one test failure in build 6 you may still need to look at though.

Strangely this only breaks on windows with a timeout. Maybe the issue is somewhere else than the code?

TestTookTooMuchTime
Process>>pvtSignal:list:
SocketStream>>checkFlush
SocketStream>>nextPutAll:
[
			clientStream
				nextPutAll: (String new: bufferSize * 2 withAll: $a);
				flush ] in SocketStreamTest>>testFlushLargeMessageOtherEndClosed
[
	  aBlock value.
	  false ] in SocketStreamTest(TestAsserter)>>executeShould:inScopeOf:
FullBlockClosure(BlockClosure)>>on:do:
SocketStreamTest(TestAsserter)>>executeShould:inScopeOf:
SocketStreamTest(TestAsserter)>>should:raise:
SocketStreamTest>>testFlushLargeMessageOtherEndClosed

@jecisc
Copy link
Member

jecisc commented Apr 5, 2024

I restarted a build to see if this is random or not.

I don't know it we should merge this one or not because of this test. It already fixes a lot of our current failing tests.

@jecisc
Copy link
Member

jecisc commented Apr 7, 2024

The same failure happened again with the same stack

@daniels220
Copy link
Contributor Author

I'm hoping to look at this tomorrow (Monday). I have a sneaking suspicion what the problem might be, but it's not something I was ever able to reproduce/verify—it would be great if this were a reproducible test case, but also, I could be completely off base.

@daniels220
Copy link
Contributor Author

Okay wow, no, it's very much not what I suspected. So on Windows, when you try to set SO_SNDBUF to a reasonable value, it instead gets set to a distinctly unreasonable value—something absolutely massive. The test isn't really "taking too long", it's trying to allocate a multi-gigabyte array, and it's a tossup whether that times out or fails with OOM.

So I can work around that by only trying to set the buffer on not-Windows (or just giving up entirely and accepting that I'll need to send an 8MB message to make the test work on Linux)...at which point I find out that Windows will happily say, "yup, I sent all that" when you make a send() call that massively exceeds the send buffer, rather than saying "I could send 65KiB of that, try again later for the rest". So there's no opportunity to then detect, "wait, the other end is closed" the next time around, because it only tries once. The only way to make the test work is to add a separate explicit send/flush...but that's precisely the point of the test, is that if you send enough data it absolutely should fail without that!

So there's an arguable VM/plugin bug in play—though I think it's more like some nasty Windows errata that the VM might want to work around. Certainly looking at the plugin code myself, I don't see anything suspicious, it must be that send() is just agreeing to something that it really shouldn't. And yet it also seems like, if the other end doesn't close, the data does all get sent with just that one send() call...what in the world is Windows doing?

So. Effectively this test is expected-failure on Windows, in that I don't know how to make the behavior it's testing work on Windows. It's not a regression from earlier Pharo versions, the test itself is new (my doing). Thoughts on how to proceed?

@Rinzwind
Copy link
Contributor

Rinzwind commented Apr 9, 2024

There’s a bug when setting SO_SNDBUF to a value with more than 4 digits, try:

(9999 to: 10000) collect: [ :value |
	(Socket new
		setOption: 'SO_SNDBUF' value: value;
		getOption: 'SO_SNDBUF') second ]

I get #(9999 131072) on macOS, and #(9999 808464433) on Windows. This seems to be related to the condition on line 1539 of ‘SocketPluginImpl.c’ as of Pharo VM commit 32b2be55c1.

@daniels220
Copy link
Contributor Author

@Rinzwind Ouch, yes, I actually looked at the plugin code and was suspicious of those lines, but couldn't understand them well enough to specifically point to a bug. Looking at it now, yeah, wow, that line is completely confusing "length of a string" and "byte size of a type". Most 32-bit integers are longer than four decimal digits! (Of course strtol returns long, which might be 64 bits, but the same applies to 8 decimal digits not covering the 64-bit int range—in fact 8 decimal digits doesn't even cover 32!). As a stopgap measure, it might make sense to just remove that first condition, and simply check if the string is exclusively digits at any length. We would also technically need to check over/underflow as 32 characters can exceed even a 64-bit int.

Of course the right solution, as implied in the comments in the plugin code, is for the image to pass the values as their actual types, the primitive to convert them to appropriate encoding in a void* pointer, and pass that to the sqSocketSetOptions implementation which would use it pretty much as-is. Honestly this seems not especially hard? But I don't have the ability to compile the VM (I tried on a brand-new Debian 11 VM and the result segfaulted precisely when using sockets, go figure) and don't know quite enough C to go it alone.

But then this starts sounding an awful lot like what the FFI marshaling layer is already doing, which really brings me back around to a pet peeve of mine, that being, why is so much of the sockets implementation in primitives that are just thin wrappers over syscalls? I know we need a tiny bit of abstraction to handle different platforms, but that could be handled by switching to different helper classes or something in the image, with just a very very few primitives to handle the parts that absolutely cannot be done any other way (I'm thinking some of the stuff with aio in the name may qualify here, but I'd still want to keep as much as possible in the image).

In any case, we still have the weird behavior in Windows that we can have SO_SNDBUF at its default of 65536, send a multi-megabyte message, and Windows will happily accept the entire thing in one chunk (send(2) will return the full length you give it) and actually succeed in sending it, assuming the socket is not closed in the meantime. Where the hell is it putting all that data? Isn't this what the send buffer size is supposed to mean?

@Rinzwind
Copy link
Contributor

Another approach to solving the bug could be to change the condition on line 1539 to opt->isint with isint being a new field that indicates whether the option expects an int or not. The table ‘Socket-Level Options’ in section ‘2.10.16 Use of Options’ in the System Interfaces volume of POSIX.1-2017 has a column ‘Parameter Type’.

As for the Windows behavior, that’s not a newly introduced problem that might require reverting recent changes as far as I understand, so you may need to make #testFlushLargeMessageOtherEndClosed signal TestSkipped for now when running on Windows or change it to test the ‘current’ rather than ‘desired’ behavior with respect to the signaling of ConnectionClosed.

@daniels220 daniels220 force-pushed the socket-release-all-2 branch from a60667b to 91485b2 Compare April 10, 2024 23:53
@daniels220
Copy link
Contributor Author

Yes, explicitly specifying the type is clearly the correct behavior—that's more in line with what I was thinking of in terms of refactoring so the image passes the parameters as their correct type rather than strings. The declaration on the C side would still be needed for getsockopt(), in order to correctly interpret the return value, and we could use it for error-checking—if the image passes the incorrect type, fail the primitive rather than feeding garbage to setsockopt().

Makes sense re: skipping the test on Windows. Just did that.

@Rinzwind
Copy link
Contributor

The last build failed due to issue #16443. That seems to have been fixed so another build could be started but its Windows test run will probably fail now due to issue #16450.

@jecisc jecisc closed this Apr 12, 2024
@jecisc jecisc reopened this Apr 15, 2024
@jecisc jecisc added the To fix before release Issure to fix before the upcoming Pharo release label Apr 15, 2024
@jecisc
Copy link
Member

jecisc commented Apr 15, 2024

Tests passed! Except a known random failure.

Should we integrate?

@estebanlm
Copy link
Member

and now we have a conflict :P

@daniels220 daniels220 force-pushed the socket-release-all-2 branch from 91485b2 to 1ff5eb1 Compare April 16, 2024 22:38
This should be sufficient, as multiple processes trying to read or write at once will result in undefined behavior anyway as their data gets all mixed up together.
…through accessors

This was already sometimes the case, and makes sense--these are private implementation details that really shouldn't be exposed to the outside world at all.
Also refactor to make it clearer that #waitForData is strictly a degenerate case of #waitForDataFor:...--the loop structure is identical, just without the timeout-related bits.
…therEndClosed

Actually, extract a helper as the first part of each test is identical.
…OtherEndClosed

Linux default is enormous (~2.5MiB), while Windows is only 65KiB. Attempt to standardize on this small buffer, but make sure we exceed the *actual* buffer size, as Linux (again) refuses to set a buffer size below 416KiB.
Fails due to Windows accepting arbitrarily large messages--much larger than the TCP send buffer--in a single send(2) call, preventing #waitForSendDone... from being called during the send process. Not sure why but this is not a regression, just strange behavior revealed by the new test.
Also, no need to actually store a `sendDone` temp, if we reach the end of the method we know it is true.
@daniels220 daniels220 force-pushed the socket-release-all-2 branch from 1ff5eb1 to 82d4d7b Compare April 16, 2024 22:51
@daniels220
Copy link
Contributor Author

Right, I remember seeing the issue whose fix created that conflict. As it happens I have some further refinement, but in any case, conflict resolved.

@jecisc jecisc closed this Apr 19, 2024
@jecisc jecisc reopened this Apr 19, 2024
@jecisc jecisc closed this Apr 20, 2024
@jecisc jecisc reopened this Apr 20, 2024
@estebanlm estebanlm closed this Apr 25, 2024
@estebanlm estebanlm reopened this Apr 25, 2024
@MarcusDenker MarcusDenker reopened this Apr 25, 2024
@MarcusDenker
Copy link
Member

It's green!

@MarcusDenker MarcusDenker merged commit c754edc into pharo-project:Pharo12 Apr 25, 2024
2 checks passed
@Rinzwind
Copy link
Contributor

There are no test results for Windows in the most recent build’s results though. The output of the ‘Allocate node’ step for the Windows test run in that build (build 14) shows ‘pharo-ci-jenkins2-win10’ was used (which doesn’t actually run the tests, see issue #15105):

Running on pharo-ci-jenkins2-win10 in C:\builds\workspace\est_and_branch_Pipeline_PR-16172

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Tests passed please review! To fix before release Issure to fix before the upcoming Pharo release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants