Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

testscript: panic: fail now! in stop command on Windows #228

Closed
bep opened this issue May 12, 2023 · 22 comments · Fixed by #265
Closed

testscript: panic: fail now! in stop command on Windows #228

bep opened this issue May 12, 2023 · 22 comments · Fixed by #265

Comments

@bep
Copy link
Contributor

bep commented May 12, 2023

I'm creating this issue in the hope that someone could give me some helping hints before I give in and start up my Windows VM.

The short take is:

--- FAIL: TestCommands (2.00s)
    --- FAIL: TestCommands/server_render_to_memory (1.11s)
        testscript.go:534: # Test the hugo server command.
            # We run these tests in parallel so let Hugo decide which port to use.
 (1.102s)
            > hugo server &
            > waitServer
            > httpget $HUGOTEST_BASEURL_0 'Title: Hugo Server Test' $HUGOTEST_BASEURL_0
            > ! exists public/index.html
            > ! exists public/mystatic.txt
            > stop
            stop
            [background] hugo.exe server: exit status 1
            [stdout]
            Start building sites … 
            hugo v0.112.0-DEV windows/386 BuildDate=unknown
            
                               | EN  
            -------------------+-----
              Pages            |  2  
              Paginator pages  |  0  
              Non-page files   |  0  
              Static files     |  1  
              Processed images |  0  
              Aliases          |  0  
              Sitemaps         |  0  
              Cleaned          |  0  
            
            Built in 91 ms
            Watching for changes in $WORK\{layouts,static}
            Watching for config changes in $WORK\hugo.toml
            Running in Fast Render Mode. For full rebuilds on change: hugo server --disableFastRender
            Web Server is available at http://localhost:1313/ (bind address 127.0.0.1)
            Press Ctrl+C to stop
            FAIL: testscripts\commands\server_render_to_memory.txt:13: unexpected command failure
            
panic: fail now! [recovered]
	panic: fail now!

goroutine 45 [running]:
testing.tRunner.func1.2({0x1b175c0, 0x14847970})
	C:/hostedtoolcache/windows/go/1.19.9/x64/src/testing/testing.go:1[39](https://github.com/gohugoio/hugo/actions/runs/4961873656/jobs/8879210277?pr=10602#step:17:40)6 +0x2ab
testing.tRunner.func1()
	C:/hostedtoolcache/windows/go/1.19.9/x64/src/testing/testing.go:1399 +0x[41](https://github.com/gohugoio/hugo/actions/runs/4961873656/jobs/8879210277?pr=10602#step:17:42)f
panic({0x1b175c0, 0x14847970})
	C:/hostedtoolcache/windows/go/1.19.9/x64/src/runtime/panic.go:884 +0x1ba
github.com/rogpeppe/go-internal/testscript.(*TestScript).Fatalf(0x149e6900, {0x1d91689, 0x1a}, {0x0, 0x0, 0x0})
	C:/Users/runneradmin/go/pkg/mod/github.com/rogpeppe/[email protected][43](https://github.com/gohugoio/hugo/actions/runs/4961873656/jobs/8879210277?pr=10602#step:17:44)74f/testscript/testscript.go:1104 +0x1[49](https://github.com/gohugoio/hugo/actions/runs/4961873656/jobs/8879210277?pr=10602#step:17:50)
github.com/rogpeppe/go-internal/testscript.(*TestScript).waitBackground(0x149e[69](https://github.com/gohugoio/hugo/actions/runs/4961873656/jobs/8879210277?pr=10602#step:17:70)00, 0x1)
@mvdan
Copy link
Collaborator

mvdan commented May 12, 2023

Do they always fail like that on Windows, or just sometimes?

Do you have a minimal reproducer? Beyond making the investigation easier, it would also serve as a regression test for the eventual fix.

@mvdan
Copy link
Collaborator

mvdan commented May 12, 2023

Also, could you share an entire goroutine trace? I think the problem is that we don't set up the recover for failNow in one of the goroutines.

@bep
Copy link
Contributor Author

bep commented May 13, 2023

Do they always fail like that on Windows, or just sometimes?

I don't have a lot of data on this (I have ran it on GitHub 3-4 times, it always fails on Windows; I have since disabled these tests on Windows, and they always passes on Linux/MacOs, which is good), but I will close my eyes and boot up my Windows VM later today and see what happens.

@mvdan
Copy link
Collaborator

mvdan commented May 13, 2023

FYI I think the problem is that waitBackground uses ts.Fatalf, which is subtly different from ts.t.Fatalf. The former is our own implementation, which calls panic(failNow) to mimic the "stop now" behavior from testing.T. Within a testscript command, ts.Fatalf should be used to properly use testscript's "fail" logic. In the other setup/teardown code, ts.t.Fatalf should be used, because in many of those cases, the deferred recover to handle failNow may not exist.

waitBackground is used in both scenarios - both by the wait command, and by the teardown code when the entire script finishes. In the latter case I think that's where you run into issues, because it uses ts.Fatalf.

@myitcv
Copy link
Collaborator

myitcv commented May 13, 2023

Drive-by comment unrelated to the error you are seeing, @bep: I've not had a particularly enjoyable time trying to test server processes using a command to start the server, and then another one to close it. Instead, I've tended to use special files in the testscript itself to indicate that a server should be started for the purposes of a given test, and then tear down happens at the end. This also means that the setup command can more easily wait until the server is known to be ready before starting the script.

@bep
Copy link
Contributor Author

bep commented May 13, 2023

can more easily wait until the server is known to be ready before starting the script.

Yes, I'm actually doing that. My current problem is the stopping of the server, which seem to somehow end with ProcessState.Success() == false on Windows -- even if the test itself passes. I could possibly add a custom "non signal" way to stop the server during tests, as well, but I will try to dig a little further first.

@mvdan
Copy link
Collaborator

mvdan commented May 13, 2023

To clarify, I think Paul means starting and stopping your server program from Go code, via testscript's Setup function. Then you have full control of it: block until it's ready, capture its output, what signal to give it to stop, etc. testscript's own background process support is okay, just slightly limited depending on what you want to do.

@bep bep changed the title testscript: panic: fail now! on background processes on Windows testscript: panic: fail now! in stop command on Windows May 13, 2023
@bep
Copy link
Contributor Author

bep commented May 13, 2023

This test repo always fails on Windows:

https://github.com/bep/repro-testscript-228

Log here:

https://github.com/bep/repro-testscript-228/actions/runs/4967165173/jobs/8889003644#step:4:75

Not sure why this is falsy on Windows:

if bg.cmd.ProcessState.Success() {

@bep
Copy link
Contributor Author

bep commented May 13, 2023

OK, I guess this is a what to be expected on Windows when killing the process (as in, no SIGINT). I will find a workaround, but I'll keep this issue open if someone has a fix.

@myitcv
Copy link
Collaborator

myitcv commented May 13, 2023

I think Paul means starting and stopping your server program from Go code, via testscript's Setup function

Exactly so.

@myitcv
Copy link
Collaborator

myitcv commented May 13, 2023

OK, I guess this is a what to be expected on Windows when killing the process (as in, no SIGINT). I will find a workaround, but I'll keep this issue open if someone has a fix.

@bep I'd strongly consider doing the server setup and teardown in the Setup() function with corresponding use of Defer(). It will not only make things more robust, but it will also tidy up a number of your scripts to remove the stutter of:

hugo server &
waitServer

...

stop

You can "mark" scripts that expect a server to be running using a special file in the script archive:


-- .run_a_server --

Check for the existence of such a file in the Setup() function and run a server accordingly.

@myitcv
Copy link
Collaborator

myitcv commented May 13, 2023

For a fairly involved example I've recently written https://github.com/cue-lang/cuelang.org/blob/a0df30d1ef4858b51bdfe96b7f6c507a0f08fb81/internal/fsnotify/fsnotify_test.go#L57 (which coincidentally might be a package of interest to Hugo).

@bep
Copy link
Contributor Author

bep commented May 13, 2023

It will not only make things more robust, but it will also tidy up a number of your scripts to remove the stutter of:

@myitcv Maybe. The thing is:

  • Having test scripts that's self contained and as close to the "real thing" as possible is valuable to me, and I don't mind a little stutter if I can get that.
  • My server tests are mostly to verify that the CLI is wired up correctly, so adding variations with hugo server --renderToMemory & is incredibly powerful and self documenting.

@myitcv
Copy link
Collaborator

myitcv commented May 13, 2023

  • Having test scripts that's self contained and as close to the "real thing" as possible is valuable to me, and I don't mind a little stutter if I can get that.
  • My server tests are mostly to verify that the CLI is wired up correctly, so adding variations with hugo server --renderToMemory & is incredibly powerful and self documenting.

Just to highlight that you don't necessarily lose any capabilities with the special file approach. You could just as easily put the arguments to the server in the special file, extract them using code like this, then pass the resulting []string to your command with almost zero special handling required.

incredibly powerful and self documenting

I'm totally with you on this, just sharing my experience of having tested a number of commands where a server (or equivalent mode) became too painful.

As you will notice from the fsnotify tests I linked above, it also makes it easier to orchestrate commands that can interact more precisely with the running server. @ldemailly that partially addresses the question you asked in the linked issue.

@bep
Copy link
Contributor Author

bep commented May 13, 2023

@myitcv I'm a little confused. I see the Setup and the Defer -- but how do you pass arguments from the test script to those? Having them in some other "special file" is not what i talk about when I use the term "self contained test scripts".

@myitcv
Copy link
Collaborator

myitcv commented May 14, 2023

I'm a little confused. I see the Setup and the Defer -- but how do you pass arguments from the test script to those?

Just to clarify, I'm be talking about the example from fsnotify. The testscript scripts live within the testdata directory:

https://github.com/cue-lang/cuelang.org/tree/a0df30d1ef4858b51bdfe96b7f6c507a0f08fb81/internal/fsnotify/testdata

The Setup() function delegates to:

https://github.com/cue-lang/cuelang.org/blob/a0df30d1ef4858b51bdfe96b7f6c507a0f08fb81/internal/fsnotify/fsnotify_test.go#L57

When each script runs, before Setup() is called, the script archive is extracted to a temporary directory. The Setup() function can therefore do things based on the files within the archive, which now (at the time Setup() is called) exist on disk.

In the fsnotify example, as you can see in the Setup() function, it checks for the existence of a file named .batched in the root of the archive:

https://github.com/cue-lang/cuelang.org/blob/a0df30d1ef4858b51bdfe96b7f6c507a0f08fb81/internal/fsnotify/fsnotify_test.go#L99-L113

Note the use of (*testscript.Env).Cd for the path to the directory into which the archive has been extracted.

If the file exists, its contents are used to parse a time.Duration which is then used in the creation of a batched watcher. If the file does not exist, then we create a regular watcher.

Here is an example of one of the scripts in the testdata directory that contains one of the .batched special files:

https://github.com/cue-lang/cuelang.org/blob/a0df30d1ef4858b51bdfe96b7f6c507a0f08fb81/internal/fsnotify/testdata/batched_directory_moved_into_watch_tree.txtar#L31-L32

The file .batched is what I would refer to as a special file within a test archive. Its existence or otherwise, can be used to help signal things to the Setup() function. In this case it's not just the existence, but the contents of the file that are used.

Hence in my suggestion re the running of a server, you could use the existence of a special file within a testscript script archive to signal that Setup() should run Hugo in serve mode. Furthermore, you could use the contents of that file to drive the args to serve, in the way that I suggested in #228 (comment)

@myitcv
Copy link
Collaborator

myitcv commented May 14, 2023

Updated my latest comment with a link to one of the .batched special files to make the point slightly clearer.

@bep
Copy link
Contributor Author

bep commented May 14, 2023

@myitcv thanks for the detailed explanation. OK, I understand how to set it up. The thing is, though, when I compare the 2 scripts below:

hugo server --renderToMemory &
sleep
! stderr .
stop
# I can even start more servers here if I want to.

And:

sleep
! stderr .
-- magicserverfile --
--renderToMemory

I would much, much prefer the first variant any day of the week. And one great thing about it is that the average developer coming in to fix 1 thing in a project he/she has no experience with, can read that script and understand exactly what it does (also, it does not require 1K lines of custom test setup behind the scenes).

This particular issue (#228) is obviously a bug. It may be hard (impossible?) to fix for the testscript.RunMain commands, but perhaps the Cmds commands could have support for the & "operator" and a stop would run the Defer? Or something.

My current workaround is a custom stopServer command that just sends a HEAD request to a "stop server" endpoint.

@myitcv
Copy link
Collaborator

myitcv commented May 14, 2023

At the risk of derailing this bug report still further, just to point out a couple of things:

sleep
! stderr .

I've found sleep to in general be a very brittle way of waiting for a server to become available. You either need to be very generous on the sleep time which slows down all server tests, or risk it being too small and CI failing intermittently. Hence, when I start such a server in Setup(), I have the setup phase wait until the server is definitely ready. Then you can elide this step entirely from the script (which again is more robust).

The checking of stderr will not work for a backgrounded task FWIW. Again, you need a custom command to interact with the server process. That's why in the fsnotify tests I define a log command that directly interacts with the "server" (in that case):

https://github.com/cue-lang/cuelang.org/blob/a0df30d1ef4858b51bdfe96b7f6c507a0f08fb81/internal/fsnotify/fsnotify_test.go#L52

I take your point about it being unusual to not have a command at the start of a script, but it's a "problem" that is (again in my experience) relatively easily overcome with a script naming format like serve_*.txtar and comments that generally appear at the top of each such script explaining what is going on, what is being tested etc.

Again, none of what I am saying is a rule in any way, just sharing my experience from having written a number of such test setups.

@bep
Copy link
Contributor Author

bep commented May 14, 2023

I've found sleep to in general be a very brittle way of waiting for a server to become available

Which is why in my real tests have both a waitServer (which uses polling) and a stopServer (which uses a HTTP head request + a small timeout), which has proven to be rock stable (even on Windows).

The checking of stderr will not work for a backgrounded task FWIW.

Not that my tests rely on this, but what you say isn't obvious when reading the docs:

If the last token is '&', the program executes in the background. The standard output and standard error of the previous command is cleared, but the output of the background process is buffered — and checking of its exit status is delayed — until the next call to 'wait', 'skip', or 'stop' or the end of the test.
Apply the grep command (see above) to the standard error from the most recent exec or wait command.

I suggest, however, that we instead focus on the issue. The testscript package is incredibly useful, it would be even better if everyone didn't have to create elaborate workarounds to test servers, which is a common enough requirement.

@mvdan
Copy link
Collaborator

mvdan commented May 14, 2023

I agree that we should fix the panics in any case. Whether or not testscript's support for background commands should be redesigned (or avoided?) is a different matter.

@bep
Copy link
Contributor Author

bep commented May 14, 2023

Just to summarise my findings (so people don't have to wade through all of the above to get the gist of it:

  • I suggest leaving the panic as is -- the stack trace is more useful than a unexpected command failure message.
  • I don't think it's possible to get https://github.com/bep/repro-testscript-228 green on Windows by using any variation of Process.Signal(interrupt). There's a ton of research on this on Go's issue tracker, but no great conclusion.

mvdan added a commit to mvdan/go-internal that referenced this issue Jul 9, 2024
When an entire script runs and the end is reached, any background
processes begun with a '&' command get interrupted or killed,
depending on the platform and timeout, and we wait for them to finish.

We also checked their resulting status code and failed if they didn't
exit with a status code of 0. However, as explained in the comment,
this would always fail on Windows, given that it doesn't have interrupt
signals so we would kill directly, causing a "signal: killed" error.

Worse, any failures here caused a `panic: fail now!` as that is how
we bubble up errors when a script command is being run, but such panics
were not being recovered once we reached the end of a script.
Now that we don't check the result anymore here, the panics are gone.

Fixes rogpeppe#228.
Fixes rogpeppe#260.
mvdan added a commit to mvdan/go-internal that referenced this issue Jul 9, 2024
When an entire script runs and the end is reached, any background
processes begun with a '&' command get interrupted or killed,
depending on the platform and timeout, and we wait for them to finish.

We also checked their resulting status code and failed if they didn't
exit with a status code of 0. However, as explained in the comment,
this would always fail on Windows, given that it doesn't have interrupt
signals so we would kill directly, causing a "signal: killed" error.

Worse, any failures here caused a `panic: fail now!` as that is how
we bubble up errors when a script command is being run, but such panics
were not being recovered once we reached the end of a script.
Now that we don't check the result anymore here, the panics are gone.

Fixes rogpeppe#228.
Fixes rogpeppe#260.
@mvdan mvdan closed this as completed in #265 Jul 9, 2024
mvdan added a commit that referenced this issue Jul 9, 2024
When an entire script runs and the end is reached, any background
processes begun with a '&' command get interrupted or killed,
depending on the platform and timeout, and we wait for them to finish.

We also checked their resulting status code and failed if they didn't
exit with a status code of 0. However, as explained in the comment,
this would always fail on Windows, given that it doesn't have interrupt
signals so we would kill directly, causing a "signal: killed" error.

Worse, any failures here caused a `panic: fail now!` as that is how
we bubble up errors when a script command is being run, but such panics
were not being recovered once we reached the end of a script.
Now that we don't check the result anymore here, the panics are gone.

Fixes #228.
Fixes #260.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants