Screen reader accessibility for Ghostty #2351
Unanswered
mikolysz
asked this question in
Ideas and Issue Triage
Replies: 1 comment
-
I haven't read the full issue text yet, but I really appreciate the writeup @mikolysz and look forward to giving this a shot. I agree that as specific tasks are discovered, we can break them out. I'd prefer to not break them out until someone is actively working on that task, since this encompasses the pending work nicely, I think. I know macOS has a LOT of accessibility APIs so I think targeting macOS as the primary accessibility target makes sense. Linux can be indefinitely delayed until its clear what improvements we can make. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
This is a general issue for tracking the state of Ghostty's screen-reader support. I will try to explain what needs to be done as best as I can, along with some pointers on what to do and what pitfalls to avoid. We might want to create specific issues for the items described here at some point.
Current status
At this time, accessibility support in Ghostty is basically nonexistent. This is mostly due to the fact that Ghostty uses GPU rendering, which prevents screen readers from extracting any information from the app's UI tree as they otherwise would. However, this issue can be worked around in multiple ways (see below). Here's a list of things that need to be done, roughly in this order:
brew install
.This list is by no means exhaustive, it's merely a baseline of features which are absolutely essential for Ghostty to be AT-friendly. If we're done with these, we can think about nice-to-haves and things that make our lives a lot more pleasant but can be lived without.
Below are a few notes on how those features can be implemented and the tradeoffs involved.
Speech output
Whenever we type a command, we would expect the results of said command to be announced by our screen reader. The obvious way to do this is to communicate with the screen readers directly, most of them expose an API for this. Windows has libraries that allow us to do all of this with a unified API, instead of having to implement five or so separate APIs. However, most of them would require us to ship (redistributable) DLLs that come from the screen reader vendors, some of which we don't have source code for as far as I believe. On the Mac side, communications with Voice Over (basically the only screen reader that exists) can be achieved through AppleScript. I don't know if there's a better way, perhaps we could somehow invoke the actions that apps expose to AppleScript directly without going through the whole scripting layer? I don't really know what the situation is on the Linux side, as I barely ever use desktop Linux, we'd have to ask somebody more knowledgeable about how Linux GUI accessibility works.
However, this method has a huge drawback, particularly on Mac OS. On most other platforms, screen readers have some kind of speech queue. When requesting a screen reader to speak a particular string, we can usually choose whether the string should be spoken immediately, interrupting all other speech, or pushed to the end of the queue. Voice Over on the Mac unfortunately doesn't offer this functionality, all new speech interrupts whatever is currently being spoken. This feature is very important for a terminal app, as we're often dealing with apps that produce large quantities of output, often generating new text several times a second. In such a situation, doing things the Voice Over way would prevent us from understanding anything, as each message would instantly be interrupted by the beginning of the next one. We couldn't even manually review terminal output, as the same issue would occur. The native terminal app has this problem, so the vast majority of people don't use it with VO. Maybe there would be a way to hack around this by doing audio trickery and figuring out if voice over is still speaking or not, pushing new messages when it's done, but I'm not very hopeful there and I wouldn't recommend that strategy. Another issue is that Voice Over's Apple Script support needs to be explicitly enabled in Voice Over utility, and this setting is very flaky across system updates, it often shows as enabled while it actually isn't and has to be re-enabled manually.
One way to mitigate this problem is to interact with the system speech APIs directly, completely bypassing the screen reader. Those APIs do support queuing, and in the worst case scenario, we can always request raw audio buffers and play them in whatever way we see fit. This is what I'd recommend to do on Mac, there's really no other solution here. However, this approach solves one big problem but introduces two smaller ones, voice handling and punctuation schemas.
If we're using the system speech API, we have to allow the user to adjust their voice parameters, such as the actual voice to use, the speech rate, pitch, volume, language etc. We can implement configuration for this, forcing the user to set the parameters themselves, but a more interesting approach would be to extract that information directly from the screen reader, making the user completely unaware that we're actually generating our speech without its help. This would be difficult on Mac OS 13 and would require us to parse Voice Over's plist files, but Sonoma will expose an API for this. Perhaps we can hold off on this for now and just use the default system voice, and then switch to that API once Sonoma is stable. To be clear, we could probably find a library that does all of this for us, including cross-platform support, I'd have to check on what's good these days.
Another issue is punctuation handling. Most people who use terminals interact with code, so they want punctuation characters to be announced, something that the system speech API won't do by default. We'd basically have to implement punctuation handling ourselves and replace characters like ":" and "?" with words like "colon" and "question". The crude way to do this is with a static mapping, however, that doesn't let the user change the character descriptions, such as when their text is often in another language and they use a non-english voice, or when they care about efficiency and want "is" instead of "equals" or "score" instead of "underscore". Again, we can either implement configuration for this or parse the sqlite files where Voice Over's punctuation schemas are kept.
UI state
Linear speech output isn't everything, we often also need to review the output history in detail. Complicated pieces of code can't be understood all at once, there has to be a way for us to navigate line-by-line, word-by-word etc, hearing the content we move past. Once again, there are two ways to go about this:
Exposing the UI state to platform accessibility APIS
As I said above, Ghostty doesn't currently expose anything to the platform APIs because of the GPU rendering. Those APIs expect a tree of native widgets which are accessible out of the box, but which Ghostty doesn't utilize. However, it's not the first project with that problem, the same hurdle has been faced by third-party GUI frameworks like QT or Java's Swing, as well as browser engines. For that reason, platforms let you expose accessibility objects even for things that aren't native widgets, something that those frameworks and browser engines must do. We could expose the terminal contents this way. There might be a performance impact, but there are APIs to detect if there's active assistive technology software consuming this info, so we wouldn't need to do this work for fully-abled users. Until recently, implementing something like that required us to do the work separately for each platform, dealing with COM interfaces on Windows and underdocumented ATSPI on Linux. However, there's now a library that abstracts those platform APIs behind a common abstraction. The library is quite new, and I'm not sure if it implements everything we need, but it's already used by one somewhat popular Rust UI framework. The library exposes a C API but is written in Rust, not sure if this is acceptable to us and what impact it would have on our build process.
This approach would be preferable, as it lets users use the screen reader commands they already know. In addition, it also exposes the terminal contents to non-screen-reader assistive technologies, scripting and automation apps such as Hammerspoon on Mac and AutoID on Windows and allows reviewing output using refreshable braille displays, something that would be appreciated by users who are also hard of hearing or who just prefer to do things that way. Screen readers need to get the terminal contents from the native APIs if we want braille displays to work.
This way of doing things would also let us simplify how new output is delivered to the screen reader (at least outside of Mac OS). Most platforms provide a way to mark a text widget as a "live region", indicating to the screen reader that any new content that appears should be spoken automatically. Voice Over on Mac still has the "no speech queue" problem, so system speech seems to be the only way here. Outside of Mac OS, however, this approach might even let us dispose of a
speak_string
function completely.Implementing a micro screen-reader
Instead of exposing content to accessibility APIs, we can fully lean on the "speak output" function explained in the previous section. We can implement commands like "next line", "previous character" etc manually, give them custom keyboard shortcuts, figure out what needs to be said (probably the line/word/character we just passed) and speak the constructed message ourselves. This is how VS Code's terminal accessibility is implemented, although in a really terrible way, same goes for Warp, which is even worse. However, many terminal screen readers do this successfully. This approach, however, doesn't provide Braille support, doesn't work well with other AT software and is generally clunkier and more hacky. I don't know which way would require more work, exposing terminal contents seems like a dauting task, but the micro screen-reader would require us to implement scrolling, selection, copying, announcing line indentation, skipping repeated punctuation and other such features manually, while existing screen readers just give them to us for free.
Cursor movement
The next problem we need to tackle is software that relies on cursor navigation, such as text editors, Ncurses menus or even readline prompts. Whenever the cursor moves, we need to announce the right entity, such as the line we moved to or the character we moved past. This isn't as easy as it seems. It's not always obvious when cursor movement is due to a user action and when it's due to the program doing something. Some software resets the cursor after being done with displaying the output, and we should take care not to suppress any speech in that case. We can rely on arrow keys to detect when movement is user initiated, but then software like Vim (with navigation via hjkl) or even the common Readline shortcuts are problematic. Most screen readers (at least NVDA and TDSR) rely on arrow keys and just read whatever entity the cursor is at after a timeout, regardless of whether it moved or not, but this is not great when SSH servers are involved. When the timeout is too low and the server is too slow, the wrong thing is read, but when the timeout is too high, navigating through a file feels extremely frustrating. It would be great to innovate in this area and find a better solution.
As I said above, there's more to be done, the area of terminal accessibility is woefully under explored, and due to the prevalence of GUI platforms, some ideas common to 80s DOS screen readers are pretty much forgotten now. However, if we can get the things described above implemented, we'd be competitive with most of the Windows solutions, better than what we have on the Mac, and still a good bit behind what you can get in a true non-GUI, kernel-based Linux TTY.
Beta Was this translation helpful? Give feedback.
All reactions