Language: Go
Package bufrr provides a buffered rune reader, with both PeekRune and UnreadRune. It takes an io.Reader providing the source, buffers it by wrapping with a bufio.Reader, and creates a new Reader implementing the bufrr.RunePeeker interface (an io.RuneScanner interface plus an additional PeekRune method).
Additionally, bufrr.Reader also translates io.EOF error into the invalid rune value of -1 (defined as bufrr.EOF)
Internally, bufrr.Reader is a bufio.Reader plus a single-rune peek buffer and a single-rune unread buffer.
import (
"github.com/SteelSeries/bufrr"
"strings"
)
func ExampleBufrr() {
// example input
in := strings.NewReader("abc")
// construct buffered rune reader
buf := bufrr.NewReader(in)
var err error
var r, p rune
// common sequence of operations when lexing an awkawrd grammar
r, _, err = buf.ReadRune()
// [...]
p, _, err = buf.PeekRune()
// [...]
err = buf.UnreadRune()
// [...]
}
When writing Unicode/UTF-8 parsers/lexers/tokenizers in Go, it is preferential to work with the higher-level native rune type instead of []byte.
A common sequence of operations that a tokenizer performs on its input stream are:
- next (read)
- peek (look-ahead)
- backup (unread)
Requirement: a simple API providing ReadRune(), PeekRune() and UnreadRune().
- bufio.Reader has ReadRune and UnreadRune -- but no PeekRune (has PeekBytes though). Furthermore, under certain conditions, bufio.Reader seems to have some unexpected behaviour when combining peeks with unreads.
- scanner.Scanner is rune-based, with Read and Peek -- but no Unread.
I considered adding PeekRune() to bufio.Reader, as the easiest option. But once I got halfway through the implementation I realised there were some edge cases where things became trickier than I'd expected (due to bufio.Reader's current implementation).
I considered adding Unread() to scanner.Scanner, but decided this would introduce unnecessary complexity - plus scanner.Scanner is higher-level than needed, having additional unrequired functionality; to implement a tokenizer over the top of it would really be duplicating too much functionality.
After all this, I finally decided the easiest option was to implement a simple wrapper for bufio.Reader with the functionality I needed - it was the least amount of work I could do: my API requirement is only 3 methods.
As two of my methods are already covered by the io.RuneScanner interface, the bufrr.RunePeeker interface simply extends this with the addition of a PeekRune() method.
Why bufio.Reader? Tokenizers arguably/usually work over a buffered input stream (supporting both peek and unread implies at least a minimal amount of buffering, i.e. two runes - plus buffered I/O is generally a good thing).
An eventual end-of-file is an expected condition when parsing, lexing or tokenizing. Therefore, representing EOF as a token/marker in the rune stream, distinct from any error conditions encountered while reading the stream, is preferable, and leads to cleaner client code.
To this end, when bufrr.Reader reaches EOF, both ReadRune() and PeekRune() will return an invalid rune value of -1 (defined as bufrr.EOF), and will never return an io.EOF error.
Fetch the code:
go get github.com/SteelSeries/bufrr
Import the package into your code:
import (
...
"github.com/SteelSeries/bufrr"
...
)
See autogenerated documentation at: http://godoc.org/github.com/SteelSeries/bufrr
func NewReader(rd io.Reader) *bufrr.Reader
func NewReaderSize(rd io.Reader, size int) *bufrr.Reader
bufrr.Reader implements all the methods of interface bufrr.RunePeeker, namely:
ReadRune() (r rune, w int, err error)
PeekRune() (r rune, w int, err error)
UnreadRune() error
type RunePeeker interface {
io.RuneScanner
PeekRune() (r rune, w int, err error)
}
To run the tests:
cd $GOPATH/src/github.com/SteelSeries/bufrr
go test
The tests could do with improvement. They only test the basic API functionality and do not test all of the edge cases. But this is not to say that the code is not fully tested, per se; it is in fact well exercised by several file parsers I have written.
Bug reports and pull requests are most welcome!
This work is distributed under an MIT License (Wikipedia: MIT License) - see LICENSE file for details.