Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Idea for an efficient string representation #238

Open
guregu opened this issue Jul 23, 2022 · 11 comments
Open

Idea for an efficient string representation #238

guregu opened this issue Jul 23, 2022 · 11 comments

Comments

@guregu
Copy link
Contributor

guregu commented Jul 23, 2022

I had a crazy idea recently. Not sure if it would work, but it could maybe be a relatively low-effort way to get faster strings.

The gist is:

  • Represent strings as []rune
  • Abuse negative rune values for special values
    • e.g. rune(-1) could mean [], rune(-5) could mean variable Var("_5")

This would let us take advantage of Go's slices for a lot of common list operations.

Potential issues I could see:

  • variable IDs are uint64 so there's not enough space to represent them with int32
  • lots of special casing in all the builtin functions probably required (maybe easier now with the nice iterators?)
  • when and where should this be applied? I think it could be used for double-quoted literals at least

Maybe a slightly less crazy representation would be:

type String struct {
   data []rune
   vars []Var
}

and store the variable ID as a local index instead of the global ID, so rune(-5) would refer to vars[5] or something like that.

BTW, this project got a shoutout in @triska's latest video! https://youtu.be/CvLsVfq6cks?t=4625
Congrats 🎉

@ichiban
Copy link
Owner

ichiban commented Jul 24, 2022

An efficient string representation will be a huge improvement! Currently, we have one Term representation for each value for simplicity. By introducing specialized representations, we have plenty of room for optimization.

We might also be able to achieve an efficient string representation by introducing 2 specialized Term types, namely Char and List.

type Char rune
var _ Term = Char(0) // Behaves as if it's a single character Atom.
type List []Term
var _ Term = List(nil) // Behaves as if it's a `./2` or `Atom("[]")`.

Let's see how other implementations implement strings.

Other than strings, BigInteger for unbounded integers is along the lines of specialized representations. #191 (comment)

@ichiban
Copy link
Owner

ichiban commented Aug 2, 2022

I've been thinking about this. It could be better if we generalize compound terms, not lists.

type Compound interface {
	Term
	Arg(n int) Term
}

type Term interface {
	...
	PrincipalFunctor() Compound // e.g. '.'/2
}

That way, we should be able to make any structures, even string, a Compound.

type CharacterList string
var _ Compound = CharacterList("")

@UWN
Copy link

UWN commented Aug 2, 2022

Did you consider partial strings? Here is an explanation.

@ichiban ichiban mentioned this issue Aug 13, 2022
@ichiban
Copy link
Owner

ichiban commented Aug 28, 2022

As of v0.11.0, any Go values that implement engine.Compound are compound terms.
This allows us to introduce efficient internal representations for lists and strings.

type Compound interface {
	Functor() Atom
	Arity() int
	Arg(n int) Term
}

v0.11.0 is shipped with four efficient internal representations, namely list, charList, codeList, and partial but not limited to them. You can implement one and return it from your custom predicate.

list

list is an efficient representation for a list of arbitrary terms. It's just a Go slice.

$ $(go env GOPATH)/bin/1pl
Top level for ichiban/prolog v0.11.0
This is for testing purposes only!
See https://github.com/ichiban/prolog for more details.
Type Ctrl-C or 'halt.' to exit.
?- X = [a, b, c], go_string(X, S).
X = [a,b,c],
S = 'engine.list{"a", "b", "c"}'.
type list []Term

charList

charList is an efficient representation for a list of one-char atoms. It's just a Go string.

$ $(go env GOPATH)/bin/1pl
Top level for ichiban/prolog v0.11.0
This is for testing purposes only!
See https://github.com/ichiban/prolog for more details.
Type Ctrl-C or 'halt.' to exit.
?- X = "abc", go_string(X, S).
X = [a,b,c],
S = '"abc"'.
type charList string

codeList

codeList is an efficient representation for a list of character codes. It's just a Go string.

$ $(go env GOPATH)/bin/1pl
Top level for ichiban/prolog v0.11.0
This is for testing purposes only!
See https://github.com/ichiban/prolog for more details.
Type Ctrl-C or 'halt.' to exit.
?- set_prolog_flag(double_quotes, codes).
true.
?- X = "abc", go_string(X, S).
X = [97,98,99],
S = '"abc"'.
type codeList string

partial

partial is an efficient representation for a partial list. It's a Go struct with a base Compound and a tail Term. The base Compound must be a list without any variables in its spine.

$ $(go env GOPATH)/bin/1pl
Top level for ichiban/prolog v0.11.0
This is for testing purposes only!
See https://github.com/ichiban/prolog for more details.
Type Ctrl-C or 'halt.' to exit.
?- append("abc", X, L), go_string(L, S).
X = X,
L = [a,b,c|X],
S = 'engine.partial{Compound:"abc", tail:X}'.
type partial struct {
	Compound
	tail Term
}

@triska
Copy link

triska commented Aug 28, 2022

Wow, this sounds amazing, thank you a lot for working on this!

Two small points I noticed:

  1. Ideally, when writing X = [a, b, c], the system should detect that this can be efficiently represented as a charList, exactly as if it were written as X = "abc" (with double_quotes set to chars), and use a string instead of a slice in this case.

  2. Personally, I do not think that a dedicated specialized representation for codeList is worth supporting: Application programmers can simply use lists of characters, which yields more readable programs and answers.

@UWN
Copy link

UWN commented Aug 28, 2022

How expensive is

?- Given_long_chars = [C|Rest].

You are supporting the '\0\'-character so I wonder: Is this operation constant? Does this operation produce new auxiliary heapstructures? It seems this is necessary for you.

In Scryer, partial strings are constant for above operation and do not involve any extra space.

@ichiban
Copy link
Owner

ichiban commented Aug 29, 2022

@triska Thank you!

  1. I wonder if a Prolog programmer would write [a, b, c] to mean a string in practice. They might be emphasizing its list aspect- expecting O(1) access to the elements which we haven't optimized yet. If that's the case, charList is not an efficient representation. What do you think?

  2. The cost of supporting codeList is quite cheap and it simplifies atom_codes/2 and the parser. So, I think it's still worth supporting them even if it's not beneficial to the real users.

@ichiban
Copy link
Owner

ichiban commented Aug 29, 2022

@UWN It's constant. A Go string is a slice of bytes and the slice points to a section of the underlying array of bytes.

In Go, a string is in effect a read-only slice of bytes.
https://go.dev/blog/strings

// https://github.com/golang/go/blob/846c378b8c0cebd2d8522a5693b45ca95b018a78/src/runtime/slice.go#L15

type slice struct {
	array unsafe.Pointer
	len   int
	cap   int
}

So, Rest doesn't require a new array in the heap but just a new slice that points to the section of the same underlying array that is one character narrower than Given_long_chars.

@UWN
Copy link

UWN commented Aug 29, 2022

OK, but where does that slice struct get stored? Does it fit into a single word?

@ichiban
Copy link
Owner

ichiban commented Aug 29, 2022

It's stored in the heap in this case but it's not because it's a slice. In Go, the compiler decides whether an object gets stored in the heap or the stack and it always stores interface{} objects in the heap, which we used to abstract Compound.

The compiler will store multi-word objects like slice (3 words + padding) in the stack if the conditions are met.

@UWN
Copy link

UWN commented Aug 30, 2022

Looks good. So these details are just in the compiler and not in your code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants