-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
proposal: regexp: add iterator forms of matching methods #61902
Comments
@rsc under "The full list is" the actual function signatures need to be fixed to match the correct names in the doc comments. |
It seems like you are trying to establish a general pattern in the stdlib that the word Which is to say, I think these names would be a lot clearer:
|
@magical I think |
This proposal has been added to the active column of the proposals project |
I have found it quite confusing in the past to figure out which of the 16 |
In addition to the correction @cespare proposes, I believe your func (re *Regexp) AllString(s string) iter.Seq[string] |
I believe it would be beneficial to document in the relevant methods that any yielded slices ( I suspect most use-cases will be unaffected by this restriction, other than to benefit from reduced allocations (which could potentially further benefit from internal use of
|
Emm, that is, we will have nearly 60 methods to do matching? 😂 How about add a new type |
I agree with @leaxoy. Irrespective of my above comments, I believe we'll be better served with an approach closer to |
I don't believe a separate type is a good idea. The It's unfortunate that the |
Regarding the |
With separate methods, particularly the Index variants, I am concerned that we'd be adding the methods for completeness without gaining much value. FindReaderIndex is of questionable utility, given that when you have a reader that you know nothing about (or know that you can't or don't want to reread, such as io.Stdin), it's rare that you can make meaningful use of the indices alone. There are other cases where having a separate Match type is simply more efficient: the consumer may want a string representation of some submatches while having a byte slice representation of others. With a Match type, the consumer does not need to care what the underlying input was. If a string for a particular submatch is requested, it'll slice or copy a portion the underlying input data depending on whether or not it was a string, but will be no worse in allocation efficiency compared to what the caller needs to do today. |
@Merovius regarding deviating from the current convention being confusing, I think that's entirely manageable if all of the iter methods are internally self-consistent. Per your point, if we implement the methods as Russ initially proposed and then introduce a separate type, that certainly will be confusing. So this is really the only good time we'll get to make a clean decision. |
@extemporalgenome We seem to be talking past each other. Putting the method on a new type is the deviation. Obviously that can't be addressed by making them "internally self-consistent". And it's also not about timing - it's a deviation if we do it from the beginning just as much. If we never had the |
Is there really a need for these? do the regexp find methods return enough elements to justify returning an iterator or is this just to avoid that one slice allocation? |
@Merovius a crazy idea, how about introduce regexp v2 like the math v2 to simplify API. |
@leaxoy It doesn't seem that wild to me. I think there is an argument to be made that a) we might want to wait a release or so to see how the But yeah, it's not really up to me. Personally, I think this proposal is fine as it is, but maybe the Go team can be persuaded to do a v2 for @doggedOwl I thought the same thing, TBH. Especially as the matching groups can be sliced out of the input, so just the actual result slice has to be allocated. There are still two ways in which an iterator form arguably might improve performance: 1. it might enable you to prematurely stop some matching work - though this seems to be possible in a corner case at best. And 2. it might enable you to do subgroup matching entirely without allocations. But I'm not totally sold on needing an iterator form of these either. It would be possible to feed the iterator into other iterator transformation functions, but then again, that would be just as possible by using |
Unfortunately the proposal doesn't say, what the benefit of the proposal is or what problem it is trying to address. It looks more like a demo for iterator functions. The proposal will further inflate the number of methods for the Regexp type. Already now I have to consult the documentation every time I use the package. I would welcome a regexp2 package that simplifies the interface. Maybe by using byte slices and string as type arguments and supporting only the iterator methods, since the first match functions wouldn't be required anymore. Using a match type could also reduce the provided variants. |
It says
The benefit is that it doesn't build the slice. If you are searching large texts, you almost always want to consider the matches one at a time, and building the slice of all results is wasted memory, potentially larger than the text. |
Finishing this proposal discussion is blocked on #61405. |
This should be unblocked right? Or is it waiting on the iterators addition to roll out? |
This is a lot of new methods, but it's also very regular and consistent with the existing API. Is there anything still blocking this proposal? |
Every time I use the regexp API I find it hard to remember what all the different unnamed numbers signify. Has anyone prototyped a version of a Match method that returns an iter.Seq of some abstract match data type that provides methods (with informative names!) that can return any information you need about a given match: its indices, its byte or string value (allocating if different from the input string type), its submatch index, and so on? (speaking for proposal committee) |
https://go.dev/cl/643896 provides a sketch of the idea above: package regexp
// All returns the sequence of matches of the regular expression on
// the input text, which may be a string or a []byte slice.
//
// TODO(adonovan): API:
// - Should we define two variants All([]byte), AllString(string)?
// This is consistent with the String flavors of the existing API.
// - Or use generics: func All[S string|[]byte](S)?
// This means we must forego methods.
func (re *Regexp) All(text any) iter.Seq[Substring]
// A Substring represents a subsequence of an input string or []byte
// slice that matches a regular expression.
type Substring struct { ... }
// String returns the matched substring.
func (s Substring) String() string
// AppendTo appends the matched substring to the provided slice.
func (s Substring) AppendTo(slice []byte) []byte
// Start returns the matched substring's start index,
// relative to the input of [Regexp.All].
func (s Substring) Start() int
// End returns the matched substring's end index,
// relative to the input of [Regexp.All].
func (s Substring) End() int
// NumSubmatch returns the number of submatches.
func (s Substring) NumSubmatch() int
// Submatch returns the ith submatch of the current match.
func (s Substring) Submatch(i int) Substring Feedback welcome. |
Change https://go.dev/cl/643896 mentions this issue: |
@adonovan One note is, that this API doesn't allow to inspect a match without allocation ( In regards to your question, I would add |
We propose to add methods to regexp that allow iterating over matches instead of having to accumulate all the matches into a slice.
This is one of a collection of proposals updating the standard library for the new 'range over function' feature (#61405). It would only be accepted if that proposal is accepted. See #61897 for a list of related proposals.
Regexp has a lot of methods that return slices of all matches (the “FindAll*” methods). Each should have an iterator equivalent that doesn’t build the slice. They can be named by removing the “Find” prefix. The docs would change as follows. (Plain text is unchanged; strikethrough is removed, bold is added):
Instead of enumerating all eight methods here, let’s just show one example.
FindAllString currently reads:
This would change to become a pair of methods:
The full list is:
There would also be a new SplitSeq method alongside regexp.Regexp.Split, completing the analogy with strings.Split and strings.SplitSeq.
The text was updated successfully, but these errors were encountered: