Regex::replace* methods taking &str rather than Cow<str> harms efficiency #676

chris-morgan · 2020-05-11T08:38:33Z

The Regex::replace* methods return a Cow<str> for efficiency, but can't take a Cow<str>. This harms efficiency in cases like multiple sequential regular expressions:

|text: &str /* or Cow<str> even */| -> Cow<str> {
    let text = regex1.replace(text, replacement);
    let text = regex2.replace(text, replacement);
    text
}

Instead you’re stuck with this which necessarily wastes an allocation:

|text: &str| -> String {
    let text = regex1.replace(text, replacement);
    let text = regex2.replace(&text, replacement);
    text.into_owned()
}

The text was updated successfully, but these errors were encountered:

RReverser · 2020-05-11T10:55:45Z

Note that, as a workaround, you can write own helper to expand Cow:

fn replace_cow<'a>(cow: Cow<'a, str>, regex: &Regex, replacement: &str) -> Cow<'a, str> {
    match cow {
        Cow::Borrowed(s) => regex.replace(s, replacement),
        Cow::Owned(s) => Cow::Owned(regex.replace(&s, replacement).into_owned()),
    }
}

BurntSushi · 2020-05-11T13:46:19Z

@chris-morgan Could you please post a complete example that compiles? Along with your desired variant that doesn't compile?

It would also help if you could suggest your desired fix.

chris-morgan · 2020-05-11T16:12:53Z

@RReverser That still does an unnecessary clone in the owned case when there are no matches.

@BurntSushi Full trivial example that makes an unnecessary allocation in order to compile (otherwise beth references local variable aleph):

use std::borrow::Cow;

fn sub(input: &str) -> String {
    let regex = regex::Regex::new("f").unwrap();
    let aleph = regex.replace(input, "m");
    let beth = regex.replace(&aleph, "x");
    beth.into_owned()
}

fn main() {
    println!("{}", sub("foo"));
}

What I would like to work, and which would make no allocations in the second replace call:

use std::borrow::Cow;

fn sub(input: &str) -> Cow<str> {
    let regex = regex::Regex::new("f").unwrap();
    let aleph = regex.replace(input, "m");
    let beth = regex.replace(aleph, "x");
    beth
}

fn main() {
    println!("{}", sub("foo"));
}

The general nature of the proposed alteration would be that the type of the text parameter to the Regex::replace* methods change from &'t str to impl Into<Cow<'t, str>>. impl Trait is in Rust 1.26, and regex looks to use an MSRV of Rust 1.28, so it should be acceptable. Because there’s already one generic parameter, I don’t believe a second can be added compatibly (can’t do T: Into<Cow<'t, str>> = &'t str on fn bounds to make it optional).

I’m not confident that this would not be a breaking change, because I think it could conceivably mess with inference. If it’s deemed a breaking change, then adding new methods is all I can think of, which is eww.

The changes within the body of replacen would be straightforward: the addition of a let text = text.into();, the changing of two text to &text, and the removal of two Cow::Borrowed(…) wrappings around text.

RReverser · 2020-05-11T16:29:31Z

That still does an unnecessary clone in the owned case when there are no matches.

Do you mean when input is already a Cow::Owned? Then no, it doesn't. Note that into_owned doesn't create a clone in that case, so original data will be returned back.

chris-morgan · 2020-05-11T16:34:59Z

Yes it does: &s, you only passed an &str into replace, so it returned a Cow::Borrowed.

RReverser · 2020-05-11T16:50:33Z

Right, good point, the second case also needs to be expanded.

FWIW a very similar problem is the reason I've created cow-utils. While it's not strictly Regex-oriented, you can use its cow_replace method with a Regex on nightly thanks to the Pattern API.

If there is enough interest, it should be quite easy to add a special feature to support regex on stable Rust as well.

RReverser · 2020-05-11T16:52:15Z

One limitation is that it still intentionally accepts only &str.

The reason for that is that consuming Cow can harm efficiency of other edge cases, where right now it can reuse slice of intermediate strings instead.

BurntSushi · 2020-05-11T16:57:00Z

I see. Thanks for elaborating. That makes things a lot clearer.

My general opinion on Cow is honestly pretty low. I occasionally regret even making replace return a Cow in the first place. But it is decently motivated: the case when no replacements occur is pretty common. My distaste for Cow is mostly that it is a somewhat unergonomic structure, and has subtleties to it (like getting a &str from a Cow<'a, str> doesn't mean your &str has a lifetime of 'a). Using Cow also results in users having to pretty commonly tack on an into_owned() at the end of every replace call, which I find somewhat annoying.

I think the suggestion of impl Into<Cow> is probably not plausible. Firstly, making another part of replace generic is really not something I want to do. It adds a lot of complexity to the type signature of what really should be a simple function. Secondly, my guess is that it would be a significant breaking change. For example, this code compiles when using &'t str, but using impl<Into<Cow<'t, str>>> causes it to break. I would expect there to be more issues here, because of inference.

In terms of non-breaking changes, I think that only leaves us with adding new replacement APIs. As you hinted at, I also find this unsatisfactory. I think there are probably already too many replacement APIs. Adding more for a fairly specialized use case would be unfortunate. Personally, if I were going to add more replacement APIs, then I think I would ditch Cow altogether and go with a simpler API like the one in the aho-corasick crate:

fn replace_with(&self, haystack: &str, replacement: R, dst: &mut String)

This way, you don't need to mess with Cow at all. Unfortunately, the problem with this approach is that if there are no matches, then you still need to copy everything to dst. Which does stink.

With all of that said, I actually think your solution is much simpler: don't use regex's replace APIs. Implementing your own replace should be pretty easy. It should only be a few lines of code.

BurntSushi · 2020-06-09T17:59:03Z

I'm going to close this since it's partially not feasible because of it being a breaking change, and I also personally find the increased complexity of the API not worth it.

damooo · 2021-11-14T05:26:35Z

@chris-morgan ,
We can handle this case of sequential substitution, with-no-allocation-if-no-match using RegexSet.
We can first match with RegexSet, and if there are no matches, just return cow<str>. If there are matches, then perform sequential-replacements. In the case of match, last allocation is any-way-necessary.

xkr47 · 2022-11-28T11:03:34Z

Here is a workaround, based on rust-lang/rust#65143 (comment) — it's ugly because it relies on the implementation detail that a borrowed string currently only ever returned when nothing was replaced — in theory replacing beginning or end of a string with empty could return a borrowed substring of the original but feel pretty confident that optimizing for these cases is never going to happen in the replace* functions.

let orig = String::from("Hello");
let result = regex.replace(&orig, "foo");
let result = if let Cow::Owned(_) = result {
    result.into_owned()
} else {
    orig
};

As I see it, the if expression can't be extracted to a helper function because the borrowed orig is referenced in result and thus its ownership cannot also be passed as an argument to that helper function. But a macro could surely do the trick.

Applied for the earlier example by @chris-morgan at #676 (comment) assuming I understood the request correctly:

fn sub(input: &str) -> String {
    let regex = regex::Regex::new("f").unwrap();
    let aleph = regex.replace(input, "m").into_owned();
    let beth = regex.replace(&aleph, "x");
    if let Cow::Owned(_) = beth {
        beth.into_owned()
    } else {
        aleph
    }
}

chris-morgan · 2022-11-28T12:12:15Z

Good thinking, I like that workaround. As written, you’ve ended up with an unnecessary .into_owned() there—you already had the String by destructuring, you just discarded it instead of binding and using it. I’d also then switch to using match, which I find clearer (and shorter in conventional formatting):

let result = match result {
    Cow::Owned(result) => result,
    Cow::Borrowed(_) => orig,
};

As you say, the if expression can’t be extracted—but you can extract it along with the replace call. and so here’s an extension trait that defines new variants of replace, replace_all and replacen that take Cow<str> instead of &str:

use std::borrow::Cow;
use regex::{Regex, Replacer};

/// Extension methods for `Regex` that operate on `Cow<str>` instead of `&str`.
pub trait RegexCowExt {
    /// [`Regex::replace`], but taking text as `Cow<str>` instead of `&str`.
    fn replace_cow<'t, R: Replacer>(&self, text: Cow<'t, str>, rep: R) -> Cow<'t, str>;

    /// [`Regex::replace_all`], but taking text as `Cow<str>` instead of `&str`.
    fn replace_all_cow<'t, R: Replacer>(&self, text: Cow<'t, str>, rep: R) -> Cow<'t, str>;

    /// [`Regex::replacen`], but taking text as `Cow<str>` instead of `&str`.
    fn replacen_cow<'t, R: Replacer>(&self, text: Cow<'t, str>, limit: usize, rep: R) -> Cow<'t, str>;
}

impl RegexCowExt for Regex {
    fn replace_cow<'t, R: Replacer>(&self, text: Cow<'t, str>, rep: R) -> Cow<'t, str> {
        match self.replace(&text, rep) {
            Cow::Owned(result) => Cow::Owned(result),
            Cow::Borrowed(_) => text,
        }
    }

    fn replace_all_cow<'t, R: Replacer>(&self, text: Cow<'t, str>, rep: R) -> Cow<'t, str> {
        match self.replace_all(&text, rep) {
            Cow::Owned(result) => Cow::Owned(result),
            Cow::Borrowed(_) => text,
        }
    }

    fn replacen_cow<'t, R: Replacer>(&self, text: Cow<'t, str>, limit: usize, rep: R) -> Cow<'t, str> {
        match self.replacen(&text, limit, rep) {
            Cow::Owned(result) => Cow::Owned(result),
            Cow::Borrowed(_) => text,
        }
    }
}

And sample usage (playground) with my earlier example plus returning Cow<str> because why not:

use path::to::RegexCowExt;  // if not in the same module

fn sub(input: &str) -> Cow<str> {
    let regex = regex::Regex::new("f").unwrap();
    let aleph = regex.replace(input, "m");
    let beth = regex.replace_cow(aleph, "x");
    beth
}

fn main() {
    println!("{}", sub("foo"));
}

A few possible options here:

Land these as full regular methods. Fairly sure Andrew won’t want that.
Land this as an extension trait in the regex crate. Doubt Andrew will go for that either, but it’d be nice.
Publish the extension trait as a separate crate, regex-cow or such. Can’t be bothered, personally.
Just leave the code here so anyone that wants it can schnaffle it. (License: I reckon it’s ineligible for copyright protection, it’s glue code with no meaningful variation possible, but I’ll offer it under the same terms as regex crate if you want.)

Option four it is, so far as I’m concerned!

xkr47 · 2022-11-28T12:40:11Z

I thought the contents of the Owned and Borrowed variants are not pub so they couldn't be extracted.. but apparently it works, thanks. I love your RegexCowExt! ❤️

chris-morgan · 2022-11-28T12:47:29Z

Enum variants and their fields are always public. The closest you get to private is #[doc(hidden)]. If it was private like you can do with structs, you couldn’t even write _there.

(Also if it was private Rust wouldn’t tell you what the fields were at all, only that there were other private fields.)

ilyagr · 2024-07-12T20:35:59Z

Re @chris-morgan 's #676 (comment):

I wonder if something like this, using std::ptr:eq, is correct/practical/safer (haven't tested it yet):

    fn replace_all_cow<'t, R: Replacer>(&self, text: Cow<'t, str>, rep: R) -> Cow<'t, str> {
        match self.replace_all(&text, rep) {
            Cow::Borrowed(borrow) if std::ptr::eq(borrow, &text) => text,
            other => other
        }
    }

This could be extended to convert any function &T -> Cow<T> into Cow<T> -> Cow<T> (if it works).

BurntSushi added the question label May 11, 2020

BurntSushi closed this as completed Jun 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regex::replace* methods taking &str rather than Cow<str> harms efficiency #676

Regex::replace* methods taking &str rather than Cow<str> harms efficiency #676

chris-morgan commented May 11, 2020

RReverser commented May 11, 2020

BurntSushi commented May 11, 2020

chris-morgan commented May 11, 2020

RReverser commented May 11, 2020

chris-morgan commented May 11, 2020

RReverser commented May 11, 2020

RReverser commented May 11, 2020

BurntSushi commented May 11, 2020

BurntSushi commented Jun 9, 2020

damooo commented Nov 14, 2021

xkr47 commented Nov 28, 2022

chris-morgan commented Nov 28, 2022 •

edited

Loading

xkr47 commented Nov 28, 2022 •

edited

Loading

chris-morgan commented Nov 28, 2022 •

edited

Loading

ilyagr commented Jul 12, 2024 •

edited

Loading

Regex::replace* methods taking &str rather than Cow<str> harms efficiency #676

Regex::replace* methods taking &str rather than Cow<str> harms efficiency #676

Comments

chris-morgan commented May 11, 2020

RReverser commented May 11, 2020

BurntSushi commented May 11, 2020

chris-morgan commented May 11, 2020

RReverser commented May 11, 2020

chris-morgan commented May 11, 2020

RReverser commented May 11, 2020

RReverser commented May 11, 2020

BurntSushi commented May 11, 2020

BurntSushi commented Jun 9, 2020

damooo commented Nov 14, 2021

xkr47 commented Nov 28, 2022

chris-morgan commented Nov 28, 2022 • edited Loading

xkr47 commented Nov 28, 2022 • edited Loading

chris-morgan commented Nov 28, 2022 • edited Loading

ilyagr commented Jul 12, 2024 • edited Loading

chris-morgan commented Nov 28, 2022 •

edited

Loading

xkr47 commented Nov 28, 2022 •

edited

Loading

chris-morgan commented Nov 28, 2022 •

edited

Loading

ilyagr commented Jul 12, 2024 •

edited

Loading