Add 'Span::len()' and 'Span::trimmed()' proc-macro methods. #54373

SergioBenitez · 2018-09-20T01:17:11Z

Span::len() is fairly self explanatory. The latter, trimmed(), is necessary to get sub-spans of an existing Span. Rocket uses this, for instance, to point into slices of a string.

rust-highfive · 2018-09-20T01:17:14Z

r? @aturon

(rust_highfive has picked a reviewer for you, use r? to override)

SergioBenitez · 2018-09-20T01:17:26Z

r? @alexcrichton

petrochenkov · 2018-09-20T01:27:30Z

Previous attempt: #53930

alexcrichton · 2018-09-20T20:44:49Z

Thanks for the PR! My original comment on #53930 applies here as well though I think, do you have thoughts on that?

cc @dtolnay, you're likely interested in this as well

SergioBenitez · 2018-09-21T18:29:04Z

Copying over your comment from #53930:

Seems like a plausible API! I think this could definitely work for things like spans of string literals, but how does this work for spans of Group tokens? There's no way to know where subspans are within a larger span, and the integers here are byte positions in the source and the token stream doesn't know much about whitespace?

If I understood this correctly, I think you're saying that trimmed wouldn't be particularly useful for Group tokens since the proc-macro wouldn't know how the tokens map to source code, is that right? If so, I can totally see that, but I wonder if that's necessarily a problem here. I don't know why someone would want to trim such a Span, though perhaps that's your point: we're exposing an interface that's too expressive and thus easy to misuse.

An alternative is to expose len and sub_span methods from Literal and Ident instead. The downside is that libraries like syn must then re-expose these methods from their own AST types.

On another API note, does this handle pointing a byte position to the middle of a multibyte unicode character? Or would that perhaps cause an ICE later down the line?

We should definitely handle ensuring that the sub_span starts and ends at a unicode character boundary when the Span is created. If we make sub_span a method of Literal and Ident such a thing would be easy to handle.

dtolnay · 2018-09-21T18:42:27Z

The subspan method from #53930 is more intuitive to me than trimmed. If I am println and I want to error on part of a string literal,

"a={a} b={b} c={c}"
         ^^^

then I would rather express this as subspan(9..12) than trimmed(9, 7).

alexcrichton · 2018-09-21T20:30:41Z

Another aspect I'm worried about as well is that it's not clear that we could actually support this sort of span selection via indices that well. For example with strings if you wanted to have character at the letter 'a' how do you handle strings like r#"a"#, r####"a"####, "\u{61}", etc? I'm not sure we have a way to get the actual source string, just the value after parsing? In that sense how do we know where within the actual source code representation to place the carets?

alexcrichton · 2018-09-21T20:30:58Z

(I do agree with @dtolnay as well that the API to behave slice-like makes more sense to me too)

SergioBenitez · 2018-09-22T03:19:31Z

(I do agree with @dtolnay as well that the API to behave slice-like makes more sense to me too)

I do too, actually. Happy to change that.

I'm not sure we have a way to get the actual source string, just the value after parsing?

I was wondering this, actually; what is .to_string() on a Literal supposed to return? It's not well specified. I assumed it was the source string of the literal, but that doesn't appear to be the case.

I see ~three paths forward here:

Expose the source string in some way; expose sub_span(), let the user figure out how the parsed value maps back to the source string.

Expose more information in Literal about the source string and the literal itself. In particular, expose a non-exhaustive (for forwards-compatibility) enum that specifies the kind of literal with (almost) enough information to recover the source string. Such an enum might look like:

enum LiteralKind {
    Byte,
    Char,
    Integer(Prefix, Suffix),
    Float(Prefix, Suffix),
    CookedStr,
    RawStr(usize),
    ByteStr,
    RawByteStr(usize),
    #[doc(hidden)]
    __NonExhausitive
}

struct Prefix(&'static str);

impl Prefix { fn len(&self) -> usize }

struct Suffix(&'static str);

impl Suffix { fn len(&self) -> usize }

Of course, when there's any kind of escaping going on, this isn't sufficient to recover the source string.

Have sub_span() "do the right thing" automatically. In particular, it can compute a mapping from an indexed passed in, corresponding to an index in the string returned from to_string(), to an index in the source string.

So, for something like:
```
* Source String: `r##"hello"##`
* `.to_string()` -> hello
* foo.sub_span(2..4) -> `r##"hello"##`
                               ^^
```
Or, similarly:
```
* Source String: `"hel\u{0x6c}o"`
* `.to_string()` -> hello
* foo.sub_span(2..4) -> `"hel\u{0x6c}o"`
                            ^^^^^^^^^
```
This has fairly natural mappings for most things. Floats might be a bit weird, but I think we can find a natural mapping as well. The downside with this approach is the complexity of implementation.

I'm inclined to suggest 3) as the path forward, though I do want to note that 1) allows the user to implement 3) themselves, and is much easier to implement.

I'm interested in learning if you have any other ideas @dtolnay, @alexcrichton. It's really important for Rocket's diagnostics that we're able to create subspans. In Rocket's case, we only need subspans of cooked strings, so I'd be happy with a targeted solution there, though I'm sure others will need a more general solution that works for other literals or idents.

dtolnay · 2018-09-22T03:58:36Z

I like 3 a lot if the implementation works out.

What is the right behavior if a macro scrambles spans across different tokens? Suppose we have:

#[a]
#[b]
const S: &str = "hel\u{0x6c}o";

If a (or b, whichever expands first) expands to the same tokens but with the string literal having the span that was originally the span of the equal sign, then the one that expands second will observe string_literal.to_string().len() != string_literal.span().len(). Even in the case that lengths happen to match, the span you're slicing may be the span of some totally different original token (potentially with multi-byte character boundaries at different offsets, which is extra unfortunate).

We should definitely handle ensuring that the sub_span starts and ends at a unicode character boundary when the Span is created.

If possible, I would prefer that subspan allow slicing anywhere that is in bounds. Afterward whatever code needs a span aligned to character boundaries can round outward to the nearest character boundary. This way the naive cases like sp.subspan(0..1) will just do the right thing in the presence of multi-byte chars.

SergioBenitez · 2018-09-22T23:44:32Z

What is the right behavior if a macro scrambles spans across different tokens? Suppose we have:

I think we'd need to either 1) keep track of the original span and always return the Span corresponding to the literal, regardless of whether it's been changed, or 2) record that the Span has changed and return a None in that case. Option 1) sounds similar to what you implemented in #53902, if I'm not mistaken?

If possible, I would prefer that subspan allow slicing anywhere that is in bounds. Afterward whatever code needs a span aligned to character boundaries can round outward to the nearest character boundary.

Why do you prefer this behavior? I'm not necessarily opposed to it, except that it's not consistent with how how slicing on strings works today. Is there a benefit to this relaxed approach?

alexcrichton · 2018-09-24T17:12:39Z

I also really like option 3 if we can get it to work, but I wonder if we could perhaps strike a more minimal compromise for now? What if we only added Literal::subspan instead of Span::subspan? (or some method to only get a sub-span of a Literal). That may help solve the mismatched span issue more obviously (an obvious candidate to return None as it no longer has the original span?).

I think it'd also provide a clear location to document the internal parsing that rustc would be doing as we could clearly say "these indices are indexes into a string literal's actual literal value" and... do "something reasonable" if the literal isn't a string (maybe return None?)

SergioBenitez · 2018-09-24T23:24:31Z

What if we only added Literal::subspan instead of Span::subspan?

Absolutely. That's what I was proposing. :)

Sounds like we all agree that option 3 is the way to go. I'll give its implementation some thought and see what I come up with.

alexcrichton · 2018-09-24T23:27:03Z

Er sorry, then yeah that sounds great to me!

TimNN · 2018-10-16T17:38:59Z

Ping from triage @SergioBenitez: What is the status of this PR?

SergioBenitez · 2018-10-17T04:58:10Z

@TimNN I have not had a chance to implement the revised strategy.

TimNN · 2018-10-30T17:21:19Z

Ping from triage @SergioBenitez: It looks like this PR hasn't been updated in a while, so I'm closing it for now, per our guidelines. Thanks for your contributions and please feel free to re-open in the future.

Add 'Span::len()' and 'Span::trimmed()' proc-macro methods.

1760afe

rust-highfive assigned aturon Sep 20, 2018

rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Sep 20, 2018

rust-highfive assigned alexcrichton and unassigned aturon Sep 20, 2018

SergioBenitez mentioned this pull request Oct 9, 2018

Tracking issue for proc_macro conversion rwf2/Rocket#693

Closed

11 tasks

TimNN added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Oct 16, 2018

TimNN closed this Oct 30, 2018

TimNN added S-inactive Status: Inactive and waiting on the author. This is often applied to closed PRs. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Oct 30, 2018

SergioBenitez mentioned this pull request Nov 20, 2018

Decouple proc_macro from the rest of the compiler. #49219

Merged

dtolnay self-assigned this Mar 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add 'Span::len()' and 'Span::trimmed()' proc-macro methods. #54373

Add 'Span::len()' and 'Span::trimmed()' proc-macro methods. #54373

SergioBenitez commented Sep 20, 2018

rust-highfive commented Sep 20, 2018

SergioBenitez commented Sep 20, 2018

petrochenkov commented Sep 20, 2018

alexcrichton commented Sep 20, 2018

SergioBenitez commented Sep 21, 2018 •

edited

Loading

dtolnay commented Sep 21, 2018

alexcrichton commented Sep 21, 2018

alexcrichton commented Sep 21, 2018

SergioBenitez commented Sep 22, 2018 •

edited

Loading

dtolnay commented Sep 22, 2018

SergioBenitez commented Sep 22, 2018

alexcrichton commented Sep 24, 2018

SergioBenitez commented Sep 24, 2018

alexcrichton commented Sep 24, 2018

TimNN commented Oct 16, 2018

SergioBenitez commented Oct 17, 2018 •

edited

Loading

TimNN commented Oct 30, 2018

Add 'Span::len()' and 'Span::trimmed()' proc-macro methods. #54373

Add 'Span::len()' and 'Span::trimmed()' proc-macro methods. #54373

Conversation

SergioBenitez commented Sep 20, 2018

rust-highfive commented Sep 20, 2018

SergioBenitez commented Sep 20, 2018

petrochenkov commented Sep 20, 2018

alexcrichton commented Sep 20, 2018

SergioBenitez commented Sep 21, 2018 • edited Loading

dtolnay commented Sep 21, 2018

alexcrichton commented Sep 21, 2018

alexcrichton commented Sep 21, 2018

SergioBenitez commented Sep 22, 2018 • edited Loading

dtolnay commented Sep 22, 2018

SergioBenitez commented Sep 22, 2018

alexcrichton commented Sep 24, 2018

SergioBenitez commented Sep 24, 2018

alexcrichton commented Sep 24, 2018

TimNN commented Oct 16, 2018

SergioBenitez commented Oct 17, 2018 • edited Loading

TimNN commented Oct 30, 2018

SergioBenitez commented Sep 21, 2018 •

edited

Loading

SergioBenitez commented Sep 22, 2018 •

edited

Loading

SergioBenitez commented Oct 17, 2018 •

edited

Loading