Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coalescing for line doc comments #60936

Closed
petrochenkov opened this issue May 18, 2019 · 9 comments
Closed

Coalescing for line doc comments #60936

petrochenkov opened this issue May 18, 2019 · 9 comments
Labels
C-enhancement Category: An issue proposing an enhancement or a PR with one. I-compiletime Issue: Problems and improvements with respect to compile times. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@petrochenkov
Copy link
Contributor

Right now multiple line doc comments are considered separate tokens and separate attributes:

/// A
/// B
/// C
pub struct S;

is interpreted as

#[doc = "/// A"]
#[doc = "/// B"]
#[doc = "/// C"]
pub struct S;

To mitigate the perf impact from turning normal comments into doc comments (#60930) we may want to merge multi-line line comments into a single token and a single attribute:

#[doc = "/// A\n/// B\n/// C"]
pub struct S;
@Centril Centril added I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. C-enhancement Category: An issue proposing an enhancement or a PR with one. labels May 18, 2019
@jonas-schievink jonas-schievink added I-compiletime Issue: Problems and improvements with respect to compile times. and removed I-slow Issue: Problems and improvements with respect to performance of generated code. labels May 18, 2019
@nagisa
Copy link
Member

nagisa commented May 19, 2019

is interpreted as

#[doc = "/// A"]
#[doc = "/// B"]
#[doc = "/// C"]
pub struct S;

It cannot be interpreted that way because /// are not stripped by rustdoc. It probably is interpreted as

#[doc = "A"]
#[doc = "B"]
#[doc = "C"]
pub struct S;

so we don’t need the /// in the concatenated string either.

@petrochenkov
Copy link
Contributor Author

I've just checked, /// is there both in tokens and in AST.
It looks like they are occasionally (sigh) stripped, e.g. when doc comments are passed to macros.
Rustdoc strips them as well (see with_desugared_doc).

In other words, the details need to be figured out.
Ideally, tokens should be passed around losslessly/precisely without attempts to be "helpful".
Not sure how close to that we can get backward-compatibly.

@petrochenkov
Copy link
Contributor Author

Another incompatibility is with code doing something like this

// Passed to a macro
/// Comment 1
/// Comment 2

// Macro expects exactly two attributes
#[$meta:meta]
#[$meta:meta]

(Again, we'll need crater to figure out whether we need to provide a compatibility layer or not.)

@nagisa
Copy link
Member

nagisa commented May 19, 2019

I've just checked, /// is there both in tokens and in AST.

Interesting. I tried to run cargo doc on your desugared version and the markdown processor barfed at all the /// in it.

@petrochenkov
Copy link
Contributor Author

A bit on representing tokens produced by lexer losslessly (TLDR: I think literals do it right).

Tokens need to contain their kind and data - { kind: StrLiteral, data: `"foo"` } (quotes in the data).
The kind needs to be readily available because parser uses it all the time.
So it makes sense to remove the kind bits from the data, so there's no duplication and they cannot get inconsistent by definition - { kind: StrLiteral, data: `foo` } (no quotes in the data).
The original token still can be losslessly recovered from this representation - { kind: StrLiteral, data: `foo` } => `"foo"` .
(This is what literals in the compiler currently do.)

The same logic is applicable to doc comments, which are basically raw string literals with unusual quotes (/** + */ and /// + \n).

/** line 1
    line 2
    line 3 */

/// line

=>

{
    kind: OuterBlockDocComment,
    data: ` line 1
    line 2
    line 3 `
}
{
    kind: OuterLineDocComment,
    data: ` line`
}

(the original token still can be losslessly recovered).
(This is not what literals in the compiler currently do, but they probably should.)


If you look at line doc comments like that, it's kind of clear why each line is a separate token.

Keeping them separate at token level, but then concatenating later kind of defeats the purpose of optimization (it may be more work, and more interned strings).
So, the optimization requires redefining what the "closing quote" of a line doc comment is (start of the next non-line-doc-comment token?).

    /// line 1
    /// line 2
    /// line 3
    struct S;

=>

{
    kind: OuterLineDocComment,
    data: ` line 1
    /// line 2
    /// line 3
    `
}

(the original token still can be losslessly recovered).

Pretty weird.

@nnethercote
Copy link
Contributor

So, the optimization requires redefining what the "closing quote" of a line doc comment is (start of the next non-line-doc-comment token?).

    /// line 1
    /// line 2
    /// line 3
    struct S;

=>

{
    kind: OuterLineDocComment,
    data: ` line 1
    /// line 2
    /// line 3
    `
}

(the original token still can be losslessly recovered).

I would have thought it would be this:

{
    kind: OuterLineDocComment,
    data: "line 1\nline 2\nline 3"
}

Here data is the processed, useful form, and we can losslessly recover the original form by adding "/// " at the start and after each newline, and appending a newline at the end. (Well, I'm not sure about the newline at the end, but it definitely feels to me like the internal /// markers should be removed.)

@nnethercote
Copy link
Contributor

Another incompatibility is with code doing something like this

// Passed to a macro
/// Comment 1
/// Comment 2

// Macro expects exactly two attributes
#[$meta:meta]
#[$meta:meta]

That's very unfortunate, if it means that /// multi-line doc comments and /** multi-line doc comments are distinguishable at the language level. Which definitely complicates things.

It's worth noting that if we converted every multi-line /// doc comment in liballoc/libcore/libstd to a /** ... */ doc comment, we'd likely get a big chunk of the potential improvement.

@Mark-Simulacrum
Copy link
Member

It seems like one avenue is to have some crate-level flag or so (maybe just a feature) that collapses doc comments -- at least for now, it'd be basically internal-use-only but if that makes core/std smaller and faster to decode that's already a pretty good win. Eventually we could presumably edition-gate this change in some way (though the interaction with macros might make that a bit hard).

@nnethercote
Copy link
Contributor

#65750 implements the idea from #60935, which is an alternative to doc comment coalescing that doesn't have any backwards compatibility concerns. After #65750 lands, the potential wins from coalescing would be much smaller, and IMO not worth pursuing.

bors added a commit that referenced this issue Nov 7, 2019
Cheaper doc comments

This PR implements the idea from #60935: represent doc comments more cheaply, rather than converting them into `#[doc="..."]` attribute form. Unlike #60936 (which is about coalescing doc comments to reduce their number), this approach does not have any backwards compatibility concerns, and it eliminates about 80-90% of the current cost of doc comments (as estimated using the numbers in #60930, which eliminated the cost of doc comments entirely by treating them as normal comments).

r? @petrochenkov
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Category: An issue proposing an enhancement or a PR with one. I-compiletime Issue: Problems and improvements with respect to compile times. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

6 participants