Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

An implementation for 0168-multi-line-string-literals.md #8813

Merged
merged 39 commits into from
Apr 26, 2017

Conversation

johnno1962
Copy link
Contributor

@johnno1962 johnno1962 commented Apr 17, 2017

A prototype implementation for the proposal 0168 as discussed on the swift evolution thread over the last week. It has been tested inside the Xcode source editor is completely functional as a reference implementation bar any changes that would be need to be made to other parts of the toolchain which seem to be minimal. There is a new test file.

Resolves #42792.

// work back from the end to find whitespace to strip
while (start > Bytes.begin() && isWhitespace(start[-1])) {
if (*--start == '\n' || *start == '\r') {
if (start[-1] == '\r')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not right, I think: \r\r is two classic Mac newlines, not one. Surely, what you want is this:

switch (*--start) {
case '\n':
  if (start[-1] == '\r')
    --start;
  LLVM_FALLTHROUGH;
case '\r':
  return std::string(start, end-start);
}

@@ -397,12 +402,13 @@ class Lexer {
/// If a copy needs to be made, it will be allocated out of the provided
/// Buffer.
static StringRef getEncodedStringSegment(StringRef Str,
SmallVectorImpl<char> &Buffer);
SmallVectorImpl<char> &Buffer,
unsigned Modifiers = 0, const std::string &ToReplace = "");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: 80-character lines, here and throughout.

@@ -505,6 +511,9 @@ class Lexer {
/// Try to lex conflict markers by checking for the presence of the start and
/// end of the marker in diff3 or Perforce style respectively.
bool tryLexConflictMarker();

// new for multiline string literals
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: capitalize and punctuate comments, here and throughout.

case '\n': // String literals cannot have \n or \r in them.
case '\r':
if (Modifiers & StringLiteralMultiline) // ... unless they are mutli-line
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, typo. (Also, decide if you want to write "multi-line" or "multiline" throughout, and camel-case accordingly in the code.)

@@ -1192,6 +1207,15 @@ unsigned Lexer::lexCharacter(const char *&CurPtr, char StopQuote,
case '"': ++CurPtr; return '"';
case '\'': ++CurPtr; return '\'';
case '\\': ++CurPtr; return '\\';
case '\n':
LLVM_FALLTHROUGH;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per existing code in these files, no need for LLVM_FALLTHROUGH; when your case is otherwise empty.

}

/// determine contents of literal to be normalised - either
/// to strip indenting or normalise line endings to a single \n
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: while you're capitalizing and re-wrapping these comments to 80 characters, U.S. English spellings. (Yes, I know; I'm Canadian, but them's the rules...)

}

// are there windows line endings in the source, if so return it to be replaced
const char *windowsLinesep = strnstr(Bytes.begin(), "\r\n", Bytes.end()-Bytes.begin());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A nit and a more substantive comment.

  • Nit: windowsLineSeparator (capitalize).
  • More substantive comment: strnstr isn't portable, is it? Ironic that this won't work on Windows...

while ((BytesPtr = (const char *)memchr(BytesPtr, '\n', Bytes.end()-BytesPtr)) != nullptr) {
const char *NextPtr = BytesPtr + 1;
if (*NextPtr != '\n' && *NextPtr != '\r') {
if (BytesPtr[-1] == '\r')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See previous comment about how this treats \r\r.

@@ -1324,8 +1404,15 @@ void Lexer::lexStringLiteral() {
// NOTE: We only allow single-quote string literals so we can emit useful
// diagnostics about changing them to double quotes.

bool wasErroneous = false;

bool wasErroneous = false, wasWhitespace = false, allWhitespace = true;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: wasAllWhitespace, surely.

@kubamracek kubamracek self-requested a review April 20, 2017 21:01
@ematejska ematejska requested a review from milseman April 21, 2017 17:15
// CHECK: -2-
print("-2-")
// SKIP-CHECK-NEXT: <"Two Beta">
//print(delimit(""""Two Beta""""))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this commented out?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a test for two quotes in the middle of the string?

BytesPtr = NextPtr;
}
}

/// lexStringLiteral:
/// string_literal ::= ["]([^"\\\n\r]|character_escape)*["]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update this comment

// Strips any indent that corresponds to the indent
// of the multi-line string terminating line and
// normalises line endings in the source to \n.
// It also removes any intial empty line.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrap comments at 80 lines.

bool wasErroneous = false, wasWhitespace = false, allWhitespace = true;
unsigned Modifiers = 0;

// is this the start of a multiline string litersl
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo

// CHECK: -4-
print("-4-")
// CHECK-NEXT: <FourDelta>
print(delimit("""Four\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned on swift-evolution, this should be disallowed.


// CHECK: -14-
print("-14-")
// CHECK-WARNINGS: warning: invalid mix of multi-line string literal indentation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be an error instead of a warning.


// CHECK: -11-
print("-11-")
// CHECK-WARNINGS: warning: invalid mix of multi-line string literal indentation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be an error.

// CHECK: -12-
print("-12-")
// Note: The next few tests use physical tab characters, not spaces.
// CHECK-WARNINGS: warning: invalid mix of multi-line string literal indentation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be an error.


// CHECK: -13-
print("-13-")
// CHECK-WARNINGS: warning: invalid mix of multi-line string literal indentation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be an error.

@kubamracek
Copy link
Contributor

Hi, @johnno1962, thanks for sending this patch! Do you think you'll have time to address the review comments? I'd like to get this into Swift 4, and we're getting close to the deadline for it.

@johnno1962
Copy link
Contributor Author

johnno1962 commented Apr 23, 2017

Thanks @xwu and @kubamracek for the comments which I have endevoured to address with this last commit. This should be compliant with the core teams decision except that it includes a small amount of code to implement a new proposal https://github.com/johnno1962c/swift-evolution/blob/master/proposals/0173-newline-escape-in-strings.md to allow escaping of newlines in all strings. This code is around line 1212 of Lexer.cpp if you want to remove it.

There are limitations to the implementation in that while it will normalise end-of-line to \n for sources that use \n, \r\n or \r as the line separator, they can not be mixed within a literal and have this work.

@kubamracek
Copy link
Contributor

@johnno1962, could you extract the newline escaping feature into a separate pull request? Since it's a separate proposal that wasn't accepted yet, we shouldn't block merging this PR on the other feature.

Copy link
Collaborator

@xwu xwu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lots of nits, sorry.

My main feedback would be that I'd like to see more tests. It'd be very important to guarantee that your line normalization code does not strip manually escaped \r\n (or, for that matter, \r + literal newline).

Also, what can be done about this limitation as to normalization of mixed newlines?

ERROR(lex_unicode_escape_braces,none,
"expected hexadecimal code in braces after unicode escape", ())
ERROR(lex_illegal_multiline_string_start,none,
"inavlid start of multi-line string literal", ())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo

ERROR(lex_illegal_multiline_string_start,none,
"inavlid start of multi-line string literal", ())
ERROR(lex_illegal_multiline_string_end,none,
"inavlid end of multi-line string literal", ())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo

ERROR(lex_ambiguous_string_indent,none,
"invalid mix of multi-line string literal indentation", ())
WARNING(lex_trailing_multiline_whitespace,none,
"includes trailing space characters in multi-line string literal", ())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning on trailing whitespace is not a part of the approved proposal; please also split into separate PR.

@@ -505,6 +512,9 @@ class Lexer {
/// Try to lex conflict markers by checking for the presence of the start and
/// end of the marker in diff3 or Perforce style respectively.
bool tryLexConflictMarker();

/// New for multiline string literals
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments should have punctuation; also, please describe what it does and not just label it as new.

@@ -46,6 +46,9 @@ class Token {
/// \brief Whether this token is an escaped `identifier` token.
unsigned EscapedIdentifier : 1;

/// modifiers for string literals
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: capitalize and punctuate.

Segments.push_back(
StringSegment::getLiteral(getSourceLoc(SegmentStartPtr),
Bytes.end()-SegmentStartPtr));
Bytes.end()-SegmentStartPtr, Modifiers, ToReplace));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

80-character lines please.


// ===---------- Done --------===
// CHECK-NEXT: Done.
print("Done.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add more tests to ensure that the errors and warnings you have added behave as intended?

Some other ideas:

  • Check that only one leading newline and one trailing newline is stripped.
  • Check that all other newlines are preserved, including multiple consecutive ones.
  • Check that manually escaped \t, \r, \n, \\ are all possible and correctly lexed, including at the end of a line.
  • Check that a manually escaped \r\n is not normalized to \n.
  • Check that string literal interpolation works correctly.
  • Check that invalid ragged leading indents trigger the expected error.
  • Check that escaping \""" works correctly.

Also, I would like to see, for the purposes of this particular implementation, that end-of-line newlines after \ are preserved and that trailing whitespace is correctly preserved.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @xwu I’ve actioned most of your nits. “wasAllWhitespace” is there to be able to generate errors when there is non-whitespace before the closing delimiter. @kubamracek this commit should be in line with the proposal as accepted. I’ll put newline escapes back in as a separate PR if required. Looking at more tests while the toolchain builds.

Copy link
Collaborator

@xwu xwu Apr 23, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One typo left :)

Why can find-and-replace operations not simply replace all literal \r with \n and all literal \r\n with \n?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could do it the other way around I suppose. It tries to do all the replacing including indent stripping with a single loop due to how the code developed. Seems OK as it is apart from exotic mixes of line endings inside a single literal.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see the logic behind doing indent stripping in a single loop, because you need to get to the end of the literal before you know how much to strip. But, without having thought too deeply, it would seem that normalizing line endings can happen line-by-line as you go.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, and your logic isn't stripping leading whitespace from escaped newlines, is it?

  """
  \n  \n
  """

...should give "\n \n" and not "\n\n".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indent stripping is done on program text not expanded escapes which happens afterwards

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good. I assumed so--just wanted to check.

std::string ToReplace;

static StringSegment getLiteral(SourceLoc Loc, unsigned Length,
unsigned Modifiers, const std::string &ToReplace) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

80-character lines, please.

@@ -495,7 +502,7 @@ class Lexer {
static unsigned lexUnicodeEscape(const char *&CurPtr, Lexer *Diags);

unsigned lexCharacter(const char *&CurPtr,
char StopQuote, bool EmitDiagnostics);
char StopQuote, bool EmitDiagnostics, unsigned Modifiers = 0);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

80-character lines, please.

@@ -273,11 +276,17 @@ class Token {
void setText(StringRef T) { Text = T; }

/// \brief Set the token to the specified kind and source range.
void setToken(tok K, StringRef T, unsigned CommentLength = 0) {
void setToken(tok K, StringRef T, unsigned CommentLength = 0, unsigned Modifiers = 0) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

80-character lines, please.

@johnno1962
Copy link
Contributor Author

One more time 🙄

@milseman
Copy link
Member

@johnno1962

/Users/buildnode/jenkins/workspace/swift-PR-osx-smoke-test/branch-master/swift/lib/Parse/Lexer.cpp:1365:25: error: use of undeclared identifier 'getSourceLoc'; did you mean 'swift::Lexer::getSourceLoc'?
        Diags->diagnose(getSourceLoc(start),
                        ^~~~~~~~~~~~
                        swift::Lexer::getSourceLoc
/Users/buildnode/jenkins/workspace/swift-PR-osx-smoke-test/branch-master/swift/include/swift/Parse/Lexer.h:433:20: note: 'swift::Lexer::getSourceLoc' declared here
  static SourceLoc getSourceLoc(const char *Loc) {
                   ^
1 error generated.

BTW, you can get a quicker turn around time by running "ninja swift" in your local swift build directory to just build the compiler (without the standard library).

@milseman
Copy link
Member

@swift-ci please smoke test

@johnno1962
Copy link
Contributor Author

Thanks for the tip. I’ve been building toolchains. Is there an easy way to build a toolchain without building all architectures?

@milseman
Copy link
Member

Nothing supported that I know of. Of course, you can intercept the commands and hack something together for yourself, but that's pretty dirty.

If you're just iterating fast on a .cpp file, "ninja swift" will turn around in about 20 seconds or so. You can then use e.g. ./bin/swift -frontend -parse -verify foo.swift for the error message checking for a test.

@johnno1962
Copy link
Contributor Author

Toolchains weren't so bad. it’s just that if the day changes you have to rebuild llvm + clang. Looks like tests are ok. Merge away 👍

@milseman milseman merged commit 981e706 into swiftlang:master Apr 26, 2017
@milseman
Copy link
Member

🍾

@milseman
Copy link
Member

SR for diagnostics improvements: https://bugs.swift.org/browse/SR-4701

@kubamracek do you have an SR for multi-line literals inside of interpolations?

@johnno1962
Copy link
Contributor Author

johnno1962 commented Apr 26, 2017

Thanks @milseman, @kubamracek and particularly @xwu for all your help. Final toolchain:
http://johnholdsworth.com/swift-LOCAL-2017-04-27-a-osx.tar.gz

@kubamracek
Copy link
Contributor

https://bugs.swift.org/browse/SR-4708: Add support for multiline strings inside string interpolations

@johnno1962
Copy link
Contributor Author

@kubamracek I’ve just opened PR #9049 for multiline inside interpolations

@DevAndArtist
Copy link
Contributor

Was the warning about trailing whitespaces removed? 😞

let s = """
	abc                                                               
	"""

print(s.characters.count) // prints 66

@milseman
Copy link
Member

What would be the workaround if trailing whitespace is desired?

@DevAndArtist
Copy link
Contributor

DevAndArtist commented May 11, 2017

@milseman swiftlang/swift-evolution#695

Without the trailing backslash you'd need something like \("").

@DevAndArtist
Copy link
Contributor

The example from above is actually this:

let s = """
	abc                                                               \("")
	"""

print(s.characters.count) // prints 66

which has a visible indication that trailing whitespaces are involved. Ideally we still need the trailing backslash.

@xwu
Copy link
Collaborator

xwu commented May 11, 2017

As already discussed on swift-evolution, the accepted proposal does not include warnings about trailing whitespace. This PR correctly implements the proposal as accepted.

@DevAndArtist
Copy link
Contributor

@xwu I bet you've included all the points the core team mentioned in the accepted thread, which cases should be errors and which should be allowed, like for instance the blank line without any indent, but you seem exclusively pick things that you like and silently ignore things that you don't like the same way you did during the discussion thread. I'm not being offensive by any means, but I' highly critical about that.

https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20170417/035931.html

That seems like a reasonable thing to warn about. [...]

-Joe

https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20170417/035934.html

I don't want to sit around and watch how we'll introduce a half baked multi-line string literal.

@DevAndArtist
Copy link
Contributor

In case you insist for a bug issue: https://bugs.swift.org/browse/SR-4874

@johnno1962
Copy link
Contributor Author

Adrian, you have a subjective opinion that trailing whitespace is critical that didn’t win over the majority of the thread on this proposal. It seems like you're a bit confused as to the idea behind my follow up proposal about newline escapes (elided newlines) I'd not intended it to have anything to do with helping make explicit trailing whitespace.

@AnthonyLatsis AnthonyLatsis added compiler The Swift compiler itself feature A feature request or implementation literals Feature → expressions: Literals such as an integer or string literal multiline string literals Feature → expressions → literals → string literals: multiline string literals lexer Area → compiler: The legacy C++ lexer labels Mar 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler The Swift compiler itself feature A feature request or implementation lexer Area → compiler: The legacy C++ lexer literals Feature → expressions: Literals such as an integer or string literal multiline string literals Feature → expressions → literals → string literals: multiline string literals
Projects
None yet
Development

Successfully merging this pull request may close these issues.