Skip to content
This repository has been archived by the owner on Aug 8, 2023. It is now read-only.

Full support for bidirectional line breaking #7123

Merged
merged 1 commit into from
Nov 30, 2016
Merged

Conversation

ChrisLoer
Copy link
Contributor

@ChrisLoer ChrisLoer commented Nov 19, 2016

Fix for #7112

Also makes small changes to handling of leading/trailing whitespace in labels:

  • Trim whitespace before calculating line max-width for alignment/justification
  • Don't crash if line-breaking algorithm creates a line with just a single whitespace character on it

@mention-bot
Copy link

@ChrisLoer, thanks for your PR! By analyzing the history of the files in this pull request, we identified @kkaefer, @1ec5 and @ansis to be potential reviewers.

@1ec5 1ec5 added Core The cross-platform C++ core, aka mbgl text rendering labels Nov 19, 2016
@1ec5 1ec5 changed the title [core] Full support for bidirectional line breaking Full support for bidirectional line breaking Nov 19, 2016
UErrorCode errorCode = U_ZERO_ERROR;
ubidi_setLine(bidiParagraph, start, end, bidiLine, &errorCode);

// TODO: At this point we're committed to the line breaking working. As a fallback we could try to just return the equivalent substring of the original reordered paragraph, but maybe we should just throw an exception?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What kinds of errors could we expect to occur here? If it’s possible for line breaking to fail due to something in the vector tile data, then we might need a graceful fallback. Otherwise, if this is more about allocation or other catastrophic failures, throwing an exception would make more sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the assumptions are:

  • Input text is UTF-16
  • Line breaks are correctly determined (as I look more closely, I think my logic is broken right now for input that already has line breaks in it)
  • applyBidiToParagraph is called before applyBidiLineBreaking (I think I'll try to change the interface to make this a requirement)
  • No allocation errors, etc.

In other words, only programmer mistakes should lead to an error. So I'll go with an exception.


// Collapse invisible characters.
uint32_t lineEndIndex = shaping.positionedGlyphs.size() - 1;
// TODO refactor: Seems like isVisible should be called isInvisible? Also there's an edge case here with one character invisible lines
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, you’re totally right, the logic should be inverted so it actually returns whether the character is visible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even inverted, isVisible still seems a little confusing to me -- there are lots of code points that won't put pixels on the screen other than space, newline, and zero-width space. On the other hand, we don't really care about invisible code points as long as we don't have glyphs for them (e.g. U+200E LEFT-TO-RIGHT MARK) or they have a glyph but the glyph has zero-width advance. Is what we're trying to achieve here something like trimWhitespace, applied to both the beginning and end of the line?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Essentially: when the line is centered, we don’t want the trailing newline, space, or zero-width space to shift the centered line off to one side. (Here’s the GL JS equivalent.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to torture the map a little to get it to happen, but there is a crash bug here: if you have labels that include trailing line breaks (you can copy and paste them into Studio even though you can't enter them directly), you can end up with "lines" that just have a single line break on them. In that case lineEnd-- wraps around to 2^32.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Trailing line breaks can come from other sources, such as runtime styling and hand-crafted GeoJSON too.

@ChrisLoer ChrisLoer force-pushed the cloer_icu_multiline branch 2 times, most recently from 2aa65d0 to c1ee2b8 Compare November 23, 2016 06:22
@ChrisLoer
Copy link
Contributor Author

@1ec5, would you have a chance to review again? I tweaked the line breaking algorithm to bring it into conformance with the current ideographic breaking behavior, but there are still some subtle differences in the handling of whitespace (we can trim more than one character of whitespace from a line, we trim from the front as well as the back, and we calculate alignment after the whitespace trimming instead of before). I also added support for multi-paragaph text runs, although in practice we're unlikely to see those in labels. Most ICU errors now raise exceptions.

I see that one of the bitrise checks is failing, but I don't have a login with bitrise to see what the check is or why it's failing, so I'm not sure how to address that.

I'm also not sure how to merge the tests along with this PR. The tests are passing now pointing against the cloer_icu_multiline branch of mapbox-gl-test-suite, but the master branch contains tests that don't seem to pass on the master branch of mapbox-gl-native yet.

@1ec5
Copy link
Contributor

1ec5 commented Nov 23, 2016

I see that one of the bitrise checks is failing, but I don't have a login with bitrise to see what the check is or why it's failing, so I'm not sure how to address that.

I’ve added you to all the gl-native apps on Bitrise.

The tests are passing now pointing against the cloer_icu_multiline branch of mapbox-gl-test-suite, but the master branch contains tests that don't seem to pass on the master branch of mapbox-gl-native yet.

If test-suite has gained new tests since gl-native’s master branch last pinned to it, you can mark those tests as ignored with links to issues in this repository that track those bugs or features, then commit those changes directly to test-suite’s master branch. Then you can rebase test-suite’s cloer_icu_multiline branch atop that change and point this gl-native branch’s package.json to that commit.

When you’re ready to land this PR, rebase test-suite’s cloer_icu_multiline branch again and update the version pin here. Quickly merge your test-suite branch to master before any intervening changes make their way in, then merge this PR.

// totalWidth doesn't include the last character for magical tuning reasons. This makes the algorithm a little
// more agressive about trying to fit the text into fewer lines, taking advantage of the tolerance for going a little
// over maxWidth
for (uint32_t i = 0; i < logicalInput.size() - 1; i++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be an iterator-style for loop?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be, but then we have to add extra logic to remove the last character, and I don't think the code gets much simpler. Or am I missing something?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I missed that. 👍

for (std::u16string line : lines)
{
// Collapse whitespace so it doesn't throw off justification
boost::algorithm::trim_if(line, boost::algorithm::is_any_of(u" \t\n\v\f\r"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn’t include the ideographic space (U+3000) which is used in CJK, or some of the more exotic Western spaces such as the thin space and mutton. Since we’re bringing in ICU, can we take advantage of any character class functionality in there? Or are we concerned about it being too slow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I experimented with that idea, but it pulled in a significant chunk of new code with the ICU UnicodeString class (which isn't used by the BiDi code), and also required an extra string copy just to do the trimming. Both of those could be non-issues performance-wise, but this didn't feel like an important enough use case to pull in the dependency. If our STL specialized ctype<char16_t>, boost::algorithm::trim would work without us having to define whitespace, but alas it doesn't seem to. Maybe for now I'll just add U+3000 to our whitespace list? I can also move the list back into 18n.cpp and just pass the predicate to trim_if.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, managing the invisible character class manually sounds fine to me.

Copy link
Contributor

@1ec5 1ec5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than needing to handle more invisible characters – something that can happen in a separate PR if necessary – this PR looks fine to me. Someone more familiar with the mbgl codebase should take a look, though, in case I missed anything. I’d also encourage you to run the files you touched through clang-format at some point. There are some whitespace-only lines that we try to avoid in mbgl.

Copy link
Contributor

@jfirebaugh jfirebaugh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nits only -- feel free to merge when addressed.

}

void ProcessedBiDiText::mergeParagraphLineBreaks(std::set<int32_t>& lineBreakPoints) {
for (int32_t i = 0; i < ubidi_countParagraphs(bidi.bidiText); i++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can/should ubidi_countParagraphs(bidi.bidiText) be extracted from the loop condition, so it's executed only once?

ubidi_setLine(bidiText, start, end, bidiLine, &errorCode);

if (U_FAILURE(errorCode))
throw std::runtime_error("msg");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"msg" looks like a placeholder string.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops!

@ChrisLoer ChrisLoer force-pushed the cloer_icu_multiline branch 2 times, most recently from d03e46f to 2335c93 Compare November 30, 2016 18:23
…di functionality.

 - Trim whitespace from labels before determining their max-width for alignment.
 - Fix crash on labels that contain lines with only a single character of whitespace.
@@ -6,9 +6,6 @@ namespace mbgl {
namespace util {
namespace i18n {

/** Returns whether a character is a visible character. */
bool isVisible(uint16_t chr);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean we're now also rendering invisible characters?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was only used to remove trailing characters from a line, and only newline/space/zero-width space. I replaced it with trim because I think that's the semantics we're going for -- we don't want whitespace trailing/leading whitespace to affect layout. Zero-width space will get "rendered" now, although it won't affect layout unless the font we're using assigns it non-zero width.

"Invisible" sounds to me like a font property (if there's no glyph, the code point is effectively invisible, also the font may have a glyph that doesn't draw any pixels). As you note below Unicode also has lots of extra "whitespace" characters. This change doesn't affect our handling of them.

const uint32_t height = (line + 1) * lineHeight;
for (std::u16string line : lines) {
// Collapse whitespace so it doesn't throw off justification
boost::algorithm::trim_if(line, boost::algorithm::is_any_of(u" \t\n\v\f\r"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this with other whitespace characters that are in other Unicode blocks? (e.g. non-breaking spaces, thin spaces, em/en spaces, ideographic spaces etc.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They won't be trimmed, so they may affect layout slightly. Whether they allow word breaking is controlled by i18n::allowsWordBreaking. It looks like we don't currently break on thin/em/en spaces, either.

// algorithm a little
// more agressive about trying to fit the text into fewer lines, taking advantage of the
// tolerance for going a little
// over maxWidth
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment formatting

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack, I should have noticed this after running clang-format.

ubidi_setLine(bidiText, start, end, bidiLine, &errorCode);

if (U_FAILURE(errorCode))
throw std::runtime_error("msg");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please use braces around all control statement blocks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, is that our standard style? I'll start doing that.


BiDi::BiDi() {
bidiText = ubidi_open();
bidiLine = ubidi_open();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why two handles?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just the way the ICU interface works. ubidi_setLine wants two UBiDi* objects: one "input" representing the whole run of text, and one "output" representing bidirectional information for a single line.

// Because we set UBIDI_REMOVE_BIDI_CONTROLS, the output may be smaller than what we reserve
// Setting UBIDI_INSERT_LRM_FOR_NUMERIC would require
// ubidi_getLength(pBiDi)+2*ubidi_countRuns(pBiDi)
int32_t outputLength = ubidi_getProcessedLength(bidiLine);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const

// Setting UBIDI_INSERT_LRM_FOR_NUMERIC would require
// ubidi_getLength(pBiDi)+2*ubidi_countRuns(pBiDi)
int32_t outputLength = ubidi_getProcessedLength(bidiLine);
std::unique_ptr<UChar[]> outputText = std::make_unique<UChar[]>(outputLength);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can just use auto here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, right! I left C++ world at C++98, still getting used to all the great additions...

@kkaefer
Copy link
Member

kkaefer commented Dec 1, 2016

argh sorry; I had a few review comments, but apparently I never clicked on on submit.

@ChrisLoer
Copy link
Contributor Author

@kkaefer Porting to gl-js is exposing some further changes I'll want to make (like #7253). When I make a PR for those changes I'll incorporate your suggestions from here. I don't see anything you flagged that's broken and needs immediate fixing -- let me know if you think otherwise.

@kkaefer
Copy link
Member

kkaefer commented Dec 1, 2016

Yeah, just minor nitpicks and suggestions.

@@ -24,7 +24,7 @@
"lodash": "^4.16.4",
"mapbox-gl-shaders": "mapbox/mapbox-gl-shaders#597115a1e1bd982944b068f8accde34eada74fc2",
"mapbox-gl-style-spec": "mapbox/mapbox-gl-style-spec#7f62a4fc9f21e619824d68abbc4b03cbc1685572",
"mapbox-gl-test-suite": "mapbox/mapbox-gl-test-suite#c32d0c5ac80e3b7393bc17b8944e64fa5cffd90a",
"mapbox-gl-test-suite": "mapbox/mapbox-gl-test-suite#d85534f77e1b06fbfe7aa610c98a363be86fceb4",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mapbox/mapbox-gl-test-suite#186 was rebased onto that repository’s master branch, so mapbox/mapbox-gl-test-suite@d85534f has been orphaned. This line should have been changed to point to mapbox/mapbox-gl-test-suite@0c6f3e0 before this PR was merged.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, sorry about that, and thanks for fixing!

1ec5 added a commit that referenced this pull request Dec 2, 2016
jfirebaugh pushed a commit that referenced this pull request Dec 2, 2016
@jfirebaugh jfirebaugh deleted the cloer_icu_multiline branch December 12, 2016 18:47
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Core The cross-platform C++ core, aka mbgl text rendering
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants