-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Full support for bidirectional line breaking #7123
Conversation
@ChrisLoer, thanks for your PR! By analyzing the history of the files in this pull request, we identified @kkaefer, @1ec5 and @ansis to be potential reviewers. |
UErrorCode errorCode = U_ZERO_ERROR; | ||
ubidi_setLine(bidiParagraph, start, end, bidiLine, &errorCode); | ||
|
||
// TODO: At this point we're committed to the line breaking working. As a fallback we could try to just return the equivalent substring of the original reordered paragraph, but maybe we should just throw an exception? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What kinds of errors could we expect to occur here? If it’s possible for line breaking to fail due to something in the vector tile data, then we might need a graceful fallback. Otherwise, if this is more about allocation or other catastrophic failures, throwing an exception would make more sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the assumptions are:
- Input text is UTF-16
- Line breaks are correctly determined (as I look more closely, I think my logic is broken right now for input that already has line breaks in it)
applyBidiToParagraph
is called beforeapplyBidiLineBreaking
(I think I'll try to change the interface to make this a requirement)- No allocation errors, etc.
In other words, only programmer mistakes should lead to an error. So I'll go with an exception.
|
||
// Collapse invisible characters. | ||
uint32_t lineEndIndex = shaping.positionedGlyphs.size() - 1; | ||
// TODO refactor: Seems like isVisible should be called isInvisible? Also there's an edge case here with one character invisible lines |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, you’re totally right, the logic should be inverted so it actually returns whether the character is visible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even inverted, isVisible
still seems a little confusing to me -- there are lots of code points that won't put pixels on the screen other than space, newline, and zero-width space. On the other hand, we don't really care about invisible code points as long as we don't have glyphs for them (e.g. U+200E LEFT-TO-RIGHT MARK) or they have a glyph but the glyph has zero-width advance. Is what we're trying to achieve here something like trimWhitespace
, applied to both the beginning and end of the line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Essentially: when the line is centered, we don’t want the trailing newline, space, or zero-width space to shift the centered line off to one side. (Here’s the GL JS equivalent.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to torture the map a little to get it to happen, but there is a crash bug here: if you have labels that include trailing line breaks (you can copy and paste them into Studio even though you can't enter them directly), you can end up with "lines" that just have a single line break on them. In that case lineEnd-- wraps around to 2^32.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! Trailing line breaks can come from other sources, such as runtime styling and hand-crafted GeoJSON too.
2aa65d0
to
c1ee2b8
Compare
@1ec5, would you have a chance to review again? I tweaked the line breaking algorithm to bring it into conformance with the current ideographic breaking behavior, but there are still some subtle differences in the handling of whitespace (we can trim more than one character of whitespace from a line, we trim from the front as well as the back, and we calculate alignment after the whitespace trimming instead of before). I also added support for multi-paragaph text runs, although in practice we're unlikely to see those in labels. Most ICU errors now raise exceptions. I see that one of the bitrise checks is failing, but I don't have a login with bitrise to see what the check is or why it's failing, so I'm not sure how to address that. I'm also not sure how to merge the tests along with this PR. The tests are passing now pointing against the cloer_icu_multiline branch of mapbox-gl-test-suite, but the master branch contains tests that don't seem to pass on the master branch of mapbox-gl-native yet. |
I’ve added you to all the gl-native apps on Bitrise.
If test-suite has gained new tests since gl-native’s master branch last pinned to it, you can mark those tests as ignored with links to issues in this repository that track those bugs or features, then commit those changes directly to test-suite’s master branch. Then you can rebase test-suite’s cloer_icu_multiline branch atop that change and point this gl-native branch’s package.json to that commit. When you’re ready to land this PR, rebase test-suite’s cloer_icu_multiline branch again and update the version pin here. Quickly merge your test-suite branch to master before any intervening changes make their way in, then merge this PR. |
// totalWidth doesn't include the last character for magical tuning reasons. This makes the algorithm a little | ||
// more agressive about trying to fit the text into fewer lines, taking advantage of the tolerance for going a little | ||
// over maxWidth | ||
for (uint32_t i = 0; i < logicalInput.size() - 1; i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be an iterator-style for
loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be, but then we have to add extra logic to remove the last character, and I don't think the code gets much simpler. Or am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I missed that. 👍
for (std::u16string line : lines) | ||
{ | ||
// Collapse whitespace so it doesn't throw off justification | ||
boost::algorithm::trim_if(line, boost::algorithm::is_any_of(u" \t\n\v\f\r")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn’t include the ideographic space (U+3000) which is used in CJK, or some of the more exotic Western spaces such as the thin space and mutton. Since we’re bringing in ICU, can we take advantage of any character class functionality in there? Or are we concerned about it being too slow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I experimented with that idea, but it pulled in a significant chunk of new code with the ICU UnicodeString
class (which isn't used by the BiDi code), and also required an extra string copy just to do the trimming. Both of those could be non-issues performance-wise, but this didn't feel like an important enough use case to pull in the dependency. If our STL specialized ctype<char16_t>
, boost::algorithm::trim
would work without us having to define whitespace, but alas it doesn't seem to. Maybe for now I'll just add U+3000 to our whitespace list? I can also move the list back into 18n.cpp and just pass the predicate to trim_if
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, managing the invisible character class manually sounds fine to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than needing to handle more invisible characters – something that can happen in a separate PR if necessary – this PR looks fine to me. Someone more familiar with the mbgl codebase should take a look, though, in case I missed anything. I’d also encourage you to run the files you touched through clang-format at some point. There are some whitespace-only lines that we try to avoid in mbgl.
fad248b
to
24b9538
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nits only -- feel free to merge when addressed.
} | ||
|
||
void ProcessedBiDiText::mergeParagraphLineBreaks(std::set<int32_t>& lineBreakPoints) { | ||
for (int32_t i = 0; i < ubidi_countParagraphs(bidi.bidiText); i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can/should ubidi_countParagraphs(bidi.bidiText)
be extracted from the loop condition, so it's executed only once?
ubidi_setLine(bidiText, start, end, bidiLine, &errorCode); | ||
|
||
if (U_FAILURE(errorCode)) | ||
throw std::runtime_error("msg"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"msg"
looks like a placeholder string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops!
d03e46f
to
2335c93
Compare
…di functionality. - Trim whitespace from labels before determining their max-width for alignment. - Fix crash on labels that contain lines with only a single character of whitespace.
2335c93
to
d2ba42b
Compare
@@ -6,9 +6,6 @@ namespace mbgl { | |||
namespace util { | |||
namespace i18n { | |||
|
|||
/** Returns whether a character is a visible character. */ | |||
bool isVisible(uint16_t chr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean we're now also rendering invisible characters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was only used to remove trailing characters from a line, and only newline/space/zero-width space. I replaced it with trim because I think that's the semantics we're going for -- we don't want whitespace trailing/leading whitespace to affect layout. Zero-width space will get "rendered" now, although it won't affect layout unless the font we're using assigns it non-zero width.
"Invisible" sounds to me like a font property (if there's no glyph, the code point is effectively invisible, also the font may have a glyph that doesn't draw any pixels). As you note below Unicode also has lots of extra "whitespace" characters. This change doesn't affect our handling of them.
const uint32_t height = (line + 1) * lineHeight; | ||
for (std::u16string line : lines) { | ||
// Collapse whitespace so it doesn't throw off justification | ||
boost::algorithm::trim_if(line, boost::algorithm::is_any_of(u" \t\n\v\f\r")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this with other whitespace characters that are in other Unicode blocks? (e.g. non-breaking spaces, thin spaces, em/en spaces, ideographic spaces etc.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They won't be trimmed, so they may affect layout slightly. Whether they allow word breaking is controlled by i18n::allowsWordBreaking
. It looks like we don't currently break on thin/em/en spaces, either.
// algorithm a little | ||
// more agressive about trying to fit the text into fewer lines, taking advantage of the | ||
// tolerance for going a little | ||
// over maxWidth |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment formatting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack, I should have noticed this after running clang-format.
ubidi_setLine(bidiText, start, end, bidiLine, &errorCode); | ||
|
||
if (U_FAILURE(errorCode)) | ||
throw std::runtime_error("msg"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we please use braces around all control statement blocks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, is that our standard style? I'll start doing that.
|
||
BiDi::BiDi() { | ||
bidiText = ubidi_open(); | ||
bidiLine = ubidi_open(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why two handles?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's just the way the ICU interface works. ubidi_setLine
wants two UBiDi*
objects: one "input" representing the whole run of text, and one "output" representing bidirectional information for a single line.
// Because we set UBIDI_REMOVE_BIDI_CONTROLS, the output may be smaller than what we reserve | ||
// Setting UBIDI_INSERT_LRM_FOR_NUMERIC would require | ||
// ubidi_getLength(pBiDi)+2*ubidi_countRuns(pBiDi) | ||
int32_t outputLength = ubidi_getProcessedLength(bidiLine); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const
// Setting UBIDI_INSERT_LRM_FOR_NUMERIC would require | ||
// ubidi_getLength(pBiDi)+2*ubidi_countRuns(pBiDi) | ||
int32_t outputLength = ubidi_getProcessedLength(bidiLine); | ||
std::unique_ptr<UChar[]> outputText = std::make_unique<UChar[]>(outputLength); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can just use auto
here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, right! I left C++ world at C++98, still getting used to all the great additions...
argh sorry; I had a few review comments, but apparently I never clicked on on submit. |
Yeah, just minor nitpicks and suggestions. |
@@ -24,7 +24,7 @@ | |||
"lodash": "^4.16.4", | |||
"mapbox-gl-shaders": "mapbox/mapbox-gl-shaders#597115a1e1bd982944b068f8accde34eada74fc2", | |||
"mapbox-gl-style-spec": "mapbox/mapbox-gl-style-spec#7f62a4fc9f21e619824d68abbc4b03cbc1685572", | |||
"mapbox-gl-test-suite": "mapbox/mapbox-gl-test-suite#c32d0c5ac80e3b7393bc17b8944e64fa5cffd90a", | |||
"mapbox-gl-test-suite": "mapbox/mapbox-gl-test-suite#d85534f77e1b06fbfe7aa610c98a363be86fceb4", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mapbox/mapbox-gl-test-suite#186 was rebased onto that repository’s master branch, so mapbox/mapbox-gl-test-suite@d85534f has been orphaned. This line should have been changed to point to mapbox/mapbox-gl-test-suite@0c6f3e0 before this PR was merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, sorry about that, and thanks for fixing!
Fix for #7112
Also makes small changes to handling of leading/trailing whitespace in labels: