Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up GenerateBreakTest #975

Merged
merged 32 commits into from
Nov 28, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
c48a170
Eggsperiment
eggrobin Nov 14, 2024
a66f92e
Merge remote-tracking branch 'la-vache/main' into generate-old-monkeys
eggrobin Nov 16, 2024
bf59f08
meow
eggrobin Nov 19, 2024
50a3157
Use remap rules for word and sentence too
eggrobin Nov 19, 2024
0205cab
No CM1 or ZWJ_O
eggrobin Nov 19, 2024
523cd0a
Regenerate UCD
eggrobin Nov 19, 2024
1e3d34f
Merge branch 'more-remapping-less-renaming' into generate-old-monkeys
eggrobin Nov 19, 2024
3a61d91
meow
eggrobin Nov 25, 2024
4433cc2
Some segmenter changes
eggrobin Nov 25, 2024
dc33cd0
Merge remote-tracking branch 'la-vache/main' into more-remapping-less…
eggrobin Nov 25, 2024
52e9dbd
^ rather than a variable called Not, UnicodeSet unions rather than |
eggrobin Nov 25, 2024
d6f96c4
Regenerate UCD
eggrobin Nov 25, 2024
cce6869
Merge branch 'more-remapping-less-renaming' into generate-old-monkeys
eggrobin Nov 25, 2024
5f254b9
Not bad but I should do something about QUmPi_Pf
eggrobin Nov 25, 2024
f6597bb
Better.
eggrobin Nov 26, 2024
533377d
Document the thing
eggrobin Nov 26, 2024
52a89cd
Regenerate UCD
eggrobin Nov 26, 2024
6fa6120
spotless
eggrobin Nov 26, 2024
94fb39a
Spotless and remove commented-out code.
eggrobin Nov 26, 2024
c46754f
Dead code elimination
eggrobin Nov 26, 2024
d8e4c4b
Pick the sample cleverly
eggrobin Nov 26, 2024
d22dfbe
Single assignment
eggrobin Nov 26, 2024
0ab3a58
spots
eggrobin Nov 26, 2024
6bcd20f
orig
eggrobin Nov 26, 2024
4ed94be
aaaaa
eggrobin Nov 26, 2024
dc77ece
Lex separately from name resolution…
eggrobin Nov 27, 2024
e213674
Unused variable, name gcb=XX, and a comment.
eggrobin Nov 27, 2024
9d2ddb8
No codepoint left behind
eggrobin Nov 27, 2024
ab96b84
Throw on undefined variables
eggrobin Nov 27, 2024
1bcb433
Suignardian pair table in LineBreakTest.html
eggrobin Nov 27, 2024
a901062
Merge remote-tracking branch 'la-vache/main' into generate-old-monkeys
eggrobin Nov 27, 2024
29fe201
Make it compile
eggrobin Nov 27, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
272 changes: 147 additions & 125 deletions unicodetools/data/ucd/dev/auxiliary/GraphemeBreakTest.html

Large diffs are not rendered by default.

1,726 changes: 696 additions & 1,030 deletions unicodetools/data/ucd/dev/auxiliary/GraphemeBreakTest.txt

Large diffs are not rendered by default.

4,541 changes: 2,331 additions & 2,210 deletions unicodetools/data/ucd/dev/auxiliary/LineBreakTest.html

Large diffs are not rendered by default.

34,872 changes: 18,509 additions & 16,363 deletions unicodetools/data/ucd/dev/auxiliary/LineBreakTest.txt

Large diffs are not rendered by default.

248 changes: 132 additions & 116 deletions unicodetools/data/ucd/dev/auxiliary/SentenceBreakTest.html

Large diffs are not rendered by default.

700 changes: 351 additions & 349 deletions unicodetools/data/ucd/dev/auxiliary/SentenceBreakTest.txt

Large diffs are not rendered by default.

232 changes: 127 additions & 105 deletions unicodetools/data/ucd/dev/auxiliary/WordBreakTest.html

Large diffs are not rendered by default.

2,732 changes: 1,426 additions & 1,306 deletions unicodetools/data/ucd/dev/auxiliary/WordBreakTest.txt

Large diffs are not rendered by default.

1,627 changes: 138 additions & 1,489 deletions unicodetools/src/main/java/org/unicode/text/UCD/GenerateBreakTest.java

Large diffs are not rendered by default.

255 changes: 159 additions & 96 deletions unicodetools/src/main/java/org/unicode/tools/Segmenter.java

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
Expand Up @@ -16,19 +16,15 @@ $V=\p{Grapheme_Cluster_Break=V}
$T=\p{Grapheme_Cluster_Break=T}
$LV=\p{Grapheme_Cluster_Break=LV}
$LVT=\p{Grapheme_Cluster_Break=LVT}
# Note: The following may overlap with the above
# Note: ConjunctLinkingScripts is not used anymore, instead that list exists in the derivation of Indic_Conjunct_Break.
# It is kept here so that the diff of the generated test cases compared to the Unicode 15.1 β is minimal.
# TODO(egg): Consider removing in Unicode 16.0.
$ConjunctLinkingScripts=[\p{Gujr}\p{sc=Telu}\p{sc=Mlym}\p{sc=Orya}\p{sc=Beng}\p{sc=Deva}]
$ConjunctLinker=\p{Indic_Conjunct_Break=Linker}
$LinkingConsonant=\p{Indic_Conjunct_Break=Consonant}
## $E_Base=\p{Grapheme_Cluster_Break=E_Base}
## $E_Modifier=\p{Grapheme_Cluster_Break=E_Modifier}
$ExtPict=\p{Extended_Pictographic}
$ExtCccZwj=[\p{Indic_Conjunct_Break=Linker}\p{Indic_Conjunct_Break=Extend}]
$ConjunctExtender=[\p{Indic_Conjunct_Break=Linker}\p{Indic_Conjunct_Break=Extend}]
## $EBG=\p{Grapheme_Cluster_Break=E_Base_GAZ}
## $Glue_After_Zwj=\p{Grapheme_Cluster_Break=Glue_After_Zwj}
$XX = \p{Grapheme_Cluster_Break=Other}

# RULES

Expand All @@ -47,7 +43,7 @@ $ExtCccZwj=[\p{Indic_Conjunct_Break=Linker}\p{Indic_Conjunct_Break=Extend}]
# Only for extended grapheme clusters: Do not break before SpacingMarks, or after Prepend characters.
9.1) × $SpacingMark
9.2) $Prepend ×
9.3) $LinkingConsonant $ExtCccZwj* $ConjunctLinker $ExtCccZwj* × $LinkingConsonant
9.3) $LinkingConsonant $ConjunctExtender* $ConjunctLinker $ConjunctExtender* × $LinkingConsonant
## Do not break within emoji modifier sequences or emoji zwj sequences.
## 10) $E_Base $Extend* × $E_Modifier
11) $ExtPict $Extend* $ZWJ × $ExtPict
Expand All @@ -62,7 +58,7 @@ $ExtCccZwj=[\p{Indic_Conjunct_Break=Linker}\p{Indic_Conjunct_Break=Extend}]

$AI=\p{Line_Break=Ambiguous}
$AK=\p{Line_Break=Aksara}
$AL=\p{Line_Break=Alphabetic}
$ALorig=\p{Line_Break=Alphabetic}
$AP=\p{Line_Break=Aksara_Prebase}
$AS=\p{Line_Break=Aksara_Start}
$B2=\p{Line_Break=Break_Both}
Expand All @@ -72,7 +68,7 @@ $BK=\p{Line_Break=Mandatory_Break}
$CB=\p{Line_Break=Contingent_Break}
$CL=\p{Line_Break=Close_Punctuation}
$CP=\p{Line_Break=CP}
$CM=\p{Line_Break=Combining_Mark}
$CMorig=\p{Line_Break=Combining_Mark}
$CR=\p{Line_Break=Carriage_Return}
$EX=\p{Line_Break=Exclamation}
$GL=\p{Line_Break=Glue}
Expand All @@ -88,13 +84,15 @@ $JT=\p{Line_Break=JT}
$JV=\p{Line_Break=JV}
$LF=\p{Line_Break=Line_Feed}
$NL=\p{Line_Break=Next_Line}
$NS=\p{Line_Break=Nonstarter}
$NSorig=\p{Line_Break=Nonstarter}
$NU=\p{Line_Break=Numeric}
$OP=\p{Line_Break=Open_Punctuation}
$PO=\p{Line_Break=Postfix_Numeric}
$PR=\p{Line_Break=Prefix_Numeric}
$QU=\p{Line_Break=Quotation}
$SA=\p{Line_Break=Complex_Context}
$SA_Mn=[\p{Line_Break=Complex_Context}&\p{gc=Mn}]
$SA_Mc=[\p{Line_Break=Complex_Context}&\p{gc=Mc}]
$SAmMnmMc=[\p{Line_Break=Complex_Context}-\p{gc=Mn}-\p{gc=Mc}]
$SG=\p{Line_Break=Surrogate}
$SP=\p{Line_Break=Space}
$SY=\p{Line_Break=Break_Symbols}
Expand All @@ -109,20 +107,23 @@ $EB=\p{Line_Break=E_Base}
$EM=\p{Line_Break=E_Modifier}
$ZWJ=\p{Line_Break=ZWJ}

$QU_Pi=[$QU & \p{gc=Pi}]
$QU_Pf=[$QU & \p{gc=Pf}]
$Pi = \p{gc=Pi}
$Pf = \p{gc=Pf}

$QU_Pi=[$QU & $Pi]
$QU_Pf=[$QU & $Pf]

$QUmPi=[$QU - \p{gc=Pi}]
$QUmPf=[$QU - \p{gc=Pf}]
$QUmPi=[$QU - $Pi]
$QUmPf=[$QU - $Pf]

$EastAsian = [\p{ea=F}\p{ea=W}\p{ea=H}]
$NonEastAsianBA = [$BA & [^$EastAsian]]

$DottedCircle = [◌]
$Hyphen = [\u2010]

$CP30=[$CP-[\p{ea=F}\p{ea=W}\p{ea=H}]]
$OP30=[$OP-[\p{ea=F}\p{ea=W}\p{ea=H}]]
$CPmEastAsian=[$CP-$EastAsian]
$OPmEastAsian=[$OP-$EastAsian]

$ExtPictUnassigned=[\p{Extended_Pictographic}&\p{gc=Cn}]

Expand All @@ -136,10 +137,11 @@ $eot=(?!.)

# LB 1 Assign a line breaking class to each code point of the input.
# Resolve AI, CB, SA, SG, and XX into other line breaking classes depending on criteria outside the scope of this algorithm.
# NOTE: CB is ok to fall through, but must handle others here.
## show $AL
$AL=[$AI $AL $SG $XX $SA]
$NS=[$NS $CJ]
# In the absence of such criteria all characters with a specific combination of
# original class and General_Category property value are resolved as follows:
$AL=[$AI $ALorig $SG $XX $SAmMnmMc]
$CM=[$CMorig $SA_Mn $SA_Mc]
$NS=[$NSorig $CJ]

# RULES

Expand Down Expand Up @@ -263,8 +265,8 @@ $NS=[$NS $CJ]
# LB 29 Do not break between numeric punctuation and alphabetics (\"e.g.\").
29) $IS × ($AL | $HL)
# LB 30 Do not break between letters, numbers or ordinary symbols and opening or closing punctuation.
30.01) ($AL | $HL | $NU) × $OP30
30.02) $CP30 × ($AL | $HL | $NU)
30.01) ($AL | $HL | $NU) × $OPmEastAsian
30.02) $CPmEastAsian × ($AL | $HL | $NU)
# LB 30a Break between two Regional Indicators if and only if there is an even number of them before the point being considered.
30.11) $sot ($RI $RI)* $RI × $RI
30.12) [^$RI] ($RI $RI)* $RI × $RI
Expand All @@ -291,6 +293,7 @@ $ATerm=\p{Sentence_Break=ATerm}
$STerm=\p{Sentence_Break=STerm}
$Close=\p{Sentence_Break=Close}
$SContinue=\p{Sentence_Break=SContinue}
$XX=\p{Sentence_Break=Other}
$Any=.

# SPECIAL EXTENSIONS
Expand Down Expand Up @@ -365,6 +368,7 @@ $ExtPict=\p{Extended_Pictographic}
## $EBG=\p{Word_Break=E_Base_GAZ}
## $Glue_After_Zwj=\p{Word_Break=Glue_After_Zwj}
$WSegSpace=\p{Word_Break=WSegSpace}
$XX=\p{Word_Break=Other}

# MACROS

Expand Down
Loading