-
-
Notifications
You must be signed in to change notification settings - Fork 546
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* Add micro-blog exercise This is an exercise requiring students to truncate unicode strings. Solves #1507 * Micro-blog: Don't assume native English speaker Thank you @SaschaMann for the feedback and suggestion. #1509 (comment) > I don't like that this assumes the perspective of a native English > speaker. English is a foreign language to most of the world. Perhaps > something along the lines of "text in most of the world's languages and > scripts" would be a better description. * Micro-blog: Add tests for different languages Feedback from @SaschaMann #1509 (comment) > I think it would be nice to add some test cases that aren't emoji or > English - perhaps cases with germanic umlauts, cyrillic and/or greek > letters, historic scripts etc. - because that's one of the main uses > and goals of unicode. I've added German, Bulgarian, and Greek examples. All of them have non-English characters. None of these characters use multiple UTF-16 codepoints. As such, if you use a UTF-8 programming language you may first have trouble with the German example, but if you use a UTF-16 language you will probably first have trouble at the Emoji example. I chose not to add an example with historic scripts, because I'm not aware of any that display nicely in my terminal or text-editor. Perhaps in future some could be added. I wanted another example that would be problematic in UTF-16, so I added a poker hand example using playing cards. * Micro-blog: Add German truncated example Comically, it goes from "bear carpet" to "beards". @SaschaMann, thank you for finding the example for me: #1509 (comment) * Micro-blog: Add longer maths example Empty set is a proper subset of the natural numbers which is a proper subset of the integers, which is a proper subset of the rational numbers which is a proper subset of the reals which is a proper subset of the complex numbers. It remains true when truncated which is quite nice
- Loading branch information
1 parent
f8aaffb
commit f928002
Showing
3 changed files
with
167 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,125 @@ | ||
{ | ||
"exercise": "micro-blog", | ||
"version": "1.0.0", | ||
"comments": [ | ||
"This exercise is only applicable to languages that use UTF-8, UTF-16", | ||
"or other variable width Unicode compatible encoding as their internal", | ||
"string representation.", | ||
"", | ||
"This exercise is probably too easy in languages that use Unicode aware", | ||
"string slicing.", | ||
"", | ||
"When adding additional tests to the problem specification, consider that", | ||
"in progress solutions might not fail due to UTF-8 and UTF-16", | ||
"differences.", | ||
"", | ||
"Avoid adding tests that involve characters (graphemes) that are made up", | ||
"of multiple characters, or introduce them as a more advanced step.", | ||
"", | ||
"Consider adding a track specific hint.md about if your language uses", | ||
"UTF-8, UTF-16 or other for its internal string representation." | ||
], | ||
"cases": [ | ||
{ | ||
"description": "Truncate a micro blog post", | ||
"cases": [ | ||
{ | ||
"description": "English language short", | ||
"property": "truncate", | ||
"input": { | ||
"phrase": "Hi" | ||
}, | ||
"expected": "Hi" | ||
}, | ||
{ | ||
"description": "English language long", | ||
"property": "truncate", | ||
"input": { | ||
"phrase": "Hello there" | ||
}, | ||
"expected": "Hello" | ||
}, | ||
{ | ||
"description": "German language short (broth)", | ||
"property": "truncate", | ||
"input": { | ||
"phrase": "brΓΌhe" | ||
}, | ||
"expected": "brΓΌhe" | ||
}, | ||
{ | ||
"description": "German language long (bear carpet β beards)", | ||
"property": "truncate", | ||
"input": { | ||
"phrase": "BΓ€rteppich" | ||
}, | ||
"expected": "BΓ€rte" | ||
}, | ||
{ | ||
"description": "Bulgarian language short (good)", | ||
"property": "truncate", | ||
"input": { | ||
"phrase": "ΠΠΎΠ±ΡΡ" | ||
}, | ||
"expected": "ΠΠΎΠ±ΡΡ" | ||
}, | ||
{ | ||
"description": "Greek language short (health)", | ||
"property": "truncate", | ||
"input": { | ||
"phrase": "Ο Ξ³Ξ΅ΞΉΞ¬" | ||
}, | ||
"expected": "Ο Ξ³Ξ΅ΞΉΞ¬" | ||
}, | ||
{ | ||
"description": "Maths short", | ||
"property": "truncate", | ||
"input": { | ||
"phrase": "a=ΟrΒ²" | ||
}, | ||
"expected": "a=ΟrΒ²" | ||
}, | ||
{ | ||
"description": "Maths long", | ||
"property": "truncate", | ||
"input": { | ||
"phrase": "β ββββ€ββββββ" | ||
}, | ||
"expected": "β ββββ€" | ||
}, | ||
{ | ||
"description": "English and emoji short", | ||
"property": "truncate", | ||
"input": { | ||
"phrase": "Fly π«" | ||
}, | ||
"expected": "Fly π«" | ||
}, | ||
{ | ||
"description": "Emoji short", | ||
"property": "truncate", | ||
"input": { | ||
"phrase": "π" | ||
}, | ||
"expected": "π" | ||
}, | ||
{ | ||
"description": "Emoji long", | ||
"property": "truncate", | ||
"input": { | ||
"phrase": "βπ‘π€§π€π₯π°π" | ||
}, | ||
"expected": "βπ‘π€§π€π₯" | ||
}, | ||
{ | ||
"description": "Royal Flush?", | ||
"property": "truncate", | ||
"input": { | ||
"phrase": "ππΈπ ππππ" | ||
}, | ||
"expected": "ππΈπ ππ" | ||
} | ||
] | ||
} | ||
] | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,39 @@ | ||
You have identified a gap in the social media market for very very short | ||
posts. Now that Twitter allows 280 character posts, people wanting quick | ||
social media updates aren't being served. You decide to create your own | ||
social media network. | ||
|
||
To make your product noteworthy, you make it extreme and only allow posts | ||
of 5 or less characters. Any posts of more than 5 characters should be | ||
truncated to 5. | ||
|
||
To allow your users to express themselves fully, you allow Emoji and | ||
other Unicode. | ||
|
||
The task is to truncate input strings to 5 characters. | ||
|
||
## Text Encodings | ||
|
||
Text stored digitally has to be converted to a series of bytes. | ||
There are 3 ways to map characters to bytes in common use. | ||
* **ASCII** can encode English language characters. All | ||
characters are precisely 1 byte long. | ||
* **UTF-8** is a Unicode text encoding. Characters take between 1 | ||
and 4 bytes. | ||
* **UTF-16** is a Unicode text encoding. Characters are either 2 or | ||
4 bytes long. | ||
|
||
UTF-8 and UTF-16 are both Unicode encodings which means they're capable of | ||
representing a massive range of characters including: | ||
* Text in most of the world's languages and scripts | ||
* Historic text | ||
* Emoji | ||
|
||
UTF-8 and UTF-16 are both variable length encodings, which means that | ||
different characters take up different amounts of space. | ||
|
||
Consider the letter 'a' and the emoji 'π'. In UTF-16 the letter takes | ||
2 bytes but the emoji takes 4 bytes. | ||
|
||
The trick to this exercise is to use APIs designed around Unicode | ||
characters (codepoints) instead of Unicode codeunits. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
--- | ||
title: "Micro Blog" | ||
blurb: "Given an input string, truncate it to 5 characters." |