by Sean Coates
Let's face it--this sentence is much "uglier" than the one below it.
Let’s face it–this sentence is much “prettier” than the one above it.
Lexentity is a simple piece of software that takes HTML as input and outputs a context-aware, medium-neutral representation of that HTML, with apostrophes, quotes, emdashes, ellipses, accents, etc., replaced with their respective numeric XML/Unicode entities.
Context is important. It is especially important when considering a piece of HTML like this:
<p>…and here's the example code:</p>
<pre><code>echo "watermelon!\n";</pre></code>
Contextually, you'd want here's
to become here’s
, but
you certainly don't want the code to read echo “watermelon!\n”;
.
A fancy/smart/curly quotes apostrophe is appropriate, but curly quotes in the code are likely to cause a parse error.
Lexentity understands its context, and acts appropriately, my means of lexical analysis, and turning tokens into text, not through a mostly-naive and overly-complicated regular expression.
My friend and colleague Jon Gibbins said it best in [http://dotjay.co.uk/2006/sep/named-html-entities-in-rss](this piece on his blog). In modern systems, you can't count on your HTML to always be represented as HTML. It's often (poorly) embedded in RSS or other HTML-like media, as XML.
Therefore, it is important to avoid HTML-specific entities like
”
and …
, and instead use their Unicode
code point to form numeric entities such as …
. This ensures
proper display on any terminal that can properly render Unicode XML, and avoids
missing entity errors.
Try a demo at http://files.seancoates.com/lexentity/.