-
Notifications
You must be signed in to change notification settings - Fork 1
Language Reference
hq
uses Beautiful Soup version 4, configured with the Python html.parser, to parse its HTML input.
hq
started out as an implementation of XPath 1.0 for HTML. Unlike XPath, however, HQuery provides no support for namespaces, since they don't exist as such in HTML. So element name tests, for example, never have a prefix.
While HQuery does support the comment()
node test (as well as node()
and text()
), it doesn't support processing-instruction()
because HTML.
HQuery introduces some deviations from standard XPath behavior in the interest of simplifying the more common use cases in hq
's wheelhouse, namely scraping data from HTML and producing programmatic output.
The string-value of Nodes
When 'hq' derives a string-value from HTML element or attribute content, it automatically normalizes space in the string (by running the text through the normalize-space()
function) and it converts non-breaking space characters to regular spaces. This is all done so that you don't have to pack your expressions with wordy normalize-space()
and translate()
calls.
HQuery supports all of the core node set functions except local-name()
and namespace-uri()
, which are useful only for namespace-bound XML content.
HQuery mostly supports all of the core string functions except translate()
, which seemed like a tiresome and inferior substitute for regular expression substitution. I say "mostly" because HQuery's underlying numeric support is all based on Python, so the language doesn't support positive or negative infinity or negative zero values (yes, I know that a Python float
can be negative zero; support might sneak in by accident, but I haven't done anything to cover it). The NaN-related edge cases discussed for the substring()
function are also mostly unsupported.
All of the core Boolean functions are supported except the lang()
function, which is tied to the standard xml:lang
attribute.
Support for the core number functions, like the string functions, is affected by HQuery's underlying use of Python numbers (and my disinterest in exploring the work necessary to fully support XPath numerical semantics when the Python ones are perfectly serviceable). One noteworthy result of these semantic differences is that the round()
function rounds outward from zero, rather than upward towards positive infinity, so round(-1.5)
evaluates to -2 in HQuery where it would produce -1 in XPath.
HQuery adds some XPath-flavored stuff that isn't in any of the XPath standards.
HQuery provides a new axis and a class()
function to simplify queries based on CSS class names.
HQuery adds a new axis, class
, to the set of axes supported in standard XPath. Only a name test (i.e., a name) is legal as the node test following this axis, and the axis acts like the child
axis with a name test, except only children that have the given name in their class
attribute will be selected. Here's the equivalence to standard XPath, using the regular expression matches()
function from the XPath 3 function library standard:
class::foo --> child::*[ matches(@class, "(^| )foo($| )") ]
boolean class(string, node-set?)
The class()
function accepts the name of a CSS class and an optional node set. If no second parameter is passed, then the function tests the current context node to see whether its class
attribute contains the given CSS class name. If the second argument is passed, the first (and only the first) node in the node set is tested instead of the context node. In either case, the function returns true
if the tested node's class
attribute contains the given class name, and false otherwise.
Standard XPath supplies only a couple of abbreviated forms for its axes, namely "@" for attribute::
, ".." for parent::*
and "." for self::*
. Some of those other axes are pretty wordy, so HQuery provides abbreviations for all of the other axes you're likely to use. To keep parsing simple, all of these novel abbreviations still require the trailing double-colon to be recognized as axes (unlike "@" for attribute::
):
^:: --> ancestor::
^^:: --> ancestor-or-self::
.:: --> class::
~:: --> descendant::
>>:: --> following::
>:: --> following-sibling::
<<:: --> preceding
<:: --> preceding-sibling::
HQuery provides a few functions that don't appear in any of the XPath standards.
The even()
and odd()
functions take no argument, and they both return a boolean value. They each examine the current context position (the value returned by the standard position()
function) and return true if the value is even or odd, respectively.
let $odd-rows := //table[1]//tr[ odd() ]
There are numerous useful functions in the XPath and XQuery Functions and Operators 3.0 spec, and HQuery cherry-picks a number of them. Since there are no namespaces in HQuery, the namespace part of these functions' names from the standard are, naturally, removed.
Here they are, categorized as they are in the standard:
Unlike the standard functions, these functions' arguments are required, not optional, and the functions will automatically convert non-strings to their string values.
The HQuery matches()
function deviates from the standard one in three ways: the regular expression syntax and flags differ as discussed under Regular Expression Support, the input object needn't be a string (the function will use the string value of whatever input it's given), and the function's second argument is optional. When called with only one argument, matches()
assumes the argument contains the search pattern and it searches the string value of the current context node. This makes for simpler predicates, though it doesn't allow you to pass any flags:
//a[matches("[Ll]og(in|out)")]
Like the matches()
function, the replace()
function deviates from the standard by being based on Python regular expressions and by accepting any object as its input (automatically operating on the string value of that object). Other than the type of the first argument, however, the signature is the same as the one in the standard.
The HQuery string-join()
function, unlike the standard one, accepts a sequence of any types of objects, and it joins the string value (the result of calling the string()
function) of each item in the sequence.
This tokenize()
function deviates from its standard counterpart in exactly the same way as the replace()
function, above, does.
HQuery borrows basic automation constructs from XQuery like iteration and branching, but it leaves behind all of the complex XML Schema integration features that don't apply to HTML. Since hq
is designed to be used in the form of expressions on the command line, HQuery doesn't allow you to define your own functions as XQuery does.
HQuery also borrows the notion of sequences from XQuery, including ranges and the use of commas to construct a sequence explicitly.
hq
implements just the for
, let
and return
clauses of the standard XQuery FLWOR expression. All FLWOR expressions have to conclude with a return
expression, and let
clauses can appear before the for
clause (where variables will apply globally across all iterations) or after the for
clause (where they will be local to a given iteration):
let $pi := 3.14159
for $el in //circle
let $C := 2 * $pi * number($el/@r)
return `Circle at (${$el/@cx}, ${$el/@cy}) circumference is $C`
The for
clause is optional; you might use a FLWOR just because you want to declare a variable:
let $r := //rect[@id="foo"] return $r/@width * $r/@height
HQuery offers a shorthand for all of this semi-wordy for
and return
business, when you just need to iterate and don't need to declare any variables. It's called an abbreviated FLWOR:
//circle -> `Circle at (${$_/@cx}, ${$_/@cy}) circumference is ${$_/@r * 6.2832}`
The expression to the left of the arrow plays the same role as the expression that follows the keyword in
in a regular FLWOR for
clause, and the expression to the right is the return
expression. In that expression, the variable $_
is implicitly defined to contain the sequence item at the current step in the iteration.
You can use XQuery's conditional expression syntax directly:
if (//input[@type="number"]) then "Has number fields" else "No number fields"
You can use the to
keyword to produce a range of numbers, and the comma (",") operator to assemble sequences:
~$ echo '<html/>' | hq '(1 to 3) -> `And-a-$_!`'
And-a-1!
And-a-2!
And-a-3!
~$ echo '<html/>' | hq '1, "two", 3.3, true()'
1
two
3.3
true
hq
supports interpolated or "template" string literals, just like JavaScript 6, Ruby and other languages. Interpolated strings are surrounded by back-ticks, unlike regular string literals (which are surrounded by either single or double quotation marks).
Inside an interpolated string, a bare variable reference (a dollar sign followed by a name) will be replaced with the string value of the variable:
let $foo := "world" return `Hello, $foo!`
If you need to evaluate a more complicated expression, put it inside curly braces following a dollar sign:
let $pi := 3.14159 return `The unit circle's circumference is ${2 * $pi}`
If you need to use a dollar sign in your string without HQuery interpreting it as the beginning of an embedded expression, you can use an HTML entity:
`Price: $${//td[id="price"]}`
Complex embedded expressions in HQuery interpolated strings (the ones with curly braces, like ${2 * $pi}
), can include one or more "filters" to transform the evaluated value of the expression before injecting it into the string. These filters appear immediately after the open curly brace character, and take the form of a very short abbreviation followed by a sequence of filter arguments delimited by colons.
This first example illustrates the "join" filter, which concatenates the string values of all of the items in a sequence with an optional delimiter in between items. The "join" filter's abbreviation is j
, and it accepts the delimiter as its single, optional argument:
`${j:, : //ul[@id="widget-menu"]/li }`
-----
|
Here's the filter part, in case you missed it.
Given the following HTML:
<html>
<body>
<ul id="widget-menu">
<li>Add a widget</li>
<li>Search for a widget</li>
<li>Delete some widgets</li>
</ul>
</body>
</html>
The HQuery expression would produce:
Add a widget, Search for a widget, Delete some widgets
If we decide not to provide any delimiter value, we still have to put the same number of colons ("optional" just means that you can leave the space between the colons empty):
~$ cat document.html | hq '`${j:: //ul[@id="widget-menu"]/li }`'
Add a widgetSearch for a widgetDelete some widgets
The spaces around the embedded expression, by the way, are just there to improve clarity, not because the parser needs them.
The tru
for "truncate" filter slices strings that are longer than a given maximum length at a graceful word boundary and appends an optional suffix:
~$ cat document.html | hq '`${tru:10:...://ul[@id="widget-menu"]/li[2]}`'
Search for...
Filters can be chained left-to-right, so the result of the first transform becomes the input for the next transform. Here's an example using the tru
and j
filters:
`$ cat document.html | hq `${tru:12:...:j:, : //ul[@id="widget-menu"]/li}`
Add a widget, Search for..., Delete some...
If you need to include a colon as part of an argument value, you can use an HTML entity:
${j: : :/some/elements}
The j
filter concatenates the string values of all of the items in a sequence with an optional delimiter in between items. It accepts a single, optional parameter whose contents will be inserted between items as they are concatenated together.
${j: and :1 to 3} --> 1 and 2 and 3
${j::1 to 3} --> 123
The rr
filter searches for substrings matching a given regular expression pattern and replaces them with a replacement pattern. If the input is atomic, then the transform will be applied only to the string value derived from that atomic value, but if the input is a sequence, then the transform will be applied to each item in a sequence of string values derived from the input sequence, resulting in a new sequence of strings.
rr
takes three arguments: a search pattern, a replacement pattern, and flags. Only the first argument, the search pattern, is required; the other two can be empty. See Regular Expression Support, below, for details about supported regular expression syntax and flags.
${rr:(\w+)ain:\1ood:i:"The rain in SPAIN..."} --> The rood in SPood...
${rr: :::"The rain in SPAIN..."} --> TheraininSPAIN...
The tru
filter truncates string values at word boundaries when they exceed a given length, and it adds an optional suffix to the end of such truncated strings.
${tru:10:...:"The rain in Spain"} --> The rain...
${tru:10::"The rain in Spain"} --> The rain
I considered the XQuery typeswitch construct as a means of dispatching transforms based on the "type" (element name) of elements in a node set, but typeswitch is awfully wordy, even when one boils the XML-Schema-based type arbitration down to just element names, and it seemed to me there were use cases where dispatching based on just element names wouldn't suffice. Unions, I felt, provided a convenient, terse and flexible means of grouping items in a node set for this purpose, so I created an iteration-like (FLWOR-like) construct with a union on the left-hand side and a thing that looks like a union on the right hand side. Consider this simple input document:
<html><body>
<h1>Overview</h1>
<p>HQuery has some novel features:</p>
<ul>
<li>String interpolation with filters</li>
<li>Union decomposition</li>
</ul>
</body></html>
What if we wanted to turn this HTML into Markdown? Here's how we'd use union decomposition:
/html/body -> ($_/h1 | $_/p | $_/ul/li) => `# $_` | `$_` | `* $_`
We start off with an abbreviated FLWOR so that we don't have to repeat the /html/body
in front of all of the clauses in the union that follows (we just use $_
). The union assembles all of the elements we're interested in into a node set, retaining their original document order. (Everything so far is just how unions work; we haven't gotten to the decomposition part yet.) We need the parentheses around the union, by the way, because the union operator has relatively low operator precedence, so the whole expression would be misinterpreted otherwise.
The =>
operator and the union-like thing that follows it are where the magic happens. Notice that it has exactly the same number of clauses as the union to the left of the operator. Notice, also, that it makes use of the implicitly-defined $_
variable. Here's what the query produces:
# Overview
HQuery has some novel features:
* String interpolation with filters
* Union decomposition
Here's what happened: Each of the elements in the node set produced by the union "remembered" which clause produced it. On the right-hand side of the union decomposition operator, the element "selected" the clause in the corresponding position and evaluated that expression, assigning itself to the $_
variable. So the elements are iterated over in document order, and a different transformation is applied to each one based on which of the clauses in the original union first produced it.
When an element is produced by more than one clause in the union, it "remembers" that it came from the first one (in left-to-right order).
HQuery supports a stripped down version of XQuery's computed constructors, in this case for HTML (really, XHTML) instead of XML. It also provides a similar facility for constructing JSON. In both cases, you could accomplish the same outcomes using concatenated or interpolated strings, but computed constructors provide a more elegant syntax for distinguishing structure from content, and they guarantee grammatical correctness.
The syntax is element <name> { <expression> }
, where <name>
is the tag name and <expression>
is any HQuery expression. If the expression produces an element node, then that node will be adopted as a child of the element being constructed. If it produces a simple value like a string or a number, then its string value will become the constructed element's text content.
Let's say we're querying the <ul><li>one</li><li>two</li></ul>
:
element h1 { "Overview" } --> <h1>Overview</h1>
element h1 { //li[1]/text() } --> <h1>one</h1>
//li/text() -> element img { --> <img href="one.jpg"/>
attribute href {`$_.jpg`} <img href="two.jpg"/>
}
element tr { --> <tr>
//li -> element td { <td>one</td>
$_/text() <td>two</td>
} </tr>
}
You can also use the hash
and array
computed constructors to build JSON output. Given the input document <div id="abc">laurel</div><div id="def">hardy</div>
:
hash { id: /div[1]/@id, --> {"id":"abc", "text":"laurel"}
text: $/div[1]/text() }
array { /div/text() } --> ["laurel", "hardy"]
If you don't specify attribute names in a hash, HQuery will use element names and text contents as hash attribute names and values. If there are multiple elements with the same tag name, then HQuery assembles their "values" into a list. So if we're querying <body><h1>Title</h1><p>introduction</p><p>explanation</p></body>
:
hash { /body/* } --> {"h1":"Title", "p":["introduction", "explanation"]}
Where functions and filters in HQuery accept regular expression patterns, these patterns must be based on Python's regular expression language. There are obvious similarities between this syntax and the one specified for XPath 3.0, but hq
isn't designed as an execution target for existing XPath or XQuery code; it borrows only what makes sense from those languages for its Web-scraping-utility use cases. Python regular expressions were the most direct path to powerful regex support.
hq
does take the trouble to translate the subset of standard regular expression flags that it supports, as this was easy enough to do. Specifically, HQuery supports the "i" (ignore case), "m" (multi-line mode), "s" ("dot-all" mode) and "x" (verbose) flags. Not that you're likely to use the verbose flag.