Skip to content
johndouthat edited this page Dec 18, 2011 · 34 revisions

A digest of most of the methods documented at nokogiri.org. rubyforge.org hosts an alternate source for the API docs, which may be easier to read.

Topics not covered: RelaxNG validation, Slop, SAX Parsing, or Builder See also: http://cheat.errtheblog.com/s/nokogiri

Strings are always stored as UTF-8 internally. Methods that return text values will always return UTF-8 encoded strings. Methods that return XML (like to_xml, to_html and inner_html) will return a string encoded like the source document.

More Resources

Creating and working with Documents

Nokogiri::HTML::Document Nokogiri::HTML::DocumentFragment Nokogiri::XML::Document Nokogiri::XML::DocumentFragment

  doc = Nokogiri(string_or_io) # Nokogiri will try to guess what type of document you are attempting to parse
  doc = Nokogiri::HTML(string_or_io) # [, url, encoding, options, &block]
  doc = Nokogiri::XML(string_or_io) # [, url, encoding, options, &block]
    # set options with block {|config| config.noblanks.noent.noerror.strict }
    # OR with a bitmask {|config| config.options = Nokogiri::XML::NOBLANKS | Nokogiri::XML::NOENT}
    # http://nokogiri.org/Nokogiri/XML/ParseOptions.html
  # doc = Nokogiri.parse(...)
  # doc = Nokogiri::XML.parse(...) #shortcut to Nokogiri::XML::Document.parse
  # doc = Nokogiri::HTML.parse(...) #shortcut to Nokogiri::HTML::Document.parse

  nodeset = Nokogiri::XML.fragment(string)
  nodeset = Nokogiri::HTML.fragment(string, encoding = nil)
  
  nc = Nokogiri::HTML::NamedCharacters # a Nokogiri::HTML::EntityLookup
  nc[key] # like nc.get(key).try(:value) # e.g. nc['gt'] (62) or nc['rsquo'] (8217)
  nc.get(key) # returns an Nokogiri::HTML::EntityDescription
    # e.g. nc.get('rsquo') #=>  #<struct Nokogiri::HTML::EntityDescription value=8217, name="rsquo", description="right single quotation mark, U+2019 ISOnum">
  
  # Adding a Processing Instruction (like <?xml-stylesheet?>)
  # Nokogiri::XML::ProcessingInstruction http://nokogiri.org/tutorials/modifying_an_html_xml_document.html
  pi = Nokogiri::XML::ProcessingInstruction.new(doc, "xml-stylesheet",'type="text/xsl" href="foo.xsl"')
  doc.root.add_previous_sibling(pi)
  
  # document namespaces
  doc.collect_namespaces
  doc.remove_namespaces!
  doc.namespaces
  
  # shortcuts for creating new nodes
  doc.create_cdata(string, &block)
  doc.create_comment(string, &block)
  doc.create_element(name, *args, &block) # Create an element with name, and optionally setting the content and attributes.
  doc.create_entity
  doc.create_text_node(string, &block)
  
  doc.root
  doc.root=node
  
  # A document is a Node, so see working_with_a_node

Working with a Nokogiri::XML::Node

  node = Nokogiri::XML::Node.new('name', document) # initialize a new node
  # or shorter: 
  node = document.create_element('name')
  
  node.document
  
  node.name # alias of node.node_name
  node.name= # alias of node.node_name=
  
  node.read_only?
  node.blank?
  
  # Type of Node
  node.type # alias of node.node_type
  node.cdata? # type == CDATA_SECTION_NODE
  node.comment? # type == COMMENT_NODE
  node.element? # type == ELEMENT_NODE alias node.elem? 
  node.fragment? # type == DOCUMENT_FRAG_NODE (Document fragment node)
  node.html? # type == HTML_DOCUMENT_NODE
  node.text? # type == TEXT_NODE
  node.xml? # type == DOCUMENT_NODE (Document node type)
  # other types not covered by a convenience method
    # ATTRIBUTE_DECL: Attribute declaration type
    # ATTRIBUTE_NODE: Attribute node type
    # DOCB_DOCUMENT_NODE: DOCB document node type
    # DOCUMENT_TYPE_NODE: Document type node type
    # DTD_NODE: DTD node type
    # ELEMENT_DECL: Element declaration type
    # ENTITY_DECL: Entity declaration type
    # ENTITY_NODE: Entity node type
    # ENTITY_REF_NODE: Entity reference node type
    # NAMESPACE_DECL: Namespace declaration type
    # NOTATION_NODE: Notation node type
    # PI_NODE: PI node type
    # XINCLUDE_END: XInclude end type
    # XINCLUDE_START: XInclude start type
  
  # Attributes, like a hash
  node['src'] # aliases: node.get_attribute, node.attr. similar to
  node['src'] = 'value' # alias node.set_attribute
  node.key?('src') # alias node.has_attribute?
  node.keys # array of strings
  node.values
  node.delete('src') # alias of node.remove_attribute
  node.each { |attr_name, attr_value| }
  # include Enumerable, which works on these attribute names and values
  
  # Attribute Nodes
  node.attribute('src') # Get the attribute node with name src
    # Nokogiri::XML::Attr < Nokogiri::XML::Node
    # can use .content= or .value= to modify
  node.attribute_nodes # returns a list containing the Node attributes.
  node.attribute_with_ns('src', 'namespace') # Get the attribute node with name and namespace
  node.attributes # Returns a hash containing the node's attributes. The key is the attribute name without any namespace, the value is a Nokogiri::XML::Attr representing the attribute. If you need to distinguish attributes with the same name, with different namespaces use #attribute_nodes instead.
  
  
  
  
  # Traversing / Modifying
  # +node_or_tags+ can be a Nokogiri::XML::Node, a ::DocumentFragment, a ::NodeSet, or a string containing markup.
  ## Self
  node.traverse {|node| } # yields self and all children to block _recursively_
  node.remove # alias of node.unlink # Unlink this node from its current context.
  node.replace(node_or_tags)
    # Replace this Node with +node_or_tags+.
    # Returns the reparented node (if +node_or_tags+ is a Node), or NodeSet (if +node_or_tags+ is a DocumentFragment, NodeSet, or string).
  node.swap(node_or_tags) # like above, but returns self to support chaining
  ## Siblings
  node.next # alias of node.next_sibling # Returns the next sibling node
  node.next=(node_or_tags) # alias of node.add_next_sibling 
    # Inserts node_or_tags after this node (as a sibling).
    # Returns the reparented node (if node_or_tags is a Node)
    #   or NodeSet if node_or_tags is a DocumentFragment, NodeSet, or sring
  node.after(node_or_tags) # just like above method, but returns self to suppport chaining
  node.next_element # Returns the next Nokogiri::XML::Element type sibling node.
  node.previous # alias of  node.previous_sibling # Returns the previous sibling node
  node.previous=(node_or_tags) # alias of node.add_previous_sibling ?
    # Inserts node_or_tags before this node (as a sibling).
    # Returns the reparented node (if node_or_tags is a Node)
    #   or NodeSet if node_or_tags is a DocumentFragment, NodeSet, or sring
  node.before(node_or_tags) # just like the above method, but returns self to suppport chaining
  node.previous_element # Returns the previous Nokogiri::XML::Element type sibling node.
  ## Parent
  node.parent
  node.parent=(node)
  ## Children
  node.child # returns a Node
  node.children # Get the list of children for this node as a NodeSet
  node.children=(node_or_tags)
    # Set the inner html for this Node +node_or_tags+
    # Returns the reparented node (if +node_or_tags+ is a Node), or NodeSet (if +node_or_tags+ is a DocumentFragment, NodeSet, or string).
  node.elements # alias: node.element_children # Get the list of children for this node that are elements as a NodeSet.
  node.add_child(node_or_tags)
    # Add +node_or_tags+ as a child of this Node.
    # Returns the reparented node (if +node_or_tags+ is a Node), or NodeSet (if +node_or_tags+ is a DocumentFragment, NodeSet, or string).
  node << node_or_tags # like above, but returns self to support chaining, e.g. root << child1 << child2
  node.first_element_child # Returns the first child node of this node that is an element.
  node.last_element_child # Returns the last child node of this node that is an element.
  ## Content / Children
  node.content # aliases node.text node.inner_text node.to_str # 
  node.content=(string) # Set the Node's content to a Text node containing +string+. The string gets XML escaped, not interpreted as markup.
  node.inner_html # (*args) children.map { |x| x.to_html(*args) }.join
  node.inner_html=(node_or_tags)
    # Set the inner html for this Node to +node_or_tags+
    # Returns self.
    # Also see related method +children=+
  
  
  
  
  
  ## Searching below (see working_with_a_nodest)
  # see docs for namespace bindings, variable bindings, custom handler class (for custom xpath functions)
  node.search(*paths) # alias: node / path # paths can be XPath or CSS
  node.at(*paths) # alias node % path # Search for the first occurrence of path. Returns nil if nothing is found, otherwise a Node. (like search(path, ns).first)
  node.xpath(*paths) # search for XPath queries
  node.at_xpath(*paths) # like xpath(*paths).first
  node.css(*rules) # search for CSS rules
  node.at_css(*rules) # like css(*rules).first
  node > selector # Search this node's immediate children using CSS selector
  
  
  # Searching above
  node.ancestors # list of ancestor nodes
  node.ancestors(selector) # ancestors must match selector
  
    
  # Where am I?
  node.path # Returns the path associated with this Node
  node.css_path # Get the path to this node as a CSS expression
  node.matches?(selector) # does this node match selector?
  node.line # line number from input
  node.pointer_id # internal pointer number
  
  # Namespaces
  node.add_namespace(prefix, href) # alias of node.add_namespace_definition
    # Adds a namespace definition with prefix using href value. The result is as
    # if parsed XML for this node had included an attribute
    # ‘xmlns:prefix=value'. A default namespace for this node (“xmlns=”) can be
    # added by passing ‘nil' for prefix. Namespaces added this way will not show
    # up in #attributes, but they will be included as an xmlns attribute when
    # the node is serialized to XML.
  node.default_namespace=(url)
    # Adds a default namespace supplied as a string url href, to self. The
    # consequence is as an xmlns attribute with supplied argument were present
    # in parsed XML. A default namespace set with this method will now show up
    # in #attributes, but when this node is serialized to XML an “xmlns”
    # attribute will appear. See also #namespace and #namespace=
  node.namespace #   returns the default namespace set on this node (as with an “xmlns=” attribute), as a Namespace object.
  node.namespace=(ns)
    # Set the default namespace on this node (as would be defined with an
    # “xmlns=” attribute in XML source), as a Namespace object ns . Note that a
    # Namespace added this way will NOT be serialized as an xmlns attribute for
    # this node. You probably want #default_namespace= instead, or perhaps
    # #add_namespace_definition with a nil prefix argument.
  node.namespace_definitions
    # returns namespaces defined on self element directly, as an array of
    # Namespace objects. Includes both a default namespace (as in“xmlns=”), and
    # prefixed namespaces (as in “xmlns:prefix=”).
  node.namespace_scopes
    # returns namespaces in scope for self – those defined on self element
    # directly or any ancestor node – as an array of Namespace objects. Default
    # namespaces (“xmlns=” style) for self are included in this array; Default
    # namespaces for ancestors, however, are not. See also #namespaces
  node.namespaced_key?(attribute, namespace)
    # Returns true if attribute is set with namespace
  node.namespaces # Returns a Hash of {prefix => value} for all namespaces on this node and its ancestors.
    # This method returns the same namespaces as #namespace_scopes.
    # 
    # Returns namespaces in scope for self – those defined on self element
    # directly or any ancestor node – as a Hash of attribute-name/value pairs.
    # Note that the keys in this hash XML attributes that would be used to
    # define this namespace, such as “xmlns:prefix”, not just the prefix.
    # Default namespace set on self will be included with key “xmlns”. However,
    # default namespaces set on ancestor will NOT be, even if self has no
    # explicit default namespace.
  # see also attribute_with_ns


  # Rubyisms
  node <=> another_node # Compare two Node objects with respect to their Document. Nodes from different documents cannot be compared.
    # uses xmlXPathCmpNodes "Compare two nodes w.r.t document order"
  node == another_node # compares pointer_id
  node.clone # alias node.dup # Copy this node. An optional depth may be passed in, but it defaults to a deep copy. 0 is a shallow copy, 1 is a deep copy.

  # Visitor pattern
  node.accept(visitor)# calls visitor.visit(self)
  
  # Write it out (sorted from most flexible/hardest to use to least flexible/easiest to use)
  node.write_to(io, *options)
    # Write Node to +io+ with +options+. +options+ modify the output of
    # this method.  Valid options are:
    #
    # * +:encoding+ for changing the encoding
    # * +:indent_text+ the indentation text, defaults to one space
    # * +:indent+ the number of +:indent_text+ to use, defaults to 2
    # * +:save_with+ a combination of SaveOptions constants.
      # SaveOptions
        # AS_BUILDER: Save builder created document
        # AS_HTML: Save as HTML
        # AS_XHTML: Save as XHTML
        # AS_XML: Save as XML
        # DEFAULT_HTML: the default for HTML document
        # DEFAULT_XHTML: the default for XHTML document
        # DEFAULT_XML: the default for XML documents
        # FORMAT: Format serialized xml
        # NO_DECLARATION: Do not include declarations
        # NO_EMPTY_TAGS: Do not include empty tags
        # NO_XHTML: Do not save XHTML
    # e.g. node.write_to(io, :encoding => 'UTF-8', :indent => 2)
  node.write_html_to(io, options={}) # uses write_to with :save_with => DEFAULT_HTML option (libxml2.6 does dump_html)
  node.write_xhtml_to(io. options={}) # uses write_to with :save_with => DEFAULT_XHTML option (libxml2.6 does dump_html)
  node.write_xml_to(io, options={}) # uses write_to with :save_with => DEFAULT_XML option
  node.serialize # Serialize Node a string using +options+, provided as a hash or block. Uses write_to (via StringIO)
    # node.serialize(:encoding => 'UTF-8', :save_with => FORMAT | AS_XML)
    # node.serialize(:encoding => 'UTF-8') do |config|
    #   config.format.as_xml
    # end
  node.to_html(options={}) # serializes with :save_with => DEFAULT_HTML option (libxml2.6 does dump_html)
  node.to_xhtml(options={}) # serializes with :save_with => DEFAULT_XHTML option (libxml2.6 does dump_html)
  node.to_xml(options={}) # serializes with :save_with => DEFAULT_XML option
  node.to_s # document.xml? ? to_xml : to_html
  # includes PP (prettyprint), which provides
  node.inspect
  node.pretty_print(pp) # essentially protected

  # Utility
  node.encode_special_chars(str) # Encode any special characters in str
  node.fragment(tags) # Create a DocumentFragment containing tags that is relative to this context node.
  node.parse(string_or_io, options={})
    # Parse +string_or_io+ as a document fragment within the context of
    # *this* node.  Returns a XML::NodeSet containing the nodes parsed from
    # +string_or_io+.
  
  # External subsets, like DTD declarations
  node.create_external_subset(name, external_id, system_id)
  node.create_internal_subset(name, external_id, system_id)
  node.external_subset
  node.internal_subset
  
  # Other:
  node.description # Fetch the Nokogiri::HTML::ElementDescription for this node. Returns nil on XML documents and on unknown tags.
    # e.g. if node is an <img> tag: Nokogiri::HTML::ElementDescription['img']  Nokogiri::HTML::ElementDescription: img embedded image >
  node.decorate! # Decorate this node with the decorators set up in this node's Document. Used internally to provide Slop support and Hpricot compatibility via Nokogiri::Hpricot
  node.do_xinclude # options as a block or hash
    # Do xinclude substitution on the subtree below node. If given a block, a
    # Nokogiri::XML::ParseOptions object initialized from +options+, will be
    # passed to it, allowing more convenient modification of the parser options.

Working with a Nokogiri::XML::NodeSet

  nodes = Nokogiri::XML::NodeSet.new(document, list=[])
  
  # Set operations
  nodes | other_nodeset # UNION, i.e. merging the sets, returning a new set
  nodes + other_nodeset # UNION, i.e. merging the sets, returning a new set
  nodes & other_nodeset # INTERSECTION # i.e. return a new NodeSet with the common nodes only
  nodes - other_nodeset # DIFFERENCE Returns a new NodeSet containing the nodes in this NodeSet that aren't in other_nodeset
  nodes.include?(node)
  nodes.empty?
  nodes.length # alias nodes.size
  nodes.delete(node) # Delete node from the Nodeset, if it is a member. Returns the deleted node if found, otherwise returns nil.

  # List operations (includes Enumerable)
  nodes.each {|node| }
  nodes.first
  nodes.last
  nodes.reverse # Returns a new NodeSet containing all the nodes in the NodeSet in reverse order
  nodes.index(node) # returns the numeric index or nil
  nodes[3] # element at index 3
  nodes[3,4] # return a NodeSet of size 4, starting at index 3
  nodes[3..6] # or return a NodeSet using a range of indexes
  # alias nodes.slice
  nodes.pop # Removes the last element from set and returns it, or nil if the set is empty
  nodes.push(node) # alias nodes << node # Append node to the NodeSet.
  nodes.shift # Returns the first element of the NodeSet and removes it. Returns nil if the set is empty.
  nodes.filter(expr) # Filter this list for nodes that match expr. WHAT DOES THIS RETURN? NodeSet? Array?
    # find_all { |node| node.matches?(expr) }
  
  nodes.children # Returns a new NodeSet containing all the children of all the nodes in the NodeSet
  
  # Content
  nodes.inner_html(*args) # Get the inner html of all contained Node objects
  nodes.inner_text # alias nodes.text
  
  # Convenience modifiers
  nodes.remove # alias of nodes.unlink # Unlink this NodeSet and all Node objects it contains from their current context.
  nodes.wrap("<div class='container'></div>") # wrap new xml around EACH NODE in a Nodeset
  nodes.before(datum) # Insert datum before the first Node in this NodeSet # e.g. first.before(datum)
  nodes.after(datum) # Insert datum after the last Node in this NodeSet # e.g. last.after(datum)
  nodes.attr(key, value) # set the attribute key to value on all Node objects in the NodeSet
  nodes.attr(key) { |node| 'value' } # set the attribute key to the result of the block on all Node objects in the NodeSet
    # alias nodes.attribute, nodes.set
  nodes.remove_attr(name) # removes the attribute from all nodes in the nodeset
  nodes.add_class(name) # Append the class attribute name to all Node objects in the NodeSet.
  nodes.remove_class(name = nil) # if nil, removes the class attrinute from all nodes in the nodeset
  
  # Searching
  nodes.search(*paths) # alias nodes / path
  nodes.at(*paths) # alias nodes % path
  nodes.xpath(*paths)
  nodes.at_xpath(*paths)
  nodes.css(*rules)
  nodes.at_css(*rules)
  nodes > selector # Search this NodeSet's nodes' immediate children using CSS selector selector
  
  # Writing out
  nodes.to_a # alias nodes.to_ary # Return this list as an Array
  nodes.to_html(*args)
  nodes.to_s
  nodes.to_xhtml(*args)
  nodes.to_xml(*args)
  
  # Rubyisms
  nodes == nodes # Two NodeSets are equal if the contain the same number of elements and if each element is equal to the corresponding element in the other NodeSet
  nodes.dup # Duplicate this node set
  nodes.inspect

Reader parsers

Reader parsers can be used to parse very large XML documents quickly without the need to load the entire document into memory or write a SAX document parser. The reader mades each node in the XML document available exactly once, only moving forward, like a cursor.

  reader = Nokogiri::XML::Reader(string_or_io)
    # attrs
    # .encoding
    # .errors
    # .source

  # Reading
  reader.each {|node|  } # node and reader are the same object. shortcut for while(node = self.read) yield(node); end;
  reader.read # Move the Reader forward through the XML document.

  node.name
  node.local_name

  # Attributes
  node.attribute('src')
  node.attribute_at(1)
  node.attribute_count
  node.attribute_nodes
  node.attributes
  node.attributes?

  # Content
  node.empty_element?
  node.self_closing?
  node.value # Get the text value of the node if present as a utf-8 encoded string. Does NOT advance the reader.
  node.value? # Does this node have a text value?
  node.inner_xml # Read the contents of the current node, including child nodes and markup into a utf-8 encoded string. Does NOT advance the reader
  node.outer_xml # Does NOT advance the reader

  node.base_uri # Get the xml:base of the node
  node.default? # Was an attribute generated from the default value in the DTD or schema?
  node.depth

  # Namespaces and the rest
  node.namespace_uri # Get the URI defining the namespace associated with the node
  node.namespaces # Get a hash of namespaces for this Node
  node.prefix # Get the shorthand reference to the namespace associated with the node.
  node.xml_version # Get the XML version of the document being read
  node.lang # Get the xml:lang scope within which the node resides.
  node.node_type
    # one of 
    # TYPE_ATTRIBUTE
    # TYPE_CDATA
    # TYPE_COMMENT
    # TYPE_DOCUMENT
    # TYPE_DOCUMENT_FRAGMENT
    # TYPE_DOCUMENT_TYPE
    # TYPE_ELEMENT
    # TYPE_END_ELEMENT
    # TYPE_END_ENTITY
    # TYPE_ENTITY
    # TYPE_ENTITY_REFERENCE
    # TYPE_NONE
    # TYPE_NOTATION
    # TYPE_PROCESSING_INSTRUCTION
    # TYPE_SIGNIFICANT_WHITESPACE
    # TYPE_TEXT
    # TYPE_WHITESPACE
    # TYPE_XML_DECLARATION
  node.state # Get the state of the reader

XSD Validation

XSD XSD::XMLParser XSD::XMLParser::Nokogiri

  xsd = Nokogiri::XML::Schema(string_or_io_to_schema_file)
  doc = Nokogiri::XML(File.read(PO_XML_FILE))
  
  xsd.valid?(doc) # => true/false
   
  xsd.validate(doc) # returns an an array of SyntaxError s
  xsd.validate(doc).each do |syntax_error|
    syntax_error.error?
    syntax_error.fatal?
    syntax_error.none?
    syntax_error.to_s
    syntax_error.warning?
    
    # undocumented attributes
    syntax_error.code R
    syntax_error.column R
    syntax_error.domain R
    syntax_error.file R
    syntax_error.int1 R
    syntax_error.level R
    syntax_error.line R
    syntax_error.str1 R
    syntax_error.str2 R
    syntax_error.str3 R
  end
  
  
  # http://nokogiri.org/Nokogiri/XML/Schema.html
  # http://nokogiri.org/Nokogiri/XML/AttributeDecl.html
  # http://nokogiri.org/Nokogiri/XML/DTD.html
  # http://nokogiri.org/Nokogiri/XML/ElementDecl.html
  # http://nokogiri.org/Nokogiri/XML/ElementContent.html
  # http://nokogiri.org/Nokogiri/XML/EntityDecl.html
  # http://nokogiri.org/Nokogiri/XML/EntityReference.html
  
  doc.validate # validate it against its DTD, if it has one

CSS Parsing

Nokogiri::CSS Nokogiri::CSS::Node Nokogiri::CSS::Parser Nokogiri::CSS::SyntaxError Nokogiri::CSS::Tokenizer Nokogiri::CSS::Tokenizer::ScanError

  # http://nokogiri.org/Nokogiri/CSS.html
  Nokogiri::CSS.parse('selector') # => returns an AST
  Nokogiri::CSS.xpath_for('selector', options={})
  
  # http://nokogiri.org/Nokogiri/CSS/Node.html
    # attr: type, value
    #methods
    # accept(visitor)
    # find_by_type
    # new
    # preprocess!
    # to_a
    # to_type
    # to_xpath
  # http://nokogiri.org/Nokogiri/CSS/Parser.html # a Racc generated Parser

XSLT Transformation

Nokogiri::XSLT Nokogiri::XSLT::Stylesheet

  doc   = Nokogiri::XML(File.read('some_file.xml'))
  xslt  = Nokogiri::XSLT(File.read('some_transformer.xslt'))
  puts xslt.transform(doc) # [, xslt_parameters]
  #   xslt.serialize(doc) # to am xml string
  #   xslt.apply_to(doc, params=[]) # equivalent to xslt.serialize(xslt.transform(doc, params))