-
-
Notifications
You must be signed in to change notification settings - Fork 905
Cheat sheet
A digest of most of the methods documented at nokogiri.org. rubyforge.org hosts an alternate source for the API docs, which may be easier to read.
Topics not covered: RelaxNG validation, Slop, SAX Parsing, or Builder See also: http://cheat.errtheblog.com/s/nokogiri
Strings are always stored as UTF-8 internally. Methods that return text values will always return UTF-8 encoded strings. Methods that return XML (like to_xml, to_html and inner_html) will return a string encoded like the source document.
More Resources
- sax-machine
- feedzirra
- elementor
- mechanize
- markup_validity
- XPath cheat sheet
- CSS selector cheat sheet
- XPath Reference
Nokogiri::HTML::Document Nokogiri::HTML::DocumentFragment Nokogiri::XML::Document Nokogiri::XML::DocumentFragment
doc = Nokogiri(string_or_io) # Nokogiri will try to guess what type of document you are attempting to parse
doc = Nokogiri::HTML(string_or_io) # [, url, encoding, options, &block]
doc = Nokogiri::XML(string_or_io) # [, url, encoding, options, &block]
# set options with block {|config| config.noblanks.noent.noerror.strict }
# OR with a bitmask {|config| config.options = Nokogiri::XML::NOBLANKS | Nokogiri::XML::NOENT}
# http://nokogiri.org/Nokogiri/XML/ParseOptions.html
# doc = Nokogiri.parse(...)
# doc = Nokogiri::XML.parse(...) #shortcut to Nokogiri::XML::Document.parse
# doc = Nokogiri::HTML.parse(...) #shortcut to Nokogiri::HTML::Document.parse
nodeset = Nokogiri::XML.fragment(string)
nodeset = Nokogiri::HTML.fragment(string, encoding = nil)
nc = Nokogiri::HTML::NamedCharacters # a Nokogiri::HTML::EntityLookup
nc[key] # like nc.get(key).try(:value) # e.g. nc['gt'] (62) or nc['rsquo'] (8217)
nc.get(key) # returns an Nokogiri::HTML::EntityDescription
# e.g. nc.get('rsquo') #=> #<struct Nokogiri::HTML::EntityDescription value=8217, name="rsquo", description="right single quotation mark, U+2019 ISOnum">
# Adding a Processing Instruction (like <?xml-stylesheet?>)
# Nokogiri::XML::ProcessingInstruction http://nokogiri.org/tutorials/modifying_an_html_xml_document.html
pi = Nokogiri::XML::ProcessingInstruction.new(doc, "xml-stylesheet",'type="text/xsl" href="foo.xsl"')
doc.root.add_previous_sibling(pi)
# document namespaces
doc.collect_namespaces
doc.remove_namespaces!
doc.namespaces
# shortcuts for creating new nodes
doc.create_cdata(string, &block)
doc.create_comment(string, &block)
doc.create_element(name, *args, &block) # Create an element with name, and optionally setting the content and attributes.
doc.create_entity
doc.create_text_node(string, &block)
doc.root
doc.root=node
# A document is a Node, so see working_with_a_node
Working with a Nokogiri::XML::Node
node = Nokogiri::XML::Node.new('name', document) # initialize a new node
# or shorter:
node = document.create_element('name')
node.document
node.name # alias of node.node_name
node.name= # alias of node.node_name=
node.read_only?
node.blank?
# Type of Node
node.type # alias of node.node_type
node.cdata? # type == CDATA_SECTION_NODE
node.comment? # type == COMMENT_NODE
node.element? # type == ELEMENT_NODE alias node.elem?
node.fragment? # type == DOCUMENT_FRAG_NODE (Document fragment node)
node.html? # type == HTML_DOCUMENT_NODE
node.text? # type == TEXT_NODE
node.xml? # type == DOCUMENT_NODE (Document node type)
# other types not covered by a convenience method
# ATTRIBUTE_DECL: Attribute declaration type
# ATTRIBUTE_NODE: Attribute node type
# DOCB_DOCUMENT_NODE: DOCB document node type
# DOCUMENT_TYPE_NODE: Document type node type
# DTD_NODE: DTD node type
# ELEMENT_DECL: Element declaration type
# ENTITY_DECL: Entity declaration type
# ENTITY_NODE: Entity node type
# ENTITY_REF_NODE: Entity reference node type
# NAMESPACE_DECL: Namespace declaration type
# NOTATION_NODE: Notation node type
# PI_NODE: PI node type
# XINCLUDE_END: XInclude end type
# XINCLUDE_START: XInclude start type
# Attributes, like a hash
node['src'] # aliases: node.get_attribute, node.attr. similar to
node['src'] = 'value' # alias node.set_attribute
node.key?('src') # alias node.has_attribute?
node.keys # array of strings
node.values
node.delete('src') # alias of node.remove_attribute
node.each { |attr_name, attr_value| }
# include Enumerable, which works on these attribute names and values
# Attribute Nodes
node.attribute('src') # Get the attribute node with name src
# Nokogiri::XML::Attr < Nokogiri::XML::Node
# can use .content= or .value= to modify
node.attribute_nodes # returns a list containing the Node attributes.
node.attribute_with_ns('src', 'namespace') # Get the attribute node with name and namespace
node.attributes # Returns a hash containing the node's attributes. The key is the attribute name without any namespace, the value is a Nokogiri::XML::Attr representing the attribute. If you need to distinguish attributes with the same name, with different namespaces use #attribute_nodes instead.
# Traversing / Modifying
# +node_or_tags+ can be a Nokogiri::XML::Node, a ::DocumentFragment, a ::NodeSet, or a string containing markup.
## Self
node.traverse {|node| } # yields self and all children to block _recursively_
node.remove # alias of node.unlink # Unlink this node from its current context.
node.replace(node_or_tags)
# Replace this Node with +node_or_tags+.
# Returns the reparented node (if +node_or_tags+ is a Node), or NodeSet (if +node_or_tags+ is a DocumentFragment, NodeSet, or string).
node.swap(node_or_tags) # like above, but returns self to support chaining
## Siblings
node.next # alias of node.next_sibling # Returns the next sibling node
node.next=(node_or_tags) # alias of node.add_next_sibling
# Inserts node_or_tags after this node (as a sibling).
# Returns the reparented node (if node_or_tags is a Node)
# or NodeSet if node_or_tags is a DocumentFragment, NodeSet, or sring
node.after(node_or_tags) # just like above method, but returns self to suppport chaining
node.next_element # Returns the next Nokogiri::XML::Element type sibling node.
node.previous # alias of node.previous_sibling # Returns the previous sibling node
node.previous=(node_or_tags) # alias of node.add_previous_sibling ?
# Inserts node_or_tags before this node (as a sibling).
# Returns the reparented node (if node_or_tags is a Node)
# or NodeSet if node_or_tags is a DocumentFragment, NodeSet, or sring
node.before(node_or_tags) # just like the above method, but returns self to suppport chaining
node.previous_element # Returns the previous Nokogiri::XML::Element type sibling node.
## Parent
node.parent
node.parent=(node)
## Children
node.child # returns a Node
node.children # Get the list of children for this node as a NodeSet
node.children=(node_or_tags)
# Set the inner html for this Node +node_or_tags+
# Returns the reparented node (if +node_or_tags+ is a Node), or NodeSet (if +node_or_tags+ is a DocumentFragment, NodeSet, or string).
node.elements # alias: node.element_children # Get the list of children for this node that are elements as a NodeSet.
node.add_child(node_or_tags)
# Add +node_or_tags+ as a child of this Node.
# Returns the reparented node (if +node_or_tags+ is a Node), or NodeSet (if +node_or_tags+ is a DocumentFragment, NodeSet, or string).
node << node_or_tags # like above, but returns self to support chaining, e.g. root << child1 << child2
node.first_element_child # Returns the first child node of this node that is an element.
node.last_element_child # Returns the last child node of this node that is an element.
## Content / Children
node.content # aliases node.text node.inner_text node.to_str #
node.content=(string) # Set the Node's content to a Text node containing +string+. The string gets XML escaped, not interpreted as markup.
node.inner_html # (*args) children.map { |x| x.to_html(*args) }.join
node.inner_html=(node_or_tags)
# Set the inner html for this Node to +node_or_tags+
# Returns self.
# Also see related method +children=+
## Searching below (see working_with_a_nodest)
# see docs for namespace bindings, variable bindings, custom handler class (for custom xpath functions)
node.search(*paths) # alias: node / path # paths can be XPath or CSS
node.at(*paths) # alias node % path # Search for the first occurrence of path. Returns nil if nothing is found, otherwise a Node. (like search(path, ns).first)
node.xpath(*paths) # search for XPath queries
node.at_xpath(*paths) # like xpath(*paths).first
node.css(*rules) # search for CSS rules
node.at_css(*rules) # like css(*rules).first
node > selector # Search this node's immediate children using CSS selector
# Searching above
node.ancestors # list of ancestor nodes
node.ancestors(selector) # ancestors must match selector
# Where am I?
node.path # Returns the path associated with this Node
node.css_path # Get the path to this node as a CSS expression
node.matches?(selector) # does this node match selector?
node.line # line number from input
node.pointer_id # internal pointer number
# Namespaces
node.add_namespace(prefix, href) # alias of node.add_namespace_definition
# Adds a namespace definition with prefix using href value. The result is as
# if parsed XML for this node had included an attribute
# ‘xmlns:prefix=value'. A default namespace for this node (“xmlns=”) can be
# added by passing ‘nil' for prefix. Namespaces added this way will not show
# up in #attributes, but they will be included as an xmlns attribute when
# the node is serialized to XML.
node.default_namespace=(url)
# Adds a default namespace supplied as a string url href, to self. The
# consequence is as an xmlns attribute with supplied argument were present
# in parsed XML. A default namespace set with this method will now show up
# in #attributes, but when this node is serialized to XML an “xmlns”
# attribute will appear. See also #namespace and #namespace=
node.namespace # returns the default namespace set on this node (as with an “xmlns=” attribute), as a Namespace object.
node.namespace=(ns)
# Set the default namespace on this node (as would be defined with an
# “xmlns=” attribute in XML source), as a Namespace object ns . Note that a
# Namespace added this way will NOT be serialized as an xmlns attribute for
# this node. You probably want #default_namespace= instead, or perhaps
# #add_namespace_definition with a nil prefix argument.
node.namespace_definitions
# returns namespaces defined on self element directly, as an array of
# Namespace objects. Includes both a default namespace (as in“xmlns=”), and
# prefixed namespaces (as in “xmlns:prefix=”).
node.namespace_scopes
# returns namespaces in scope for self – those defined on self element
# directly or any ancestor node – as an array of Namespace objects. Default
# namespaces (“xmlns=” style) for self are included in this array; Default
# namespaces for ancestors, however, are not. See also #namespaces
node.namespaced_key?(attribute, namespace)
# Returns true if attribute is set with namespace
node.namespaces # Returns a Hash of {prefix => value} for all namespaces on this node and its ancestors.
# This method returns the same namespaces as #namespace_scopes.
#
# Returns namespaces in scope for self – those defined on self element
# directly or any ancestor node – as a Hash of attribute-name/value pairs.
# Note that the keys in this hash XML attributes that would be used to
# define this namespace, such as “xmlns:prefix”, not just the prefix.
# Default namespace set on self will be included with key “xmlns”. However,
# default namespaces set on ancestor will NOT be, even if self has no
# explicit default namespace.
# see also attribute_with_ns
# Rubyisms
node <=> another_node # Compare two Node objects with respect to their Document. Nodes from different documents cannot be compared.
# uses xmlXPathCmpNodes "Compare two nodes w.r.t document order"
node == another_node # compares pointer_id
node.clone # alias node.dup # Copy this node. An optional depth may be passed in, but it defaults to a deep copy. 0 is a shallow copy, 1 is a deep copy.
# Visitor pattern
node.accept(visitor)# calls visitor.visit(self)
# Write it out (sorted from most flexible/hardest to use to least flexible/easiest to use)
node.write_to(io, *options)
# Write Node to +io+ with +options+. +options+ modify the output of
# this method. Valid options are:
#
# * +:encoding+ for changing the encoding
# * +:indent_text+ the indentation text, defaults to one space
# * +:indent+ the number of +:indent_text+ to use, defaults to 2
# * +:save_with+ a combination of SaveOptions constants.
# SaveOptions
# AS_BUILDER: Save builder created document
# AS_HTML: Save as HTML
# AS_XHTML: Save as XHTML
# AS_XML: Save as XML
# DEFAULT_HTML: the default for HTML document
# DEFAULT_XHTML: the default for XHTML document
# DEFAULT_XML: the default for XML documents
# FORMAT: Format serialized xml
# NO_DECLARATION: Do not include declarations
# NO_EMPTY_TAGS: Do not include empty tags
# NO_XHTML: Do not save XHTML
# e.g. node.write_to(io, :encoding => 'UTF-8', :indent => 2)
node.write_html_to(io, options={}) # uses write_to with :save_with => DEFAULT_HTML option (libxml2.6 does dump_html)
node.write_xhtml_to(io. options={}) # uses write_to with :save_with => DEFAULT_XHTML option (libxml2.6 does dump_html)
node.write_xml_to(io, options={}) # uses write_to with :save_with => DEFAULT_XML option
node.serialize # Serialize Node a string using +options+, provided as a hash or block. Uses write_to (via StringIO)
# node.serialize(:encoding => 'UTF-8', :save_with => FORMAT | AS_XML)
# node.serialize(:encoding => 'UTF-8') do |config|
# config.format.as_xml
# end
node.to_html(options={}) # serializes with :save_with => DEFAULT_HTML option (libxml2.6 does dump_html)
node.to_xhtml(options={}) # serializes with :save_with => DEFAULT_XHTML option (libxml2.6 does dump_html)
node.to_xml(options={}) # serializes with :save_with => DEFAULT_XML option
node.to_s # document.xml? ? to_xml : to_html
# includes PP (prettyprint), which provides
node.inspect
node.pretty_print(pp) # essentially protected
# Utility
node.encode_special_chars(str) # Encode any special characters in str
node.fragment(tags) # Create a DocumentFragment containing tags that is relative to this context node.
node.parse(string_or_io, options={})
# Parse +string_or_io+ as a document fragment within the context of
# *this* node. Returns a XML::NodeSet containing the nodes parsed from
# +string_or_io+.
# External subsets, like DTD declarations
node.create_external_subset(name, external_id, system_id)
node.create_internal_subset(name, external_id, system_id)
node.external_subset
node.internal_subset
# Other:
node.description # Fetch the Nokogiri::HTML::ElementDescription for this node. Returns nil on XML documents and on unknown tags.
# e.g. if node is an <img> tag: Nokogiri::HTML::ElementDescription['img'] Nokogiri::HTML::ElementDescription: img embedded image >
node.decorate! # Decorate this node with the decorators set up in this node's Document. Used internally to provide Slop support and Hpricot compatibility via Nokogiri::Hpricot
node.do_xinclude # options as a block or hash
# Do xinclude substitution on the subtree below node. If given a block, a
# Nokogiri::XML::ParseOptions object initialized from +options+, will be
# passed to it, allowing more convenient modification of the parser options.
Working with a Nokogiri::XML::NodeSet
nodes = Nokogiri::XML::NodeSet.new(document, list=[])
# Set operations
nodes | other_nodeset # UNION, i.e. merging the sets, returning a new set
nodes + other_nodeset # UNION, i.e. merging the sets, returning a new set
nodes & other_nodeset # INTERSECTION # i.e. return a new NodeSet with the common nodes only
nodes - other_nodeset # DIFFERENCE Returns a new NodeSet containing the nodes in this NodeSet that aren't in other_nodeset
nodes.include?(node)
nodes.empty?
nodes.length # alias nodes.size
nodes.delete(node) # Delete node from the Nodeset, if it is a member. Returns the deleted node if found, otherwise returns nil.
# List operations (includes Enumerable)
nodes.each {|node| }
nodes.first
nodes.last
nodes.reverse # Returns a new NodeSet containing all the nodes in the NodeSet in reverse order
nodes.index(node) # returns the numeric index or nil
nodes[3] # element at index 3
nodes[3,4] # return a NodeSet of size 4, starting at index 3
nodes[3..6] # or return a NodeSet using a range of indexes
# alias nodes.slice
nodes.pop # Removes the last element from set and returns it, or nil if the set is empty
nodes.push(node) # alias nodes << node # Append node to the NodeSet.
nodes.shift # Returns the first element of the NodeSet and removes it. Returns nil if the set is empty.
nodes.filter(expr) # Filter this list for nodes that match expr. WHAT DOES THIS RETURN? NodeSet? Array?
# find_all { |node| node.matches?(expr) }
nodes.children # Returns a new NodeSet containing all the children of all the nodes in the NodeSet
# Content
nodes.inner_html(*args) # Get the inner html of all contained Node objects
nodes.inner_text # alias nodes.text
# Convenience modifiers
nodes.remove # alias of nodes.unlink # Unlink this NodeSet and all Node objects it contains from their current context.
nodes.wrap("<div class='container'></div>") # wrap new xml around EACH NODE in a Nodeset
nodes.before(datum) # Insert datum before the first Node in this NodeSet # e.g. first.before(datum)
nodes.after(datum) # Insert datum after the last Node in this NodeSet # e.g. last.after(datum)
nodes.attr(key, value) # set the attribute key to value on all Node objects in the NodeSet
nodes.attr(key) { |node| 'value' } # set the attribute key to the result of the block on all Node objects in the NodeSet
# alias nodes.attribute, nodes.set
nodes.remove_attr(name) # removes the attribute from all nodes in the nodeset
nodes.add_class(name) # Append the class attribute name to all Node objects in the NodeSet.
nodes.remove_class(name = nil) # if nil, removes the class attrinute from all nodes in the nodeset
# Searching
nodes.search(*paths) # alias nodes / path
nodes.at(*paths) # alias nodes % path
nodes.xpath(*paths)
nodes.at_xpath(*paths)
nodes.css(*rules)
nodes.at_css(*rules)
nodes > selector # Search this NodeSet's nodes' immediate children using CSS selector selector
# Writing out
nodes.to_a # alias nodes.to_ary # Return this list as an Array
nodes.to_html(*args)
nodes.to_s
nodes.to_xhtml(*args)
nodes.to_xml(*args)
# Rubyisms
nodes == nodes # Two NodeSets are equal if the contain the same number of elements and if each element is equal to the corresponding element in the other NodeSet
nodes.dup # Duplicate this node set
nodes.inspect
Reader parsers
Reader parsers can be used to parse very large XML documents quickly without the need to load the entire document into memory or write a SAX document parser. The reader mades each node in the XML document available exactly once, only moving forward, like a cursor.
reader = Nokogiri::XML::Reader(string_or_io)
# attrs
# .encoding
# .errors
# .source
# Reading
reader.each {|node| } # node and reader are the same object. shortcut for while(node = self.read) yield(node); end;
reader.read # Move the Reader forward through the XML document.
node.name
node.local_name
# Attributes
node.attribute('src')
node.attribute_at(1)
node.attribute_count
node.attribute_nodes
node.attributes
node.attributes?
# Content
node.empty_element?
node.self_closing?
node.value # Get the text value of the node if present. Returns a utf-8 encoded string.
node.value? # Does this node have a text value?
node.inner_xml
node.outer_xml
node.base_uri # Get the xml:base of the node
node.default? # Was an attribute generated from the default value in the DTD or schema?
node.depth
# Namespaces and the rest
node.namespace_uri # Get the URI defining the namespace associated with the node
node.namespaces # Get a hash of namespaces for this Node
node.prefix # Get the shorthand reference to the namespace associated with the node.
node.xml_version # Get the XML version of the document being read
node.lang # Get the xml:lang scope within which the node resides.
node.node_type
# one of
# TYPE_ATTRIBUTE
# TYPE_CDATA
# TYPE_COMMENT
# TYPE_DOCUMENT
# TYPE_DOCUMENT_FRAGMENT
# TYPE_DOCUMENT_TYPE
# TYPE_ELEMENT
# TYPE_END_ELEMENT
# TYPE_END_ENTITY
# TYPE_ENTITY
# TYPE_ENTITY_REFERENCE
# TYPE_NONE
# TYPE_NOTATION
# TYPE_PROCESSING_INSTRUCTION
# TYPE_SIGNIFICANT_WHITESPACE
# TYPE_TEXT
# TYPE_WHITESPACE
# TYPE_XML_DECLARATION
node.state # Get the state of the reader
XSD XSD::XMLParser XSD::XMLParser::Nokogiri
xsd = Nokogiri::XML::Schema(string_or_io_to_schema_file)
doc = Nokogiri::XML(File.read(PO_XML_FILE))
xsd.valid?(doc) # => true/false
xsd.validate(doc) # returns an an array of SyntaxError s
xsd.validate(doc).each do |syntax_error|
syntax_error.error?
syntax_error.fatal?
syntax_error.none?
syntax_error.to_s
syntax_error.warning?
# undocumented attributes
syntax_error.code R
syntax_error.column R
syntax_error.domain R
syntax_error.file R
syntax_error.int1 R
syntax_error.level R
syntax_error.line R
syntax_error.str1 R
syntax_error.str2 R
syntax_error.str3 R
end
# http://nokogiri.org/Nokogiri/XML/Schema.html
# http://nokogiri.org/Nokogiri/XML/AttributeDecl.html
# http://nokogiri.org/Nokogiri/XML/DTD.html
# http://nokogiri.org/Nokogiri/XML/ElementDecl.html
# http://nokogiri.org/Nokogiri/XML/ElementContent.html
# http://nokogiri.org/Nokogiri/XML/EntityDecl.html
# http://nokogiri.org/Nokogiri/XML/EntityReference.html
doc.validate # validate it against its DTD, if it has one
Nokogiri::CSS Nokogiri::CSS::Node Nokogiri::CSS::Parser Nokogiri::CSS::SyntaxError Nokogiri::CSS::Tokenizer Nokogiri::CSS::Tokenizer::ScanError
# http://nokogiri.org/Nokogiri/CSS.html
Nokogiri::CSS.parse('selector') # => returns an AST
Nokogiri::CSS.xpath_for('selector', options={})
# http://nokogiri.org/Nokogiri/CSS/Node.html
# attr: type, value
#methods
# accept(visitor)
# find_by_type
# new
# preprocess!
# to_a
# to_type
# to_xpath
# http://nokogiri.org/Nokogiri/CSS/Parser.html # a Racc generated Parser
Nokogiri::XSLT Nokogiri::XSLT::Stylesheet
doc = Nokogiri::XML(File.read('some_file.xml'))
xslt = Nokogiri::XSLT(File.read('some_transformer.xslt'))
puts xslt.transform(doc) # [, xslt_parameters]
# xslt.serialize(doc) # to am xml string
# xslt.apply_to(doc, params=[]) # equivalent to xslt.serialize(xslt.transform(doc, params))