-
-
Notifications
You must be signed in to change notification settings - Fork 905
/
Copy pathdocument.rb
258 lines (244 loc) · 10 KB
/
document.rb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
# frozen_string_literal: true
module Nokogiri
module XML
module SAX
# :markup: markdown
#
# The SAX::Document class is used for registering types of events you are interested in
# handling. All of the methods on this class are available as possible events while parsing an
# \XML document. To register for any particular event, subclass this class and implement the
# methods you are interested in knowing about.
#
# To only be notified about start and end element events, write a class like this:
#
# class MyHandler < Nokogiri::XML::SAX::Document
# def start_element name, attrs = []
# puts "#{name} started!"
# end
#
# def end_element name
# puts "#{name} ended"
# end
# end
#
# You can use this event handler for any SAX-style parser included with Nokogiri.
#
# See also:
#
# - Nokogiri::XML::SAX
# - Nokogiri::HTML4::SAX
#
# ### Entity Handling
#
# ⚠ Entity handling is complicated in a SAX parser! Please read this section carefully if
# you're not getting the behavior you expect.
#
# Entities will be reported to the user via callbacks to #characters, to #reference, or
# possibly to both. The behavior is determined by a combination of _entity type_ and the value
# of ParserContext#replace_entities. (Recall that the default value of
# ParserContext#replace_entities is `false`.)
#
# ⚠ <b>It is UNSAFE to set ParserContext#replace_entities to `true`</b> when parsing untrusted
# documents.
#
# 💡 For more information on entity types, see [Wikipedia's page on
# DTDs](https://en.wikipedia.org/wiki/Document_type_definition#Entity_declarations).
#
# | Entity type | #characters | #reference |
# |--------------------------------------|------------------------------------|-------------------------------------|
# | Char ref (e.g., <tt>’</tt>) | always | never |
# | Predefined (e.g., <tt>&</tt>) | always | never |
# | Undeclared † | never | <tt>#replace_entities == false</tt> |
# | Internal | always | <tt>#replace_entities == false</tt> |
# | External † | <tt>#replace_entities == true</tt> | <tt>#replace_entities == false</tt> |
#
#
#
# † In the case where the replacement text for the entity is unknown (e.g., an undeclared entity
# or an external entity that could not be resolved because of network issues), then the
# replacement text will not be reported. If ParserContext#replace_entities is `true`, this
# means the #characters callback will not be invoked. If ParserContext#replace_entities is
# `false`, then the #reference callback will be invoked, but with `nil` for the `content`
# argument.
#
class Document
###
# Called when an \XML declaration is parsed.
#
# [Parameters]
# - +version+ (String) the version attribute
# - +encoding+ (String, nil) the encoding of the document if present, else +nil+
# - +standalone+ ("yes", "no", nil) the standalone attribute if present, else +nil+
def xmldecl(version, encoding, standalone)
end
###
# Called when document starts parsing.
def start_document
end
###
# Called when document ends parsing.
def end_document
end
###
# Called at the beginning of an element.
#
# [Parameters]
# - +name+ (String) the name of the element
# - +attrs+ (Array<Array<String>>) an assoc list of namespace declarations and attributes, e.g.:
# [ ["xmlns:foo", "http://sample.net"], ["size", "large"] ]
#
# 💡If you're dealing with XML and need to handle namespaces, use the
# #start_element_namespace method instead.
#
# Note that the element namespace and any attribute namespaces are not provided, and so any
# namespaced elements or attributes will be returned as strings including the prefix:
#
# parser.parse(<<~XML)
# <root xmlns:foo='http://foo.example.com/' xmlns='http://example.com/'>
# <foo:bar foo:quux="xxx">hello world</foo:bar>
# </root>
# XML
#
# assert_pattern do
# parser.document.start_elements => [
# ["root", [["xmlns:foo", "http://foo.example.com/"], ["xmlns", "http://example.com/"]]],
# ["foo:bar", [["foo:quux", "xxx"]]],
# ]
# end
#
def start_element(name, attrs = [])
end
###
# Called at the end of an element.
#
# [Parameters]
# - +name+ (String) the name of the element being closed
#
def end_element(name)
end
###
# Called at the beginning of an element.
#
# [Parameters]
# - +name+ (String) is the name of the element
# - +attrs+ (Array<Attribute>) is an array of structs with the following properties:
# - +localname+ (String) the local name of the attribute
# - +value+ (String) the value of the attribute
# - +prefix+ (String, nil) the namespace prefix of the attribute
# - +uri+ (String, nil) the namespace URI of the attribute
# - +prefix+ (String, nil) is the namespace prefix for the element
# - +uri+ (String, nil) is the associated URI for the element's namespace
# - +ns+ (Array<Array<String, String>>) is an assoc list of namespace declarations on the element
#
# 💡If you're dealing with HTML or don't care about namespaces, try #start_element instead.
#
# [Example]
# it "start_elements_namespace is called with namespaced attributes" do
# parser.parse(<<~XML)
# <root xmlns:foo='http://foo.example.com/'>
# <foo:a foo:bar='hello' />
# </root>
# XML
#
# assert_pattern do
# parser.document.start_elements_namespace => [
# [
# "root",
# [],
# nil, nil,
# [["foo", "http://foo.example.com/"]], # namespace declarations
# ], [
# "a",
# [Nokogiri::XML::SAX::Parser::Attribute(localname: "bar", prefix: "foo", uri: "http://foo.example.com/", value: "hello")], # prefixed attribute
# "foo", "http://foo.example.com/", # prefix and uri for the "a" element
# [],
# ]
# ]
# end
# end
#
def start_element_namespace(name, attrs = [], prefix = nil, uri = nil, ns = []) # rubocop:disable Metrics/ParameterLists
# Deal with SAX v1 interface
name = [prefix, name].compact.join(":")
attributes = ns.map do |ns_prefix, ns_uri|
[["xmlns", ns_prefix].compact.join(":"), ns_uri]
end + attrs.map do |attr|
[[attr.prefix, attr.localname].compact.join(":"), attr.value]
end
start_element(name, attributes)
end
###
# Called at the end of an element.
#
# [Parameters]
# - +name+ (String) is the name of the element
# - +prefix+ (String, nil) is the namespace prefix for the element
# - +uri+ (String, nil) is the associated URI for the element's namespace
#
def end_element_namespace(name, prefix = nil, uri = nil)
# Deal with SAX v1 interface
end_element([prefix, name].compact.join(":"))
end
###
# Called when character data is parsed, and for parsed entities when
# ParserContext#replace_entities is +true+.
#
# [Parameters]
# - +string+ contains the character data or entity replacement text
#
# ⚠ Please see Document@Entity+Handling for important information about how entities are handled.
#
# ⚠ This method might be called multiple times for a contiguous string of characters.
#
def characters(string)
end
###
# Called when a parsed entity is referenced and not replaced.
#
# [Parameters]
# - +name+ (String) is the name of the entity
# - +content+ (String, nil) is the replacement text for the entity, if known
#
# ⚠ Please see Document@Entity+Handling for important information about how entities are handled.
#
# ⚠ An internal entity may result in a call to both #characters and #reference.
#
# Since v1.17.0
#
def reference(name, content)
end
###
# Called when comments are encountered
# [Parameters]
# - +string+ contains the comment data
def comment(string)
end
###
# Called on document warnings
# [Parameters]
# - +string+ contains the warning
def warning(string)
end
###
# Called on document errors
# [Parameters]
# - +string+ contains the error
def error(string)
end
###
# Called when cdata blocks are found
# [Parameters]
# - +string+ contains the cdata content
def cdata_block(string)
end
###
# Called when processing instructions are found
# [Parameters]
# - +name+ is the target of the instruction
# - +content+ is the value of the instruction
def processing_instruction(name, content)
end
end
end
end
end