Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

getting ready for 0.7 #10

Merged
merged 4 commits into from
Sep 13, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 0 additions & 3 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,14 +3,11 @@ language: julia
os:
- linux
julia:
- 0.6
- 0.7
- 1.0
- nightly
matrix:
allow_failures:
- julia: 0.7
- juila: 1
- julia: nightly
notifications:
email: false
Expand Down
16 changes: 9 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ This package depends on the [Gumbo.jl](https://github.com/porterjamesj/Gumbo.jl)

### Usage

Usage is simple. Use `Gumbo` to parse an HTML string into a document, create a `Selector` from a string, and then use `matchall` to get the nodes in the document that match the selector. Alternatively, use `sel"<selector string>"` to do the same thing as `Selector`. The `matchall` function returns an array of elements which match the selector. If no match is found, a zero element array is returned. For unique matches, the array contains one element. Thus, check the length of the array to test whether a selector matches.
Usage is simple. Use `Gumbo` to parse an HTML string into a document, create a `Selector` from a string, and then use `eachmatch` to get the nodes in the document that match the selector. Alternatively, use `sel"<selector string>"` to do the same thing as `Selector`. The `eachmatch` function returns an array of elements which match the selector. If no match is found, a zero element array is returned. For unique matches, the array contains one element. Thus, check the length of the array to test whether a selector matches.

```julia
using Cascadia
Expand All @@ -19,15 +19,17 @@ using Gumbo
n=parsehtml("<p id=\"foo\"><p id=\"bar\">")
s=Selector("#foo")
sm = sel"#foo"
matchall(s, n.root)
eachmatch(s, n.root)
# 1-element Array{Gumbo.HTMLNode,1}:
# Gumbo.HTMLElement{:p}

matchall(sm, n.root)
eachmatch(sm, n.root)
# 1-element Array{Gumbo.HTMLNode,1}:
# Gumbo.HTMLElement{:p}
```

__Note:__ The top level matching function name has changed from `matchall` in `v0.6` to `eachmatch` in `v0.7` and higher to reflect the change in Julia base.

### Webscraping Example

The primary use case for this library is to enable webscraping -- the automatic extraction of information from html pages. As an example, consider the following code, which returns a list of questions that have been tagged with `julia-lang` on StackOverflow.
Expand All @@ -40,14 +42,14 @@ using Requests
r = get("http://stackoverflow.com/questions/tagged/julia-lang")
h = parsehtml(convert(String, r.data))

qs = matchall(Selector(".question-summary"),h.root)
qs = eachmatch(Selector(".question-summary"),h.root)

println("StackOverflow Julia Questions (votes answered? url)")

for q in qs
votes = nodeText(matchall(Selector(".votes .vote-count-post "), q)[1])
answered = length(matchall(Selector(".status.answered"), q)) > 0
href = matchall(Selector(".question-hyperlink"), q)[1].attributes["href"]
votes = nodeText(eachmatch(Selector(".votes .vote-count-post "), q)[1])
answered = length(eachmatch(Selector(".status.answered"), q)) > 0
href = eachmatch(Selector(".question-hyperlink"), q)[1].attributes["href"]
println("$votes $answered http://stackoverflow.com$href")
end
```
Expand Down
2 changes: 1 addition & 1 deletion REQUIRE
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
julia 0.6
julia 0.7
Gumbo
AbstractTrees
40 changes: 24 additions & 16 deletions appveyor.yml
Original file line number Diff line number Diff line change
@@ -1,9 +1,18 @@
environment:
matrix:
- JULIA_URL: "https://julialang-s3.julialang.org/bin/winnt/x86/0.4/julia-0.4-latest-win32.exe"
- JULIA_URL: "https://julialang-s3.julialang.org/bin/winnt/x64/0.4/julia-0.4-latest-win64.exe"
- JULIA_URL: "https://julialangnightlies-s3.julialang.org/bin/winnt/x86/julia-latest-win32.exe"
- JULIA_URL: "https://julialangnightlies-s3.julialang.org/bin/winnt/x64/julia-latest-win64.exe"
- julia_version: 0.7
- julia_version: 1
- julia_version: nightly

platform:
- x86 # 32-bit
- x64 # 64-bit

# # Uncomment the following lines to allow failures on nightly julia
# # (tests will run but not make your overall status red)
# matrix:
allow_failures:
- julia_version: nightly

branches:
only:
Expand All @@ -17,19 +26,18 @@ notifications:
on_build_status_changed: false

install:
- ps: "[System.Net.ServicePointManager]::SecurityProtocol = [System.Net.SecurityProtocolType]::Tls12"
# Download most recent Julia Windows binary
- ps: (new-object net.webclient).DownloadFile(
$env:JULIA_URL,
"C:\projects\julia-binary.exe")
# Run installer silently, output to C:\projects\julia
- C:\projects\julia-binary.exe /S /D=C:\projects\julia
- ps: iex ((new-object net.webclient).DownloadString("https://raw.githubusercontent.com/JuliaCI/Appveyor.jl/version-1/bin/install.ps1"))

build_script:
# Need to convert from shallow to complete for Pkg.clone to work
- IF EXIST .git\shallow (git fetch --unshallow)
- C:\projects\julia\bin\julia -e "versioninfo();
Pkg.clone(pwd(), \"Cascadia\"); Pkg.build(\"Cascadia\")"
- echo "%JL_BUILD_SCRIPT%"
- C:\julia\bin\julia -e "%JL_BUILD_SCRIPT%"

test_script:
- C:\projects\julia\bin\julia --check-bounds=yes -e "Pkg.test(\"Cascadia\")"
- echo "%JL_TEST_SCRIPT%"
- C:\julia\bin\julia -e "%JL_TEST_SCRIPT%"

# # Uncomment to support code coverage upload. Should only be enabled for packages
# # which would have coverage gaps without running on Windows
# on_success:
# - echo "%JL_CODECOV_SCRIPT%"
# - C:\julia\bin\julia -e "%JL_CODECOV_SCRIPT%"
4 changes: 2 additions & 2 deletions src/parser.jl
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ function parseEscape(p::Parser)
while i < p.i+6 && i <= length(p.s) && hexDigit(p.s[i])
i += 1
end
v = parse(UInt, p.s[start:i], 16)
v = parse(UInt, p.s[start:i], base=16)
if length(p.s) >= i
if p.s[i] == '\r'
i += 1
Expand Down Expand Up @@ -211,7 +211,7 @@ function skipWhitespace(p::Parser) #->boolean
continue
elseif p.s[i] == '/'
if startswith(p.s[p.i:end], "/*")
ends,endl = search(p.s, "*/", i+length("/*"))
ends,endl = something(findnext("*/", p.s, i+length("/*")), 0:-1)
if endl != -1
i = endl+1
continue
Expand Down
28 changes: 17 additions & 11 deletions src/selector.jl
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ firstChild(n::HTMLDocument) = n.root
function nextSibling(n::HTMLNode)
p=n.parent
l=length(p.children)
i=find(x->x===n, p.children)
i=findall(x->x===n, p.children)
if isempty(i) || i[1]==l
return nothing
else
Expand All @@ -28,7 +28,7 @@ end
function prevSibling(n::HTMLNode)
p=n.parent
l=length(p.children)
i=find(x->x===n, p.children)
i=findall(x->x===n, p.children)
if isempty(i) || i[1]==1
return nothing
else
Expand Down Expand Up @@ -85,12 +85,18 @@ end
# // MustCompile is like Compile, but panics instead of returning an error.


#// MatchAll returns a slice of the nodes that match the selector,
#// eachmatch returns a slice of the nodes that match the selector,
#// from n and its children.
function Base.matchall(s::Selector, n::HTMLNode ) #->HTMLNode[]
function Base.eachmatch(s::Selector, n::HTMLNode ) #->HTMLNode[]
return matchAllInto(s, n, HTMLNode[])
end

if VERSION >= v"0.7-" && VERSION < v"1.0-"
function Base.matchall(s::Selector, n::HTMLNode )
Base.depwarn("use eachmatch instead of matchall", :matchall)
eachmatch(s, n)
end
end


function matchAllInto(s::Selector, n::HTMLNode, storage::Array)
Expand Down Expand Up @@ -228,7 +234,7 @@ end
#// the attribute named key contains val.
function attributeSubstringSelector(key::AbstractString, val::AbstractString) #-> Selector
return attributeSelector(key) do s
return contains(s, val)
return occursin(val, s)
end
end

Expand All @@ -238,7 +244,7 @@ end
#// the attribute named key matches the regular expression rx
function attributeRegexSelector(key::AbstractString, rx::Regex) #->Selector
return attributeSelector(key) do s
return ismatch(rx, key)
return occursin(rx, key)
end
end

Expand Down Expand Up @@ -299,7 +305,7 @@ function nodeOwnText(n::HTMLNode) #->String
write(b, c.text)
end
end
return takebuf_string(b)
return String(take!(copy(b)))
end

nodeOwnText(n::HTMLText) = ""
Expand All @@ -311,7 +317,7 @@ nodeOwnText(n::HTMLText) = ""
function textSubstrSelector(val::AbstractString) #->Selector
return Selector() do n::HTMLNode
text = lowercase(nodeText(n))
return contains(text, val)
return occursin(val, text)
end
end

Expand All @@ -322,7 +328,7 @@ end
function ownTextSubstrSelector(val::AbstractString) #->Selector
return Selector() do n::HTMLNode
text = lowercase(nodeOwnText(n))
return contains(text, val)
return occursin(val, text)
end
end

Expand All @@ -332,7 +338,7 @@ end
#// the specified regular expression
function textRegexSelector(rx::Regex) #->Selector
return Selector() do n::HTMLNode
return ismatch(rx, nodeText(n))
return occursin(rx, nodeText(n))
end
end

Expand All @@ -342,7 +348,7 @@ end
#// directly matches the specified regular expression
function ownTextRegexSelector(rx::Regex) #->Selector
return Selector() do n::HTMLNode
return ismatch(rx, nodeOwnText(n))
return occursin(rx, nodeOwnText(n))
end
end

Expand Down
6 changes: 3 additions & 3 deletions test/runtests.jl
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
using Cascadia
using Base.Test
using Test
using JSON
using Gumbo

Expand Down Expand Up @@ -40,11 +40,11 @@ for (i, d) in enumerate(selectorTests)
c = Selector(d["Selector"])
@test typeof(c) == Selector
n=parsehtml(d["HTML"])
r=matchall(c, n.root)
r=eachmatch(c, n.root)
l=length(r)
e = length(d["Results"])
if l != e
cnt += 1
global cnt += 1
println("Test Failure (known) for $(d["Selector"]) Expected $e, got $l")
else
println("Test Success for $(d["Selector"])")
Expand Down