dom-text-matcher

Experiment about searching for text in the DOM, transcending element boundaries

What is this

This is an experiment about how to locate text patterns in a DOM, when the match might span multiple nodes, and we don't know where to look exactly.

The code traverses the (configured part of) the dom, and collects info about which string slice is where. Then it searches for the pattern in the innerText of he (configured part of the) DOM using google-diff-match-patch (see app/lib/fancymatcher/README.txt for more info about this part.)

When a match is found, it maps in back to DOM elements, using the collected information, and it returns info about where to text. (XPath expressions and string indices.)

Optionally, it can also highlight the match in the DOM.

All the DOM analyzing logic is in one CoffeeScript file ( app/lib/domsearcher/domsearcher.coffee )

How to run

run scripts/web-server.js
Go to http://localhost:8000/
Click the buttons, and see what happens.

Or see the live demo.

Unsolved problems:

Pattern length

Unfortunately, there is a limitation about the max length of pattern. The exact value varies by browser and platform, but it's typically 32 or 64. Which means you can't search for patterns longer that that. (The getMaxPatternLength method returns the current limit.)

This limitation comes from the implementation of the Bitap matching algorithm; will need to look into this later. See thread here.

Currently, this is worked around by searching for the first and the last 32-bit slice of the pattern, and if both is found, and the distance is about right, then it's accepted as match. What goes between the start and the end slice is not analyzed. This might lead to false matches.
Hidden nodes

When the "display" property of a node is set to "none", it's not displayd, so it's content does not get into it's parent's innerText.

However, there is no easy way to detect this when one is only looking at the DOM. (Like we do.)

So, currently we are trying to detect whether a node (and it's children) is hidden by searching for it's innerText in the innerText of the parent. (If there is no match, it means that this is probably hidden.)

However, this approach can yield "false negatives", if a node is indeed hidden, but the content is the same like that of an other, non-hidden node.

This would not break the results, but would add false parts to the selection.

Name		Name	Last commit message	Last commit date
Latest commit History 163 Commits
app		app
config		config
logs		logs
scripts		scripts
test		test
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
TODO		TODO

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

dom-text-matcher

Experiment about searching for text in the DOM, transcending element boundaries

What is this

How to run

Unsolved problems:

About

Releases

Packages

Contributors 7

Languages

License

csillag/inner-peace

Folders and files

Latest commit

History

Repository files navigation

dom-text-matcher

Experiment about searching for text in the DOM, transcending element boundaries

What is this

How to run

Unsolved problems:

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 7

Languages

Packages