forked from psycotica0/Scrape
-
Notifications
You must be signed in to change notification settings - Fork 1
This is a small tool designed to help scraping streams into lists of useful data.
License
ossguy/Scrape
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Scrape. This program is to be used to easily parse out either single items or lists of items from input streams. It works by being given a Perl Compatible Regular Expression, assumedly with captures, that represents each item in the list of things you want to scrape from a stream. It then outputs, in order, all of the captures from all occurences of that regexp found in the stream. The captures are, by default, all wrapped in curly braces, and while a capture may contain newlines, each set of wrapped captures for each item will take up only a line. For example: echo "<div>This is text one</div><div>This is text two</div><div>This text has a newline </div><div>This is done</div>" | scrape '<div>([^ ]*) (.*?)</div>' Would output: {This} {is text one} {This} {is text two} {This} {text has a newline} {This} {is done} Also fun is: curl -L google.com | scrape '<a.*?href="(.*?)".*?>\s*(.*?)\s*</a>' Which returns the urls and text of all the links on google's homepage, or any page, in a list like: {URL} {TEXT} {URL} {TEXT} {URL} {TEXT} etc. Also, it has optional delimeters to help narrow down the search. For instance: scrape -s '<body[^>]*>' -e'</body>' '<p>(.*?)</p>' Would go through and pull all paragraphs out of (assumedly) an html page without even considering the stuff outside of the body. To change the output the flags "a","b", and "d" can be used. Each takes a string argument. -b is the string placed before the field -a is the string placed after the field -d is the string placed between fields So, by default, -b'{' -a'}' -d' '. It also supports what I call a Continuation Pattern. This is useful when trying to scrape all the data from a paginated input. For example, one could pull in an html page with curl, pass it to scrape, then scrape would also output the URL of the "next" link it found, and the loop would start over. Currently this works by outputing the first capture of the pattern to stderr. So, then, the script that's using scrape can capture this stderr and if it's a URL go back to the top of the loop. Example: curl $URL | scrape -s '<body[^>]*>' -e '</body>' -c '<[^>]*href="([^"]*)"[^>]*>Next' '<p>(.*?)</p>' 2> /tmp/NEXT_URL Would output all paragraphs in a page, and store the url of the next page in /tmp/NEXT_URL, allowing the script to loop on that.
About
This is a small tool designed to help scraping streams into lists of useful data.
Resources
License
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published
Languages
- C 100.0%