Web software for financial data scraping (i.e. for the purposes of price comparison, analytics, etc.) from online shops and the likes.
This design approach allows for a broad range of processing units (pretty much anything from desktop browsers and VPS/cloud servers to mobile browsers and low-cost embedded servers) with virtually no limits to their number, concurrently scraping data for one distributed graph database.
- Front-end
- Semantic UI
- HighCharts/HighStock
- Back-end: Laravel 5
- Database: Neo4j
- Cassidy: JavaScript
- jQuery
- AngularJS
- Sundance: NodeJS (Express)
An example config entry looks like this:
ngApp.constant('CassidiBlueprints', {
'ebay.com': {
price: {
selector: '#prcIsum',
callback: function(e){ return parseFloat(e.text().replace(/[^\d.]/g, '')) }
},
name: '#itemTitle',
image: {
selector: '#icImg',
callback: function(e){ return e.attr('src') }
}
}
});
There are two options when adding a new key for parsing.
When the value is string
it's treated as a DOM selector, in which case the contained text will be returned.
That is a shorthand of the full object with selector
, and callback
keys, where the latter is a function which is called for each match on every page and receives the jQuery element as sole argument. The callback must return a non-nil value for further processing or nil in case of an error. Data accuracy depends on proper callback error reporting.
Once there is a config for a given domain you can scrape URLs in it. With AngularJS it's as simple as:
Cassidi.steal(url).then(function(swag){ if(swag) console.log(swag.price); })
The steal
method returns an Angular Promise which is later ("asynchronously") called. The closure for the promise receives an object as a sole argument which has the same keys as the parser for that domain. Each key is either a parse result value or false
. In case all keys have failed to parse boolean false
is passed instead of an object (so be sure to check for that).
However, usually you'd want to run multiple (possibly hundreds or even thousands) of scrape operations in a non-UI-blocking manner. That is the purpose of the queue
method:
Cassidi.queue([url1, url2, ...], function(swag){ console.log(swag.price); }, function(){ alert('All done!'); })
It's pretty self-explanatory, but for the sake of clarity:
- The first argument is an array of URLs.
- The second argument is an optional function which will be used as a
callback
for each processed URL. - The third argument is an optional final callback, i.e. a simple function to be run when all URLs have been processed.
Usually you'd want to fetch a few items from the Queue REST API (see below) and run them in the queue
method while fetching the next items.
Cassidy
is an AngularJS service for data scraping with a domain-based parser configuration (or Blueprints
).
Sundance
is a lightweight NodeJS script that can run scraping queues on the server.
It's designed to be stackable and scalable.
The queue has a simple workflow. Using the REST API the typical worker scenario looks like this:
GET /api/v1/queue
returns a list of product IDs not related to recent (8h) prices and not related to a recently updated:Promise
.GET /api/v1/lock/{product}
assigns the given product a new:Promise
and returns its idPOST /api/v1/lock/{promise}
adds thevalue
parameter to the parent product's prices and deletes the promise on success (otherwise touchesupdated_at
)