Skip to content
Aravind Reddy V edited this page Jun 24, 2017 · 177 revisions

Osmosis is a utility for easily extracting data from HTML or XML documents.

Command reference

These are all of the "commands" that are available for chaining in an Osmosis instance.

click

( selector )

Click on nodes found by selector

contains

( string )

Discard any nodes whose contents do not match string

config

( opts )
( key, val )

Set HTTP options and configure Osmosis

data

( callback( data ) )

Calls callback with the current data object

( null )

Empty the data object

( object )

Add or replace each key in the data object with a new val

##debug

( callback( msg ) )

Call callback when any debug messages are received

delay

( seconds )

Delay starting next promise for seconds (float or int)

do

( osmosis..., osmosis... )

Call each Osmosis instance with the current context. This will always continue, even if an instance fails.

doc

Reset the current context to the Document

dom

( callback )

Create a DOM object from the current context.

The callback will be be called with 3 arguments (window, data, and next). The next([context], [data]) function must be called at least once

done

( callback )

Calls callback when parsing has completely finished

error

( callback( msg ) )

Call callback when any error messages are received

failure/fail

( selector )

Discard any nodes that match selector

filter/success

( selector )

Discard any nodes that do not match selector

find

( selector )

Find elements based on selector anywhere within the current document

follow

( [selector] )

Follow URLs found via selector. If selector isn't provided, follow will search the current element text or common URL attributes (href, src, etc).

Examples:

.follow() .follow('@href') .follow('a') .follow('a@href') .follow('span.outlink') .follow('input.cloneURL@value') .follow('link[type="application/rss+xml"]@href')

get / post

( url , [data] , [opts] )

Make an HTTP request

url - A string containing a URL, which can be relative to the current context.

data (optional) - An object containing GET query parameters or POST request data.

opts (optional) - An object containing HTTP request options.


Note: Query parameter values will be urlencoded by needle so make sure that your parameter values are not urlencoded.

log

( callback( msg ) )

Call callback when any log messages are received

login

( user , pass , [success] , [fail] )

Submit a login form.

Arguments:

user - A string containing a username, email address, etc.

pass - A password string

success (optional) - A selector string determining if the login attempt succeeded

fail (optional) - A selector string determining if the login attempt failed


How it works

login finds the first form containing input[type="password"] and uses that input as the password field. It will use the preceding <input> element as the user field.

match

( [selector], RegExp )

Discard any nodes whose contents do not match RegExp

page / paginate

( selector , [limit] )

Paginate the previous request limit times based on selector.

selector:

selector (String) - A selector string for either:

  • an element with the next page URL in its inner text or in an attribute that commonly contains a URL (href, src, etc.)
  • an element whose name and value attributes will respectively be added or replaced in the next page query.

selector (Object) - An object where each key is a query parameter name and each value is either a selector string or an increment amount (+1, -1, etc.).

limit:

limit (Number) - Total number of "next page" requests to make.

limit (String) - A selector string for an element containing the total number of requests to make.


.paginate('a.nextPage') // go to `a.nextPage` `@href`
.paginate('link[rel="next"]@href') // go to `link` `@href`
.paginate('input[name="page"]') // update `page` parameter of the next query

// adds 20 to the `startIndex` query parameter
// sets `page` query parameter to `a.nextPage` content
// stops after 15 requests are made
.paginate({ startIndex: +20,  page: 'a.nextPage' }, 15)

pause / resume / stop

Pause, resume or stop an osmosis instance.

parse

( string )

Parse an HTML or XML string

Arguments:

string - A string or buffer containing the HTML/XML data

set

( name , selector)

Set name to the value of selector

( object )

Set each key to the value of each val selector.


.set('title') // set 'title' to current element text .set('title', 'a.title') // set 'title' to text of 'a.title' .set({ title: 'a.title', description: 'p.description', url: 'a.permalink @href', images: ['img @src'], comments: [ osmosis .follow('a.comments') .find('div.comment') .set({ 'author': '.author' 'content': 'p.content', 'date': '.date' }) ] });

submit

( selector , [data] )

Submit a form

Arguments:

selector - A selector for the <form> element or submit button.

data (optional) - An object where each key and value represents a form input name and value

then

( callback( context, data, [next], [done] ) )

Calls callback with the context of the current element.

context:

The context argument is the current context at that point in the command chain. If the previous command was get, post, follow, or parse then the context will be a Document. If the previous command was find then the current context will be one of the Elements that was found.

data:

The data argument contains values set via osmosis.set. This object can be modified in any way.

next:

The next argument is a function that will call the next command. It takes two arguments: context and data.

done:

The done argument is a function to call when then will no longer call next. This is only required if then calls next asynchronously any number of times.

Note: If the callback accepts done as an argument, it must always call done, even if next was never called.

Functions

The callback will have these functions bound to its this value:

  • this.request(method, url, [data], callback([err], context), [opts])
  • this.log(msg)
  • this.debug(msg)
  • this.error(msg)

Examples:

Example 1: find every ul > li and pass it to the next command

osmosis ... .then(function(context, data, next) { var items = context.find('ul > li'); items.forEach(function(item) { next(item, data); }) })


**Example 2:** set `data.url` to the current page URL
```javascript

osmosis ... .then(function(context, data, next) { data.url = context.doc().request.url; next(context, data); })


**Example 3:** only continue if `lastname != undefined`
```javascript

osmosis ... .then(function(context, data, next) { if (data.lastname != undefined) next(context, data) })


**Example 4:** using the `done` function
```javascript

osmosis ... .then(function(context, data, next, done) { if (db.connected == false) { this.error('database disconnected'); done(); return; } data.someArray.forEach(function(obj, index) { db.save(obj, function() { next(context, data); if (index == data.someArray.length-1) done(); }) }) })

Clone this wiki locally