Skip to content
Robbie Chipka edited this page Dec 23, 2015 · 177 revisions

Getting Started

Osmosis works by passing a context object and a data object down a command chain.

#####[context] The context object is an XML/HTML Document or [Element] (https://developer.mozilla.org/en-US/docs/Web/API/Element).

#####[data] The data object is just a regular object that starts out empty.

Example:

osmosis
.command1() // passes a context to command2
.command2() // receives context and inherits all data values
...

osmosis
.command3() // new instance doesn't receive context or data

####Contexts that might be passed

#####New context

Some commands select new elements or request new documents. These commands will pass a new context down the chain. For example osmosis.find will pass found elements to the next command.

#####Multiple contexts

A command can have more than one context (i.e. element) for the next command to process. Rather than pass an array of elements to the next command, it simply calls the next command once for each element.

#####Same context

Other commands, such as log and set do passive operations. They simply forward the current context.

##Command reference

These are all of the "commands" that are available for chaining in an Osmosis instance.

##click

#####( selector ) Click on nodes found by selector

##contains

#####( string ) Discard any nodes whose contents do not match string

##config

#####( opts ) #####( key, val ) Set HTTP options and configure Osmosis

##data

#####( callback( data ) ) Calls callback with the current data object

#####( null ) Empty the data object

#####( object ) Add or replace each key in the data object with a new val

##debug

#####( callback( msg ) ) Call callback when any debug messages are received

##delay

#####( seconds ) Delay starting next promise for seconds (float or int)

##do

#####( osmosis..., osmosis... ) Call each Osmosis instance with the current context. This will always continue, even if an instance fails.

##doc

Reset the current context to the Document

##dom

#####( callback ) Create a DOM object from the current context.

The callback will be be called with 3 arguments (window, data, and next). The next([context], [data]) function must be called at least once

##done

#####( callback ) Calls callback when parsing has completely finished

##error

#####( callback( msg ) ) Call callback when any error messages are received

##failure/fail

#####( selector ) Discard any nodes that match selector

##filter/success

#####( selector ) Discard any nodes that do not match selector

##find

#####( selector ) Find elements based on selector anywhere within the current document

##follow

#####( [selector] ) Follow URLs found via selector. If selector isn't provided, follow will search the current element text or common URL attributes (href, src, etc).

####Examples:

.follow() .follow('@href') .follow('a') .follow('a@href') .follow('span.outlink') .follow('input.cloneURL@value') .follow('link[type="application/rss+xml"]@href')

##get / post

#####( url , [data] , [opts] ) Make an HTTP request

url - A string containing a URL, which can be relative to the current context.

data (optional) - An object containing GET query parameters or POST request data.

opts (optional) - An object containing HTTP request options.


Note: Query parameter values will be urlencoded by needle so make sure that your parameter values are not urlencoded.

##log

#####( callback( msg ) ) Call callback when any log messages are received

##login

#####( user , pass , [success] , [fail] ) Submit a login form.

#####Arguments: user - A string containing a username, email address, etc.

pass - A password string

success (optional) - A selector string determining if the login attempt succeeded

fail (optional) - A selector string determining if the login attempt failed


######How it works login finds the first form containing input[type="password"] and uses that input as the password field. It will use the preceding <input> element as the user field.

##match

#####( [selector], RegExp ) Discard any nodes whose contents do not match RegExp

##page / paginate

#####( selector , [limit] ) Paginate the previous request limit times based on selector.

####selector: selector (String) - A selector string for either:

  • an element with the next page URL in its inner text or in an attribute that commonly contains a URL (href, src, etc.)
  • an element whose name and value attributes will respectively be added or replaced in the next page query.

selector (Object) - An object where each key is a query parameter name and each value is either a selector string or an increment amount (+1, -1, etc.).

####limit: limit (Number) - Total number of "next page" requests to make.

limit (String) - A selector string for an element containing the total number of requests to make.


.paginate('a.nextPage') // go to `a.nextPage` `@href`
.paginate('link[rel="next"]@href') // go to `link` `@href`
.paginate('input[name="page"]') // update `page` parameter of the next query

// adds 20 to the `startIndex` query parameter
// sets `page` query parameter to `a.nextPage` content
// stops after 15 requests are made
.paginate({ startIndex: +20,  page: 'a.nextPage' }, 15)

##pause / resume / stop

Pause, resume or stop an osmosis instance.

##parse

#####( string ) Parse an HTML or XML string

#####Arguments: string - A string or buffer containing the HTML/XML data

##set

#####( name , selector) Set name to the value of selector

#####( object ) Set each key to the value of each val selector.


.set('title') // set 'title' to current element text .set('title', 'a.title') // set 'title' to text of 'a.title' .set({ title: 'a.title', description: 'p.description', url: 'a.permalink @href', images: ['img @src'], comments: [ osmosis .follow('a.comments') .find('div.comment') .set({ 'author': '.author' 'content': 'p.content', 'date': '.date' }) ] });

##submit

#####( selector , [data] ) Submit a form

#####Arguments: selector - A selector for the <form> element

data (optional) - An object where each key and value represents a form input name and value

##then

#####( callback( context, data, [next], [done] ) ) Calls callback with the context of the current element.

####context: The context argument is the current context at that point in the command chain. If the previous command was get, post, follow, or parse then the context will be a Document. If the previous command was find then the current context will be one of the Elements that was found.

####data: The data argument contains values set via osmosis.set. This object can be modified in any way.

####next: The next argument is a function that will call the next command. It takes two arguments: context and data.

####done: The done argument is a function to call when then will no longer call next. This is only required if then calls next asynchronously any number of times.

Note: If the callback accepts done as an argument, it must always call done, even if next was never called.

####Functions The callback will have these functions bound to its this value:

  • this.request(method, url, [data], callback([err], context), [opts])
  • this.log(msg)
  • this.debug(msg)
  • this.error(msg)

####Examples:

Example 1: find every ul > li and pass it to the next command

osmosis ... .then(function(context, data, next) { var items = context.find('ul > li'); items.forEach(function(item) { next(item, data); }) })


**Example 2:** set `data.url` to the current page URL
```javascript

osmosis ... .then(function(context, data, next) { data.url = context.doc().request.url; next(context, data); })


**Example 3:** only continue if `lastname != undefined`
```javascript

osmosis ... .then(function(context, data, next) { if (data.lastname != undefined) next(context, data) })


**Example 4:** using the `done` function
```javascript

osmosis ... .then(function(context, data, next, done) { if (db.connected == false) { this.error('database disconnected'); done(); return; } data.someArray.forEach(function(obj, index) { db.save(obj, function() { next(context, data); if (index == data.someArray.length-1) done(); }) }) })

Clone this wiki locally