Skip to content

Commit

Permalink
Merge pull request #42 from Narzerus/refactor/better-setup-api
Browse files Browse the repository at this point in the history
Refactor/better setup api
  • Loading branch information
Rafael Vidaurre authored and Rafael Vidaurre committed Apr 15, 2015
2 parents 865d291 + 1585f6e commit ab5b28b
Show file tree
Hide file tree
Showing 13 changed files with 220 additions and 326 deletions.
99 changes: 99 additions & 0 deletions Readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -330,6 +330,84 @@ To run the job simply use the job.run() method, please keep in mind jobs are not
job.run(); // Job will start running
```

Hooks
=====
Hooks are run in specific moments of an instanced `task`'s life (before emitting events to the outside), and they can modify the scrapers default behavior.
To specify a `task`'s hooks use its `setup` method.

```javascript
Yakuza.task('scraper', 'agent', 'someTask').setup(function (config) {
config.hooks = {
'onFail': function (task) {
// ... do stuff
},
'onSuccess': function (task) {
// ... do stuff
}
};
});
```

onFail
------
Runs when a task fails, `onFail` can be used to do some fancy stuff like retrying failed tasks right away.

The `task` object passed to the `onFail` hook has the following properties:
- runs: Amount of times the task has run (starts from 1)
- params: Parameters with which the task was instanced for the first time (doesn't change)
- rerun([params]): Re-runs the task with original parameters (passed by the builder), if an object is provided, it will replace the task's parameters with the object passed.
- error: Error thrown by the task's `fail` event, (if passed)

onSuccess
---------
Runs when a task succeeds, `onSuccess` can be used to stop the job's execution even though the task was successful. This can be useful when we need to stop our execution depending on the data we receive.

The `task` object passed to the `onSuccess` hook has the following properties:
- data: Data returned by the `task`'s success() method
- stopJob(): Method which, if called, stops the job execution in once the current `executionBlock` is done

Here's an example on when this could be useful:

```javascript
Yakuza.task('scraper', 'agent', 'login').setup(function (config) {
config.hooks = {
'onSuccess': function (task) {
// We stop the job if the loginStatus returns `wrongPassword`
// remember: in many cases wrongPassword might NOT be an error, identifying what's the login status
// can be part of a successful scraping process as well.

if (task.data.loginStatus === 'wrongPassword') {
task.stopJob();
}
}
};
}).main(function (task, http, params) {
var opts;

opts = {
url: 'http://someurl.com',
data: {
username: 'foo',
password: 'bar'
}
};

http.post(opts)
.then(function (res, body) {
if (body === 'wrong password') {
task.success({loginStatus: 'wrongPassword});
} else {
task.success({loginStatus: 'authorized});
}
})
.fail(function (error) {
task.fail(error);
})
.done();
});
```

When calling `task.stopJob()` the `task:<taskName>:success` event is, of course, still fired.

Advanced
========
Expand Down Expand Up @@ -484,6 +562,27 @@ Yakuza.task('scraper', 'agent', 'login').main(function (task, http, params) {

Any new task will now have its `http` object initialized with the cookies that were present at the time `saveCookies` was called. Notice that only tasks from the next **execution block** will be afected.

Retrying tasks
--------------
In many cases the websites we scrape are sloppy, implemented in very wrong ways or simply unstable. This will cause our tasks to sometimes fail without warning. For this reason `Yakuza` provides a way of re-running tasks when this happens via it's `onFail` hook.

When a task is rerun, it restarts to the point in which it was instanced. Except (for some properties like `startTime` which marks the moment when the task was first run)

```javascript
Yakuza.task('scraper', 'agent', 'login').setup(function (config) {
config.hooks = {
onFail: function (task) {
if (task.runs <== 5) {
// Will retry the task a maximum amount of 5 times
task.rerun();
}
}
};
});
```

You can find the `task` object's properties on the **Hooks section**

Glossary
========

Expand Down
45 changes: 8 additions & 37 deletions agent.js
Original file line number Diff line number Diff line change
Expand Up @@ -23,12 +23,6 @@ function Agent (id) {
*/
this.__applied = false;

/**
* List of functions which modify the Agent's configuration (provided by setup())
* @private
*/
this.__configCallbacks = [];

/**
* Agent's configuration object (set by running all configCallback functions)
* @private
Expand Down Expand Up @@ -58,17 +52,6 @@ function Agent (id) {
this.id = id;
}

/**
* Run functions passed via config(), thus applying their config logic
* @private
*/
Agent.prototype.__applyConfigCallbacks = function () {
var _this = this;
_.each(_this.__configCallbacks, function (configCallback) {
configCallback(_this.__config);
});
};

/**
* Turns every element in the execution plan into an array for type consistency
* @private
Expand Down Expand Up @@ -108,41 +91,29 @@ Agent.prototype.__formatPlan = function () {
this._plan = formattedPlan;
};

/**
* Applies all task definitions
* @private
*/
Agent.prototype.__applyTaskDefinitions = function () {
_.each(this._taskDefinitions, function (taskDefinition) {
taskDefinition._applySetup();
});
};

/**
* Applies all necessary processes regarding the setup stage of the agent
*/
Agent.prototype._applySetup = function () {
if (this.__applied) {
return;
}
this.__applyConfigCallbacks();
this.__applyTaskDefinitions();

this.__formatPlan();
this.__applied = true;
};

/**
* Saves a configuration function into the config callbacks array
* @param {function} cbConfig method which modifies the agent's config object (passed as argument)
* Sets the task's execution plan
* @param {Array} executionPlan array representing the execution plan for this agent
*/
Agent.prototype.setup = function (cbConfig) {
if (!_.isFunction(cbConfig)) {
throw new Error('Setup argument must be a function');
Agent.prototype.plan = function (executionPlan) {
// TODO: Validate execution plan format right away
if (!_.isArray(executionPlan)) {
throw new Error('Agent plan must be an array of task ids');
}

this.__configCallbacks.push(cbConfig);

return this;
this.__config.plan = executionPlan;
};

/**
Expand Down
1 change: 0 additions & 1 deletion job.js
Original file line number Diff line number Diff line change
Expand Up @@ -631,7 +631,6 @@ Job.prototype.__applyComponents = function () {
return;
}

this._scraper._applySetup();
this.__agent._applySetup();

this.__componentsApplied = true;
Expand Down
2 changes: 1 addition & 1 deletion package.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"name": "yakuza",
"version": "0.2.1",
"version": "1.0.0",
"description": "",
"main": "yakuza.js",
"repository": {
Expand Down
55 changes: 0 additions & 55 deletions scraper.js
Original file line number Diff line number Diff line change
Expand Up @@ -16,24 +16,6 @@ Agent = require('./agent');
* @class
*/
function Scraper () {
/**
* Determines if the setup processes have been applied
* @private
*/
this.__applied = false;

/**
* Array of callbacks provided via config() which set the Scraper's configuration variables
* @private
*/
this.__configCallbacks = [];

/**
* Config object, contains configuration data and is exposed via the setup() method
* @private
*/
this.__config = {};

/**
* Object which contains scraper-wide routine definitions, routines are set via the routine()
* method
Expand Down Expand Up @@ -70,43 +52,6 @@ Scraper.prototype.__createAgent = function (agentId) {
return this._agents[agentId];
};

/**
* Run functions passed via config(), thus applying their config logic
* @private
*/
Scraper.prototype.__applyConfigCallbacks = function () {
var _this = this;
_.each(_this.__configCallbacks, function (configCallback) {
configCallback(_this.__config);
});
};

/**
* Applies all necessary processes regarding the setup stage of the scraper
*/
Scraper.prototype._applySetup = function () {
if (this.__applied) {
return;
}
this.__applyConfigCallbacks();
this.__applied = true;
};

/**
* Used to configure the scraper, it enqueues each configuration function meaning it
* allows a scraper to be configured in multiple different places
* @param {function} cbConfig function which will modify config parameters
*/
Scraper.prototype.setup = function (cbConfig) {
if (!_.isFunction(cbConfig)) {
throw new Error('Config argument must be a function');
}

this.__configCallbacks.push(cbConfig);

return Scraper;
};

/**
* Creates or gets an agent based on the id passed
* @param {string} agentId Id of the agent to retrieve/create
Expand Down
51 changes: 23 additions & 28 deletions spec/agent.spec.js
Original file line number Diff line number Diff line change
Expand Up @@ -20,37 +20,36 @@ beforeEach(function () {
});

describe('Agent', function () {
describe('#setup', function () {
var error;
describe('#plan', function () {
it('should set the execution plan', function () {
var agent;

error = 'Setup argument must be a function';
agent = yakuza.agent('Scraper', 'Agent');

agent.plan([
'Task1'
]);

agent.__config.plan.should.eql(['Task1']);
});

it('should throw if argument is not an array', function () {
var error;

error = 'Agent plan must be an array of task ids';

it('should throw if argument is not a function', function () {
(function () {
yakuza.agent('Scraper', 'Agent').setup('foo');
yakuza.agent('Scraper', 'Agent').plan(123);
}).should.throw(error);

(function () {
yakuza.agent('Scraper', 'Agent').setup(['foo']);
yakuza.agent('Scraper', 'Agent').plan({foo: 'bar'});
}).should.throw(error);

(function () {
yakuza.agent('Scraper', 'Agent').setup(123);
yakuza.agent('Scraper', 'Agent').plan('foo');
}).should.throw(error);
});

it('it should add a config callback', function (done) {
yakuza.agent('Scraper', 'Agent').setup(function (config) {
config.plan = [
'Task1'
];
done();
});

yakuza.task('Scraper', 'Agent', 'Task1').main(function (task) {
task.success();
});

yakuza.ready();
});
});

describe('#task', function () {
Expand All @@ -65,13 +64,9 @@ describe('Agent', function () {
beforeEach(function () {
agent = yakuza.agent('Scraper', 'Agent');
});

it('should create an agent-level routine', function () {
agent.setup(function (config) {
config.plan = [
'Task1',
'Task2'
];
});
agent.plan(['Task1', 'Task2']);
agent.routine('OnlyOne', ['Task1']);
yakuza.job('Scraper', 'Agent').routine('OnlyOne');
});
Expand Down
Loading

0 comments on commit ab5b28b

Please sign in to comment.