- Node.js 8 or higher;
- NPM;
- Active internet connection.
As soon as you get the project as a .zip or through git clone
, execute the following steps:
npm install
# To crawl web URLs, point to a file containing a list of links.
npm start crawler.txt
# To crawl the file system, give the path to the folder.
npm start malicious-js-folder
After crawling through the data, the program will then save a CSV file under the reports
folder. If this folder does not exist, it will be created automatically upon running.
The output file name will always be the timestamp of the moment it was generated, keeping your reports in order of creation.
The CSV structure consists of a first column named link
, and the rest of the columns consisting of function names extracted from the scraped JavaScript code.
By editing some properties of the file crawler_options.json
, you can customize some of the application's behavior. Listed below are all of the editable properties and their uses.
Property | Type | Example | Description |
---|---|---|---|
crawling_attempts |
Integer | 3 | The number of times a WebLoader instance will try to fetch a given URL. |
function_input_filename |
String | functions.txt | The path (relative to the project's root folder) to the functions list. |
A Loader is an abstract interface with a single asynchronous method: load
.
It's constructor accepts two arguments: url
and options
.
Parameter | Type | Description |
---|---|---|
url |
String | Link or path to a file to be loaded. |
options |
Object | An object containing options. This is the crawler_options serialized content. |
How each loader handles it's content to be loaded is individual to them. So each implementation of a Loader class must manage it's load
method properly. Below, are the list of implemented Loader classes.
Name | Description |
---|---|
WebLoader | Uses Puppeteer under the hood to fully render pages, including post-inserted scripts and data. Fetches all of the HTML and JavaScript content that is present in the given URL or loaded into it. |
FileSystemLoader | Recursively load content from a folder, crawling for JavaScript files and reading their content. |
Every Loader's load
method must return an array of objects. And every object must have the properties: type
, data
, url
and origin
. These are used on the processing step.
A Processor consists of an abstract interface with a single method: process
.
How each processor handles it's data is individual to them. So each implementation of it must manage it's process
method properly. Below, are the list of implemented Processor classes.
Name | Description |
---|---|
HTMLProcessor | As of now, no HTML parsing is needed, since WebLoader's scraping separates every resource, and then JavaScript processing is handled by the related processor. |
JSProcessor | Uses Acorn, acorn-loose and acorn-walk to parse JavaScript code and get the called functions in the analyzed code. If a file defined in function_input_filename is present, it will also filter the functions, generating a smaller, targeted, report. |
There is also a class called ProcessorFactory
, with a single static create
method, which utilizes the Factory pattern to deliver the right instance of a processor class given the needs.
Aside from the classes that are building blocks for the application scalability, we also have some units that are executed in a given time.
-
js_parser
: Exports a method calledgetCalledFunctions
, that accepts a string of the JavaScript code and returns a JSON object mapping the function name to the number of times it was called throughout the code. -
data_processing
: Exports a method calledgenerateCsvFileContent
, which accepts an array of the processed results and return a single string, representing the structured data as CSV. -
cli
: Exports a method calledcreateDataLoaders
, that accepts the cli arguments and returns an array of instances of data loaders classes, ready to be executed.
The code starts running from the file bin.js
. From there, it executes the execute
async method and follows a straight-forward path. The flow is as follows:
- Create the folder structure, if not existent, required for the project to run.
- Create the data loaders from the CLI provided argument.
- Then, for each loader, it will:
- 3.1. Load the content;
- 3.2. Process it;
- 3.3. Join the processed contents and move on to the next item.
- Generate a valid CSV string.
- Write the report into the formatted file, with the date of execution on the filename.