HTMLSpitter

Lightweight Docker image with NodeJS server to spit out HTML from loaded JS using Puppeteer and Chrome

Medium story: HTML from the Javascript world

Image size	RAM usage
558MB	110MB+

Click to show base components

The program is written in NodeJS with Typescript, in the src directory.

Description

Runs a NodeJS server accepting HTTP requests with two URL parameters:

url which is the URL to prerender into HTML
wait which is the optional load event to wait for before stopping the prerendering. It can be:
- load (wait for the load event)
- domcontentloaded (wait for the DOMContentLoaded event)
- networkidle0 (default, wait until there is no network connections for at least 500 ms)
- networkidle2 (wait until there are less than 3 network connections for at least 500 ms)

For example:

http://localhost:8000/?url=https://github.com/qdm12/htmlspitter

The server scales up Chromium instances if needed
It limits the number of opened pages per instance to prevent one page crashing all the other pages
It has a 1 hour cache for loaded HTML
It has a queue system for requests once the maximum number of pages/chromium instances is reached
Not compatible with other architectures than amd64 as Chrome-Beta is only built for amd64 for now and is required.

Usage

Run the container

docker run -it --rm --init -p 8000:8000 qmcgaw/htmlspitter

You can also use docker-compose.yml.

Environment variables

Name	Default	Possible values	Description
`MAX_PAGES`	`10`	`-1` or integer larger than `0`	Max number of pages per Chromium instance at any time, `-1` for no max
`MAX_HITS`	`300`	`-1` or integer larger than `0`	Max number of pages opened per Chromium instance during its lifetime (before relaunch), `-1` for no max
`MAX_AGE_UNUSED`	`60`	`-1` or integer larger than `0`	Max age in seconds of inactivity before the browser is closed, `-1` for no max
`MAX_BROWSERS`	`10`	`-1` or integer larger than `0`	Max number of Chromium instances at any time, `-1` for no max
`MAX_CACHE_SIZE`	`10`	`-1` or integer larger than `0`	Max number of MB stored in the cache, `-1` for no max
`MAX_QUEUE_SIZE`	`100`	`-1` or integer larger than `0`	Max size of queue of pages per Chromium instance, `-1` for no max
`LOG`	`normal`	`normal` or `json`	Format to use to print logs
`TIMEOUT`	`15000`	`-1` or integer larger than `0`	Timeout in ms to load a page, `-1` for no timeout

Troubleshooting

Chrome fails to launch

If you obtain the error:

{"error":"Error: Failed to launch chrome!\nFailed to move to new namespace: PID namespaces supported, Network namespace supported, but failed: errno = Operation not permitted\n\n\nTROUBLESHOOTING: https://github.com/GoogleChrome/puppeteer/blob/master/docs/troubleshooting.md\n"}

Then you might need to use seccomp with the chrome.json file of this repository:

wget https://raw.githubusercontent.com/qdm12/htmlspitter/master/chrome.json
docker run -it --rm --init --security-opt seccomp=$(pwd)/chrome.json -p 8000:8000 qmcgaw/htmlspitter

Details

Program

A built-in local memory cache holds HTML content obtained the last hour and is limited in the size of characters it contains.
A built-in pool of Chromium instances creates and removes Chromium instances according to the server load.
Each Chromium instance has a limited number of pages so that if one page crashes Chromium, not all page loads are lost.
As Chromium caches content, each instance is destroyed and re-created once it reaches a certain number of page loads.

Docker

chrome.json may be required depending on your host OS.
The --init flag is added to prevent eventual zombie Chromium processes to exist when the container stops the main NodeJS program.
A built in healthcheck is implemented by running node build/healthcheck.js against a running instance.

Performance considerations

Chromium is written in C++ and multi threaded so it scales well with more CPU cores
The NodeJS program should not be the bottleneck because all the work is done by Chromium
The bottleneck will be CPU and especially RAM used by Chromium instance(s)
You can scale up by having multiple machines running the program, behind a load balancer

Development

Either use the Docker container development image with Visual Studio Code and the remote development extension
Or install Node and NPM on your machine

# Install all dependencies
npm i
# Transcompile the Typescript code to Javascript and run build/main.js with
npm run start

Test it with, for example:

wget -qO- http://localhost:8000/?url=https://github.com/qdm12/htmlspitter

You can also:

Run tests
```
npm t
```
Run the sever with hot reload (performs npm run start on each .ts change)
```
npx nodemon
```

Build Docker

docker build -t qmcgaw/htmlspitter .

You can also specify the branch of Google Chrome from beta (default), stable and unstable

docker build -t qmcgaw/htmlspitter --build-arg GOOGLE_CHROME_BRANCH=unstable

There are two environment variables you might find useful:
- PORT to set the HTTP server listening port
- CHROME_BIN which is the path to the Chrome binary or Puppeteer-bundled

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
.devcontainer		.devcontainer
src		src
.dockerignore		.dockerignore
.gitignore		.gitignore
.travis.yml		.travis.yml
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
chrome.json		chrome.json
ci.sh		ci.sh
docker-compose.yml		docker-compose.yml
jest.config.js		jest.config.js
nodemon.json		nodemon.json
package-lock.json		package-lock.json
package.json		package.json
title.png		title.png
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HTMLSpitter

Description

Usage

Environment variables

Troubleshooting

Chrome fails to launch

Details

Program

Docker

Performance considerations

Development

TODOs

Credits

License

About

Releases

Packages

Contributors 2

Languages

License

qdm12/htmlspitter

Folders and files

Latest commit

History

Repository files navigation

HTMLSpitter

Description

Usage

Environment variables

Troubleshooting

Chrome fails to launch

Details

Program

Docker

Performance considerations

Development

TODOs

Credits

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages