Archiving the Intranet, thankfully, is a task made simple using the following technologies:
- Cloud Platform
- AWS S3
- AWS CloudFront
- HTTrack Cli
- NodeJS Server
Access is granted to the snapshot if, you:
- Have access to the Digital VPN (Nurved), or
- Have your IP address validated and added to our allow-list, and
- Are in possession of our basic-auth credentials
Access points:
Please get in touch with the Intranet team on Slack for further information.
Access is granted if you are in possession of the basic-auth credentials.
Access point: via Cloud Platform (dev)
It's important to note that creating a snapshot of the intranet from a local machine proved to present resource related issues, such as VPN timeouts and rate limiting.
Requires
- Docker
Clone to your machine:
git clone https://github.com/ministryofjustice/intranet-archive.git && cd intranet-archive
Start docker compose:
make run
There is a script designed to help you install the Dory Proxy, if you'd like to.
If you chose to install Dory, you can access the application here:
Otherwise, access the application here:
Let's begin with servers and their interactions within...
The Archiver has an Nginx server. This is used to display responses from the underlying NodeJS server where Node processes form requests and decides how to treat them. Essentially, if happy with the request, Node will instruct HTTrack to perform a website copy operation, and it does this with predefined options, and a custom plugin.
At the very heart of the Archiver sits HTTrack. This application is configured by Node to take a snapshot of the MoJ Intranet. Potentially, you can point the Archiver at any website address and, using the settings for the Intranet, it will attempt to create an isolated copy of it.
The output of HTTrack can be noted in Docker Composes' stdout
in the running terminal window however, a more
detailed and linear output stream is available in the hts-log.txt
file. You can find this in the root of the snapshot.
During the build of the Archiver, we came across many challenges, two of which almost prevented our proof of concept from succeeding. The first was an inability to display images. The second was an inability to download them.
1) The HTTrack srcset
problem
In modern browsers, the srcset
attribute is used to render a correctly sized image, for the device the image was loaded
in. This helps to manage bandwidth and save the user money. The trouble is HTTrack doesn't modify the URLs in srcset
attributes so instead, we get no images where the attribute is used.
Using srcset
in the Archive bears little value so to fix this we decided to remove srcset
completely, we use
HTTracks' -V
option; this allows us to execute a command on every file that is downloaded. In particular, we run the
following sed
command, where $0
is the file reference in HTTrack.
# find all occurrences of srcset in the file referenced by $0
# select and remove, including contents.
sed -i 's/srcset="[^"]*"//g' $0
2) The HTTrack persisted Query String problem
During normal operation of the Intranet, CDN resources are collected using a signed URL. We use AWS signatures to
authorise access to S3 objects however, we only allow 30 minutes for each request. We discovered that HTTrack would
gather URLs, mark them as .tmp
and then pop them in a queue, ready for collection at a later time.
As you can imagine, this method of operation will indeed cause problems should the length of time exceed 30 minutes. In fact, a variation of this issue caused more than 17,000 forbidden errors and prevented more than 6GB of data from being collected in our attempts to take a snapshot of HMCTS.
Indeed, when the issue was investigated further it was agreed that the Archiver was granted full access to the CDN. In fact, there really shouldn't have been an issue grabbing objects due to this, however, we discovered that if we leave the query string signature in the URL, even though we are authorised, the request validates as 401 unauthorized.
Because HTTrack has very limited options around query string manipulation, we were presented with 2 possibilities to fix this problem:
- Give CDN objects an extended life
- Use HTTracks' plugin system to add functionality
Yes. You guessed it!
We used HTTracks' plugin system. Extending the expiry time would be nowhere
near ideal for two reasons; 1) we would never really know the length of time that would be optimal. 2) we are modifying
another system to make ours work... that should send alarm-bells ringing in any software developers' head!
As defined by HTTrack, our new plugin needed to be written in C (programming language), compiled using gcc
and
converted to a .so
, ready for HTTrack to load and launch its operation on every URL. Simples.
The plugin can be viewed in /conf/httrack/
.
It can be noted that we identified one query string parameter that needed to be removed for the request to validate correctly. Once this was coded in all requests were successful and our 401 unauthorized was gone.
All processing for HTTrack is managed in the process.js
file located in the NodeJS application. You will find all the
options used to set HTTrack up.
To understand the build process further, please look at the Makefile.
The Archiver currently runs an unprivileged image on Cloud Platform that isn't yet confgiured to run CRON jobs. This has meant that an automated script to synchronise a snapshot with the configured S3 bucket fails to run. In order to get the content from our application to S3, the command s3sync
needs to be executed on the pod that is running the snapshot process.
It may be possible to interact with running pods with help from this cheatsheet. Please be aware that with every call to the CP k8s cluster, you will need to provide the namespace, as shown below:
kubectl -n intranet-archive-dev
Kubernetes
# list available pods for the namespace
kubectl -n intranet-archive-dev get pods
# copy a log file from a pod to your local machine
# update pod-id, agency and date
kubectl -n intranet-archive-dev cp intranet-archive-dev-<pod-id>:/archiver/snapshots/intranet.justice.gov.uk/<agency>/<date>/hts-log.txt ~/hts-log.txt
Make