section | x-masysma-name | title | date | lang | author | keywords | x-masysma-version | x-masysma-website | x-masysma-repository | x-masysma-owned | x-masysma-copyright | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
32 |
maxbupst |
Ma_Sys.ma Bupstash Extractor |
2023/06/13 22:48:05 |
en-US |
|
|
1.0.0 |
1 |
(c) 2022, 2023 Ma_Sys.ma <[email protected]>. |
The Ma_Sys.ma Bupstash Extractor (maxbupst
) is an application that can read
and decode the data from Bupstash (https://bupstash.io) Repositories as
created with a supported Bupstash version.
In backup_tests_borg_bupstash_kopia(37) multiple modern backup programs were tested as potential replacements for jmbb(32). There, the conclusion was that Bupstash is a most viable replacement for JMBB.
One problem of the modern backup tools is that due to their advanced features like encryption and deduplication, they tend to store data in their own proprietary formats that no other tool can read. For tools using fast-paced development or new programming languages (like e.g. Rust) it can be challenging to get them to compile as often new compilers and online dependency downloads are required.
This makes the old problem of not having the restoration software available in the time of need even more critical since while the software itself may be available, some of its dependencies or an adequate compiler may not.
Additionally, given the rich feature set that supports multiple backups from different machines and potentially uses different encryption keys for various parts of the backups, the modern tools come with a high amount of inherent complexity.
JMBB, which is less modern a tool, has a design that tries to mitigate these risks by being based on formats that can be decoded by combining multiple third-party tools (aescrypt, cpio, xz) for restoring the backup contents although the restoration process may be slow and slightly off (in that restored data can contain files that were deleted in a recent backup).
For using any of the more modern alternatives like Bupstash it seems it would be best if there were multiple ways to restore a backup, too. To achieve this, this repository provides an alternative restoration implementation in a different programming language (Ada instead of Rust) and using different libraries except for the decryption where it currently uses libsodium just like Bupstash.
Just like with the JMBB “emergency” restoration using standard tools, this implementation does not aim at being as good as the original. Instead, it is intended to serve as an alternative for the sake of having such that may come at degraded performance and with a hugely limited set of features.
The Bupstash data format consists of versioned data structures. This implementation does not support all of the data structure revisions. Instead it focuses on the structures that were used by specific versions of Bupstash that were used productively by the Ma_Sys.ma i.e. some arbitrary set of versions is supported. See the table under Supported Versions for details. The idea is that if you only ever switch from one listed Bupstash versions to another one then the resulting backup data structures are restorable by the newest Maxbupstash revision.
In an ideal world, this implementation would have been created entirely indepdendently from Bupstash without looking at its implementation, because this might greatly increase resilience in that truly different implementations are unlikely to contain the same bugs, making a successful data restoration more likely. Since the existing online documentation about Bupstash's data and crypto structures are not comprehensive specifications, this was unfortunately not feasible. Instead, the implementation closely follows Bupstash's in many places. This makes it likely that structural bugs that exist in the original Bupstash are present in this implementation, too.
If you are interested in doing a proper “clean room” implementation of a tool to be compatible with the Bupstash data format, do not hesitate to contact me. I might be able to assist with the specification part.
Bupstash Versions Maxbupstash Versions
0.10.3 1.1.1
Ma_Sys.ma Bupstash Extractor
(c) 2022, 2023 Ma_Sys.ma <[email protected]>
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
Here is an overview about the Ma_Sys.ma-supplied dependencies' licenses:
Dependency License
lz4_ada(32) Expat (“MIT”) blake3_ada(32) CC0 tar_ada(32) GPL 3+
Why this chaos you might ask? The idea behind the licenses for LZ4 and Blake3 is to align them with other important projects from the respective ecosystems, e.g. the LZ4 Specification or the Blake3 reference implementation as not to needlessly restrict the license further than the original projects do.
For the Tar implementation there was no real reference implementation although
it is inspired by tar.rs
(cf. https://docs.rs/tar/latest/tar/) and
Bupstash's xtar.rs
hence the Ma_Sys.ma default “GPL v3 or later” is used. As
for maxbupst itself: Since it is not a library but rather an application
program, the GPL should be less of a problem for practical use and hence the
Ma_Sys.ma default GPL v3+ was chosen over Bupstash's more permissive Expat
license. If this is an issue for you, feel free to contact me about it and
explain whatever difficulty you see regarding the licensing.
Some of Bupstash's required dependencies were found to not have any Ada equivalent readily available. Specifically, the following programs seemed not to be available in the Ada world: LZ4, Blake3 and TAR Archive creation.
To provide these features, dedicated separate libraries were thus developed as part of the Maxbupst development. Their pages are here:
Additionally, the external dependency on libsodium is required, i.e. on
Debian systems this is package libsodium-dev
.
There are multiple ways to go about compiling this program depending on the intended mode of deployment. Please refer to the following subsections for details.
The primary intended use case is to build all of the libraries as separate
Debian packages, install them on the running Debian system and then compile
maxbupst and also install it as a Debian package. If the necessary dependencies
like ant
, gnat-12
and devscripts
are installed, this can be achieved by
running
ant package
in all of the dependencies' individual directories, then installing all of the
resulting packages like e.g. with apt install ./...deb
and then compiling
maxbupst
with the same command in the repository checkout:
ant package
The resulting package can then be installed and the maxbupst
command becomes
available.
In order to compile this package and obtain an executable that does not have any external dependencies except for the system's libsodium, use the following target:
ant build-rogue
This automatically downloads the required Ma_Sys.ma dependencies next to the
current repository checkout and statically links all of them into the single
maxbupst
binary output.
The following sequence of commands is expected to produce a YAML output equal to
the one found in testdata/small-0.10.3-expected.yml
:
root="$(pwd)"
export BUPSTASH_KEY="$root/testdata/maxbupst-testkey.key"
export BUPSTASH_REPOSITORY="$root/testdata/small-0.10.3"
ulimit -s unlimited
maxbupst -l
Compilation on Windows is a little bit involved because installing the dependencies requires a lot of manual actions. Also, in order to setup GNAT on Windows, the currently recommended course of operation is to use Alire which means that in order to make use of this installation method, Alire itself needs to be setup first. The following steps give a rough guide that worked for me:
- Install chocolatey if not already installed in an administrative powershell.
Set-ExecutionPolicy Bypass -Scope Process -Force; [System.Net.ServicePointManager]::SecurityProtocol = [System.Net.ServicePointManager]::SecurityProtocol -bor 3072; iex ((New-Object System.Net.WebClient).DownloadString('https://community.chocolatey.org/install.ps1'))
Details: command from https://chocolatey.org/install “individual” variant. Test by runningchoco
. It should print out a version. - In the same shell, install git
choco install git
. The tool becomes available in newly created shells afterwards. - Install ant and its dependencies:
choco install microsoft-openjdk
and thenchoco install ant
Further info: https://community.chocolatey.org/packages/ant, https://mkyong.com/ant/how-to-install-apache-ant-on-windows/ - Switch to a user shell in an empty directory
- Clone the repository:
git clone https://github.com/m7a/bo-maxbupst
- Get GNAT via Alire
https://github.com/alire-project/alire/releases/
Download e.g.
alr-1.2.2-bin-x86_64-windows.zip
. Copyalr.exe
tobo-maxbupst
. - Switch directory:
cd bo-maxbupst
- Run Alire to install the GNAT compiler. Select not to install msys2.
.\alr toolchain --select
. Chosegnat_native
and a recent gprbuild. Try outgnatmake
to check if the command is available. - Add Alire's toolchain to your PATH
${env:PATH} = "${env:PATH};${env:USERPROFILE}\.config\alire\cache\dependencies\gnat_native_12.2.1_c210a022\bin"
(adjust version to your installation) - Get Libsodium https://download.libsodium.org/libsodium/releases/
Download the file with suffix
-msvc.zip
and extract thelibsodium.dll
for your architecture. Copy it tobo-maxbupst
.
Now you should be ready to compile the “rogue” variant (others are not supported on Windows):
ant build-rogue
The files that are necessary for Maxbupst to run are then: libsodium.dll
and
maxbupst.exe
.
In order to test the functioning of the build, create a new directory test
outside the repository directory and copy the files libsodium.dll
and
maxbupst.exe
there.
Then use a regular cmd.exe
(not a powershell because that one mangles the
binary output and hence the result .tar will not be extractable) and run e.g.
the following commands:
set ROOT=%CD%\..\bo-maxbupst
set BUPSTASH_KEY=%ROOT%\testdata\maxbupst-testkey.key
set BUPSTASH_REPOSITORY=%ROOT%\testdata\small-0.10.3
maxbupst -l
maxbupst -g -i 81e33ca1db1e6d562bc7146ddd9b37ab | tar -xf -
Note that it is really tar -xf -
because tar -x
gives an error on Windows :)
maxbupst
-- Ma_Sys.ma Bupstash Extractor
maxbupst -l|list [-k KEY] [-r REPO]
maxbupst -g|get [-k KEY] [-r REPO] -i ID
Read from a Bupstash repository and either display the list of items (-l
) or
restore the data from a specific item (-g
.
Key and Repository locations can be passed either through the options
-k
and -r
or through the environment variables BUPSTASH_KEY
or
BUPSTASH_REPOSITORY
.
-l
List mode. List the repository's items as YAML-like output.
-g
Get mode. Extracts the specified item ID and prints it to stdout.
-k
Specify Bupstash private key file to use.
-r
Specify Bupstash repository root directory to use.
-i
Specify item ID to restore. Identify this item by using -l
before.
BUPSTASH_KEY
: Alternative way to specify the key file to use. Equivalent to option -k
.
BUPSTASH_REPOSITORY
: Alternative way to specify the repository location. Equivalent to option
-r
.
Any output data produced is written to stdout. For -g
it is often advisable to
pipe the output through a tar -x
command in order to extract the retrieved
data.
When single data items are extracted, their output is returned directly and may thus require different processing.
maxbupst -l -k testdata/maxbupst-testkey.key -r testdata/small-0.10.3
The order of items as retrieved by -l
is file-system dependent and does not
properly reflect the order of creation.
Testing Backup software correctly is not all that easy. Even for a mere “restore” the only real test is: Does it restore the real production data of interest. Tests with production data are always difficult to conduct for multiple reasons: Data under consideration may be large, confidential and a restore must not interfere with the productive systems' operations.
Hence the testing for maxbupst is threefold:
- A simple restore test can be found in script
test_with_testdata.sh
. It really just tries to restore known files from a tiny repository supplied as part of the maxbupst source code. After compilingmaxbupst
and having amaxbupst
binary in the repository's directory, it can be run as-is without any additional configuration being required. - A synthetic test based on Bupstash's
cli-tests
test suite. This one is nice because it makes use of Bupstash's own test cases although it cannot run all of them. It works by downloading the test definitions from the Bupstash repository, editing them to replacebupstash get
instances withmaxbupst
and then running a sensible subset of the tests. If you usedant
to compilemaxbupst
, the necessary prerequisites for the script to work may already be present. Additionally, packagebats
must be installed for the script to be able to run the Bupstash tests. The script does not take any parameters and can be run as-is then. - The test with productive data. This one cannot run without additional
configuration by the user. Also, given how long it can take to complete its
invocation is organized in multiple stages such that one can perform
one stage after each other or independent stages in parallel even. To avoid
messing with the production data, it is intended that they are copied to
a working directory for the tests. These tests are contained in
subdirectory
test_with_production_data
and implemented in a GNU Makefile. Runmake -C test_with_production_data help
to display information about the variables and targets available. The broad idea is that you provide anenv.mk
in that directory to setMAXBT_PROD_SOURCE_DATA
and your repository access data. Then you edit thes1_update_backup.txt
target to your needs and finally runmake -C test_with_production_data -j all
to perform the end-to-end test. As additional dependencies, this test requiresdocker
, GNUmake
, GNUtar
anddiff
.
-- from bupstash_key.ads
type Key is tagged limited record
ID: Bupstash_Types.XID;
Rollsum_Key: String(1 .. Random_Seed_Bytes);
Data_Hash_Key_Part_1: Bupstash_Types.Partial_Hash_Key;
Data_Hash_Key_Part_2: Bupstash_Types.Partial_Hash_Key;
Data_PK: Bupstash_Types.PK;
Data_SK: Bupstash_Types.SK;
Data_PSK: Bupstash_Types.PSK;
Idx_Hash_Key_Part_1: Bupstash_Types.Partial_Hash_Key;
Idx_Hash_Key_Part_2: Bupstash_Types.Partial_Hash_Key;
Idx_PK: Bupstash_Types.PK;
Idx_SK: Bupstash_Types.SK;
Idx_PSK: Bupstash_Types.PSK;
Metadata_PK: Bupstash_Types.PK;
Metadata_SK: Bupstash_Types.SK;
Metadata_PSK: Bupstash_Types.PSK;
end record;
ID
uniquely identifies this key, Rollsum_Key
is used for deduplication
during backup creation and not needed for data restoration. The remainder of
the structure's entries are used for restoration and explained in the following.
Note that this is my understanding as a reader of the source code rather than the inventor of Bupstash. Feel free to point out any parts where I understood the hierarchy wrongly.
Different keys are used to encrypt different parts of the repository as follows:
Metadata (Items) Index Data
+----------+ +-------------------+ +-----------------------+
| Backup 1 |------>| hello.txt size 12 |----->| Hello world.#!/bin/sh |
+----------+ | test.sh size 24 | | -eu.echo Test. |
+-------------------+ +-----------------------+
Schematic visualization of an example repository with a single item
containing two files.
Metadata Keys
: At the high-level, Bupstash repositories contain any number of items.
Metadata stores information about such an item. It contains a number
of tags, the date of backup creation and the addresses and sizes of the
associated index data (if any) and the associated data (always present).
The keys prefixed Metadata_
are used for protecting the private (“secret”)
part of the metadata.
Index Keys
: If an item contains multiple files (i.e. is not just a data stream that
was added as-is to a bupstash repository) then the index stores information
about each of the contained files. Among other file metadata this includes
file paths and file sizes. The keys prefixed Idx_
are used to protect
this data.
Data Keys
: Data is the actual backup contents. For restoration purposes, it can be
thought of as an opaque stream of bytes. If multiple files are contained
within the backup, the index is used to associate suitable chunks of the
data stream to the individual files. The keys prefixed Data_
are used
to protect the backup contents.
The use of multiple keys for the different repository contents seems sensible. It allows sub-keys to be created to e.g. only access the backup metadata without having to be able to decrypt the index and data contents. The use of separate keys for index and data ensures that adversaries cannot attack the system by exchanging index and data contents.
For each of the parts, PK, SK and PSK keys are stored.
- PK (“Public Key”)
- SK (“Secret Key”)
- PSK (“Pre-Shared Key”)
The idea behind the PSK is that it is a symmetric key that is considered to be between the public and the secret key in terms of secrecy: Unlike the public key it is not stored together with the data, but unlike the secret key it is provided in key files where SK is missing. For restoration purposes, SK and PSK can both be considered required secret key inputs.
At the low level, all data chunks are encrypted using libsodium's cryptobox
functionality. At the lowest level, the following two API calls are used for
decryption (zsodiumbinding.ads
):
crypto_box_curve25519xchacha20poly1305_beforenm
: This function computes a “shared key” from public and secret keys that is used for decryption. This is intended to be used as an input toopen_easy_afternm
afterwards (bupstash does something peculiar here, though -- read on). I subsequently call thiscryptobox-beforenm
.crypto_box_curve25519xchacha20poly1305_open_easy_afternm
: This function decrypts an encrypted message by providing the “shared key”, the ciphertext and a nonce that is typically stored together with the ciphertext. I subsequently call thiscryptobox-open
.
Raw ciphertext data in Bupstash consists of the following parts:
Ciphertext := Nonce || Cryptobox Ciphertext || PK
The decryption key (“box key”, BK) for the contained cryptobox ciphertext is computed as follows and from that, the plaintext:
BK := BLAKE3(Key=PSK, Data=cryptobox-beforenm(PK, SK))
Plaintext := cryptobox-open(Key=BK, Nonce, Data=Cryptobox Ciphertext)
This encryption is used at the level of chunks with the chunks being managed by a data and index tree for data and index contents respectively. In order to check that the content addressable storage is indeed addressed correctly, the addresses inside the trees are checked as follows using a Hash Key (HK):
HK := BLAKE3(Key Part 1 || Key Part 2)
Computed Address := BLAKE3(Key=HK, Data=Plaintext)
It is then asserted that the computed address corresponds to the address specified in the tree. The concatenation of all decrypted plaintext chunks then forms the contents of the index and data respectively.
Graphically, this scheme can be drawn as follows with HKP1 and HKP2 serving as a shorthand notation for Key Part 1 and Key Part 2 of the Hash Key:
Chunk Key
+-------+----------------------+----+ +----+-----+------+------+
| Nonce | Cryptobox Ciphertext | PK | | SK | PSK | HKP1 | HKP2 |
+-------+----------------------+----+ +----+-----+------+------+
| | | | | | |
| | v v | +- concat -+
| | +--------------+ | |
| | | cryptobox- | | v
| | | beforenm | | +--------------+
| | +--------------+ | | Blake 3 Hash |
| | | | +--------------+
| | | | |
| | | data | key |
| | v input v input |
| | +--------------------+ |
| | | Blake 3 keyed | |
| | | hash function | |
| | +--------------------+ |
| | | BK |
| | | |
| nonce | ciphertext | key |
v input v input v input |
+---------------------------------------------------+ |
| cryptobox-open | |
+---------------------------------------------------+ |
| plaintext | key
| output v input
| data input +---------------+
+---------------------------------------------->| Blake 3 keyed |
| | hash function |
| +---------------+
| |
v v
+---------------+ +---------------+
| Plaintext to | | computed |
| use | | chunk address |
+---------------+ +---------------+
This section contains some notes about what I learned from reading the Bupstash source code. It focuses on the data structures and is completed by a diagram which shows the implementation maxbupst at a high-level glance.
To associate file contents and metadata, bupstash does the following:
- It creates two separate HTrees and iterates over both of them independently: One for metadata and one for the actual file contents.
- To restore, it iterates over the metatadata tree which consists of a stream of records.
- For each record, it reads out the size of the respective item.
- If size is nonzero, it reads size bytes from the data tree and produces them as output.
- Now the data tree cursor points to data from the next entry with data such that the restore can continue by going to the next metadata item.
- Additional data fields allow for restoration of a subset of files. I did not check this in more detail because maxbupst only implements the “full” restore functionality.
From a restorer point of view, Bupstash stores its data as a “stream”. Instead of having one large file where data can be appended, a tree structure is used to represent the “stream”.
When there is only one backup containing multiple files then there is typically two streams: One for metadata and one for data (see above).
The storage only contains encrypted data and tree metadata. As a result, the HTree can be traversed without being decrypted. The actual (encrypted) data is contained in the leaf nodes whereas the other nodes contain unencrypted metadata about which other tree nodes belong to the same subtree.
Bupstash's storage is “content-addressable”, i.e. the ID of a tree node is actually the BLAKE3 hash over the concatenation of its (file) contents.
As a result, an entire stream can be identified by its ID. Such a tree can basically be traversed as follows given the root node ID:
- Read the file with file name ID
- If this is already a leaf node, emit its contents as data.
- If this is not a leaf node, decode the contained IDs in this node and continue by traversing each of the contained nodes in order
To determine which nodes are leaf nodes and which not, additional metadata is required. This metadata is stored by bupstash as part of the backup metadata and not contained in any of the two (metadata and data) trees.
Since there is no obvious relation between the tree nodes and the files in the backup, this structure is not expected to leak any sensitive information. In order to ensure that the trees are not tampered with, all content addresses must be validated by independently computing the hash over their contents and comparing the result with the ID they were found under.
In order to save memory it makes sense to not read the entire tree into RAM. Rather, the list of leaf nodes is constructed while processing such that there is always only a few nodes loaded into RAM rather say the entire the backup contents. The Maxbupst implementation simplifies this a little in that it reads all of the leaf nodes' addresses into RAM and holds them there while restoring the backup. While this is a waste of memory, it is also the easiest variant that could be implemented.
The following diagram shows the dependencies between the maxbupst Ada packages. An arrow A -> B defines a dependency of type “A knows B”. External library components like Tar, Blake3 and LZ4 are shown for completeness despite not being contained in the maxbupst source tree.
┌─────┐
│ LZ4 │
└──▲──┘
│
┌───────┴─────┐
│ Compression │
└───────▲─────┘
┌───────┐ │
┌───────────────────▶│ Serde │ ┌────────┐ │
│ ┌───────▶│ │ ╔═╡ Crypto ╞═╪══════════╗
│ │ └───▲▲──┘ ║ └────────┘ │ ║
│ │ ││ ┌────────┐ ║ │ ║
│ ┌────┐ │ ││ │ Blake3 │◀──╫──────┐ │ ║
│ ╔═╡ FS ╞══╪══════╗ ││ └───▲─▲──┘ ║ │ │ ║
│ ║ └────┘ │ ║ ││ │ │ ║ │ │ ║
│ ║ │ ║ ││ │ │ ║ │ │ ZSodium ║
│ ║ │ ║ │└───────┼┐│ ║ │ │ ▲ ║
┌─────┐ │ ║ │ ║ └───────┐│││ ║ │ │ │ ║
│ Tar │ │ ║ Index ◀──╫────────────┐││││ ║ │ │ │ ║
└──▲──┘ │ ║ ▲ ║ │││││ ║ │ │ │ ║
│ │ ║ │ ║ │││││ ║ Decryption ─────┘ ║
└─────┼─╫─────── XTar ◀──╫────────────┤││││ ║ ▲▲ ║
│ ║ ║ │││││ ║ ││ ║
│ ╚════════════════╝ │││││ ╚══════╪╞═══════════════╝
│ │││││ ││
│ │││││ ││
│ │││└┼─────────┐ ││
│ ┌──────┐ │││ └─────┐ │ ││
│ ╔═╡ Tree ╞════════════════════╪╪╪═════╗ │ │ ││
│ ║ └──────┘ │││ ║ │ │ ││
│ ║ ┌───────▶ HTree_LL ◀──┐│││ ║ │ │ ││
│ ║ │ ▲ ││││ ║ │ │ ││
│ ║ HTree_Iter │ ││││ ║ │ │ ││
│ ║ ▲ │ ││││ ║ │ │ ││
│ ║ └────────────┼───── Restorer ──╫─┼───┼───┘│
│ ║ │ ▲ ║ │ │ │
│ ║ │ │ ║ │ │ │
│ ╚═══════════════════╪══════════╪══════╝ │ │ │
│ │ │ │ │ │
└────────────────────┐│┌─────────┼────────┘ │ │
│││┌────────┼────────────┼────┘
┌────┐ ││││ │ │
╔═╡ DB ╞═══╪╞╞╞════════╪════════════╪═══════════════╗
║ └────┘ ││││ │ │ ║
║ ││││ │ │ ZBase64 ║
║ ││││ │ │ ▲ ║
║ Item │ Key │ ║
║ ▲ │ ▲ │ ║
║ └─────── Repository ────┴─────────┘ ║
║ ▲ ║
║ │ ║
║ Main ║
║ ║
╚═══════════════════════════════════════════════════╝
Subdirectory testcpp
contains a previous attempt to write such a thing in C++
(as an exercise to get to know modern C++ better). After the necessity for
learning more C++ broke away, a rewrite was performed in Ada.
- Currently, only Bupstash v0.10.3 is supported. Is intended to add support for a newer version in the future.
- The Ma_Sys.ma CI cannot build this. As a workaround one can install the library dependencies into the build container. To fix this the best solution would be to replace the Ma_Sys.ma CI by something better.