Filesystem repo spec (local repos)

Before the 1.0 release of the dbx projects, treat all documentation as likely to change without warning.

A dbx repository is a structured way to save machine learning experiment results. Within this documentation we use the words repo, dbx repo, repository, experiments repository, dbx repository and so on to mean the same thing. In this document particularly, we use the word repo to mean a local or filesystem repository and not a server repo.

This document describes a dbx repository stored on the filesystem (also called local repos). This serves as the spec and guideline for implementing how local repos work.

A local dbx repo is any folder that has a .dbx subdirectory. The .dbx folder stores some information that is useful for synchronising repositories. dbx does not store a full history of versions like git, but stores minimal information about changes made to a repository to avoid conflicts and be able to resolve which version is the latest.

The .dbx directory may be empty at first but will typically contain the following files:

config.json to store configuration for this local repo
index an experiment index with one entry for each experiment in this repo, containing experiment id, path and hash
history a change-log of the repository that stores info useful for synchronising between repos (new experiment version created, experiment delete log)
attach/ folder is an optional folder used for content-addressable storage of RepoFiles

`history`, `index` and logging experiments into a repository

When logging experiments into a local repo, the index and history files are not updated. This may change in the future, but at the moment these files are generated before and after sync operations by the dbx command line tool.

Example with a RepoFile

A dbx repository with one experiment called an_idea/test1 (name="an_idea/text1" and id="abcd") that has a RepoFile called checkpoint.pth(full namean_idea/text1/abcd/checkpoint.pth`).

.dbx/
  -> config.json
  -> attach/
       -> SHA256-8034c7f3218acc3bd14c22389408f36686ceab61c70a91859dffb5436aa3863d
  -> index
  -> history
an_idea/
  -> test1/
     -> abcd/
        -> meta.json
        -> log.jsonl
        -> checkpoint.pth     # (symlink to [.dbx]/attach/SHA256-8034c7f3218acc3bd14c22389408f36686ceab61c70a91859dffb5436aa3863d)

Config

The repository config is stored as config.json. It is read by the loggers before they start saving data into this repository, as it contains information on how this repository stores files.

Currently no configuration is supported, but there will be a few options in the future:

configuring remotes for this repository so synching will be simpler on the command line
configuring different storage mechanisms for RepoFiles, not only in the attach/ folder.

Experiments

Experiments are stored in the subtree of the root folder excluding the .dbx folder. The path is given by the experiment name, which can (and is encouraged to) contain slashes (/). Slashes in experiment names allow for easy grouping of experiments by ideas, key parameters or other logical values that might help during analysis.

Similar to how names can group experiments, creating custom groups of experiments is planned for the future but not yet available. It is likely that support for groups will come server-side first and in local repositories after.

All the experiments in the repository are stored in a subfolder of the root.

For instance

<root>/
    .dbx/
        ...
    j482_fuzxv/
        meta.json
        log.jsonl
    ja78jczzss/
        meta.json
        log.jsonl

is a repository that has two experiments: j482_fuzxv and ja78jczzss.

Experiments in a repository do not have to be subfolders of root directly, but they need to be under the root in the filesystem. This is also a valid filesystem repository:

<root>
    .dbx/
        ...
    cifar10/
        simple_idea/
            fjd08ayvu/
                meta.json
                log.jsonl
            uv89_kxzf/
                meta.json
                log.jsonl
        complex_idea/
            ijy8vbc9x/
                meta.json
                log.jsonl
            mn0uhiv91/
                meta.json
                log.jsonl

Here we have 4 experiments and are logically grouped in subfolders. Each experiment has a name property and an id property. The path of an experiment in a repository is:

<root>/<exp.name>/<exp.id>

or if there is no name (like in the first example above):

<root>/<exp.name>/<exp.id>

Experiment names do not need to be unique, but the IDs must be. The dbxlogger package automatically generates unique IDs using nanoid.

Files

	ExpFiles	RepoFiles	RefFiles
Size	small files	large files	any files
Stored	With experiment	With repo or configured location	externally (not managed)
On sync	always copied	can be omitted (and fetched later if needed, no need to fetch all RepoFiles)	never copied by dbx
Example	experiment code uncommitted diffs	checkpoint files	dataset archive

ExpFiles

ExpFiles are files that are stored in the same folder as the experiment, and are deleted as soon as the experiment is deleted. They are meant for files that:

are useless outside the scope of this experiment,
add information that is either easier to read or hard to otherwise save using logs or metadata,
are small in size and can be copied easily,
will not be required if the experiment is deleted.

A great example usage is uncommited local code diffs, which are stored as ExpFiles with _patch_ in their name.

ExpFiles are managed by dbx.

Cannot be named meta.json, files.json, or anything that ends with log.jsonl.

To save ExpFiles all the user needs to do is physically save the file in the experiment folder. No need for any other operations. There are helpers in the python logger to create any files in dbx repositories. The ExpFile class may be used to log references to an ExpFile.

RepoFiles

RepoFiles are files that are stored in the repository and may or may not come from experiments. They can be referenced from and saved as part of experiments but are not physically stored in the same folder as experiments. The final storage location depends on the repository config, but by default they are stored in a content-addressable manner in .dbx/attach/ folder.

RepoFiles are managed by dbx.

Ideal for:

experiment artefacts that are not small in size, like checkpoint files
files that are used by many experiments but not necessarily an artefact of any (like a pre-trained network checkpoint),
any other files experiments will output but are not critical for quick analysis of the results

Not ideal for things like dataset archives.

RepoFiles are not deleted when an experiment that created/used them is deleted, but may be garbage collected and, as a result, files that do not belong to any experiment need to be added to the .dbx/attachments file.

`.dbx/files` file

It's a plaintext file with a file name per line:

<hash> <filename>

RefFiles

Files stored outside of dbx repositories but referenced from experiments. Can be things like datasets archives or pre-trained model checkpoints if they aren't stored as RepoFiles. Ideal use is when you need to save a hash of a file but not necessarily the file, and the file is stored elsewhere.

File storage locations

The only supported location now is the local filesystem, with more to come. The storage locations are for RepoFiles only, since ExpFiles are stored where the experiments are stored and RefFiles are not managed by dbx.

The plan:

relative path (.dbx/attach for RepoFiles) (default behaviour)
local filesystem (absolute paths)
local filesystem with named root directory (path relative to something else then .dbx/attach - allows for different computers to define different physical paths to the same named root directory)
AWS S3
Google Cloud Storage
Others?

Local fs with named root directory

If we work with multiple computers we may desire all our large files (RepoFiles) to be copied along when we copy the repository, but not in the same place as we store the dbx repository.

For example:

# machine A
      using physical "/storage/lfs/dbx_store" as "storage"
# machine B
      using physical "/home/jdoe/dbx_store" as "storage"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filesystem repo spec (local repos)

`history`, `index` and logging experiments into a repository

Example with a RepoFile

Config

Experiments

Files

ExpFiles

RepoFiles

`.dbx/files` file

RefFiles

File storage locations

Local fs with named root directory

Clone this wiki locally

Filesystem repo spec (local repos)

history, index and logging experiments into a repository

Example with a RepoFile

Config

Experiments

Files

ExpFiles

RepoFiles

.dbx/files file

RefFiles

File storage locations

Local fs with named root directory

Clone this wiki locally

`history`, `index` and logging experiments into a repository

`.dbx/files` file