-
Notifications
You must be signed in to change notification settings - Fork 0
Filesystem repo spec (local repos)
Before the 1.0 release of the dbx projects, treat all documentation as likely to change without warning.
A dbx repository is a structured way to save machine learning experiment results. Within this documentation we use the words repo, dbx repo, repository, experiments repository, dbx repository and so on to mean the same thing. In this document particularly, we use the word repo to mean a local or filesystem repository and not a server repo.
This document describes a dbx repository stored on the filesystem (also called local repos). This serves as the spec and guideline for implementing how local repos work.
A local dbx repo is any folder that has a .dbx
subdirectory. The .dbx
folder stores some information that is useful for synchronising repositories. dbx does not store a full history of versions like git, but stores minimal information about changes made to a repository to avoid conflicts and be able to resolve which version is the latest.
The .dbx
directory may be empty at first but will typically contain the following files:
-
config.json
to store configuration for this local repo -
index
an experiment index with one entry for each experiment in this repo, containing experimentid
,path
andhash
-
history
a change-log of the repository that stores info useful for synchronising between repos (new experiment version created, experiment delete log) -
attach/
folder is an optional folder used for content-addressable storage of RepoFiles
When logging experiments into a local repo, the index
and history
files are not updated. This may change in the future, but at the moment these files are generated before and after sync operations by the dbx
command line tool.
A dbx repository with one experiment called an_idea/test1
(name="an_idea/text1" and
id="abcd") that has a RepoFile called
checkpoint.pth(full name
an_idea/text1/abcd/checkpoint.pth`).
.dbx/
-> config.json
-> attach/
-> SHA256-8034c7f3218acc3bd14c22389408f36686ceab61c70a91859dffb5436aa3863d
-> index
-> history
an_idea/
-> test1/
-> abcd/
-> meta.json
-> log.jsonl
-> checkpoint.pth # (symlink to [.dbx]/attach/SHA256-8034c7f3218acc3bd14c22389408f36686ceab61c70a91859dffb5436aa3863d)
The repository config is stored as config.json
. It is read by the loggers before they start saving data into this repository, as it contains information on how this repository stores files.
Currently no configuration is supported, but there will be a few options in the future:
- configuring remotes for this repository so synching will be simpler on the command line
- configuring different storage mechanisms for RepoFiles, not only in the
attach/
folder.
Experiments are stored in the subtree of the root folder excluding the .dbx
folder. The path is given by the experiment name, which can (and is encouraged to) contain slashes (/
). Slashes in experiment names allow for easy grouping of experiments by ideas, key parameters or other logical values that might help during analysis.
Similar to how names can group experiments, creating custom groups of experiments is planned for the future but not yet available. It is likely that support for groups will come server-side first and in local repositories after.
All the experiments in the repository are stored in a subfolder of the root.
For instance
<root>/
.dbx/
...
j482_fuzxv/
meta.json
log.jsonl
ja78jczzss/
meta.json
log.jsonl
is a repository that has two experiments: j482_fuzxv
and ja78jczzss
.
Experiments in a repository do not have to be subfolders of root directly, but they need to be under the root in the filesystem. This is also a valid filesystem repository:
<root>
.dbx/
...
cifar10/
simple_idea/
fjd08ayvu/
meta.json
log.jsonl
uv89_kxzf/
meta.json
log.jsonl
complex_idea/
ijy8vbc9x/
meta.json
log.jsonl
mn0uhiv91/
meta.json
log.jsonl
Here we have 4 experiments and are logically grouped in subfolders. Each
experiment has a name
property and an id
property. The path of an experiment
in a repository is:
<root>/<exp.name>/<exp.id>
or if there is no name (like in the first example above):
<root>/<exp.name>/<exp.id>
Experiment names do not need to be unique, but the IDs must be. The dbxlogger package automatically generates unique IDs using nanoid.
ExpFiles | RepoFiles | RefFiles | |
---|---|---|---|
Size | small files | large files | any files |
Stored | With experiment | With repo or configured location | externally (not managed) |
On sync | always copied | can be omitted (and fetched later if needed, no need to fetch all RepoFiles) | never copied by dbx |
Example | experiment code uncommitted diffs | checkpoint files | dataset archive |
ExpFiles are files that are stored in the same folder as the experiment, and are deleted as soon as the experiment is deleted. They are meant for files that:
- are useless outside the scope of this experiment,
- add information that is either easier to read or hard to otherwise save using logs or metadata,
- are small in size and can be copied easily,
- will not be required if the experiment is deleted.
A great example usage is uncommited local code diffs, which are stored as ExpFiles with _patch_
in their name.
ExpFiles are managed by dbx.
Cannot be named meta.json
, files.json
, or anything that ends with log.jsonl
.
To save ExpFiles all the user needs to do is physically save the file in the experiment folder. No need for any other operations. There are helpers in the python logger to create any files in dbx repositories. The ExpFile class may be used to log references to an ExpFile.
RepoFiles are files that are stored in the repository and may or may not come from experiments. They can be referenced from and saved as part of experiments but are not physically stored in the same folder as experiments. The final storage location depends on the repository config, but by default they are stored in a content-addressable manner in .dbx/attach/
folder.
RepoFiles are managed by dbx.
Ideal for:
- experiment artefacts that are not small in size, like checkpoint files
- files that are used by many experiments but not necessarily an artefact of any (like a pre-trained network checkpoint),
- any other files experiments will output but are not critical for quick analysis of the results
Not ideal for things like dataset archives.
RepoFiles are not deleted when an experiment that created/used them is deleted, but may be garbage collected and, as a result, files that do not belong to any experiment need to be added to the .dbx/attachments
file.
It's a plaintext file with a file name per line:
<hash> <filename>
Files stored outside of dbx repositories but referenced from experiments. Can be things like datasets archives or pre-trained model checkpoints if they aren't stored as RepoFiles. Ideal use is when you need to save a hash of a file but not necessarily the file, and the file is stored elsewhere.
The only supported location now is the local filesystem, with more to come. The storage locations are for RepoFiles only, since ExpFiles are stored where the experiments are stored and RefFiles are not managed by dbx.
The plan:
- relative path (
.dbx/attach
for RepoFiles) (default behaviour) - local filesystem (absolute paths)
- local filesystem with named root directory (path relative to something else then
.dbx/attach
- allows for different computers to define different physical paths to the same named root directory) - AWS S3
- Google Cloud Storage
- Others?
If we work with multiple computers we may desire all our large files (RepoFiles) to be copied along when we copy the repository, but not in the same place as we store the dbx repository.
For example:
# machine A
using physical "/storage/lfs/dbx_store" as "storage"
# machine B
using physical "/home/jdoe/dbx_store" as "storage"