-
#64 upgraded the Schema dependency to 0.7.3.
-
#109:
-
Upgrades
psycopg2-binary
andpytest
to get tests passing again with modern Python. -
Modifies
docker build
commands to get the built image ID via--iidfile
vs parsing the shell command output. -
adds
--platform linux/amd64
todocker run
anddocker build
commands, until we get support forarm64
sorted out. -
fixes failure when GCP credentials aren't configured running in local mode.
-
Small release to archive for JOSS acceptance.
- Move
cloud_sql_proxy
installation before code copy (#87)
The biggest feature in this new release is native support for logging to an MLFlow tracking server using the UV Metrics project. (#35) This feature is in alpha; expect documentation soon.
- minor bugfixes for GKE (#85)
- additional tests for gke.{types, util} (#84)
- re-order custom apt packages before pip requirements (#82)
- modify base image to our more general cloudbuild naming scheme (#80)
- updated
google-auth
dependency version to1.19.0
(#79) - add clearer contribution info (#76)
- Update uv-metrics tutorial (#74, #72)
- add support for running an embedded cloudsql_proxy (#60)
- bugfix for #65: do not add resource maxima when quota is < 1 (#67)
- Updated accelerator regions (and globally availabe AI Platform regions to match the current state here): https://cloud.google.com/ai-platform/training/docs/regions
- @ramasesh Added a fix that prevented
pip
git dependencies from working incaliban shell
mode (#55) This adds a small update to the base image, so be sure to run
docker pull gcr.io/blueshift-playground/blueshift:cpu
docker pull gcr.io/blueshift-playground/blueshift:gpu
to get access to this fix.
-
Thanks to @eschnett,
--docker_run-args
can now deal with arbitrary whitespace in the list of arguments, instead of single spaces only. (#46) -
Caliban now authenticates AI Platform job submissions using the authentication provided by
gcloud auth login
, rather than requiring a service account key. This significantly simplifies the setup required for a first time user. -
caliban cloud
now checks if the image exists remotely before issuing adocker push
command on the newly built image (#36) -
Big internal refactor to make it easier to work on code, increase test coverage, add new backends (#32)
-
add
schema
validation for.calibanconfig.json
. This makes it much easier to add configuration knobs: #37 -
Custom base image support (#39), thanks to #20 from @sagravat.
.calibanconfig.json
now supports a"base_image"
key. For the value, can supply:- a Docker base image of your own
- a dict of the form
{"cpu": "base_image", "gpu": "base_image"}
with both entries optional, of course.
Two more cool features.
First, if you use a format string, like
"my_image-{}:latest"
, the format block{}
will be filled in with eithercpu
orgpu
, depending on the mode Caliban is using.Second, we now have native support for Google's Deep Learning VMs as base images. The actual VM containers live here. If you provide any of the following strings, Caliban will expand them out to the actual base image location:
dlvm:pytorch-cpu
dlvm:pytorch-cpu-1.0
dlvm:pytorch-cpu-1.1
dlvm:pytorch-cpu-1.2
dlvm:pytorch-cpu-1.3
dlvm:pytorch-cpu-1.4
dlvm:pytorch-gpu
dlvm:pytorch-gpu-1.0
dlvm:pytorch-gpu-1.1
dlvm:pytorch-gpu-1.2
dlvm:pytorch-gpu-1.3
dlvm:pytorch-gpu-1.4
dlvm:tf-cpu
dlvm:tf-cpu-1.0
dlvm:tf-cpu-1.13
dlvm:tf-cpu-1.14
dlvm:tf-cpu-1.15
dlvm:tf-gpu
dlvm:tf-gpu-1.0
dlvm:tf-gpu-1.13
dlvm:tf-gpu-1.14
dlvm:tf-gpu-1.15
dlvm:tf2-cpu
dlvm:tf2-cpu-2.0
dlvm:tf2-cpu-2.1
dlvm:tf2-cpu-2.2
dlvm:tf2-gpu
dlvm:tf2-gpu-2.0
dlvm:tf2-gpu-2.1
dlvm:tf2-gpu-2.2
Format strings work here as well! So, "dlvm:pytorch-{}-1.4"
is a totally valid
base image.
- Prepared for a variety of base images by setting up a cloud build matrix: #25
- Added better documentation for
gcloud auth configure-docker
#26 - Added
close()
toTqdmFile
, preventing an error when pipingstdout
: #30 tqdm
progress bars and other interactive outputs now display correctly incaliban run
outputs.stdout
flushes properly! Before these changes,stderr
would appear before anystdout
, making it difficult to store the logs in a text file. Now, by default, python processes launched bycaliban run
won't buffer. #31
- fixes the python binary that caliban notebook points to (now that we use conda)
- adds DEBIAN_FRONTEND=noninteractive to the apt-get command, so that packages like texlive won't freeze and wait for you to specify a timezone.
This makes it easy to add, for example, npm and latex support to your caliban notebook invocations.
- fixes a bug with
parse_region
not handling a lack of default. - converts the build to Github Actions.
- Rolls Caliban back to requiring only python 3.6 support.
- Removes some unused imports from a few files.
- Added fix for an issue where large user IDs would crash Docker during the build phase. #8
- Fix for bug with requirements.txt files.
- Added support for Conda dependencies
(#5). If you include
environment.yml
in your project's folder Caliban will attempt to install all dependencies. - Caliban now uses a slimmer base image for GPU mode.
- Base images for CPU and GPU modes now use Conda to manage the container's
virtual environment instead of
virtualenv
.
- Caliban now caches the service account key and ADC file; you should see faster builds, BUT you might run into trouble if you try to run multiple Caliban commands in the same directory in different processes, due to a race condition with a temp file dropped in the directory. If you see a failure, try again.
- Private Cloud Source Repositories are now supported as requirements in the
requirements.txt
andsetup.py
of projects executed using Caliban. caliban notebook
now installsjupyter
instead ofjupyterlab
.caliban notebook --lab
of course still usesjupyterlab
.- Caliban now works with Python 3.5.
- The default release channel for gke clusters in caliban is now 'regular', as node autoprovisioning now works with preemptible instances in the regular channel.
- If you provide
--cloud_key
Caliban will now properly use the supplied cloud key to submit jobs to AI platform. Previously, Caliban would rely on the system's auth method, which made it impossible to point to a different project if your current service account key wasn't the owner. - Changed gke job submission to accept min cpu and min mem arguments instead of an explicit machine-type. This allows gke to efficiently schedule jobs and prevents issues where jobs can be oversubscribed on compute nodes.
- added support for
.calibanconfig.json
. You can now add this file with an"apt_packages"
entry to specify aptitude packages to install inside the container. The value under this key can be either a list, or a dictionary with"gpu"
and"cpu"'
keys. For example, any of the following are valid:
# This is a list by itself. Comments are fine, by the way.
{
"apt_packages": ["libsm6", "libxext6", "libxrender-dev"]
}
This works too:
# You can also include a dictionary with different deps
# for gpu and cpu modes. It's fine to leave either of these blank,
# or not include it.
{
"apt_packages": {
"gpu": ["libsm6", "libxext6", "libxrender-dev"],
"cpu": ["some_other_package"]
}
}
-
caliban notebook
now attempts to search for the first free port instead of failing due to an already-occupied port. -
pip
is now called with--no-cache-dir
inside the container; this should shrink container sizes with no impact on performance. -
All commands have a new
--no-cache
option; when supplied, Docker will skip using its build cache. This is helpful to use if you want to, say, force new dependencies to get installed without bumping their versions explicitly.
-
JSON experiment configuration files can now handle arguments which are varied together, by supplying a compound key, of the form e.g.
[arg1,arg2]
. -
better error messages print when a docker command fails.
-
Caliban can now handle pushing containers to "domain scoped projects": https://cloud.google.com/container-registry/docs/overview#domain-scoped_projects The colon in the project name separating domain and project ID is handled properly.
-
'caliban run' and 'caliban shell' now take an --image_id argument; if provided, these commands will skip their 'docker build' phase and use the image ID directly.
-
AI Platform labels now swap periods for underscores (thanks to vinay@!); this means that floating point numbers will no longer have pre- and post- decimal components concatenated.
-
A new
expansion
script will expand experiment configs into all of the individual experiments they'd generate. This command can acceptstdin
, just like the--experiment_config
argument. Options include--pprint
and--print_flags
. The output of this script can be piped directly intocaliban cloud --experiment_config stdin
. -
caliban shell
will now default to bash if you're using a shell that's notbash
orzsh
(fish shell, for example) instead of erroring out. -
caliban shell
has a new--shell
argument that you can use to override the container's default shell.
-
consolidated gke tpu/gpu spec parsing with cloud types
-
modified all commands to accept as the module argument paths to arbitrary shell scripts. Any argument of the format "trainer.train" will execute using "python -m trainer.train", just as before. If instead you pass a python script as a file, like "trainer/train.py", caliban will execute this file inside the container using "python trainer/train.py". Any other argument, if it exists in the local directory, will be executed as a bash script.
This allows users to run commands like "caliban cloud my_script.sh" and have it all work.
-
"caliban run" now supports --experiment_config and --dry_run. These work just like they do for "caliban cloud"; the experiment config will expand out and execute N jobs on your local machine.
-
moved some methods from cluster/cluster.py to gke/utils.py
-
added unit tests for some gke/utils.py methods
-
Support for ADC credentials! if application_default_credentials.json is present on the user's machine, they now get copied into the container.
-
if ADC credentials are NOT present but a service account key is we write a placeholder. this is required to get ctpu working inside containers.
- added tpu driver specification for gke jobs
- added query for getting available tpu drivers for cluster/project
- set host_ipc=True for cluster jobs
- moved cluster constants to separate file
- moved cluster gpu validation to separate file
- added test for gpu limits validation
- TPU and GPU spec now accept validate_count arg to disable count validation.
- Fixed a bug where the label for the job name wasn't getting properly sanitized - this meant that if you provided an upper-cased job name job submission would fail.
- Fixed a bug that prevented parsing experiment config values that were floats.
- experiment config parsing now performs the full expansion at CLI-parse-time and validates every expanded config.
-
--docker_run_args
allows you to pass a string of arguments directly through todocker run
. This command works forcaliban run
,caliban notebook
andcaliban shell
. -
docker.py
reorganized, now takes explicitJobMode
instances throughout. -
new
--cloud_key
argument for all commands. If specified this lets you override theGOOGLE_APPLICATION_CREDENTIALS
value that's usually inspected for the service account key. -
fixed a bug in
caliban.util.TempCopy
where aNone
-valued path would fail. This affected environments whereGOOGLE_APPLICATION_CREDENTIALS
wasn't set.
-
--experiment_config
can now take experiment configs via stdin (pipes, yay!); specify--experiment_config stdin
, or any-cased version of that, and the script will wait to accept your input.As an example, this command pipes in a config and also passes
--dry_run
to show the series of jobs that WILL be submitted when the--dry_run
flag is removed:cat experiment.json | caliban cloud -e gpu --experiment_config stdin --dry_run trainer.train
You could pipe the output of a nontrivial python script that generates a JSON list of dicts.
-
--image_tag
argument tocaliban cloud
; if you supply this it will bypass the Docker build and push steps and use this image directly. This is useful if you want to submit a job quickly without going through a no-op build and push, OR if you want to broadcast an experiment to some existing container. -
if you supply a
--tpu_spec
and DON'T supply an explicit--gpu_spec
, caliban will default to CPU mode.--gpu_spec
and--nogpu
are still incompatible. You can use a GPU and TPU spec together without problems. -
added support for
--tpu_spec
incaliban cloud
mode. This validates in a similar style to--gpu_spec
; any invalid combination of count, region and TPU type will fail.(Unlike
--gpu_spec
this mode IS compatible with--nogpu
. In fact, many demos seem to use the non-GPU version of tensorflow here.) -
added a
caliban build
mode that simply builds the image and returns the image ID. Useful for checking if your image can build at all with the current settings.
- the CLI will now error if you pass any caliban keyword arguments AFTER the
python module name, but before
--
. In previous versions, if you did something like
caliban cloud trainer.train --nogpu
That final --nogpu
would get passed on directly to your script, vs getting
absorbed by Caliban.
- If you pass
--nogpu
mode and have a setup.py file, caliban will automatically pass--extras cpu
and attempt to install an extras dependency calledcpu
. Same goes forgpu
if you DON'T pass--nogpu
. So, if you have a setup.py file, the default pip installation will now look like one of these:
pip install .[gpu]
pip install .[cpu]
Instead of
pip install .
- Minor bugfix; I was calling "len" on an iterator, not a list.
This version:
- Moves from batch submission to submitting 1 request at a time, so Cloud can handle our rate limiting
- adds support for lists of experiment configs
- adds terminal highlighting
- adds a progress bar for Cloud submissions
- cleans up the terminal output for each job to show the interesting bits, not the entire training spec.
- moves the job index to the end of the jobId, so searching is easier
If you like you can set -v 1
to see the full spec output.
caliban.cloud.types
has lots of enums and types that make it easier to code well against AI Platform.caliban cloud
has more validations, now.caliban cloud
now supports:--gpu_spec
, which you can use to configure the GPU count and type for your job.--machine_type
allows you to specify the machine type for all jobs that run.--experiment_config
lets you submit batches of jobs at once.--force
skips all validations and forces a submission with the specified config.--dry_run
will generate logs showing what WOULD happen if you submit a batch of jobs.
- The
caliban cloud --stream_logs
argument is now gone; the command prints, so this is easy enough to run without special help, and the argument made batch job submission difficult. - all local Docker commands now run
--ipc host
, which givesdocker run
access to all of the host's memory. - Base images now contain
gsutil
.docker.py
automatically configuresgcloud
andgsutil
with shared credentials when they're present on the machine running caliban commands.