Complete image-building automation docs

Signed-off-by: Chris Evich <[email protected]>
containers · Sep 16, 2020 · 5b1d619 · 5b1d619
1 parent b9207cc
commit 5b1d619
Showing 1 changed file with 187 additions and 29 deletions.
diff --git a/README.md b/README.md
@@ -58,55 +58,152 @@ documentation](https://www.packer.io/docs).
    of the VM to run automated tests.
 
 
-# The last step
+## The last part first (overview step 3)
 
-Assuming all you need to do is tweak the package list, add or adjust a makefile
-target, start here.  Before you go deleting all the seemingly "unnecessary" and
-"extra" packages, please remember these VM images are shared by automation
-in multiple repositories :D
+a.k.a. ***Cache Images***
 
-1. VM configuration starts with one of the `cache_images/*_setup.sh` scripts.
+These are the VM Images actually used by other repositories for automated
+testing.  So, assuming you just need to update packages or tweak the list,
+start here.  Though be aware, this repository does not yet perform any testing
+of the images.  That's your secondary responsibility, see step 5 below.
+
+Notes:
+
+* ***Warning:*** Before you go deleting seemingly "unnecessary" packages and
+  "extra" code, remember these VM images are shared by automation in multiple
+  repositories.
+
+* VM configuration starts with one of the `cache_images/*_setup.sh` scripts.
    Normally you probably won't need/want to mess with these.
 
-2. The bulk of the packaging work occurs next, from the `cache_images/*_packaging.sh`
+*  The bulk of the packaging work occurs next, from the `cache_images/*_packaging.sh`
    scripts.  This is most likely what you want to modify.
 
-3. Lastly, some non-packaged/source-based tooling is installed, using the
-   `cache_images/podman_tooling.sh` script.
+*  Some non-packaged/source-based tooling is installed using the
+   `cache_images/podman_tooling.sh` script.  These are slightly fragile, as
+   they always come from upstream (master) podman.  Avoid adding/changing
+   anything here if alternatives exist.
+
+Process:
 
-4. After you make your changes, push to a PR.  Shell-script changes will be
-   automatically validated, and then VM image building will begin.
+1. After you make your changes, push to a PR.  Shell-script changes will be
+   validated and VM image production building will begin automatically.
 
-5. After a successful build, the name of each output VM and container image
-   will share a common suffix.  To discover this suffix, find and click one
-   of the `View more details on Cirrus CI` links at the bottom of the
-   *Checks* tab. Any **Cirrus-CI** task will do, it doesn't matter which
-   you pick.
+2. Assuming successful image-build, the name of all output images will share
+   a common suffix.  To discover this suffix, find and click one of the
+   `View more details on Cirrus CI` links (bottom of the *Checks* tab in github).
+   Any **Cirrus-CI** task will do, it doesn't matter which you pick.
 
-6. Toward the top of the page, is a button with an arrow, that reads
-   *VIEW ALL TASKS*.  Click this button.
+3. Toward the top of the page, is a button labeled *VIEW ALL TASKS*.
+   Click this button.
 
-7. Look at the URL in your browser, it will use the form
+4. Look at the URL in your browser, it will be of the form
    `https://cirrus-ci.com/build/<big number>`.  Copy-paste (or otherwise
    record in stone) the **big number**, you'll need it for the next step.
 
-6. Go over to whatever other containers/repository needed the image update.
-   Open the `.cirrus.yml` file, and paste the **big number** *prefixed by the
-   letter 'c'*, in place of the value next to `_BUILT_IMAGE_SUFFIX:`.  For
-   example, if the url was `http://.../12345` you would paste in `c12345`
-   as the value for `_BUILT_IMAGE_SUFFIX:`.
+5. Go over to whatever other containers/repository needed the image update.
+   Open the `.cirrus.yml` file, and find the 'env' line referencing the image
+   suffix.  It will likely be named `_BUILT_IMAGE_SUFFIX:` or something similar.
 
-7. Open up a PR with this change, and push it.  Once all tests pass and your
+7. Paste in the **big number** *prefixed by the letter 'c'*.  The *"c*" indicates
+   the images are *cache images*.  For example, if the url was `http://.../12345`
+   you would paste in `c12345` as the value for `_BUILT_IMAGE_SUFFIX:`.
+
+8. Open up a PR with this change, and push it.  Once all tests pass and you're
    satisfied with the image changes, ask somebody to review/approve both
    PRs for merging.  If you're feeling generous, perhaps provide cross-links
    between the two PRs in comments, for future reference.
 
-8. After all the PRs are merged, you're done.  You may now attend to the little
-   dog begging you for a walk for the last hour.  Hurry!  Little-dogs, do not
-   have big-dog bladders!
+9. After all the PRs are merged, you're done.
+
+
+
+
+## The image-builder image (overview step 1)
+
+Google compute engine (GCE) does not provide a wide selection of ready-made
+VM images for use.  Instead, a lengthy and sophisticated process is involved
+to prepare, properly format, and import external VM images for use.  In order
+to perform these steps within automation, a dedicated VM image is needed which
+itself has been prepared with the necessary incantations, packages, configuration,
+and magic license keys.
+
+For normal day-to-day use, this process should not need to be modified or
+maintained much.  However, on the off-chance that's ever not true, here is
+an overview of the process followed **by automation** to produce the
+*image-building VM image*:
+
+1. Build the container defined in the `ci` subdirectory's `ContainerFile`.
+   Start this container with a copy of the current repository code provided
+   in the `$CIRRUS_WORKING_DIR` directory. For example on a Fedora host:
+
+   ```
+   podman run -it --rm --security-opt label=disable -v $PWD:$PWD -w $PWD ci_image
+   ```
+
+2. From within the `ci` container (above), in the repository root volume, execute
+   the  `make image_builder` target.
+
+3. The `image_builder/gce.yml` file is converted into JSON format for
+   consumption by the [Hashicorp *packer* utility](https://www.packer.io/).
+   This generated file may be ignored, *make* will be regenerate it upon
+   any changes to the YAML file.
+
+4. Packer will spin up a GCE VM based on CentOS, installs necessary packages
+   and attaches a [nested-virtualization "license" to the
+   VM](https://cloud.google.com/compute/docs/instances/enable-nested-virtualization-vm-instances#enablenestedvirt).  Be patient until this process completes.
 
+5. Near the end of the process, the GCE VM delete temporary files, ssh host keys,
+   etc. and then shut down.
 
-## Container images
+6. Packer should then automatically call into the google cloud APIs and
+   coordinate conversion of the VM disk into a bootable image.  Again,
+   please be patient for this process to complete, it may take several minutes.
+
+7. When finished, packer will writes the freshly created image name and other
+   metadata details into the local `image_builder/manifest.json` file for
+   reference and future use.  The details may be useful for debugging, or
+   the file can be ignored/disregarded.
+
+8. Automation scoops up the `manifest.json` file and archives it along with
+   the build logs.
+
+
+## The Base Images (overview step 2)
+
+VM Images in GCE depend upon certain google-specific systemd-services to be
+running on boot.  Additionally, in order to import external OS images,
+google needs a specific partition and archive file layout.  Lastly,
+importing images must be done indirectly, through [Google Cloud
+Storage (GCS)](https://cloud.google.com/storage/docs/introduction).  As with
+the image-builder image, this process is mainly orchestrated by Packer:
+
+1. A GCE VM is booted from the image-builder image, produced in *overview step 1*.
+
+2. On the image-builder VM, the (upstream) generic-cloud images for each
+   distribution are downloaded and verified.  *This is very networking-intense.*
+
+3. The image-builder VM then boots (nested) KVM VMs for the downloaded
+   images.  These local VMs are then updated, installed, and prepared
+   with the necessary packages and services as described above. *This
+   is very disk and CPU intense*.
+
+4. All the automation-deities pray with us, that the nested VMs setup
+   correctly and completely.  Debugging them can be incredibly difficult
+   and painful.
+
+5. Packer (running on the image-builder VM), shuts down the nested VMs,
+   and performs the import/conversion process.  Creating compressed tarballs,
+   uploading to GCS, then importing into GCP VM images.
+
+7. Packer deletes the VM, and writes the freshly created image name and other
+   metadata details into a `image_builder/manifest.json` file for reference.
+
+8. Automation scoops up the `manifest.json` file and archives it along with
+   the build logs.
+
+
+## Container images (Also overview step 2)
 
 In parallel with other tasks, several instances of the image-builder VM are
 used to create container images.  In particular, Fedora and Ubuntu
@@ -120,3 +217,64 @@ Cache-images.  They are then automatically pushed to:
 
 The meaning of *prior* and not, is defined by the contents of the `*_release`
 files within the `podman` subdirectory.
+
+
+# Debugging / Locally driving VM Image production
+
+Because the entire automated build process is containerized, it may easily be
+performed locally on your laptop/workstation.  However, this process will
+still involve interfacing with GCP and GCS.  Therefore, you must be in possession
+of a *Google Application Credentials* (GAC) JSON file.
+
+The GAC JSON file should represent a service account (contrasted to a user account,
+which always uses OAuth2).  The name of the service account doesn't matter,
+but it must have the following roles granted to it:
+
+* Compute Instance Admin (v1)
+* Compute OS Admin Login
+* Service Account User
+* Storage Admin
+* Storage Object Admin
+
+Somebody familiar with Google IAM will need to provide you with the GAC JSON
+file and ensure correct service account configuration.  Having this file
+stored *in your home directory* on your laptop/workstation, the process of
+producing images proceeds as follows:
+
+1. Invent some unique identity suffix for your images.  It may contain (***only***)
+   lowercase letters, numbers and dashes; nothing else.  Some suggestions
+   of useful values would be your name and todays date.  If you manage to screw
+   this up somehow, stern errors will be presented without causing any real harm.
+
+2. Ensure you have podman installed, and lots of available network and CPU
+   resources (i.e. turn off YouTube, shut down background VMs and other hungry
+   tasks).  Build the image-builder container image, by executing
+   ``make image_builder_debug GAC_FILEPATH=</home/path/to/gac.json> IMG_SFX=<UUID chosen in step 1>``
+
+3. You will be dropped into a debugging container, inside a volume-mount of
+   the repository root.  This container is practically identical to the VM
+   produced and used in *overview step 1*.  If changes are made, the container
+   image should be re-built to reflect them.
+
+4. Still within the container, again ensure you have plenty of network and CPU
+   resources available.  Build the VM Base images by executing the command
+   ``make base_images``. This is the equivalent operation as documented by
+   *overview step 2*.  ***N/B*** The GCS -> GCE image conversion can take
+   some time, be patient.  Packer may not produce any output for several minutes
+   while the conversion is happening.
+
+5. When successful, the names of the produced images will all be referenced
+   in the `base_images/manifest.json` file.  If there are problems, fix them
+   and remove the `manifest.json` file.  Then re-run the same *make* command
+   as before, packer it will force-overwrite any broken/partially created
+   images automatically.
+
+6. Produce the GCE VM Cache Images, equivalent to the operations outlined
+   in *overview step 3*.  Execute the following command (still within the
+   debug image-builder container): ``make cache_images``.
+
+7. Again when successful, you will find the image names are written into
+   the `cache_images/manifest.json` file.  If there is a problem, remove
+   this file, fix the problem, and re-run the `make` command.  No cleanup
+   is necessary, leftover/disused images will be automatically cleaned up
+   eventually.