Volume preparation using Salt formulas #1513

Ebaneck · 2019-08-09T09:36:52Z

Component:

'salt', 'storage', 'kubernetes'

Context:

Summary:

A combination of Salt state and custom salt modules should be able to take 0 or 1 device as arguments and prepare the specified volume if no argument is passed, we should prepare all volumes not yet prepared within the pillars.

Acceptance criteria:

Be able to successfully prepare loop devices and block devices as metalk8s volumes.
Ensure that prepared volumes are always available even after a restart.
Ensure that prepared volumes have a valid file system
Be able to write persistent data on all prepared volumes

Closes: #1474

bert-e · 2019-08-09T09:36:54Z

Hello ebaneck,

My role is to assist you with the merge of this
pull request. Please type @bert-e help to get information
on this process, or consult the user documentation.

Status report is not available.

bert-e · 2019-08-09T09:36:57Z

Waiting for approval

The following approvals are needed before I can proceed with the merge:

the author
one peer

Peer approvals must include at least 1 approval from the following list:

jbertran

Some formatting issues, some stray stuff. Please mark comments about stuff remaining to implement as TODO so we can clearly differentiate which comments are indicative and which are for ongoing items.

salt/metalk8s/volume/init.sls

salt/metalk8s/volume/files/loopdevice.service.j2

salt/_states/metalk8s_volume.py

salt/metalk8s/volume/prepare/init.sls

salt/_modules/metalk8s_volume.py

salt/_modules/metalk8s_kubernetes.py

salt/metalk8s/volume/files/loopdevice.service.j2

salt/metalk8s/volume/prepare/init.sls

slaperche-scality · 2019-08-13T09:50:07Z

Current state:

Prepare a sparseLoopDevice
Prepare a rawBlockDevice
Calling the state with no arguments prepare all the volumes of the node
Handle reboot for sparseLoopDevice

storage-operator/pkg/salt/client.go

salt/metalk8s/volumes/init.sls

NicolasT

Good stuff, some comments.

salt/_modules/metalk8s_volumes.py

salt/metalk8s/salt/minion/files/minion-99-metalk8s.conf.j2

salt/metalk8s/volumes/prepared/init.sls

salt/metalk8s/volumes/prepared/installed.sls

salt/metalk8s/volumes/prepared/init.sls

slaperche-scality · 2019-08-13T15:35:12Z

We now wandle the reboot for sparseLoopDevice so the PR is no longer a draft.

This new patchset started to address some of the comments, but a new patchset is coming soon to address the remaining ones: don't need to re-review now (you can wait the next patchset).

salt/_states/metalk8s_volumes.py

salt/_modules/metalk8s_volumes.py

salt/_states/metalk8s_volumes.py

salt/metalk8s/volumes/prepared/init.sls

salt/_modules/metalk8s_volumes.py

salt/metalk8s/volumes/prepared/init.sls

storage-operator/pkg/salt/client.go

slaperche-scality · 2019-08-16T07:54:11Z

Ok, new version ready for review.

This time everything was addressed.

There are two new commits that may require your attention:

ba91f00: this one handle the issue of "We can't run two state in parallel"
be6d91d: as discussed on Slack this one tries to call into libblkid directly instead of calling the shell command

About ①, I'm wondering if we could use parallel states, but to do so we would need to move the installation of the packages out of this state (where could we put them?).

This commit adds a minion startup state to ensure that all loop devices are always created when a minion reboots. Note that now we need to always specify a saltenv, thus we update the bootstrap accordingly. Closes: #1474

Since the implemented behavior differs from the usual one, this may be surprising to people reading the code. Let's use an explicit method instead. Refs: #1474 Signed-off-by: Sylvain Laperche <[email protected]>

We don't need `--show` because we aren't using the ouptut and we don't need `--partscan` because we aren't expecting (and don't want to handle) partitions here. Refs: #1474 Signed-off-by: Sylvain Laperche <[email protected]>

Formatting being a destructive operation which can brick an OS and/or causes data loss, we want to be really careful here. This commit adds a check to prevent formatting an already formatted device, that way if someone mispell `/dev/sdb` as `/dev/sda` they won't destroy their node :] Refs: #1474 Signed-off-by: Sylvain Laperche <[email protected]>

- don't use `mkfs` (it's deprecated), use the filesystem specific utils (such as `mkfs.ext4`, `mkfs.xfs`, …) instead - as a consequence check that the given FS type is supported - for `sparseLoopDevice`, format the sparse file instead of the associated loop device (less racy, we're sure that we are formatting what we want to) => replace `block_device` property by `path`. Refs: #1474 Signed-off-by: Sylvain Laperche <[email protected]>

When creating a sparse loop device, the formatting need to happen **before** associating the sparse file to a loop device (other `udev` won't create the symlink under `/dev/disk/by-uuid`). Since the step is now the last, let's rename `initialization` to `finalization`. Refs: #1474 Signed-off-by: Sylvain Laperche <[email protected]>

This command it not reliable: it cannot get the FS type from a sparse file (fair enough), but in such case it doesn't even fail reliably so you can't even know that something went wrong… Calling `salt-call disk.fstype /var/lib/metalk8s/storage/sparse/example` returns "/dev/sda1 ext4 41152736 7754388 31284864 20% /"… (because it fallbacks on `df` when `lsblk` fails and that's what `df` returns…) Even on an actual block device, it returns garbage if it's not formatted: `salt-call disk.fstype /dev/sdb` returns "devtmpfs devtmpfs 1932084 0 1932084 0% /dev" Let's use `blkid` instead… Refs: #1474 Signed-off-by: Sylvain Laperche <[email protected]>

Since Salt states have a mandatory name parameter, let's use it to pass the volume name instead of adding an extra-parameter. Refs: #1474 Signed-off-by: Sylvain Laperche <[email protected]>

If we want to be really sure that a device have been formatted by use we shouldn't check the FS type but the UUID. In fact, relying on the FS type for `is_formatted` can be wrong. - a Volume V is created with the StorageClass SC which has `fsType` set to `xfs` - we format the device with XFS - later, the StorageClass SC is deleted and someone reuse the name to create a new StorageClass SC which has `fsType` set to `ext4` - the server reboot, we reprovision all the volume declared on the node, when the pillar is updated it stored the volume V and embed the content of the (new) StorageClass SC - when we check if the device FS type (xfs) match the expected one (from the StorageClass, now `ext4`) we get a mismatch and think the device is not formatted - we then call `format`, here the safeguard detect that the device is already formated and we bail out and avoid data loss, but we have put the Volume V in a Failed state when everything was fine: that's not good… So yeah, we really should really on the UUID instead of the FS type. Note that this only hold for `is_formatted`, the safeguard in `format` still need to check if we already have something on the device before formatting to avoid data loss. Refs: #1474 Signed-off-by: Sylvain Laperche <[email protected]>

This commit adds a new Python module to call directly into the blkid library (using ctypes) instead of relying on the `blkid` command. Why do we do that? ================== Because we want to be as sure as possible that when `blkid` returns no informationg is means "there is no info" and not "maybe there is some info but I couldn't get them" (because if we mix two cases, we may actually format something because we thought "ho there is nothing, davai" whereas the reality is "there was data but I haven't seen them because reasons…"). Relying on `return code 2 == no data` seems kinda safe, but it's not 100% clear from the man page. Now we could check the source code (and I did) but that would only be valid for the current version we look at. So let's directly attack the library, making sure that any API call will result in an exception and thus if we end up with "no data" then that truly means "everything went well, but there was nothing there". Refs: #1474 Signed-off-by: Sylvain Laperche <[email protected]>

Call the real Salt state to prepare the volume (`metalk8s.volumes`) instead of calling `test.rand_sleep`. Refs: #1474 Signed-off-by: Sylvain Laperche <[email protected]>

Don't hardcode saltenv, instead compute it from the node's MetalK8s version. We use the node's version instead of the cluster one because that's also the one used by the Salt minion. Refs: #1474 Signed-off-by: Sylvain Laperche <[email protected]>

Until now, we used to rely on the value of `success` to know if a job was completed successfully or not. Unfortunately, that's not good enough: you can get `success: true` while nothing was actually run… See saltstack/salt#4002 This is a problem because if we think that the `PrepareVolume` step was a success, we will then try to get the size of the device with `disk.dump`, but if `PrepareVolume` didn't run then the device doesn't exists and `disk.dump` will fail => we wrongly move the Volume into a failed state. And we easily encounter this issue by creating several volume at once/within a short time interval. Try to spawn two state executions at the same time: salt-call state.sls metalk8s.volumes saltenv=metalk8s-2.4.0-dev pillar='{"volume": "foo"}'& salt-call state.sls metalk8s.volumes saltenv=metalk8s-2.4.0-dev pillar='{"volume": "bar"}' You'll get: local: Data failed to compile: ---------- The function "state.sls" is running as PID 32544 and was started at 2019, Aug 14 21:45:18.338129 with jid 20190814214518338129 And when it's done by the operator through the API, here is what we get for the second job: "bootstrap": { "return": [ "The function \"state.sls\" is running as PID 22874 and was started at 2019, Aug 14 20:47:46.622817 with jid 20190814204746622817" ], "retcode": 1, "success": true, "out": "highstate" } `success` is true, but `retcode` is non-zero… So, to be 100% sure that the spawned job succeeded we need to check the return code. Why not ignoring `success` and only rely on `retcode`? Because we want to tell the difference between: - the job ran but failed (success == false): move the Volume into Failed state - the job failed to run (success == true but retcode != 0), probably because one was already running: we can retry later Note that it seems possible to run states in parallel[1], but our state may install packages and that seems to be a dangerous things to do in parallel, so let's avoid… (But maybe we should move the deps install out of this state, that would probably make sense…) [1]: https://docs.saltstack.com/en/latest/ref/states/parallel.html Refs: #1474 Signed-off-by: Sylvain Laperche <[email protected]>

Because I mainly test the operator outside of the cluster (and thus running as admin:admin) I forgot to update those permissions when I moved from `test.ping` to real Salt states. We also need `@jobs` to poll the state of async jobs. We really need some automated testing here ^^" Refs: #1474 Signed-off-by: Sylvain Laperche <[email protected]>

salt/metalk8s/orchestrate/etcd.sls

Retrospectively, using the complete state `metalk8s.volume.prepared`, while working, isn't a great idea. Using a dedicated state that only does the "provisioning" as several benefit: - it's faster (we don't need to check for depedencies, check if disks are formatted, …) - it's safer (we're sure that we aren't gonna trigger costly formating or things like that) - it can be ran in parallel (thus, it won't block the execution of other states while it's running). Refs: #1474 Signed-off-by: Sylvain Laperche <[email protected]>

salt/metalk8s/salt/minion/files/minion-99-metalk8s.conf.j2

NicolasT · 2019-08-19T15:10:55Z

I think this can be +1'd?

bert-e · 2019-08-20T08:26:41Z

Not Author

I'm afraid I cannot do that, @slaperche-scality:

Only the author of the pull request @Ebaneck can use this command.

Please delete the corresponding comment so I can move on.

The following options are set: wait

Ebaneck · 2019-08-20T08:27:40Z

/approve

bert-e · 2019-08-20T08:28:40Z

In the queue

The changeset has received all authorizations and has been added to the
relevant queue(s). The queue(s) will be merged in the target development
branch(es) as soon as builds have passed.

The changeset will be merged in:

✔️ development/2.4

The following branches will NOT be impacted:

development/1.0
development/1.1
development/1.2
development/1.3
development/2.0
development/2.1
development/2.2
development/2.3

There is no action required on your side. You will be notified here once
the changeset has been merged. In the unlikely event that the changeset
fails permanently on the queue, a member of the admin team will
contact you to help resolve the matter.

IMPORTANT

Please do not attempt to modify this pull request.

Any commit you add on the source branch will trigger a new cycle after the
current queue is merged.
Any commit you add on one of the integration branches will be lost.

If you need this pull request to be removed from the queue, please contact a
member of the admin team now.

The following options are set: approve

bert-e · 2019-08-20T12:47:01Z

I have successfully merged the changeset of this pull request
into targetted development branches:

✔️ development/2.4

The following branches have NOT changed:

development/1.0
development/1.1
development/1.2
development/1.3
development/2.0
development/2.1
development/2.2
development/2.3

Please check the status of the associated issue None.

Goodbye ebaneck.

Ebaneck requested review from alexandre-allard, gdemonet, NicolasT and slaperche-scality August 9, 2019 09:37

jbertran reviewed Aug 9, 2019

View reviewed changes