Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow vic-machine configure to set appropriate roles for ops user #7777

Merged
merged 2 commits into from
Apr 25, 2018

Conversation

jzt
Copy link
Contributor

@jzt jzt commented Apr 18, 2018

This change fixes a bug in vic-machine configure that was preventing a VCH installed without an ops user enabled to be reconfigured to do so. With this change, you can now run vic-machine configure against an existing VCH to configure the ops user credentials and permissions.

Fixes #7725, #7796

@codecov-io
Copy link

codecov-io commented Apr 18, 2018

Codecov Report

Merging #7777 into master will increase coverage by 0.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #7777      +/-   ##
==========================================
+ Coverage   25.85%   25.86%   +0.01%     
==========================================
  Files          35       35              
  Lines        5124     5122       -2     
==========================================
  Hits         1325     1325              
+ Misses       3692     3690       -2     
  Partials      107      107
Impacted Files Coverage Δ
cmd/vic-machine/common/ops_credentials.go 63.79% <100%> (ø) ⬆️
cmd/vicadmin/vicadm.go 2.82% <0%> (ø) ⬆️
cmd/vic-machine/create/create.go 42.51% <0%> (+0.07%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b6355f0...8325d30. Read the comment docs.

@@ -327,7 +328,7 @@ func (c *Configure) Run(clic *cli.Context) (err error) {

validator, err := validate.NewValidator(op, c.Data)
if err != nil {
op.Errorf("Configuring cannot continue - failed to create validator: %s", err)
op.Errorf("Configure cannot continue - failed to create validator: %s", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

err = d.update(conf, settings, isConfigureOp)

// If successful try to grant permissions to the ops-user
// try to grant permissions to the ops-user
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: Grant permissions to the ops-user before initializing the appliance

@@ -28,6 +28,8 @@ const (
// create time is stored in nanoseconds (previously seconds) in the portlayer.
ContainerCreateTimestampVersion

VMFolderSupportVersion
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: s/VMFolderSupportVersion/VCHFolderSupportVersion perhaps?

A short comment to describe this field for posterity would be nice to have.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I wondered whether it was a catch-all (VM) or exclusive to the VCH. I'll clarify.

@jzt jzt force-pushed the ops/7725 branch 4 times, most recently from 1964072 to 3e2b28e Compare April 19, 2018 19:53
@@ -120,6 +126,8 @@ func (d *Dispatcher) Configure(vch *vm.VirtualMachine, conf *config.VirtualConta
}
}

err = d.update(conf, settings, isConfigureOp)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the reason that this happens before is that there's no way to rollback opsuser.GrantOpsUserPerms if this fails.

Copy link
Member

@zjs zjs Apr 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To summarize this issue:

Today, configure (including upgrade) is organized in the following way:

  1. A new configuration for the VCH is constructed based on the old configuration
    and requested changes.
  2. The new configuration is validated.
  3. The updated version of any ISO files being changed are uploaded.
  4. Resource settings are updated, and a defer is registered to undo the change
    if subsequent steps fail.
  5. A snapshot is taken.
  6. The VCH is updated:
    a. If running, it is powered off.
    b. If volume stores are being added, those are created.
    c. The VCH itself is reconfigured.
    d. The VCH is powered on.
    e. We wait for the VCH to start and the service to be ready.
  7. If the update is successful, we update the operations user.
  8. If either the update or the operations user update is unsuccessful, we undo:
    a. We rollback to the snapshot we took.
    b. We delete uploaded images.
    c. We delete the snapshot we took.
    d. The defer undoes resource settings updates.

This process has a key characteristic: if a failure occurs, we cleanly return to
the initial state.

However, this process also has a key limitation: the process will fail at (6)(d)
or (6)(e) if the operations user requires additional privileges for the basic
operation of the VCH following the configuration change or upgrade.

A simple solution to this might be to reverse the order of steps 6 and 7: update
the operations user before updating the VCH. This will address the limitation,
but will weaken the guarantee around cleanly returning to the initial state when
a failure occurs; it's not clear how to undo the changes to the operations user.

The effects of this are twofold:

  1. Following the rollback of a change where additional privileges were granted
    to the operations user, those additional privileges are left as "cruft".
  2. Following the rollback of a change where privileges were revoked from the
    operations user, the VCH will not start.

Options to improve this might include:

  1. Looking into ways to undo changes to the operations user, perhaps following a
    similar pattern to the resource settings. This requires reading the old state
    from the system before making changes, including both the role-privilege map
    and resource-role associations.
  2. Moving the operations logic in between (6)(c) and (6)(d) instead of between
    (5) and (6). This reduces the cases where we will leave cruft behind.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not extremely familiar with this code. But I believe that we must determine what permissions the ops-user has before we begin to assemble the perms that are going to be needed in order to function. If we store that we could defere a configure back to those original permissions. @zjs why not make the ops-user changes before creating the volume-stores as well? They are created by vic-machine, but if we fail to assign the ops-user that will leave 1 less set of things to clean up.

In the event of a failure what is the downside to returning the supplied ops-user to it's original permissions set?

Copy link
Member

@zjs zjs Apr 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the event of a failure what is the downside to returning the supplied ops-user to it's original permissions set?

Returning the operations user to the original permissions set is the desired behavior.

But I believe that we must determine what permissions the ops-user has before we begin to assemble the perms that are going to be needed in order to function. If we store that we could defere a configure back to those original permissions.

This is a possible approach, but actually extracting the "before" state is non-trivial. We need to track the changes being made to the privileges for each role as well as the roles being applied to each resource. Because of the way the vSphere APIs work, it's somewhat complicated to read all of that information. Because we use share the same set of roles for multiple operations users, safely rolling back in the presence of concurrent upgrade operations may not be straight-forward.

why not make the ops-user changes before creating the volume-stores as well? They are created by vic-machine, but if we fail to assign the ops-user that will leave 1 less set of things to clean up.

Currently, the operations user changes is the hardest thing to roll back (now tracked by #7814), so we do those as late in the process as possible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added #7814 (comment) to describe an alternate pattern, which is likely to be less work than tracking the old privileges/roles.

Copy link
Member

@zjs zjs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think #7777 (review) needs to be resolved before merging this.

@jzt
Copy link
Contributor Author

jzt commented Apr 23, 2018

I've moved the logic that grants the ops user permissions into the update function itself, right before the poweron step. This should narrow (possibly close) the window for errors between granting the ops user permissions and powering on the VM, reducing the likelihood of leaving the crufty additional privileges lying around after a rollback. It does not fully eliminate this possibility, however, so a new issue has been created here that @zjs will introduce in his changes for 1.5.

@jzt jzt force-pushed the ops/7725 branch 3 times, most recently from b39ca1e to 3420762 Compare April 23, 2018 20:42
@@ -204,6 +204,7 @@ func (c *Configure) copyChangedConf(o *config.VirtualContainerHostConfigSpec, n
if c.OpsCredentials.IsSet {
o.Username = n.Username
o.Token = n.Token
o.GrantPermsLevel = n.GrantPermsLevel
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure whether this is the behavior we want. (I'm also not sure that it isn't.)

By including o.GrantPermsLevel = n.GrantPermsLevel in this check, if an administrator uses vic-machine configure to change an operations user's password which previously had a GrantPermsLevel of AddPerms without explicitly including --ops-grant-perms in that command we'll actually downgrade GrantPermsLevel to "". I think that would be surprising behavior: change the password (e.g., due to expiration) and no longer have permissions be automatically managed.

It may be better to have a check here along the lines of clic.IsSet("ops-grant-perms") so that we only adjust GrantPermsLevel if --ops-grant-perms or --ops-grant-perms=false is included on the command line. (To do this, you'd just need to pass clic *cli.Context into the copyChangedConf method.) In that case, we only change the properties a user has asked us to change.

A downside of both the current implementation and this suggestion (and therefore an upside of the proposed change) is that if you change to a completely new operations user we won't clear the GrantPermsLevel. This could also be surprising: this is a case where the user might expect us to change something even though they haven't asked us to (because they may see --ops-grant-perms as being associated with the user, not the VCH). Unfortunately, the way configure is designed really doesn't give us a lot of options here (and leads to similar issues for other settings).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. There was no clean way to get the name of the option out of the cli flags, nor did there seem to be much reason to declare a constant for it all on its own, but if you find it irksome, I can pull it out as a constant anyway.

Copy link
Member

@zjs zjs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Just minor comments on test code.

Log Govc output: ${output}
Should Be Equal As Integers ${rc} 1
Should Contain ${output} Permission to perform this operation was denied
Attempt To Create Resource Pool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be Attempt To Disable DRS, can't it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be, but it was basically a merge of your additions and @anchal-agrawal's additions, so I left both Attempt To Create Resource Pool and Attempt To Disable DRS in there for variety. Still trying to get granted ops-user perms work after upgrade to pass, so I'll clean it up as I go forward.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, disabling DRS requires less privilege than creating a resource pool. Usually disabling DRS makes a good sanity check that we're not granting the operations user privileges they don't need.

In the specific case that we're using --affinity-vm-group, we currently have to grant the operations user the privilege that lets them disable DRS (as that's the same privilege that lets them manage affinity constructs), so we check creating a resource pool instead.


Run Privileged Commands

Cleanup VIC Appliance On Test Server
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing newline at end of file.

Copy link
Contributor

@mdubya66 mdubya66 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved for cherry pick into 1.4

@mdubya66 mdubya66 merged commit f243850 into vmware:master Apr 25, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

vic-machine configure --ops-grant-perms doesn't create roles or assign permissions
8 participants