Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cap IOPS when calculating from iopsPerGB #809

Merged
merged 4 commits into from
Apr 13, 2021

Conversation

jsafrane
Copy link
Contributor

@jsafrane jsafrane commented Mar 23, 2021

Is this a bug fix or adding new feature?
bugfix

What is this PR about? / Why do we need it?
Consider AWS IOPS limits when calculating IOPS from iopsPerGB to always provision a supported volume. Users should be create a StorageClass "fast" that provides max. IOPS that AWS can support for any volume size they provide, without errors.

Without the PR, users must calculate iopsPerGB * PVC size and choose the right PVC size where resulting IOPS does not violate AWS IOPS limits. For example, with iopsPerGB = 10 and PVC size 5 GB, they get error Volume iops of 50 is too low; minimum is 100.

Please check the current AWS limits carefully. https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-volume-types.html is not exactly clear what's max IOPS per GB for io2 volumes - on one place it's 500:1 and on another 1000:1, 500 seems to be the right ratio.

What testing is done?
Wrote few new unit tests, checking only IOPS in CreateVolumeInput.

  • Tested with 4GB io1 volume with 1, 50 and 500 IOPS per GB, got volume 100, 100 and 200 IOPS.
  • Tested with 4GB io2 volume with 1, 50, 500 and 1000 IOPS per GB, got volume 100, 200, 2000 and 2000 IOPS.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Mar 23, 2021
@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 23, 2021
@coveralls
Copy link

coveralls commented Mar 23, 2021

Pull Request Test Coverage Report for Build 1771

  • 51 of 53 (96.23%) changed or added relevant lines in 2 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.2%) to 81.972%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/driver/controller.go 20 22 90.91%
Totals Coverage Status
Change from base Build 1768: 0.2%
Covered Lines: 1787
Relevant Lines: 2180

💛 - Coveralls

@jsafrane
Copy link
Contributor Author

/retest

@jsafrane
Copy link
Contributor Author

/home/prow/go/src/github.com/kubernetes-sigs/aws-ebs-csi-driver/tests/e2e/dynamic_provisioning.go:535
Expected
    <string>: topology.kubernetes.io/zone
to equal
    <string>: topology.ebs.csi.aws.com/zone
/home/prow/go/src/github.com/kubernetes-sigs/aws-ebs-csi-driver/tests/e2e/testsuites/testsuites.go:369

Did I break it?

@ayberk
Copy link
Contributor

ayberk commented Mar 23, 2021

That looks related to the recent change I made. I'll take a look today.

Copy link
Contributor

@ayberk ayberk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @AndyXiangLi for verifying EBS limits.

func capIOPS(requestedCapacityGiB int64, requstedIOPSPerGB, minTotalIOPS, maxTotalIOPS, maxIOPSPerGB int64) int64 {
iops := requestedCapacityGiB * requstedIOPSPerGB

if iops < minTotalIOPS {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a big fan of this change. This might end up in a surprise bill for the customer, so if IOPS is smaller than minimum we should just return and explicit error instead of increasing it.

Copy link
Contributor

@wongma7 wongma7 Mar 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider the alternative as described in PR description is forcing the user to inspect the storageclass then edit their PVC to be larger which also incurs a larger bill. Using example in PR description, two scenarios:

  1. We accept this PR and always increase iops to the floor: 5GiB at 100iops is 5*.125+100*.065 = $7.125/month.
  2. We force user to set the iops to the floor: 10GiB at 100 iops is 10*.125+100*.065 = $7.75/month (+ the cost of their time to visit aws docs and pull out their calculator).

This is just one example **, I don't really want to put the effort into drawing graphs. I think as long as we document this behavior it will be fine. Admins will appreciate they do not have to estimate their users' behavior (if they are on a tight budget they can just set iopspergb to 1 and call it a day) and users do not have to pull out their calculators.

So atm I am inclined to accept the PR.

edit: cost per gb is .065 not .0625

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's my reasoning. If a user expects their volume to cost, say, $5 with the values they're setting, it becomes a problem at scale. The difference on paper is only $2, but what if you have 1000 of such volumes across 10 regions? Suddenly your bill is 2 x 10 x 1000 more than you expect. Am I missing something?

We can include the limits in the error message (or even a direct link to the docs) to make it a little bit easier on the user. I understand it's still not a great user experience, but I think it's preferable to having a surprise bill.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be safe, my idea is we don't make decision for customers, customer can adjust the iopsPerGB or volume size per their requirement as long as we deliver the message clearly (Like Ayberk mentioned, attach the doc link in the error message, etc).

Copy link
Contributor

@wongma7 wongma7 Mar 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The user is mistaken in their $5 calculation if it's predicated on their volume having an impossible GB+IOPS configuration. The choice is not between $7 and $5, it's between $7 (increase both size and iops), $6 (increase only iops), and not running the Pod w/ EBS at all. So really we are debating whether we should let the user know about their mistake or fix it on their behalf. I prefer the latter as long as we document it, let me spell it out.

(Keep in mind on a multi-tenant cluster the person creating PVCs doesn't necessarily have permissions to create StorageClasses. Also the above numbers are imaginary but you get my point)

Say the driver provisioning for StorageClass A with iopspergb x is presented with an impossible GB+IOPS configuration. To remedy the situation, something/somebody must:

  1. Increase SIZE and IOPS of volume such that its IOPS is greater than the minimum.
  2. Increase IOPS of volume such that its IOPS is greater than the minimum.
  3. Decide that fixing the volume to be a valid configuration would be too expensive, so do nothing, and use EFS instead or just not run their workload.

Note option 1) is ALWAYS more expensive than option 2).

If the driver throws an error and refuses to create the volume (like today), option 2 is unavailable. Because for StorageClass A there is no way for them to control iops except by increasing size of volume. So if they want to run their stateful workload they are forced to go with the more expensive option 1

If the driver fixes the volume for the user (if we accept this PR as-is), option 3 is unavailable. I don't think this option 3 would be that popular. I think people would prefer their workload to run w/ EBS than not.

IF this is so contentious, we can make it a flag that defaults to current behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am open to remove increasing the IOPS, however, but it must come with a very clear error message. The current one is very confusing. With 4 GiB volume and 1 IOPS per GB, I get:

could not create volume in EC2: InvalidParameterValue: Volume iops of 4 is too low; minimum is 100.

But there is no iops setting in the storage class. Only iopsPerGB.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

talked to @msau42 on slack

What about this:

  • Let's cap max IOPS to get within the supported range. I don't see any controversy here.
  • For CSI storage classes, do not increase iops to the supported range and return error (= as it is today, only with a nicer error message).
  • However, increase the iops for storage classes migrated from in-tree, to preserve existing behavior.

How to distinguish these two in CreateVolume?

  • CSI translation lib can add alpha.csi.storage.k8s.io/migrated parameter to CSI StorageClass when it's migrated from in-tree. Note: this translation is in-memory, in the external-provisioner. It's not stored in API server.
  • The driver sees alpha.csi.storage.k8s.io/migrated parameter in CreateVolume.
  • When users create a CSI StorageClass with alpha.csi.storage.k8s.io/migrated parameter, it's their problem. We will not document the parameter and we reserve the right to remove it in the future. That's why the alpha prefix, even though it will be there probably forever :-(

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When users create a CSI StorageClass with alpha.csi.storage.k8s.io/migrated parameter, it's their problem. We will not document the parameter and we reserve the right to remove it in the future. That's why the alpha prefix, even though it will be there probably forever :-(

This also has a nice side-effect of enabling customers to bypass the error by adding this. Feels dirty, but I like it :P

I am all for this approach. It's the best of both worlds: we avoid the surprise bill issue and we don't change the behavior for transitioning volumes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not make it an explicit parameter then like ignoreIopsTooLow (terrible name but ya)? If you introduce this hack then someone will rely on it and complain, doesn't amtter if it says alpha. migration would silently add the parameter, but for new StorageClasses admin will have the choice whether to err on side of ensuring all PVCs get a PV even if overprovisioned v.s. ensuring all PVCs either get an exactly-provisoined PV or no PV at all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added parameter allowAutoIOPSPerGBIncrease (naming is hard!) + unit tests + documentation, PTAL.

pkg/cloud/cloud.go Show resolved Hide resolved
Copy link
Contributor

@wongma7 wongma7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with this as long as we edit README too

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jsafrane, wongma7

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ayberk
Copy link
Contributor

ayberk commented Mar 23, 2021

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 23, 2021
@ayberk
Copy link
Contributor

ayberk commented Mar 23, 2021

/retest

@jsafrane
Copy link
Contributor Author

BTW, my motivations for this PR are:

  • Be in sync with in-tree.
  • Better usability.
    • Admins can prepare "default" storage class (gp2 / gp3?) and something "fast" (fast for every volume that users create, regardless of the size).
    • Users should not do any size+iops calculations at all. It's not users who actually create PVCs these days, they use StatefulSets, helm charts or other software.

(I don't like iops parameter in gp3 volumes because of this - users MUST do the calculations there. The error messages when users get out of the supported bounds are not exactly clear. And a storage class with 10 000 IOPS is fast for some volumes and super slow for others.)

@jsafrane
Copy link
Contributor Author

Updated README.

io1MaxTotalIOPS = 64000
io1MaxIOPSPerGB = 50
io2MinTotalIOPS = 100
io2MaxTotalIOPS = 256000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be 64k, similar to io1. EBS team told us 256k is for block express, which is currently opt-in, so for most customers the limit is still 64k.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lowered to 64k.

@jsafrane
Copy link
Contributor Author

/test pull-aws-ebs-csi-driver-unit
connection reset during vendor check

@@ -154,6 +155,8 @@ func (d *controllerService) CreateVolume(ctx context.Context, req *csi.CreateVol
if err != nil {
return nil, status.Errorf(codes.InvalidArgument, "Could not parse invalid iopsPerGB: %v", err)
}
case AllowAutoIOPSPerGBIncreaseKey:
allowIOPSPerGBIncrease = value == "true"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this case insensitive? Since the keys are, it'd be more consistent.

createType = diskOptions.VolumeType
iops = capacityGiB * int64(diskOptions.IOPSPerGB)
iops, err = capIOPS(diskOptions.VolumeType, capacityGiB, int64(diskOptions.IOPSPerGB), io2MinTotalIOPS, io2MaxTotalIOPS, io2MaxIOPSPerGB, diskOptions.AllowIOPSPerGBIncrease)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine for now, but it might make more sense to create a struct for volume types with each encapsulating their limits, eventually. cc @AndyXiangLi something to keep in mind if we add more types with similar logic.

@ayberk
Copy link
Contributor

ayberk commented Mar 29, 2021

/unhold

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 29, 2021
@jsafrane
Copy link
Contributor Author

Not sure if I missed any comment,IMO it's ready for review. @AndyXiangLi @wongma7 PTAL.

@AndyXiangLi
Copy link
Contributor

This looks good to me. Thank you!
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 13, 2021
@k8s-ci-robot k8s-ci-robot merged commit 9f4ce44 into kubernetes-sigs:master Apr 13, 2021
@jsafrane
Copy link
Contributor Author

With this merged, I should update CSI translation lib to add allowAutoIOPSPerGBIncrease to translated StorageClasses. However, it will pin the library (and the external-provisioner) to the new version of the driver. CSI migration will fail with an older CSI driver version, because it does not know allowAutoIOPSPerGBIncrease. Is this allowed? 🤔

@wongma7
Copy link
Contributor

wongma7 commented Apr 13, 2021

IMO it's fine, we only need to support the combo of external-provisioner+driver we release+test in our helm chart. I guess someone out there may try using new external-provisioner and old driver and run into this issue but we shouldnt have to support it : /

@jsafrane
Copy link
Contributor Author

Filed kubernetes/kubernetes#101082 to update the translation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants