-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement a rebind to default driver as a w/a #233
Implement a rebind to default driver as a w/a #233
Conversation
Thanks for your PR,
To skip the vendors CIs use one of:
|
/hold Running some validations |
pkg/utils/utils.go
Outdated
return err | ||
} | ||
|
||
glog.Errorf("setVfAdminMac(): workaround implement for VF %s", vfAddr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be a glog.Info ?
What about moving this log to the beginning of the Unbind call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will like to have it at least as a warning so it will popup in the logs
pkg/utils/utils.go
Outdated
glog.Errorf("setVfAdminMac(): VF link is not ready for device %+v %q", vfAddr, err) | ||
return err | ||
// TODO: remove workaround after intel fix driver issue | ||
if err := Unbind(vfAddr); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the VF doesn't exist, can we still use its pciAddr to unbind and rebind the driver? I think we did that before, just want to confirm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right :)
is this related to : https://bugzilla.redhat.com/show_bug.cgi?id=1875338 ? i see there was some WA which was introduced and reverted around rebinding VF driver also commit message should explain the issue but avoid IMO console bash output |
pkg/utils/utils.go
Outdated
@@ -565,8 +565,23 @@ func setVfAdminMac(vfAddr string, pfLink netlink.Link) error { | |||
} | |||
vfLink, err := vfIsReady(vfAddr) | |||
if err != nil { | |||
glog.Errorf("setVfAdminMac(): VF link is not ready for device %+v %q", vfAddr, err) | |||
return err | |||
// TODO: remove workaround after intel fix driver issue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this bit be moved to a separate function ?
with comment explaining the issue and link to bz maybe ?
"intel fix driver issue" is a bit generic to me :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done I also update the commit message and remove all the console outputs
ec2cb3c
to
ded2720
Compare
Thanks for your PR,
To skip the vendors CIs use one of:
|
/hold cancel |
ded2720
to
e2d31ad
Compare
Thanks for your PR,
To skip the vendors CIs use one of:
|
@adrianchiris @zshi-redhat can you please give it another look I test it locally and it pass |
pkg/utils/utils.go
Outdated
@@ -291,12 +291,16 @@ func configSriovDevice(iface *sriovnetworkv1.Interface, ifaceStatus *sriovnetwor | |||
if err = setVfGuid(addr, pfLink); err != nil { | |||
return err | |||
} | |||
} else if driver != "" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
name choice for this is bad IMO, wish it was : dpdkDriver
but not related to this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this needed ? to avoid triggering the WA ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will use this instead.
This is needed if there was a problem with one of the vfs and some of them are in vfio we will try to see check if VfIsReady and that will always fail because the VF is already attached to a user-space driver
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can change the name is a small change :)
pkg/utils/utils.go
Outdated
@@ -565,8 +569,19 @@ func setVfAdminMac(vfAddr string, pfLink netlink.Link) error { | |||
} | |||
vfLink, err := vfIsReady(vfAddr) | |||
if err != nil { | |||
glog.Errorf("setVfAdminMac(): VF link is not ready for device %+v %q", vfAddr, err) | |||
return err | |||
err = RebindVfToDefaultDriver(vfAddr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking on this more, this is a surprising place to find a WA.
here we basically just do ip link set <pf> vf <idx> mac <macaddr>
the reason i assume its here is because vfIsReady
was put here in commit : eb47703
I wonder if a better approach is to roll the WA logic into VfIsReady
and call it from configSriovDevices
essentially after we create the VFs, we wait for them to be ready then move on with our business.
as the issue only happens on Intel, you dont need to worry about a rebind later on that might hit the issue.
which happens only with rdma.(also as the WA is applied today it will not solve it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
e2d31ad
to
2d55873
Compare
Thanks for your PR,
To skip the vendors CIs use one of:
|
2d55873
to
1c8c9c6
Compare
Thanks for your PR,
To skip the vendors CIs use one of:
|
// only set GUID and MAC for VF with default driver | ||
// for userspace drivers like vfio we configure the vf mac using the kernel nic mac address | ||
// before we switch to the userspace driver | ||
if yes, d := hasDriver(addr); yes && !sriovnetworkv1.StringInArray(d, DpdkDrivers) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/cc @adrianchiris @mskrocki @pliurh @zshi-redhat
I add this validation instead of if iface.NumVfs != ifaceStatus.NumVfs
that was proposed under #245.
I think is a better solution because we are inside a for loop going over all the vfs. if there is an issue with one of the vfs(for example the intel driver get stuck) we exist the configSriovDevice
function with an error. Then in the next reconcile the iface.NumVfs will be equal to the ifaceStatus.NumVfs because the vfs got already created here (https://github.com/k8snetworkplumbingwg/sriov-network-operator/pull/233/files#diff-81ddbadfb415ccbb9c7af84f11668d1aa5e53c34025bf86d4702f16b4e42f045R246) but we didn't really finish allocating the GUID or the mac address to all the vfs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree.
fix #244 |
/lgtm |
Since this is a workaround, I propose that we remove it when all related BZs have been fixed. With that in mind, the sole commit in this PR does more than just introduce RebindVfToDefaultDriver. I suggest to split the commit into at least two commits where one will just introduce the workaround and the other applies the changes we want to keep even after the issues have been resolved in the kernel. |
…ver only Signed-off-by: Sebastian Sch <[email protected]>
Signed-off-by: Sebastian Sch <[email protected]>
1c8c9c6
to
ce121d6
Compare
Thanks for your PR,
To skip the vendors CIs use one of:
|
sounds good can you take another look? |
/lgtm |
@@ -541,7 +569,7 @@ func vfIsReady(pciAddr string) (netlink.Link, error) { | |||
glog.Infof("vfIsReady(): VF device %s", pciAddr) | |||
var err error | |||
var vfLink netlink.Link | |||
err = wait.PollImmediate(time.Second, 5*time.Second, func() (bool, error) { | |||
err = wait.PollImmediate(time.Second, 10*time.Second, func() (bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need to increase the timeout?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the timeout was too short and was generating spurious errors.
This commit add a w/a to an issue observed on intel nics where not all the vfs are created.
Test:
We are working with kernel developers to find and fix the issue.
This w/a will give us more time without impacting the sriov operator
Signed-off-by: Sebastian Sch [email protected]