Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cirrus: Workaround F32 BFQ Kernel bug #8188

Conversation

cevich
Copy link
Member

@cevich cevich commented Oct 29, 2020

Fixes #8068

@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 29, 2020
@mheon
Copy link
Member

mheon commented Oct 29, 2020 via email

@cevich
Copy link
Member Author

cevich commented Oct 29, 2020

I've rebased my instrumented/testing PR on this: #8169 to help confirm the problem is only on F32 and that the previous workaround functions.

@cevich cevich requested review from baude and mheon October 29, 2020 18:32
@cevich
Copy link
Member Author

cevich commented Oct 29, 2020

Update: Given the "lively" discussion in https://bugzilla.redhat.com/show_bug.cgi?id=1851783 I'm considering if we aught to simply and globally use the deadline scheduler. Also seeing (from other bugs) as this is the second time BFQ has broken for people.

@cevich
Copy link
Member Author

cevich commented Oct 29, 2020

Update: I'm investigating additional cases of agent-stopped-responding in F31. May need to include the fix there as well, until we can get up to F33 (#8074)

@rhatdan
Copy link
Member

rhatdan commented Oct 30, 2020

LGTM
@cevich ready to get this in. I think you can work on other changes in a different PR, we need to get this in now.

@cevich
Copy link
Member Author

cevich commented Oct 30, 2020

I think you can work on other changes in a different PR

No, not true. I'm still seeing remarkably similar failures on F31. I'm testing out application of the 'deadline' scheduler workaround for all VMs, and also to "even the playing field". Should be finished shortly...

@cevich cevich force-pushed the workaround_agent_stopped_responding branch from 5fa734c to 0ebee0c Compare October 30, 2020 13:16
@cevich
Copy link
Member Author

cevich commented Oct 30, 2020

Update: Rebased + deadline workaround for all platforms: Testing in parallel with #8169 (instrumented)

@cevich
Copy link
Member Author

cevich commented Oct 30, 2020

Important Observation: I'm noticing a significant increase in runtime for the all the "remote", especially Ubuntu. Unf. I'm also seeing (in other PRs) what appear like general google-cloud networking hiccups/slowdowns. There's no way I can separate these two (or more) affects on test runtime.

If the performance problem persists past the networking slowdown, we should consider following the google recommendations for increasing storage performance.

@cevich cevich changed the title WIP: Cirrus: Workaround F32 BFQ Kernel bug Cirrus: Workaround F32 BFQ Kernel bug Oct 30, 2020
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 30, 2020
@cevich
Copy link
Member Author

cevich commented Oct 30, 2020

@mheon @rhatdan okay, I feel comfortable with merging this now.

@mheon
Copy link
Member

mheon commented Oct 30, 2020

/approve
LGTM

@openshift-ci-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cevich, mheon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 30, 2020
@mheon
Copy link
Member

mheon commented Oct 30, 2020

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 30, 2020
@openshift-merge-robot openshift-merge-robot merged commit f794a4f into containers:master Oct 30, 2020
@cevich cevich deleted the workaround_agent_stopped_responding branch June 30, 2021 18:06
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 22, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 22, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cirrus CI: agent stopped responding
5 participants