Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix regression for pool creation timeout retry #887

Merged
merged 6 commits into from
Nov 26, 2024

Conversation

tiagolobocastro
Copy link
Contributor

@tiagolobocastro tiagolobocastro commented Nov 20, 2024

test: use tmp in project workspace

Use a tmp folder from the workspace allowing us to cleanup up things like
LVM volumes a lot easier as we can just purge it.

Signed-off-by: Tiago Castro <[email protected]>

test(pool): create on very large or very slow disks

Uses LVM Lvols as backend devices for the pool.
We suspend these before pool creation, allowing us to simulate slow
pool creation.
This test ensures that the pool creation is completed by itself and also
that a client can also complete it by calling create again.

Signed-off-by: Tiago Castro <[email protected]>

fix: allow pool creation to complete asynchronously

When the initial create gRPC times out, the data-plane may still be creating
the pool in the background, which can happen for very large pools.
Rather than assume failure, we allow this to complete in the background up to
a large arbitrary amount of time. If the pool creation completes before, then
we retry the creation flow.
The reason why we don't simply use very large timeouts is because the gRPC
operations are currently sequential, mostly due to historical reasons.
Now that the data-plane is allowing concurrent calls, we should also allow
this on the control-plane.
TODO: allow concurrent node operations

Signed-off-by: Tiago Castro <[email protected]>

fix: check for correct not found error code

A previous fix ended up not working correctly because it was merged
incorrectly, somehow!

Signed-off-by: Tiago Castro <[email protected]>

chore: update terraform node prep

Pull the Release key from a recent k8s version since the old keys are no
longer valid.
This will have to be updated from time to time.

Pull the Release key from a recent k8s version since the old keys are no
longer valid.
This will have to be updated from time to time.

Signed-off-by: Tiago Castro <[email protected]>
A previous fix ended up not working correctly because it was merged
incorrectly, somehow!

Signed-off-by: Tiago Castro <[email protected]>
@tiagolobocastro
Copy link
Contributor Author

Resolves openebs/mayastor#1772

When the initial create gRPC times out, the data-plane may still be creating
the pool in the background, which can happen for very large pools.
Rather than assume failure, we allow this to complete in the background up to
a large arbitrary amount of time. If the pool creation completes before, then
we retry the creation flow.
The reason why we don't simply use very large timeouts is because the gRPC
operations are currently sequential, mostly due to historical reasons.
Now that the data-plane is allowing concurrent calls, we should also allow
this on the control-plane.
TODO: allow concurrent node operations

Signed-off-by: Tiago Castro <[email protected]>
@tiagolobocastro tiagolobocastro force-pushed the pool-timeout branch 3 times, most recently from 9607994 to 405f633 Compare November 25, 2024 12:59
Uses LVM Lvols as backend devices for the pool.
We suspend these before pool creation, allowing us to simulate slow
pool creation.
This test ensures that the pool creation is completed by itself and also
that a client can also complete it by calling create again.

Signed-off-by: Tiago Castro <[email protected]>
@tiagolobocastro
Copy link
Contributor Author

bors try

bors-openebs-mayastor bot pushed a commit that referenced this pull request Nov 25, 2024
@bors-openebs-mayastor
Copy link

try

Build failed:

Use a tmp folder from the workspace allowing us to cleanup up things like
LVM volumes a lot easier as we can just purge it.

Signed-off-by: Tiago Castro <[email protected]>
Not sure why this is starting to fail now... even on an unchanged
release branch it's failing now!?

Signed-off-by: Tiago Castro <[email protected]>
@tiagolobocastro
Copy link
Contributor Author

bors try

bors-openebs-mayastor bot pushed a commit that referenced this pull request Nov 25, 2024
@bors-openebs-mayastor
Copy link

try

Build succeeded:

@tiagolobocastro
Copy link
Contributor Author

bors merge

@bors-openebs-mayastor
Copy link

Build succeeded:

@bors-openebs-mayastor bors-openebs-mayastor bot merged commit 61ef768 into develop Nov 26, 2024
4 checks passed
@bors-openebs-mayastor bors-openebs-mayastor bot deleted the pool-timeout branch November 26, 2024 09:53
tiagolobocastro added a commit that referenced this pull request Nov 26, 2024
887: Fix regression for pool creation timeout retry r=tiagolobocastro a=tiagolobocastro

    test: use tmp in project workspace

    Use a tmp folder from the workspace allowing us to cleanup up things like
    LVM volumes a lot easier as we can just purge it.

    Signed-off-by: Tiago Castro <[email protected]>

---

    test(pool): create on very large or very slow disks

    Uses LVM Lvols as backend devices for the pool.
    We suspend these before pool creation, allowing us to simulate slow
    pool creation.
    This test ensures that the pool creation is completed by itself and also
    that a client can also complete it by calling create again.

    Signed-off-by: Tiago Castro <[email protected]>

---

    fix: allow pool creation to complete asynchronously

    When the initial create gRPC times out, the data-plane may still be creating
    the pool in the background, which can happen for very large pools.
    Rather than assume failure, we allow this to complete in the background up to
    a large arbitrary amount of time. If the pool creation completes before, then
    we retry the creation flow.
    The reason why we don't simply use very large timeouts is because the gRPC
    operations are currently sequential, mostly due to historical reasons.
    Now that the data-plane is allowing concurrent calls, we should also allow
    this on the control-plane.
    TODO: allow concurrent node operations

    Signed-off-by: Tiago Castro <[email protected]>

---

    fix: check for correct not found error code

    A previous fix ended up not working correctly because it was merged
    incorrectly, somehow!

    Signed-off-by: Tiago Castro <[email protected]>

---

    chore: update terraform node prep

    Pull the Release key from a recent k8s version since the old keys are no
    longer valid.
    This will have to be updated from time to time.

Co-authored-by: Tiago Castro <[email protected]>
bors-openebs-mayastor bot pushed a commit that referenced this pull request Nov 26, 2024
890: Backport fixes to release/2.7 r=tiagolobocastro a=tiagolobocastro

    chore(bors): merge pull request #887
    
    887: Fix regression for pool creation timeout retry r=tiagolobocastro a=tiagolobocastro
    
        test: use tmp in project workspace
    
        Use a tmp folder from the workspace allowing us to cleanup up things like
        LVM volumes a lot easier as we can just purge it.
    
        Signed-off-by: Tiago Castro <[email protected]>
    
    ---
    
        test(pool): create on very large or very slow disks
    
        Uses LVM Lvols as backend devices for the pool.
        We suspend these before pool creation, allowing us to simulate slow
        pool creation.
        This test ensures that the pool creation is completed by itself and also
        that a client can also complete it by calling create again.
    
        Signed-off-by: Tiago Castro <[email protected]>
    
    ---
    
        fix: allow pool creation to complete asynchronously
    
        When the initial create gRPC times out, the data-plane may still be creating
        the pool in the background, which can happen for very large pools.
        Rather than assume failure, we allow this to complete in the background up to
        a large arbitrary amount of time. If the pool creation completes before, then
        we retry the creation flow.
        The reason why we don't simply use very large timeouts is because the gRPC
        operations are currently sequential, mostly due to historical reasons.
        Now that the data-plane is allowing concurrent calls, we should also allow
        this on the control-plane.
        TODO: allow concurrent node operations
    
        Signed-off-by: Tiago Castro <[email protected]>
    
    ---
    
        fix: check for correct not found error code
    
        A previous fix ended up not working correctly because it was merged
        incorrectly, somehow!
    
        Signed-off-by: Tiago Castro <[email protected]>
    
    ---
    
        chore: update terraform node prep
    
        Pull the Release key from a recent k8s version since the old keys are no
        longer valid.
        This will have to be updated from time to time.
    
    Co-authored-by: Tiago Castro <[email protected]>

---

    fix(resize): atomically check for the required size
    
    Ensures races don't lead into volume resize failures.
    
    Signed-off-by: Tiago Castro <[email protected]>

---

    test(bdd/thin): fix racy thin prov test
    
    Add retry waiting for condition to be met.
    
    Signed-off-by: Tiago Castro <[email protected]>

---

    feat(topology): remove the internal labels while displaying
    
    Signed-off-by: sinhaashish <[email protected]>

---

    fix(fsfreeze): improved error message when volume is not staged
    
    Signed-off-by: Abhinandan Purkait <[email protected]>

---

    fix(deployer): increasing the max number of allowed connection attempts to the io-engine
    
    Signed-off-by: sinhaashish <[email protected]>

---

    fix(topology): hasTopologyKey overwites affinityTopologyLabels
    
    Signed-off-by: sinhaashish <[email protected]>


Co-authored-by: sinhaashish <[email protected]>
Co-authored-by: Abhinandan Purkait <[email protected]>
Co-authored-by: Tiago Castro <[email protected]>
Co-authored-by: mayastor-bors <[email protected]>
tiagolobocastro added a commit that referenced this pull request Nov 26, 2024
887: Fix regression for pool creation timeout retry r=tiagolobocastro a=tiagolobocastro

    test: use tmp in project workspace

    Use a tmp folder from the workspace allowing us to cleanup up things like
    LVM volumes a lot easier as we can just purge it.

    Signed-off-by: Tiago Castro <[email protected]>

---

    test(pool): create on very large or very slow disks

    Uses LVM Lvols as backend devices for the pool.
    We suspend these before pool creation, allowing us to simulate slow
    pool creation.
    This test ensures that the pool creation is completed by itself and also
    that a client can also complete it by calling create again.

    Signed-off-by: Tiago Castro <[email protected]>

---

    fix: allow pool creation to complete asynchronously

    When the initial create gRPC times out, the data-plane may still be creating
    the pool in the background, which can happen for very large pools.
    Rather than assume failure, we allow this to complete in the background up to
    a large arbitrary amount of time. If the pool creation completes before, then
    we retry the creation flow.
    The reason why we don't simply use very large timeouts is because the gRPC
    operations are currently sequential, mostly due to historical reasons.
    Now that the data-plane is allowing concurrent calls, we should also allow
    this on the control-plane.
    TODO: allow concurrent node operations

    Signed-off-by: Tiago Castro <[email protected]>

---

    fix: check for correct not found error code

    A previous fix ended up not working correctly because it was merged
    incorrectly, somehow!

    Signed-off-by: Tiago Castro <[email protected]>

---

    chore: update terraform node prep

    Pull the Release key from a recent k8s version since the old keys are no
    longer valid.
    This will have to be updated from time to time.

Co-authored-by: Tiago Castro <[email protected]>
tiagolobocastro added a commit that referenced this pull request Nov 26, 2024
887: Fix regression for pool creation timeout retry r=tiagolobocastro a=tiagolobocastro

    test: use tmp in project workspace

    Use a tmp folder from the workspace allowing us to cleanup up things like
    LVM volumes a lot easier as we can just purge it.

    Signed-off-by: Tiago Castro <[email protected]>

---

    test(pool): create on very large or very slow disks

    Uses LVM Lvols as backend devices for the pool.
    We suspend these before pool creation, allowing us to simulate slow
    pool creation.
    This test ensures that the pool creation is completed by itself and also
    that a client can also complete it by calling create again.

    Signed-off-by: Tiago Castro <[email protected]>

---

    fix: allow pool creation to complete asynchronously

    When the initial create gRPC times out, the data-plane may still be creating
    the pool in the background, which can happen for very large pools.
    Rather than assume failure, we allow this to complete in the background up to
    a large arbitrary amount of time. If the pool creation completes before, then
    we retry the creation flow.
    The reason why we don't simply use very large timeouts is because the gRPC
    operations are currently sequential, mostly due to historical reasons.
    Now that the data-plane is allowing concurrent calls, we should also allow
    this on the control-plane.
    TODO: allow concurrent node operations

    Signed-off-by: Tiago Castro <[email protected]>

---

    fix: check for correct not found error code

    A previous fix ended up not working correctly because it was merged
    incorrectly, somehow!

    Signed-off-by: Tiago Castro <[email protected]>

---

    chore: update terraform node prep

    Pull the Release key from a recent k8s version since the old keys are no
    longer valid.
    This will have to be updated from time to time.

Co-authored-by: Tiago Castro <[email protected]>
Signed-off-by: Tiago Castro <[email protected]>
bors-openebs-mayastor bot pushed a commit that referenced this pull request Nov 26, 2024
890: Backport fixes to release/2.7 r=tiagolobocastro a=tiagolobocastro

    chore(bors): merge pull request #887
    
    887: Fix regression for pool creation timeout retry r=tiagolobocastro a=tiagolobocastro
    
        test: use tmp in project workspace
    
        Use a tmp folder from the workspace allowing us to cleanup up things like
        LVM volumes a lot easier as we can just purge it.
    
        Signed-off-by: Tiago Castro <[email protected]>
    
    ---
    
        test(pool): create on very large or very slow disks
    
        Uses LVM Lvols as backend devices for the pool.
        We suspend these before pool creation, allowing us to simulate slow
        pool creation.
        This test ensures that the pool creation is completed by itself and also
        that a client can also complete it by calling create again.
    
        Signed-off-by: Tiago Castro <[email protected]>
    
    ---
    
        fix: allow pool creation to complete asynchronously
    
        When the initial create gRPC times out, the data-plane may still be creating
        the pool in the background, which can happen for very large pools.
        Rather than assume failure, we allow this to complete in the background up to
        a large arbitrary amount of time. If the pool creation completes before, then
        we retry the creation flow.
        The reason why we don't simply use very large timeouts is because the gRPC
        operations are currently sequential, mostly due to historical reasons.
        Now that the data-plane is allowing concurrent calls, we should also allow
        this on the control-plane.
        TODO: allow concurrent node operations
    
        Signed-off-by: Tiago Castro <[email protected]>
    
    ---
    
        fix: check for correct not found error code
    
        A previous fix ended up not working correctly because it was merged
        incorrectly, somehow!
    
        Signed-off-by: Tiago Castro <[email protected]>
    
    ---
    
        chore: update terraform node prep
    
        Pull the Release key from a recent k8s version since the old keys are no
        longer valid.
        This will have to be updated from time to time.
    
    Co-authored-by: Tiago Castro <[email protected]>

---

    fix(resize): atomically check for the required size
    
    Ensures races don't lead into volume resize failures.
    
    Signed-off-by: Tiago Castro <[email protected]>

---

    test(bdd/thin): fix racy thin prov test
    
    Add retry waiting for condition to be met.
    
    Signed-off-by: Tiago Castro <[email protected]>

---

    feat(topology): remove the internal labels while displaying
    
    Signed-off-by: sinhaashish <[email protected]>

---

    fix(fsfreeze): improved error message when volume is not staged
    
    Signed-off-by: Abhinandan Purkait <[email protected]>

---

    fix(deployer): increasing the max number of allowed connection attempts to the io-engine
    
    Signed-off-by: sinhaashish <[email protected]>

---

    fix(topology): hasTopologyKey overwites affinityTopologyLabels
    
    Signed-off-by: sinhaashish <[email protected]>


Co-authored-by: sinhaashish <[email protected]>
Co-authored-by: Abhinandan Purkait <[email protected]>
Co-authored-by: Tiago Castro <[email protected]>
Co-authored-by: mayastor-bors <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants