[SAT-29018] Fix/corrupted RA blocks content streaming #6064

pedro-psb · 2024-11-22T21:26:02Z

depends on #6026
closes #5725

pedro-psb · 2024-11-27T17:02:58Z

pulp_file/pytest_plugin.py

+        )
+        return server, remote
+
+    yield _generate_server_and_remote


Moved from test_acs to facilitate creating an ACS + server.
I need an ACS here because it associated RA has priority on content-streaming.

pedro-psb · 2024-11-27T17:07:53Z

pulpcore/tests/functional/api/using_plugin/test_content_delivery.py

-    with pytest.raises(ClientPayloadError, match="Response payload is not completed"):
-        download_file(get_url)
-
-    # Assert again with curl just to be sure.


Reason:

This is testing the close-connection on digest validation.
The implementation on this PR cause the second request (from curl) gets a 404 instead, because the only (and corrupted) RA is temporarily ignored.

Here I wanted to test different clients, but testing only curl is enough.

If we added another call to curl right away, we would expect a different error, because the "only" RA is corrupted. right?
Would it be easy to add that to this test?

pedro-psb · 2024-11-27T17:08:45Z

pulpcore/tests/functional/utils.py

+            random.seed(seed)
+            contents = random.randbytes(size)
+        else:
+            contents = os.urandom(size)


Reason:

Facilitate creating two different files with same content to exercise corruption scenarios.

pedro-psb · 2024-11-27T17:25:19Z

pulpcore/content/handler.py

+            content_artifact.remoteartifact_set.select_related("remote")
+            .order_by_acs()
+            .exclude(failed_at__gte=timezone.now() - timedelta(minutes=5))
+        )


I though it would be nice to start simple, but maybe this threshold could depend on the content size? Or be configurable via settings?

Configurable is probably better

@mdellweg On the first approach that was a constant, but the "proper" value is subjective.

dralley · 2024-11-28T06:24:49Z

pulpcore/content/handler.py

+                "- Forcing the connection to close.\n"
+                "- Marking this Remote Server to be ignored for the next 5 minutes.\n\n"
+                "If the Remote is permanently corrupted, we advice the admin to "
+                "manually prune affected RemoteArtifacts/Remotes from Pulp."


There's not really a way to delete RemoteArtifacts directly without messing with the database, and I'm not sure we want to suggest in any way that the user should go digging in there. Deleting remotes is probably better, and we should probably suggest resyncing the repo as it might have been fixed at the source.

Typo "checksum" on line 1153

There's not really a way to delete RemoteArtifacts directly without messing with the database, and I'm not sure we want to suggest in any way that the user should go digging in there.

Fair, I was thinking in the last case, but that is probably not a good general idea.

Yeah, I mean, it CAN be done, occasionally if someone comes to us asking how to do it we'll help them out, but it's not a road we want to suggest going down

mdellweg · 2024-12-04T12:50:35Z

pulpcore/app/settings.py

+# The time a RemoteArtifact will be ignored after failure.
+# In on-demand, if a fetching content from a remote failed due to corrupt data,
+# the corresponding RemoteArtifact will be ignored for that time (seconds).
+FAILED_REMOTE_ARTIFACT_PROTECTION_TIME = 5 * 60  # 5 minutes


This sounds more like a cooldown to me, not a protection time.

Have we considered making this a constant before adding a new setting?

Cooldown sounds good.
And yes, it was a constant first (marked you above)

Yes, let's not ping pong it.

pulpcore/content/handler.py

mdellweg · 2024-12-04T12:57:30Z

pulpcore/tests/functional/api/using_plugin/test_content_delivery.py

-    with pytest.raises(ClientPayloadError, match="Response payload is not completed"):
-        download_file(get_url)
-
-    # Assert again with curl just to be sure.


If we added another call to curl right away, we would expect a different error, because the "only" RA is corrupted. right?
Would it be easy to add that to this test?

pedro-psb · 2024-12-04T13:09:23Z

If we added another call to curl right away, we would expect a different error, because the "only" RA is corrupted. right?

Yes, client will get a 404 because the RA will be ignored for the second attempt within the cooldown interval.

Would it be easy to add that to this test?

Yes, I can do that.

mdellweg · 2024-12-04T16:46:12Z

pulpcore/app/settings.py

@@ -298,7 +298,7 @@
 # The time a RemoteArtifact will be ignored after failure.
 # In on-demand, if a fetching content from a remote failed due to corrupt data,
 # the corresponding RemoteArtifact will be ignored for that time (seconds).
-FAILED_REMOTE_ARTIFACT_PROTECTION_TIME = 5 * 60  # 5 minutes
+FAILED_REMOTE_ARTIFACT_COOLDOWN_TIME = 5 * 60  # 5 minutes


Trying to find a snappier name here: What about REMOTE_ARTIFACT_SKIP_COOLDOWN?

Also I think we have a bit of a containment issue here. RemoteArtifact is a pure implementation detail. Users and administrators will never see them. So referencing them here may be confusing. (This is me probably overthinking this.)

I agree RemoteArtifact is an implementation detail to some degree, but there is also a concept of a remote source for an artifact which is something meaningful to users. They know there are possible multiple sources for the same content (alternate content sources, for example), and that's what we need to express here. So yes, maybe RemoteArtifact doesnt express that very well for the uninitiated.

Personally I don't think REMOTE_ARTIFACT_SKIP_COOLDOWN much better, it reads very vaguely if we were not inside the context of this problem. But that's also true for the current one, so I'll be ok with that.

What about:

REMOTE_CONTENT_FETCH_FAILURE_COOLDOWN

FAILED_REMOTE_CONTENT_FETCH_COOLDOWN

REMOTE_CONTENT_FETCH_COOLDOWN.

ON_DEMAND_CONTENT_FETCH_COOLDOWN

ON_DEMAND_REMOTE_COOLDOWN

IHO the "content/artifact fetch" here improves the scoping of that.
Those should generally read as "cooldown for when content fetching from a remote fails".

REMOTE_CONTENT_FETCH_FAILURE_COOLDOWN is still a mouth full, but probably the best of all the suggestions we came up with. (Being a setting means we cannot easily change it later.)

mdellweg

All my concerns have been addressed. Time for squashing?

On a request for on-demand content in the content app, a corrupted Remote that contains the wrong binary (for that content) prevented other Remotes from being attempted on future requests. Now the last failed Remotes are temporarily ignored and others may be picked. Closes pulp#5725

ggainey

Looks good - tests look good to exercise the new codepaths, great discussion.

github-actions bot added multi-commit no-changelog no-issue labels Nov 22, 2024

pedro-psb changed the title ~~Fix/corrupted RA blocks content streaming~~ [SAT-17783] Fix/corrupted RA blocks content streaming Nov 22, 2024

pedro-psb force-pushed the fix/corrupted-ra-blocks-content-streaming branch 4 times, most recently from 58375ba to d5f13af Compare November 27, 2024 15:38

pedro-psb commented Nov 27, 2024

View reviewed changes

pedro-psb marked this pull request as ready for review November 27, 2024 17:25

pedro-psb requested review from dralley and mdellweg November 27, 2024 17:26

pedro-psb changed the title ~~[SAT-17783] Fix/corrupted RA blocks content streaming~~ [SAT-29018] Fix/corrupted RA blocks content streaming Nov 27, 2024

dralley reviewed Nov 28, 2024

View reviewed changes

pedro-psb force-pushed the fix/corrupted-ra-blocks-content-streaming branch from 1b376db to 03d1a11 Compare November 28, 2024 18:38

ggainey self-requested a review December 3, 2024 14:34

mdellweg reviewed Dec 4, 2024

View reviewed changes

pedro-psb force-pushed the fix/corrupted-ra-blocks-content-streaming branch from dabb51e to 4b4b195 Compare December 4, 2024 14:53

mdellweg reviewed Dec 4, 2024

View reviewed changes

mdellweg reviewed Dec 6, 2024

View reviewed changes

pedro-psb force-pushed the fix/corrupted-ra-blocks-content-streaming branch from 215397e to 0dff938 Compare December 6, 2024 11:57

github-actions bot removed multi-commit no-changelog labels Dec 6, 2024

ggainey approved these changes Dec 6, 2024

View reviewed changes

dralley approved these changes Dec 9, 2024

View reviewed changes

mdellweg approved these changes Dec 9, 2024

View reviewed changes

mdellweg merged commit 0a5ac4a into pulp:main Dec 9, 2024
12 checks passed

pedro-psb deleted the fix/corrupted-ra-blocks-content-streaming branch December 9, 2024 12:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SAT-29018] Fix/corrupted RA blocks content streaming #6064

[SAT-29018] Fix/corrupted RA blocks content streaming #6064

pedro-psb commented Nov 22, 2024

pedro-psb Nov 27, 2024

pedro-psb Nov 27, 2024

mdellweg Dec 4, 2024

pedro-psb Nov 27, 2024 •

edited

Loading

pedro-psb Nov 27, 2024

dralley Nov 28, 2024

pedro-psb Dec 4, 2024

dralley Nov 28, 2024

pedro-psb Nov 28, 2024

dralley Dec 3, 2024

mdellweg Dec 4, 2024

pedro-psb Dec 4, 2024

mdellweg Dec 4, 2024

mdellweg Dec 4, 2024

pedro-psb commented Dec 4, 2024

mdellweg Dec 4, 2024

pedro-psb Dec 4, 2024

pedro-psb Dec 4, 2024

mdellweg Dec 4, 2024

mdellweg left a comment

ggainey left a comment

[SAT-29018] Fix/corrupted RA blocks content streaming #6064

[SAT-29018] Fix/corrupted RA blocks content streaming #6064

Conversation

pedro-psb commented Nov 22, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pedro-psb Nov 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pedro-psb commented Dec 4, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mdellweg left a comment

Choose a reason for hiding this comment

ggainey left a comment

Choose a reason for hiding this comment

pedro-psb Nov 27, 2024 •

edited

Loading