Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SAT-29018] Fix/corrupted RA blocks content streaming #6064

Merged

Conversation

pedro-psb
Copy link
Member

depends on #6026
closes #5725

@pedro-psb pedro-psb changed the title Fix/corrupted RA blocks content streaming [SAT-17783] Fix/corrupted RA blocks content streaming Nov 22, 2024
@pedro-psb pedro-psb force-pushed the fix/corrupted-ra-blocks-content-streaming branch 4 times, most recently from 58375ba to d5f13af Compare November 27, 2024 15:38
)
return server, remote

yield _generate_server_and_remote
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved from test_acs to facilitate creating an ACS + server.
I need an ACS here because it associated RA has priority on content-streaming.

with pytest.raises(ClientPayloadError, match="Response payload is not completed"):
download_file(get_url)

# Assert again with curl just to be sure.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reason:

This is testing the close-connection on digest validation.
The implementation on this PR cause the second request (from curl) gets a 404 instead, because the only (and corrupted) RA is temporarily ignored.

Here I wanted to test different clients, but testing only curl is enough.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we added another call to curl right away, we would expect a different error, because the "only" RA is corrupted. right?
Would it be easy to add that to this test?

random.seed(seed)
contents = random.randbytes(size)
else:
contents = os.urandom(size)
Copy link
Member Author

@pedro-psb pedro-psb Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reason:

Facilitate creating two different files with same content to exercise corruption scenarios.

content_artifact.remoteartifact_set.select_related("remote")
.order_by_acs()
.exclude(failed_at__gte=timezone.now() - timedelta(minutes=5))
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I though it would be nice to start simple, but maybe this threshold could depend on the content size? Or be configurable via settings?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Configurable is probably better

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mdellweg On the first approach that was a constant, but the "proper" value is subjective.

@pedro-psb pedro-psb marked this pull request as ready for review November 27, 2024 17:25
@pedro-psb pedro-psb changed the title [SAT-17783] Fix/corrupted RA blocks content streaming [SAT-29018] Fix/corrupted RA blocks content streaming Nov 27, 2024
"- Forcing the connection to close.\n"
"- Marking this Remote Server to be ignored for the next 5 minutes.\n\n"
"If the Remote is permanently corrupted, we advice the admin to "
"manually prune affected RemoteArtifacts/Remotes from Pulp."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's not really a way to delete RemoteArtifacts directly without messing with the database, and I'm not sure we want to suggest in any way that the user should go digging in there. Deleting remotes is probably better, and we should probably suggest resyncing the repo as it might have been fixed at the source.

Typo "checksum" on line 1153

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's not really a way to delete RemoteArtifacts directly without messing with the database, and I'm not sure we want to suggest in any way that the user should go digging in there.

Fair, I was thinking in the last case, but that is probably not a good general idea.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I mean, it CAN be done, occasionally if someone comes to us asking how to do it we'll help them out, but it's not a road we want to suggest going down

@pedro-psb pedro-psb force-pushed the fix/corrupted-ra-blocks-content-streaming branch from 1b376db to 03d1a11 Compare November 28, 2024 18:38
@ggainey ggainey self-requested a review December 3, 2024 14:34
Comment on lines 298 to 301
# The time a RemoteArtifact will be ignored after failure.
# In on-demand, if a fetching content from a remote failed due to corrupt data,
# the corresponding RemoteArtifact will be ignored for that time (seconds).
FAILED_REMOTE_ARTIFACT_PROTECTION_TIME = 5 * 60 # 5 minutes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds more like a cooldown to me, not a protection time.

Have we considered making this a constant before adding a new setting?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cooldown sounds good.
And yes, it was a constant first (marked you above)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, let's not ping pong it.

pulpcore/content/handler.py Show resolved Hide resolved
with pytest.raises(ClientPayloadError, match="Response payload is not completed"):
download_file(get_url)

# Assert again with curl just to be sure.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we added another call to curl right away, we would expect a different error, because the "only" RA is corrupted. right?
Would it be easy to add that to this test?

@pedro-psb
Copy link
Member Author

If we added another call to curl right away, we would expect a different error, because the "only" RA is corrupted. right?

Yes, client will get a 404 because the RA will be ignored for the second attempt within the cooldown interval.

Would it be easy to add that to this test?

Yes, I can do that.

@pedro-psb pedro-psb force-pushed the fix/corrupted-ra-blocks-content-streaming branch from dabb51e to 4b4b195 Compare December 4, 2024 14:53
@@ -298,7 +298,7 @@
# The time a RemoteArtifact will be ignored after failure.
# In on-demand, if a fetching content from a remote failed due to corrupt data,
# the corresponding RemoteArtifact will be ignored for that time (seconds).
FAILED_REMOTE_ARTIFACT_PROTECTION_TIME = 5 * 60 # 5 minutes
FAILED_REMOTE_ARTIFACT_COOLDOWN_TIME = 5 * 60 # 5 minutes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to find a snappier name here: What about REMOTE_ARTIFACT_SKIP_COOLDOWN?

Also I think we have a bit of a containment issue here. RemoteArtifact is a pure implementation detail. Users and administrators will never see them. So referencing them here may be confusing. (This is me probably overthinking this.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree RemoteArtifact is an implementation detail to some degree, but there is also a concept of a remote source for an artifact which is something meaningful to users. They know there are possible multiple sources for the same content (alternate content sources, for example), and that's what we need to express here. So yes, maybe RemoteArtifact doesnt express that very well for the uninitiated.

Personally I don't think REMOTE_ARTIFACT_SKIP_COOLDOWN much better, it reads very vaguely if we were not inside the context of this problem. But that's also true for the current one, so I'll be ok with that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about:

  1. REMOTE_CONTENT_FETCH_FAILURE_COOLDOWN
  2. FAILED_REMOTE_CONTENT_FETCH_COOLDOWN
  3. REMOTE_CONTENT_FETCH_COOLDOWN.
  4. ON_DEMAND_CONTENT_FETCH_COOLDOWN
  5. ON_DEMAND_REMOTE_COOLDOWN

IHO the "content/artifact fetch" here improves the scoping of that.
Those should generally read as "cooldown for when content fetching from a remote fails".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

REMOTE_CONTENT_FETCH_FAILURE_COOLDOWN is still a mouth full, but probably the best of all the suggestions we came up with. (Being a setting means we cannot easily change it later.)

Copy link
Member

@mdellweg mdellweg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All my concerns have been addressed. Time for squashing?

On a request for on-demand content in the content app, a corrupted Remote
that contains the wrong binary (for that content) prevented other Remotes
from being attempted on future requests.

Now the last failed Remotes are temporarily ignored and others may be picked.

Closes pulp#5725
@pedro-psb pedro-psb force-pushed the fix/corrupted-ra-blocks-content-streaming branch from 215397e to 0dff938 Compare December 6, 2024 11:57
Copy link
Contributor

@ggainey ggainey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good - tests look good to exercise the new codepaths, great discussion.

@mdellweg mdellweg merged commit 0a5ac4a into pulp:main Dec 9, 2024
12 checks passed
@pedro-psb pedro-psb deleted the fix/corrupted-ra-blocks-content-streaming branch December 9, 2024 12:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Wrong package download after on-demand sync
4 participants