Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sha256 uniqueness constraint to CollectionVersion #1119

Closed
wants to merge 1 commit into from

Conversation

gerrod3
Copy link
Contributor

@gerrod3 gerrod3 commented Jul 16, 2022

Have to do lots of testing for this one, the migration is not pretty.

ContentArtifact = apps.get_model('core', 'ContentArtifact')
RemoteArtifact = apps.get_model('core', 'RemoteArtifact')
Artifact = apps.get_model('core', 'Artifact')
collection_bulk = {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting it to work is the first prio of course. But in the end, we want this migration to work in batches.
On a really large installation, one by one is way to slow, but all in one will blow up the ram.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After asking if anyone uses GitRemotes (which can create on-demand Collections), it seems it is not widespread, so I simplified the migration to only migrate collections that have artifacts in the system. If on-demand collections are present an exception is raised with the collections that failed.

@gerrod3 gerrod3 force-pushed the sha256-digest branch 8 times, most recently from f90ee9e to ceda235 Compare July 22, 2022 17:57
@gerrod3 gerrod3 marked this pull request as ready for review July 26, 2022 22:04
@gerrod3 gerrod3 requested a review from a team as a code owner July 26, 2022 22:04
# If there are on-demand collections then the next migration will fail, so error here with
# helpful message on how to fix. No work will be performed by this migration on a second-run.
if len(collections_on_demand) > 0:
raise Exception(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it is a good practice to raise exception in the middle on the migration because a) it already performed some changes to the db b) rollback might not be easiest in the middle of failed upgrade.

Is it possible to fetch sha256 from the RemoteArtifact of the on_demand collection content?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it is a good practice to raise exception in the middle on the migration because a) it already performed some changes to the db b) rollback might not be easiest in the middle of failed upgrade.

Well I tried to design the migration to be able to be re-ran without re-doing the work it has already done and reverting this migration should be as easy as simply removing the new sha256 column.

Is it possible to fetch sha256 from the RemoteArtifact of the on_demand collection content?

Yes if sha256 was set on the RemoteArtifact, but it isn't required and the GIt sync code does not set the sha256 when doing an on_demand sync.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see.
My suggestion would be to first ( before doing any modifications to the data on the collections) identify whether there are any collections for which it is not possible to retrieve sha256 ( whether from artifact or remote_artifact), then find in which repos they are, and as last, raise error saying what repos need to be synced with immediate policy. Displaying a list of collection.pk is not very friendly and informative.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't want anything to change in case we are unable to perform the whole migration, i would simply rely on the atomic transaction a migration is usually run in. Adding the extra check upfront introduces unnecessary complexity.

@@ -169,7 +172,7 @@ def relative_path(self):

class Meta:
default_related_name = "%(app_label)s_%(model_name)s"
unique_together = ("namespace", "name", "version")
unique_together = ("sha256",)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the artifact contains the information about namespace, name and version.
Do you see any conceivable way (maybe through on-demand sync) that pulp could be fooled about that information? Is this a problem we intend to solve, or do we in general trust upstream sources?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which information being fooled? The name, namespace, version or the sha256? name, namespace, version is up to the user so it could be what ever they want. If you are talking about it differing from the downloaded artifact, then yes it is possible if an upstream messes with the database to change the metadata served or change which artifact is pointed to. I think in general we trust the upstream source (and since upstream will either be galaxy or pulp I think this is a safe assumption to make. Maybe this something we can consider in #1106 .

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the sha256 of the tar.gz itself, or the sha256 of the MANIFEST.json?

Since ansible-galaxy collection build always generates a new MANIFEST.json and FILES.json, we set the mtime to time.time() for those files in the tarball. So if you simply run ansible-galaxy collection build 2 times, the tar.gz will have different sha256

Would this cause problems for you if someone tried to upload a rebuilt artifact?

As it is, the sha256 of the tarball isn't necessarily equivalent of (namepace, name, version).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new sha256 field is for the tarball sha256 (the underlying artifact of the CollectionVersion).

Uploading a rebuilt artifact wouldn't break Pulp, it would just cause confusion for the user. With this change Pulp would gladly accept both the original and rebuilt collection since they would have different digests. Both could be stored in Pulp, but the repository you have them in would only hold the latest uploaded one. Doing a global content list for (namespace, name, version) could now return multiple results so the user would have to know the date uploaded or sha256 to select the right one.

However, having the new field be the sha256 of the MANIFEST.json might be a better idea. In that scenario would the sha256 of the MANIFEST.json of the second built artifact be the same as the first?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The MANIFEST.json will have the same sha, unless of course if something else was changed:

$ shasum -a 256 sivel-toiletwater-0.0.10.tar.gz*
878311263fcd21f040022a33187d682a63a798e16a79d9bdf91d65e8af30b1a6  sivel-toiletwater-0.0.10.tar.gz.1
5fb2e8026e5dae65e1c40c19fa168883c524f3f99c90ec67a58ebe6ea920b51f  sivel-toiletwater-0.0.10.tar.gz.2
$ tar -Oxzvf sivel-toiletwater-0.0.10.tar.gz.1 MANIFEST.json | shasum -a 256
x MANIFEST.json
9a0812ebbd0107139c0e3e4684d585393cfce6a8d9fff4876e6cabd3e244ade7  -
$ tar -Oxzvf sivel-toiletwater-0.0.10.tar.gz.2 MANIFEST.json | shasum -a 256
x MANIFEST.json
9a0812ebbd0107139c0e3e4684d585393cfce6a8d9fff4876e6cabd3e244ade7  -

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for testing this. Am I safe to assume that the MANIFEST.json is source of truth for two collections being the same? If so, I'll change the new field to be the MANIFEST.json's sha256 to remove the issue of uploading duplicate build artifacts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that is up for interpretation, but to clarify, the MANIFEST.json holds a sha256 for FILES.json, which holds a sha256 for every file in the artifact. So the sha256 of a MANIFEST.json which also holds namespace, name, and version should accurately represent whether 2 collections are different.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My question is whether the FILES.json is always ordered in the same way, thereby giving the same checksum if the same files are present.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The order of FILES.json is determined by filesystem order, effectively that of os.listdir.

ValidationError(_("Artifact already exists"))
cv = CollectionVersion.objects.filter(sha256=sha256).first()
if cv:
raise ValidationError(_("Collection Version already exists"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another story for another day, but i think we need this serializer to parse the artifact in deferred_validate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are right, we should write plugin documentation for uploading content to Pulp for when and where fields should be processed and validated because I can never remember off the top of my head.

@gerrod3 gerrod3 linked an issue Jul 28, 2022 that may be closed by this pull request
@newswangerd
Copy link
Contributor

I don't think this is a good idea at the moment. We don't really need this for repo management. This would be useful if we ever get asked to do multi tenancy, but so far we haven't had to do that. It might actually make repo management harder to do. Imagine looking up foo.bar:1.0.0 and getting back 5 different collections with the same namespace, name and version. How would I begin to determine which one to use? It wouldn't be hard to accidentally end up with a bunch of identical collections lying around since ansible-galaxy doesn't reliably produce collections with the same SHA.

If we're going to do this, I would like to see a really solid reason for it. Otherwise, we're just creating a big headache for ourselves.

@gerrod3 gerrod3 force-pushed the sha256-digest branch 2 times, most recently from 3af07e4 to bf91f00 Compare July 29, 2022 22:41
@gerrod3 gerrod3 marked this pull request as draft August 2, 2022 19:05
@newswangerd newswangerd removed the request for review from a team October 4, 2022 16:22
@stale
Copy link

stale bot commented Jan 2, 2023

This pull request has been marked 'stale' due to lack of recent activity. If there is no further activity, the PR will be closed in another 30 days. Thank you for your contribution!

@stale stale bot added the stale label Jan 2, 2023
@gerrod3 gerrod3 closed this Jan 3, 2023
@gerrod3 gerrod3 deleted the sha256-digest branch January 3, 2023 15:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Content should have a digest in the (system-)uniqueness constraints
5 participants