-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add sha256 uniqueness constraint to CollectionVersion #1119
Conversation
ContentArtifact = apps.get_model('core', 'ContentArtifact') | ||
RemoteArtifact = apps.get_model('core', 'RemoteArtifact') | ||
Artifact = apps.get_model('core', 'Artifact') | ||
collection_bulk = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Getting it to work is the first prio of course. But in the end, we want this migration to work in batches.
On a really large installation, one by one is way to slow, but all in one will blow up the ram.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After asking if anyone uses GitRemotes (which can create on-demand Collections), it seems it is not widespread, so I simplified the migration to only migrate collections that have artifacts in the system. If on-demand collections are present an exception is raised with the collections that failed.
f90ee9e
to
ceda235
Compare
# If there are on-demand collections then the next migration will fail, so error here with | ||
# helpful message on how to fix. No work will be performed by this migration on a second-run. | ||
if len(collections_on_demand) > 0: | ||
raise Exception( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it is a good practice to raise exception in the middle on the migration because a) it already performed some changes to the db b) rollback might not be easiest in the middle of failed upgrade.
Is it possible to fetch sha256 from the RemoteArtifact of the on_demand collection content?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it is a good practice to raise exception in the middle on the migration because a) it already performed some changes to the db b) rollback might not be easiest in the middle of failed upgrade.
Well I tried to design the migration to be able to be re-ran without re-doing the work it has already done and reverting this migration should be as easy as simply removing the new sha256 column.
Is it possible to fetch sha256 from the RemoteArtifact of the on_demand collection content?
Yes if sha256 was set on the RemoteArtifact, but it isn't required and the GIt sync code does not set the sha256 when doing an on_demand sync.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see.
My suggestion would be to first ( before doing any modifications to the data on the collections) identify whether there are any collections for which it is not possible to retrieve sha256 ( whether from artifact or remote_artifact), then find in which repos they are, and as last, raise error saying what repos need to be synced with immediate policy. Displaying a list of collection.pk is not very friendly and informative.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't want anything to change in case we are unable to perform the whole migration, i would simply rely on the atomic transaction a migration is usually run in. Adding the extra check upfront introduces unnecessary complexity.
@@ -169,7 +172,7 @@ def relative_path(self): | |||
|
|||
class Meta: | |||
default_related_name = "%(app_label)s_%(model_name)s" | |||
unique_together = ("namespace", "name", "version") | |||
unique_together = ("sha256",) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the artifact contains the information about namespace, name and version.
Do you see any conceivable way (maybe through on-demand sync) that pulp could be fooled about that information? Is this a problem we intend to solve, or do we in general trust upstream sources?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which information being fooled? The name, namespace, version or the sha256? name, namespace, version is up to the user so it could be what ever they want. If you are talking about it differing from the downloaded artifact, then yes it is possible if an upstream messes with the database to change the metadata served or change which artifact is pointed to. I think in general we trust the upstream source (and since upstream will either be galaxy or pulp I think this is a safe assumption to make. Maybe this something we can consider in #1106 .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the sha256
of the tar.gz itself, or the sha256
of the MANIFEST.json
?
Since ansible-galaxy collection build
always generates a new MANIFEST.json
and FILES.json
, we set the mtime
to time.time()
for those files in the tarball. So if you simply run ansible-galaxy collection build
2 times, the tar.gz
will have different sha256
Would this cause problems for you if someone tried to upload a rebuilt artifact?
As it is, the sha256
of the tarball isn't necessarily equivalent of (namepace, name, version)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This new sha256
field is for the tarball sha256
(the underlying artifact of the CollectionVersion).
Uploading a rebuilt artifact wouldn't break Pulp, it would just cause confusion for the user. With this change Pulp would gladly accept both the original and rebuilt collection since they would have different digests. Both could be stored in Pulp, but the repository you have them in would only hold the latest uploaded one. Doing a global content list for (namespace, name, version)
could now return multiple results so the user would have to know the date uploaded or sha256
to select the right one.
However, having the new field be the sha256
of the MANIFEST.json
might be a better idea. In that scenario would the sha256
of the MANIFEST.json
of the second built artifact be the same as the first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The MANIFEST.json
will have the same sha, unless of course if something else was changed:
$ shasum -a 256 sivel-toiletwater-0.0.10.tar.gz*
878311263fcd21f040022a33187d682a63a798e16a79d9bdf91d65e8af30b1a6 sivel-toiletwater-0.0.10.tar.gz.1
5fb2e8026e5dae65e1c40c19fa168883c524f3f99c90ec67a58ebe6ea920b51f sivel-toiletwater-0.0.10.tar.gz.2
$ tar -Oxzvf sivel-toiletwater-0.0.10.tar.gz.1 MANIFEST.json | shasum -a 256
x MANIFEST.json
9a0812ebbd0107139c0e3e4684d585393cfce6a8d9fff4876e6cabd3e244ade7 -
$ tar -Oxzvf sivel-toiletwater-0.0.10.tar.gz.2 MANIFEST.json | shasum -a 256
x MANIFEST.json
9a0812ebbd0107139c0e3e4684d585393cfce6a8d9fff4876e6cabd3e244ade7 -
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for testing this. Am I safe to assume that the MANIFEST.json
is source of truth for two collections being the same? If so, I'll change the new field to be the MANIFEST.json
's sha256
to remove the issue of uploading duplicate build artifacts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess that is up for interpretation, but to clarify, the MANIFEST.json
holds a sha256
for FILES.json
, which holds a sha256
for every file in the artifact. So the sha256
of a MANIFEST.json
which also holds namespace
, name
, and version
should accurately represent whether 2 collections are different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My question is whether the FILES.json is always ordered in the same way, thereby giving the same checksum if the same files are present.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The order of FILES.json
is determined by filesystem order, effectively that of os.listdir
.
ValidationError(_("Artifact already exists")) | ||
cv = CollectionVersion.objects.filter(sha256=sha256).first() | ||
if cv: | ||
raise ValidationError(_("Collection Version already exists")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another story for another day, but i think we need this serializer to parse the artifact in deferred_validate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you are right, we should write plugin documentation for uploading content to Pulp for when and where fields should be processed and validated because I can never remember off the top of my head.
I don't think this is a good idea at the moment. We don't really need this for repo management. This would be useful if we ever get asked to do multi tenancy, but so far we haven't had to do that. It might actually make repo management harder to do. Imagine looking up If we're going to do this, I would like to see a really solid reason for it. Otherwise, we're just creating a big headache for ourselves. |
3af07e4
to
bf91f00
Compare
This pull request has been marked 'stale' due to lack of recent activity. If there is no further activity, the PR will be closed in another 30 days. Thank you for your contribution! |
Have to do lots of testing for this one, the migration is not pretty.