Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for overriding the default hash format (new --hash-format argument) #3709

Merged
merged 21 commits into from
Mar 7, 2021

Conversation

grossag
Copy link
Contributor

@grossag grossag commented Jun 22, 2020

This change adds support for a new --hash-format parameter that can be used to override the default hash format used by SCons. The default remains MD5, but this allows consumers to opt into SHA1, SHA256, or any other hash algorithm offered by their implementation of hashlib.

Contributor Checklist:

  • I have created a new test or updated the unit tests to cover the new/changed functionality.
  • I have updated CHANGES.txt (and read the README.rst)
  • I have updated the appropriate documentation

@bdbaddog
Copy link
Contributor

Adding to 4.1 release plans. Subject to change.

grossag added 5 commits August 4, 2020 09:07
This change adds support for a new --hash-format parameter that can be used to
override the default hash format used by SCons. The default remains MD5, but
this allows consumers to opt into SHA1, SHA256, or any other hash algorithm
offered by their implementation of hashlib.
Fixes one sider issue and a code error that broke some tests
1. Remove unnecessary writes of f1.in and f2.in. The former is already in the dir fixture and the latter is unused.
2. Untabify SConstruct file.
Rebasing off of master put my CHANGES.txt addition in an old release.
This commit moves it up to the correct section.
@grossag grossag force-pushed the topic/grossag/newhashes branch from cf61dc0 to a1bd89e Compare August 4, 2020 13:12
@grossag
Copy link
Contributor Author

grossag commented Aug 4, 2020

Yesterday's md5_chunksize PR conflicted with this one and the start of the branch was a bit old, so I have rebased this PR off of master.

@mwichmann
Copy link
Collaborator

mwichmann commented Aug 11, 2020

I stumbled across a trap in this transition, a hidden assumption that a signature is always 32 bytes. Just adding here so it's recorded somewhere. Look for:

SCons/Node/FS.py:                if len(sig) == 32:

A quick grep suggests there's only the one of these.

@grossag
Copy link
Contributor Author

grossag commented Sep 3, 2020

I stumbled across a trap in this transition, a hidden assumption that a signature is always 32 bytes. Just adding here so it's recorded somewhere. Look for:

SCons/Node/FS.py:                if len(sig) == 32:

A quick grep suggests there's only the one of these.

This function is in convert_old_entry, which is called here:

        if isinstance(sconsign_entry, FileBuildInfo):
            # This is a .sconsign file from before the Big Signature
            # Refactoring; convert it as best we can.
            sconsign_entry = self.convert_old_entry(sconsign_entry)

I could change the check from "== 32" to ">= 32" but I don't believe that this code would operate on anything other than MD5 hashes because it would only be called on very old .sconsign files.

@mwichmann
Copy link
Collaborator

Hmmm, I wonder if we can ditch that logic entirely. Many of these "old" things are 10+ years old.

@mwichmann
Copy link
Collaborator

So if this is going to be pursued, it's time to flush out if this approach is okay with the maintainer, or if a hash-format-detecting approach is needed (possibly with a one-time conversion tool to "upgrade" old sconsigns)?

@bdbaddog
Copy link
Contributor

I think I've suggested this before, if the hash isn't md5, then the sconsign file name should have the hash type in it.
That way it will be clear.

Unless we want to go down the rathole of trying to figure out which hash was used in an existing sconsign file, or create a new format which has the hash type used somewhere in the header of the sconsign file..

We can add logic to check for each filename when opening if there's already one, otherwise default if not speicfied?

@grossag
Copy link
Contributor Author

grossag commented Nov 11, 2020

I'm having trouble understanding how such a check would happen. Can you help me understand what the benefit of this would be? Would the downside be that such a check have the side effect that would unnecessarily cause rebuilds when switching hash formats if you are using a timestamp decider?

@grossag
Copy link
Contributor Author

grossag commented Nov 11, 2020

Also, two more questions:

  1. Should I keep the current (no hash included) name for MD5 and just add the hash name for other formats?
  2. Which part of SCons owns the process of figuring out the sconsign.dblite name and which part of SCons calls into it? I ask because I see sconsign.py and dblite.py but I can't figure out who calls into them. I see Environment.SConsignFile but I don't see any non-test code calling into it.

@mwichmann
Copy link
Collaborator

mwichmann commented Nov 11, 2020

Can I just respond to question 2 above by saying "it's horrid"?

I had had thoughts of changing things to use multihash (https://pymultihash.readthedocs.io/en/latest/), but it might be this is a case of YAGNI, and just having a .sconsign.dblite and .sconsign_sha256.dblite is all that's practically needed (that's "pseudo-code", not a concrete proposal for what the naming should be).

@grossag
Copy link
Contributor Author

grossag commented Nov 12, 2020

Ok I think I figured out how to do this. I'll push an update in the next hour if I can validate that I haven't broken SConsign-related tests.

This was requested in the code review. The sconsign database file name is still
.sconsign.dblite if the hash format is not overridden, but if it is, the name
will be something like .sconsign_sha256.dblite.
@grossag
Copy link
Contributor Author

grossag commented Nov 13, 2020

PR is updated to change the sconsign db file name if the hash format is overridden. Functional test now validates that the db name is correct if the hash format is overridden either by --hash-format or by SCons.Util.set_hash_format().

CHANGES.txt Outdated Show resolved Hide resolved
SCons/Environment.py Outdated Show resolved Hide resolved
SCons/Environment.xml Outdated Show resolved Hide resolved
SCons/Util.py Outdated Show resolved Hide resolved
@bdbaddog
Copy link
Contributor

Looks good to me.
A couple questions (included in inline comments with code but want to summarize here as well)

  1. Deprecation cycle. Likely we need to do this, and also make sure we note the new/old alternatives in the docs
  2. What should scons do if no hash type is specified but say there's a .sconsign_sha256.dblite present. Should it detect it and auto set the hashtype?

@mwichmann
Copy link
Collaborator

mwichmann commented Dec 14, 2020

2. What should scons do if no hash type is specified but say there's a .sconsign_sha256.dblite present. Should it detect it and auto set the hashtype?

This is the same question I have with other work in progress, if multiple signature DBs are detected but no directive to help pick one is detected, how should SCons behave? Default to using .dblite always, or pick a newer one if it exists, since it's presumably "intentional" to have created it? I can see arguments both ways.

@bdbaddog bdbaddog changed the title Add support for overriding the default hash format Add support for overriding the default hash format (new --hash-format argument) Dec 14, 2020
@bdbaddog
Copy link
Contributor

I think it should walk the possibilities and take the first in order.
md5, sha1, sha256
If there's no default filenamed.

Only thing left is to pick a hash format based on the sconsign database name.
1. Fix failure finding UserError.
2. Fix bad string formatting.
3. Add test case covering passing an invalid hash format.
4. Remove blake2b, as I haven't tested it. We can add it some day if people want it.
@grossag
Copy link
Contributor Author

grossag commented Dec 15, 2020

Looks good to me.
A couple questions (included in inline comments with code but want to summarize here as well)

  1. Deprecation cycle. Likely we need to do this, and also make sure we note the new/old alternatives in the docs
  2. What should scons do if no hash type is specified but say there's a .sconsign_sha256.dblite present. Should it detect it and auto set the hashtype?

I did some diff updates and these two comments are the big ones left.

For #1, I did add a deprecation warning and updated the documentation but I am not sure if there are other changes I need to make.

For #2, I am concerned that SConsign.Get_Database() is called too late. As far as I can tell, the SConsign database is lazy-loaded so for example, in a test, the first call to self.taskmaster.next_task() is what causes the database to be loaded (Node.get_stored_info() calls Dir.sconsign() which then loads the database if it isn't already loaded). As a result, I would prefer to not do this one.

@mwichmann
Copy link
Collaborator

Seems like there's a flake8 warning from CI: SCons/Util.py:1526 do not use bare 'except' (E722)

I'd say we can discuss whether an actual deprecation warning needs to be added to the code as a separate item, but let's see what Bill thinks...

Now that we have a much more limited selection of supported algorithms,
I don't need the code to actually call the hash function in set_hash_format.
This also means that I don't need a try/except and can instead use getattr().
@bdbaddog
Copy link
Contributor

IMHO this is great work.

One question to resolve, and feel free to say push this to later, is about using the list of hashes from hashlib.algorithms_guaranteed or hashlib.algorithms_available?

@grossag @mwichmann - thoughts?
Is supporting the others not in the list from this PR important? or really in to the maybe someone someday might want them?
If we say punt, I'm good with merging this as is.

CHANGES.txt Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants