Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate what might be causing partial matches from single-file Etherscan contracts #936

Closed
kuzdogan opened this issue Feb 23, 2023 · 19 comments

Comments

@kuzdogan
Copy link
Member

kuzdogan commented Feb 23, 2023

Looking into verifying contracts from Etherscan, I notice almost always we can verify perfectly if the contracts are with standard-json input, which is great.

This leaves me wondering if we can find a pattern in single file or multi-part contracts that would enable perfect matches. Because opposingly, we almost always have partial matches with those.

My initial naive guess was the prepended "Submitted to Etherscan at..." part in the code but apparently that's just added on the UI and not in the API.

A way to find out this might be:

  • Deploy a contract to a testnet, save the metadata before.
  • Verify it as a single file contract on Etherscan
  • Check if it's verified. If not compare the metadata generated by Sourcify during verification vs. the saved metadata of the deployed contract.

Example single file: https://etherscan.io/address/0x3446dd70b2d52a6bf4a5a192d9b0a161295ab7f9#code https://etherscan.io/address/0x4691937a7508860f876c9c0a2a617e7d9e945d4b#code

View in Huly HI-391

@sealer3
Copy link
Contributor

sealer3 commented Mar 30, 2023

I found the solution. Etherscan was replacing my Unix newlines (\n) with Windows newlines (\r\n). I was able to get the correct source keccak256 hash from the Etherscan output by removing the carriage returns (\r).

Also, Etherscan strips trailing newlines (confirmed) and possibly other whitespace at the end of the contract. Newlines and other whitespace might be stripped from the beginning of the contract string too.

@marcocastignoli
Copy link
Member

marcocastignoli commented Mar 30, 2023

@kuzdogan if I remember correctly Sourcify already has a function to test the keccak256 changing whitespaces and trailing newlines?

EDIT: found it

function generateVariations(pathContent: PathContent): PathContent[] {
const variations: string[] = [];
const original = pathContent.content;
for (const contentVariator of CONTENT_VARIATORS) {
const variatedContent = contentVariator(original);
for (const endingVariator of ENDING_VARIATORS) {
const variation = endingVariator(variatedContent);
variations.push(variation);
}
}

@sealer3
Copy link
Contributor

sealer3 commented Mar 30, 2023

@marcocastignoli I did get a perfect match manually from Etherscan API output by replacing the newlines and regenerating the metadata from solc. Sourcify was not able to get a verified match on its own from importing from Etherscan, so I think this is still a bug. We should investigate the variators. (I also wrote my own application that tries variations in a similar way, and it got a perfect match on my own contract by using unix line endings + a newline at the end.)

I have since learned more about the metadata hash. There are even more issues. Etherscan doesn't know the filename OR if the metadata entered by the user is even correct for single-file uploads.

Examples:

  • You can change the number of runs slightly -> metadata hash changes (still compiles to the same bytecode)
  • Change the filename or don't know the correct filename (currently inferred from the contract name for single file uploads) -> metadata hash changes, still same bytecode
  • Have multiple extra newlines at beginning or end -> Etherscan removes them and the file hash changes. It's much harder to predict data, in this case whitespace, when it disappears altogether. Current variators won't find the match if your contract ends with an extra newline.
  • On older contracts without embedded solc version, even change the version slightly and it still compiles to the same bytecode
  • Many contracts were "flattened" only for verification because it was difficult or not possible to upload multiple files to Etherscan. People literally wrote tools to do this (you might know it was common practice) because of Etherscan's UI issues. These can never pass a full match.

These are all reasons why an Etherscan single file upload might result in a partial match. I was able to verify a smart contract on Etherscan with a different filename (e.g. MyContract3.sol) than the contract name (e.g. MyContract) and entered 100 runs instead of the 200 I used to compile it. So Sourcify guesses the wrong filename (guess: MyContract.sol) and has the wrong runs number (100) and yet still finds the partial match, but the metadata hash will definitely be wrong.

(Probably everyone already knows this... but it's fun to put the thoughts together myself :)

@marcocastignoli
Copy link
Member

Sorry I forgot to mention that that function is not currently called inside the etherescan verification process, I'm trying right now to implement it.

@marcocastignoli
Copy link
Member

@sealer3

(Probably everyone already knows this... but it's fun to put the thoughts together myself :)

Forgot to reply to the message itself, thanks a lot for your research, actually very few people know about this, not me for example.

I also wrote my own application that tries variations in a similar way

Is this application open source?

@sealer3
Copy link
Contributor

sealer3 commented Mar 30, 2023

Is this application open source?

I wrote it in about 20 minutes just for this issue. It is just a simple Python script that looks at permutations of transformations to the source code: change line endings to unix/Windows, prepend/append newline. It turns out the way I wrote it is just like Sourcify's variators. My version opens a subprocess and launches solc for each check, so it's not as clean as running solcjs directly. Instead of looking at the keccak256 hash of the source file, I was checking for a perfect bytecode match of the entire smart contract binary.

@marcocastignoli
Copy link
Member

@kuzdogan integrating this to Sourcify will take more time than I expected. I'll close this issue and create a new one explaining how we can implement it.

@kuzdogan
Copy link
Member Author

@marcocastignoli Why? Isn't it simply putting the Etherscan verification through file variations?

@marcocastignoli
Copy link
Member

Potentially yes, but in order to be optimized we need some refactoring

@kuzdogan
Copy link
Member Author

Alright let's open a new issue without closing this and link the intro text to that. I'll edit it if you can't

@marcocastignoli
Copy link
Member

@sealer3 do you have an example of a contract that gets perfectly verified in this case?

did get a perfect match manually from Etherscan API output by replacing the newlines and regenerating the metadata from solc.

@sealer3
Copy link
Contributor

sealer3 commented Mar 30, 2023

@marcocastignoli Yes. Here is one: https://sepolia.etherscan.io/address/0x6F1D75a53a8805DcA5347aE5F3eDE035CAE3CBC1

You can try for yourself by removing the carriage returns and adding a newline at the end before compiling. I submitted the correct metadata to Etherscan and named the file according to the contract name, so a Sourcify that tries the variations should be able to perfectly match it.

marcocastignoli added a commit that referenced this issue Apr 3, 2023
experiment to create files varations before running verification searching for perfect matches
@marcocastignoli
Copy link
Member

Yes. Here is one: https://sepolia.etherscan.io/address/0x6F1D75a53a8805DcA5347aE5F3eDE035CAE3CBC1

Ok I pushed this commit (5f79dc2) in a new branch with a proof of concept of this feature in which it successfully verifies the contract mentioned above. It's not optimized because it runs the compilation multiple times.

In order to find all the files that are part of the same variation I had to group the generated variations per combination so now the generateVariations returns an identifier for the variation together with the path and the content.

@kuzdogan in this proof of concept I'm verifiying two times, the first one while searching for the perfect match with the variations, the second one when I found the right variation (giving a perfect match) passing through the standard sourcify process for session verification (checkContractsInSession, verifyContractsInSession)

@kuzdogan
Copy link
Member Author

kuzdogan commented Apr 3, 2023

Thinking about this further, I think this should be something not limited to Etherscan verification.

Currently, we take the metadata.json the user provides when verifying as the source of truth. Meaning we generate the variations of the source files and try to match them to the keccak256 in the provided metadata.json. If the user happens to provide us a modified metadata, we don't get a full match.

However, it could be that we can generate the metadata with the hash that matches the one in the bytecode. Here we're able to spot one common variation causing full matches to be lost. I know one possible case, for example, when Truffle adds project: prefix to the source file paths. If we encounter other common variation patterns in the wild, we can add them to our pipeline.

I think this should be its own module that not only tries matching the hashes of the source files (as we do now), but also generates variations of the metadata files to match the CBOR metadata hash.

@kuzdogan
Copy link
Member Author

kuzdogan commented Apr 3, 2023

Summarizing and tidying up what I'm thinking:

Currently how we verify is as follows:

  1. User gives us a metadata
  2. User gives us source files
  3. We generate 3x7=21 “variations” of each source file:
    1. Variations in the content of the file: replace Win style line endings with POSIX style and vice versa (2 + original = 3)
    2. Variations in the endings of the file: trim ending whitespaces, add Win ending, add POSIX ending, add additional Win and POSIX endings (6 + original = 7)
  4. Save them in a map. The keccak256 of the resulting variation is the key, content as the value
  5. Check the keccak256 in sources in the metadata
  6. Find them in the byHash map.
  7. Verify if complete. Mark as incomplete if couldn’t be found

This lets us easily find matching sources for a given metadata but we don’t do variations on the metadata itself. Hence we can only get partial matches, in some cases.

  • ✅ First, try to verify with the given metadata file. If we get a full match all good
  • ⚠️ If partial match, do metadata variations (see next section).
  • ❌ If no match, break. Because variations will only be syntactic differences and if the “executional” bytecode is different, there’s nothing we can do.

⚠️ Partial match

For the partial matches, we now try to find the original metadata file.

We have a target metadata in the onchain bytecode

image
Here our target metadata hash is 0x1220dceca8706b29e917dacf25fceef95acac8d90d765ac926663ce4096195952b61 (or =QmdD3hpMj6mEFVy9DP4QqjHaoeYbhKsYvApX1YZNfjTVWp) in base58 string encoding.

This is the CIDv0 of IPFS: 0x1220 + sha256

Not the sha256 of the whole file but IPFS has its own way of breaking files down into chunks and creating a hash for each file. We don’t have to do that ourselves and just use the respective js-ipfs package.

Ok so our target is QmdD3hpMj6mEFVy9DP4QqjHaoeYbhKsYvApX1YZNfjTVWp

For each source file, generate the above 3x7=21 variations

Let’s call the 3 content variations A, B, C

Let’s call the 7 ending variations 0,1,2…,6,7

for content of [A,B,C]
	for ending of [1,2,...,6,7]
		// replace all source[path].keccak256 fields in the metadata
		const newMetadata = replaceSourceHashes(variations[content][ending], originalMetadata)
		// take metadataHash
		const metadataHashVariation = getIpfsHash(newMetadata)
		if metadataHashVariation === targetMetadataHash
			return metadataHashVariation

Potentially we can add other variations, like the Truffle case project:/

Say if we spotted the project:/ prefix, we generate the variations [Y,Z]

function tryTofindOriginalMetadata(deployedBytecode)
	const [_, targetMetadataHash] = splitCBOR(deployedBytecode)
	for truffle of [Y,Z]
		for content of [A,B,C]
			for ending of [1,2,...,6,7]
				// replace all source[path].keccak256 fields in the metadata
				const newMetadata = replaceSourceHashes(variations[truffle][content][ending], originalMetadata)
				// take metadataHash
				const metadataHashVariation = getIpfsHash(newMetadata)
				if metadataHashVariation === targetMetadataHash
					checkedContract.replaceMetadataAndSourceFiles(newMetadata, variations[truffle][content][ending])
					return true;

If we have found the metadata recompile and verify again. Because theoretically someone can place a metadata hash manually to the bytecode.

This would be somewhere in the verification after we find a partial match

if (match.status === "partial")
	if (checkedContract.tryToFindOriginalMetadata(deployedBytecode))
		// Must compile again with the new metadata.
		recompiled = checkedContract.recompile()
		matchWithDeployedBytecode(match, recompiled.deployedBytecode, deployedBytecode)
	

marcocastignoli added a commit that referenced this issue Apr 4, 2023
…bytecode WIP

*  add tryToFindOriginalMetadata to CheckedContract: tries to recontruct the original metadata by iterating on all files variations
* call tryToFindOriginalMetadata after `matchWithDeployedBytecode` if match is not perfect
marcocastignoli added a commit that referenced this issue Apr 5, 2023
marcocastignoli added a commit that referenced this issue Apr 5, 2023
marcocastignoli added a commit that referenced this issue Apr 6, 2023
…hash by updating the source section of the metadata and generating the metadata CID
marcocastignoli added a commit that referenced this issue Apr 11, 2023
* I translated the ipfsHash.cpp file from the solidity repo
* now that everything is synchronous I simplified the code that replace the ipfs and bzz hash in the urls array in metadata.sources
* I added some comments to clearify the storebyhash grouped by variation
marcocastignoli added a commit that referenced this issue Apr 12, 2023
@marcocastignoli
Copy link
Member

Closing this, implemented with #976

@kuzdogan
Copy link
Member Author

While the PR looks good for the case provided here, I wasn't able to find any other examples for a contract that would normally be a "partial" match but gets a "perfect" match thanks to the variations. I am comparing verifying on staging (before the PR is merged) and running the PR locally. We are still getting lots of partial matches from "single file" contract verifications.

We should still look for ways to increase our chances of verifying perfectly. Either we need to try more contracts, or we should further investigate why we can't generate a "perfect match" variation. Opening again

@kuzdogan kuzdogan reopened this Apr 20, 2023
@sealer3
Copy link
Contributor

sealer3 commented Apr 20, 2023

Possible paths of research:

  • Investigate the behavior of contract flatteners and how they generate metadata
    • Do users of these flatten before or after generating the bytecode they actually deploy?
  • Investigate the relationship between contracts and filenames
    • Today we are assuming the contract filename is exactly "[target contract name].sol". What types of variations exist in real life? Lowercase/uppercase? Which directory?
    • Determine where the code is being compiled and what the default metadata looks like. For example, if the code is put into Remix, do users usually compile contracts from the contracts directory?
  • Reach out to contract deployers and ask what they did to compile the bytecode (Testnets might be a good place to do this cheaply. Send a message to every single recent Etherscan verified single-file contract deployer on Goerli/Sepolia with a question.)
  • Incentives-based discoveries via contest
    • Dump a bunch of partial match contracts in a folder and hold a contest rewarding a small prize to whoever finds the first perfect match for each of the contracts. Also require them to show how they found the solution for each one.
    • Or, give a prize to whoever writes a program that can find perfect metadata from single-file contracts in a more generic way

I imagine it will be impossible to match many of the contracts because the metadata (including filenames) is simply not accessible. The real filename might have nothing to do with anything inside the contract, and so it becomes almost impossible to guess. For instance, the contract could be generic while the filename is specific to the contract's application: a generic contract whose Solidity name is MyERC20Token that uses arguments to set its parameters and whose source code is actually the correct hash might just be renamed DogeElonMuskShibaGPTToken.sol.

I wonder if in the future there will be discoveries of perfect matches like the way people run their computers to discover other things, like new large primes. Each one is a puzzle.

@marcocastignoli
Copy link
Member

  • While there might be a possibility to unflatten files, reconstructing the filename could prove to be a challenging task, as highlighted by @sealer3. I know that some flatteners includes the file name as a comment, but not all of them.
  • Implementing a brute-force mechanism to discover contract paths might not be the most efficient solution. Although it is possible to explore default filesystems of popular frameworks such as Hardhat, Truffle, and Remix, it is important to consider that developers can create custom subfolders and use arbitrary naming conventions. Consequently, the chances of success could be quite low, and the time and resources invested in this method might not yield satisfactory results.
  • One idea worth exploring is searching for the contract name or sections of code on platforms like GitHub and attempting to compile the top results.

That said, I think this will be a long process and I think it make sense to implement every solution when it is mature enough. If the features implemented in #976 work, then I would merge it without closing this issue.

kuzdogan added a commit that referenced this issue Apr 20, 2023
* #936 file variation while importing from etherscan
experiment to create files varations before running verification searching for perfect matches

* #936 generate the metadata with the hash that matches the one in the bytecode WIP
*  add tryToFindOriginalMetadata to CheckedContract: tries to recontruct the original metadata by iterating on all files variations
* call tryToFindOriginalMetadata after `matchWithDeployedBytecode` if match is not perfect

* restore etherscan standard verification

* #936 tests working

* update package-json lib-sourcify

* #936 fix linting issues

* wait 1 second between every etherscan request

* #936 update non-session verify from etherscan controler

* #936 better function naming `getMetadataFromCompiler`

* #936 instead of recompiling each variation, recalculate the metadata hash by updating the source section of the metadata and generating the metadata CID

* #936 fix error on non existance of the path in sources

* #936 add coments to group by variations to make it more clear

* #936 replace the ipfs calculate hash
* I translated the ipfsHash.cpp file from the solidity repo
* now that everything is synchronous I simplified the code that replace the ipfs and bzz hash in the urls array in metadata.sources
* I added some comments to clearify the storebyhash grouped by variation

* fix default ipfs gateway

* #936 add test for tryToFindOriginalMetadata

* #936 add swarmBzzr0Hash and test metadata with both ipfs and swarm

* Add comments

* #936 refactor tryToFindOriginalMetadata adding types
* test in verification.spec.ts

* #976 add license type in MetadataSources

* #976 change test name for wrong end of line

* Update packages/lib-sourcify/test/verification.spec.ts

Co-authored-by: Kaan Uzdoğan <[email protected]>

* #976 try to find original metadata for each verification method
* refactor `verifyDeployed` so that it uses the new function ´tryToFindOriginalMetadataAndMatch´

---------

Co-authored-by: Marco Castignoli <[email protected]>
marcocastignoli added a commit that referenced this issue Apr 21, 2023
* instead of verifying if an address is valid, tries to checksum it
kuzdogan pushed a commit that referenced this issue Apr 21, 2023
* instead of verifying if an address is valid, tries to checksum it
@kuzdogan kuzdogan removed the research label Apr 19, 2024
@kuzdogan kuzdogan closed this as not planned Won't fix, can't repro, duplicate, stale Jan 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants
@marcocastignoli @kuzdogan @sealer3 and others