Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactored solc --link to use LinkerObject #12756

Closed
wants to merge 67 commits into from

Conversation

bmoore01
Copy link

@bmoore01 bmoore01 commented Mar 8, 2022

I have refactored solc/CommandLineInterface.cpp Fixes #10308

Copy link
Member

@cameel cameel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! I'll review it in detail later but here's some early feedback. For now just a few superficial things that stood out.

Also, test are failing. Please take a look at the failed CI runs. I think you must have removed some imports that are actually used.

@cameel
Copy link
Member

cameel commented Mar 9, 2022

Two more things:

  • Please add Fixes #10308 to your description so that github will automatically close the issue when we merge this.
  • I'd recommend not submitting PRs from the develop branch of your fork. Keeping it up to date with develop in the upstream repo is always a bit awkward. In my experience it's much more convenient to use a feature branch.

@bmoore01
Copy link
Author

bmoore01 commented Mar 9, 2022

@cameel Thanks for the review, I have made another commit addressing your initial issues so hopefully this fixes any issues!

@@ -931,45 +932,61 @@ void CommandLineInterface::link()
// be just the cropped or '_'-padded library name, but this changed to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this comment any more, at least not at this point in the code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it was still relevant but looking closer at the PR now, it removes support for old-style placeholders and without it it's not necessary.

Are we fine with that though? Modern compiler versions do not produce them so I'd be fine with that but it's technically a breaking change.

continue;

size_t bytecodeIndex;
// Bytecode can be on line 0 or 3, depending on if file header is included
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really want to account for the output formatting of the non-file-based commandline interface?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally yes, because this is required to allow the tests to pass although, I'm away that it's not the most elegant solution. I also considered using a regex to find the index instead of hard coding it but, this seemed simpler.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd do one of these here:

  1. Assume a rigid file structure and validate it more strictly. For example when you skip files shorter than 3 lines, verify that these lines do not contain the bytecode (i.e. a string consisting only of hex chars without whitespace and possibly placeholders between them)
  2. Detect the bytecode and ignore everything else. This will be more robust against things like someone just taking the compiler output (with multiple bytecode pieces in it), putting it in a file and running the linker on it.

The current version is neither of these.

Given how the linker has been working so far (basically accepting anything, regardless of it's bytecode or not), I think (2) would be better.

continue;

size_t bytecodeIndex;
// Bytecode can be on line 0 or 3, depending on if file header is included
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd do one of these here:

  1. Assume a rigid file structure and validate it more strictly. For example when you skip files shorter than 3 lines, verify that these lines do not contain the bytecode (i.e. a string consisting only of hex chars without whitespace and possibly placeholders between them)
  2. Detect the bytecode and ignore everything else. This will be more robust against things like someone just taking the compiler output (with multiple bytecode pieces in it), putting it in a file and running the linker on it.

The current version is neither of these.

Given how the linker has been working so far (basically accepting anything, regardless of it's bytecode or not), I think (2) would be better.

Comment on lines 951 to 968
if (
end - it < placeholderSize ||
*(it + 1) != '_' ||
*(it + placeholderSize - 2) != '_' ||
*(it + placeholderSize - 1) != '_'
)
solThrow(
CommandLineExecutionError,
"Error in binary object file " + src.first + " at position " + to_string(it - src.second.begin()) + "\n" +
'"' + string(it, it + min(placeholderSize, static_cast<int>(end - it))) + "\" is not a valid link reference."
);

string foundPlaceholder(it, it + placeholderSize);
if (librariesReplacements.count(foundPlaceholder))
string const& name = library.first;
// Bytecode we need is on the third line there's a blank line at the start
bytecode = sourceLines.at(bytecodeIndex);
size_t libraryStartOffset = bytecode.find("__$");

for (size_t i = 0; i < placeholderSize; i++)
bytecode[libraryStartOffset + i] = '0';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code used to throw on invalid placeholders. Now it just looks for __$ and assumes that anything it founds must be a valid placeholder. But it could be something else ($__ missing) or it could be truncated and you'll go past the end of the bytecode array.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also reminds me that we have no tests covering this code at all. This is because we can't use our usual command-line tests for this - linker is a bit unusual compared to other input modes in that it modifies files in place and never prints them to stdout.

Recently test/solc/CommandLineInterface.cpp was added though so we could test it there. It would be something similar to cli_input test but with different options and would need to create input files and then read output from created ones (we have helpers for most of this, see other tests in the file).

Or we could refactor link() into a static method (taking libraries and sourceCodes as arguments) and then test that method. This would make the test much simpler.

In any case, we need at least three tests: input with placeholders, input with no placeholders, input with broken placeholders.

}

evmasm::LinkerObject linkerObject= {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
evmasm::LinkerObject linkerObject= {
evmasm::LinkerObject linkerObject = {

Comment on lines 980 to 981
map<u256, std::pair<std::string, std::vector<size_t>>>(),
std::map<std::string, evmasm::LinkerObject::FunctionDebugData>()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should be able to use empty initializer lists here and the compiler should infer the types:

Suggested change
map<u256, std::pair<std::string, std::vector<size_t>>>(),
std::map<std::string, evmasm::LinkerObject::FunctionDebugData>()
{},
{}

linkerObject.link(librariesReplacements);
sourceLines.at(bytecodeIndex) = toHex(linkerObject.bytecode);

std::string resolvedBytecode = boost::algorithm::join(sourceLines, "\n");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
std::string resolvedBytecode = boost::algorithm::join(sourceLines, "\n");
string resolvedBytecode = boost::algorithm::join(sourceLines, "\n");

Comment on lines 969 to 970
else
serr() << "Reference \"" << foundPlaceholder << "\" in file \"" << src.first << "\" still unresolved." << endl;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this being removed?

@@ -931,45 +932,61 @@ void CommandLineInterface::link()
// be just the cropped or '_'-padded library name, but this changed to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it was still relevant but looking closer at the PR now, it removes support for old-style placeholders and without it it's not necessary.

Are we fine with that though? Modern compiler versions do not produce them so I'd be fine with that but it's technically a breaking change.

@bmoore01
Copy link
Author

I added the recommended changes and I've added in some unit tests to cover this functionality too.

Copy link
Member

@cameel cameel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's more feedback. Sorry for a ton of comments but most of these are just small style corrections, improvement suggestions and error message tweaks. Here's a summary of the more important ones:

  • More test cases needed.
  • This removes support for old-style placeholders.
  • This changes how newlines are handled (they're now always rewritten to \n).
  • Maybe we should allow leading/trailing whitespace on bytecode lines as the old implementation did?

@@ -653,7 +654,8 @@ void CommandLineInterface::processInput()
assemble(m_options.assembly.inputLanguage, m_options.assembly.targetMachine);
break;
case InputMode::Linker:
link();
solAssert(m_options.input.mode == InputMode::Linker, "");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have recently made the description optional in asserts:

Suggested change
solAssert(m_options.input.mode == InputMode::Linker, "");
solAssert(m_options.input.mode == InputMode::Linker);

{
solAssert(m_options.input.mode == InputMode::Linker, "");
return isxdigit(c) || c == '_' || c == '$';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just FYI, we have recently added wrappers for functions like isdigit() to ensure we always use the C locale (#12740). I see from the docs that isxdigit() is not affected by locale though so I guess using it directly is fine.

Comment on lines 1029 to 1034
evmasm::LinkerObject linkerObject = {
fromHex(bytecode),
linkReferences,
{},
{}
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
evmasm::LinkerObject linkerObject = {
fromHex(bytecode),
linkReferences,
{},
{}
};
evmasm::LinkerObject linkerObject = {fromHex(bytecode), linkReferences, {}, {}};

Comment on lines 991 to 994
solThrow(
CommandLineExecutionError,
"Error in binary object file " + src.first + " unbounded placeholder without start point __$ at " + to_string(libraryEndOffset) + "\n"
);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
solThrow(
CommandLineExecutionError,
"Error in binary object file " + src.first + " unbounded placeholder without start point __$ at " + to_string(libraryEndOffset) + "\n"
);
solThrow(
CommandLineExecutionError,
"Error in binary object file " + src.first + " unbounded placeholder without start point __$ at " + to_string(libraryEndOffset) + "\n"
);

Comment on lines 997 to 1007
if (startMarkerFound && endMarkerFound)
{
if ((libraryEndOffset - libraryStartOffset) != placeholderSize - 3)
{
solThrow(
CommandLineExecutionError,
"Error in binary object file " + src.first + " truncated placeholder found at " + to_string(libraryStartOffset) +
" placeholder addresses should be " + to_string(placeholderSize) + " characters long (including address markers __$ and $__)\n"
);
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need braces if there's just one instruction inside if or for. The code is more concise without them:

Suggested change
if (startMarkerFound && endMarkerFound)
{
if ((libraryEndOffset - libraryStartOffset) != placeholderSize - 3)
{
solThrow(
CommandLineExecutionError,
"Error in binary object file " + src.first + " truncated placeholder found at " + to_string(libraryStartOffset) +
" placeholder addresses should be " + to_string(placeholderSize) + " characters long (including address markers __$ and $__)\n"
);
}
}
if (startMarkerFound && endMarkerFound)
if ((libraryEndOffset - libraryStartOffset) != placeholderSize - 3)
solThrow(
CommandLineExecutionError,
"Error in binary object file " + src.first + " truncated placeholder found at " + to_string(libraryStartOffset) +
" placeholder addresses should be " + to_string(placeholderSize) + " characters long (including address markers __$ and $__)\n"
);

There are also a few other places like this in the PR.


CommandLineInterface::link(fileReader, libraries, hasError, cerr);
BOOST_TEST(!hasError);
BOOST_TEST(fileReader.sourceUnits().find("bytecodeInput.bin")->second.find(hashString) != string::npos);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just compare the result directly? Less chance of false positives by mistake and it's not that long anyway:

Suggested change
BOOST_TEST(fileReader.sourceUnits().find("bytecodeInput.bin")->second.find(hashString) != string::npos);
map<string, string> expectedSources = {{
"bytecodeInput.bin",
"565b6040516100579190610158565b60405180910390f35b600073" + hashString + "63b3de648b8360405"
}};
BOOST_TEST(fileReader.sourceUnits() == expectedSources);

Comment on lines 326 to 328
fileReader.setSourceUnits({
{"bytecodeInput.bin", "565b6040516100579190610158565b60405180910390f35b600073$__b3157e226f2c4dddae135e6cab4ed4e74763b3de648b8360405"}
});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fileReader.setSourceUnits({
{"bytecodeInput.bin", "565b6040516100579190610158565b60405180910390f35b600073$__b3157e226f2c4dddae135e6cab4ed4e74763b3de648b8360405"}
});
fileReader.setSourceUnits({
{"bytecodeInput.bin", "565b6040516100579190610158565b60405180910390f35b600073$__b3157e226f2c4dddae135e6cab4ed4e74763b3de648b8360405"}
});


bool hasError = false;

string expectedMessage ="Error in binary object file bytecodeInput.bin unbounded placeholder without start point __$ at 54\n";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
string expectedMessage ="Error in binary object file bytecodeInput.bin unbounded placeholder without start point __$ at 54\n";
string expectedMessage = "Error in binary object file bytecodeInput.bin unbounded placeholder without start point __$ at 54\n";

{
frontend::FileReader fileReader({},{},{});
fileReader.setSourceUnits({
{"bytecodeInput.bin", "565b6040516100579190610158565b60405180910390f35b600073__$b3157e226f2c4dddae135e6cab4ed4e74763b3de648b8360405"}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a few more test cases? I think that the code might not work correctly with some of them:

  • Old-style placeholder:
    565b604051610057919__lib.sol:L_____________________________48b8360405
    
  • Two starts, single end:
    565b6040516100579190__$b36__$b3157e226f2c4dddae135e6cab4ed4e747$__63b3de648b8360405
    
  • Placeholder in a placeholder:
    565b6040516100579190__$b36__$b3157e226f2c4dddae135e6cab4ed4e747$__63b3de$__648b8360405
    
  • Double-sided marker:
    565b604051610057919__$__157e226f2c4dddae135e6cab4ed4e7__$__48b8360405
    
    I think this was valid in the old implementation and would be treated as an old-style placeholder.
  • Marker without $:
    565b604051610057919___b3157e226f2c4dddae135e6cab4ed4e747___48b8360405
    
    Also technically a valid old-style placeholder.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be supporting the placeholder in a placeholder or double-sided marker cases?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The double-sized one could be a valid old-style placeholder so yes, unless we drop support for old-style placeholders. The __ on the inside basically counts as part of the library identifier in that case.
But I'm mentioning that case mostly because it's a corner case that looks like it could easily confuse regex based detection.

As for nested ones, if the outer __ pair has the correct distance between markers, then again, it could be a valid old-style placeholder. But if it's the inner one that has the right length then the outer one should be treated as unmatched markers and result in an error.

Comment on lines 921 to 923
static bool isBytecodeChar(char c)
{
solAssert(m_options.input.mode == InputMode::Linker, "");
return isxdigit(c) || c == '_' || c == '$';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the detection will unfortunately need to be a bit more complex to handle old-style placeholders since they could contain arbitrary chars. So it would be something like this on the outside of __$ and $__ and anything in between.

Maybe let's try this:

  • The line has to contain __.
  • All chars before the first and after the last __ must be valid hex chars.
  • There can be anything between the first __ and last __.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless we can make our lives easier and actually drop support for these old-style placeholders.
@chriseth Would that be acceptable without a breaking release?

@cameel
Copy link
Member

cameel commented Mar 15, 2022

By the way, please don't get discouraged by the comments, I think this is overall going in the right direction :) Especially I like that we now will have this covered with some tests!

@bmoore01
Copy link
Author

@cameel No probblem, i understand, i'll get these changes done, always worth doing right!

@leonardoalt
Copy link
Member

Seems to be waiting for review @cameel

@bmoore01
Copy link
Author

bmoore01 commented Apr 5, 2022

@leonardoalt yeah, I'm happy with it so just waiting for review now.

@bmoore01
Copy link
Author

@cameel Are you happy for me to merge this one now? It's been open for a while just wanted to see if you wanted any more changes.

@cameel
Copy link
Member

cameel commented Apr 24, 2022

@bmoo0 Hello, sorry for not going back to this one sooner but there's just been too much happening. We still have a huge heap of PRs to get through and on top of that right now most of the team (including me) is still at Devconnect.

So for now, just a few short comments. I discussed this with @chriseth earlier and the problem is really that this refactor complicates the mechanism more than we expected it would. There were two major conclusions from that discussion:

  • Currently there are very few restrictions on how the .bin object would look like. Now with this PR we have to detect the bytecode so I had to push for move validations and this unfortunately makes things complicated. We also have to support old format for placeholders.

    One idea to make things simpler is basically to make it a breaking change scheduled for 0.9.0. This way we could drop all the legacy stuff and inconvenient restrictions. To do that you would have to change the base branch of the PR to a branch called breaking.

  • A very valuable part of the PR as it is currently are the tests. If you extracted them into a separate PR we'd gladly accept them - or at least the part that would still work on top of develop branch.

What do you think? If you agree with these changes I could try to go over the PR again and give some hints on what specifically should be carried over and what would not be necessary when we make the change breaking.

@bmoore01
Copy link
Author

bmoore01 commented May 1, 2022

Sounds good to me. I can actually break it out into two PRs if that's better. One containing the tests branched from the develop branch and the other adding the breaking change which can stem from the breaking branch, if that makes things simpler.

@bmoore01 bmoore01 changed the base branch from develop to breaking May 2, 2022 18:00
@cameel
Copy link
Member

cameel commented May 2, 2022

Great! Tests should go into develop since they do not have to wait for a breaking release.

@bmoore01 bmoore01 changed the base branch from breaking to develop May 8, 2022 19:43
@bmoore01 bmoore01 changed the base branch from develop to breaking May 8, 2022 19:43
@Marenz
Copy link
Contributor

Marenz commented Jul 11, 2022

It seems the PR needs a rebase to resolve the current conflicts.
I see the plan was to split out the tests and make this one breaking.. any updates on that? Can we help or should we take over parts possible?

@@ -918,12 +917,64 @@ void CommandLineInterface::serveLSP()
solThrow(CommandLineExecutionError, "LSP terminated abnormally.");
}

static bool isBytecodeChar(char c)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think keeping this helper may still be nice.

{
return isxdigit(c) || c == '_' || c == '$';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to this PR, but:

isxdigit(std::locale) | checks if a character is classified as a hexadecimal digit by a locale(function template)

Perhaps we'd be better off with just a manual check and could make it constexpr that case.

@leonardoalt
Copy link
Member

Since this PR seems stale is going to be split up anyway, I'm closing for now. The code is still available in the branch.

@bmoore01
Copy link
Author

Hey sorry I was away for awhile count' work on this. I think the tests can still be merged if you think they're useful I just accidentally pulled too many changes into my branch and need to look up how to only merge my new tests.

@Marenz Marenz added the takeover Can be taken over by someone other than label giver label Aug 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
external contribution ⭐ takeover Can be taken over by someone other than label giver
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CLI] Rewrite solc --link to use LinkerObject