Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update the llvm commit to 061e0189a3dab6b1831a80d489ff1b15ad93aafb #1599

Merged
merged 22 commits into from
Aug 19, 2022

Conversation

sstamenova
Copy link
Collaborator

@sstamenova sstamenova commented Aug 11, 2022

There are multiple impactful changes from mlir:

Of these, the last two are the most impactful and I'm still working through figuring out the correct SymbolTableCollection usage, but I wanted to get eyes on this earlier rather than later because of the OpaqueElementsAttr change.

OpaqueElementsAttr which is used both in KrnlToLLVM and the NNPA accelerator was removed and a new similar, but not exactly the same, attribute was added - DenseResourceElementsAttr. torch-mlir removed there usage of OpaqueElementsAttr in favor of SparseElementsAttr, but this does not seem appropriate here (llvm/torch-mlir@bb47c16) and the new DenseResourceElementsAttr seems like the right choice. I think DenseElementsAttr could probably also work. I'd love to get some feedback from @tungld , @AlexandreEichenberger and anyone else since it is not a simple replacement.

…ource NOT including the accelerators

Signed-off-by: Stella Stamenova <[email protected]>
Signed-off-by: Stella Stamenova <[email protected]>
Signed-off-by: Stella Stamenova <[email protected]>
Signed-off-by: Stella Stamenova <[email protected]>
Signed-off-by: Stella Stamenova <[email protected]>
Signed-off-by: Stella Stamenova <[email protected]>
Signed-off-by: Stella Stamenova <[email protected]>
@sstamenova
Copy link
Collaborator Author

Note this will update onnx-mlir to the commit chosen here: llvm/torch-mlir#1178

@AlexandreEichenberger
Copy link
Collaborator

@tungld sorry to add to your plate, you did some good work with the constants. Do you mind looking into @sstamenova's question? Tx

@sstamenova thanks so much for lifting the LLVM to newer version, this is much much appreciated.

@AlexandreEichenberger
Copy link
Collaborator

@sstamenova sometimes the CI just timeouts on downloading resnet50. So you can restart the CIs to see if amd/ppc results are a fluke. Both are in a zone that make download slower.

@sstamenova
Copy link
Collaborator Author

@sstamenova sometimes the CI just timeouts on downloading resnet50. So you can restart the CIs to see if amd/ppc results are a fluke. Both are in a zone that make download slower.

@AlexandreEichenberger : I'm still trying to figure out the correct way to get symbol lookup to work after the mlir change, so there is one lit test failing. I started the PR to give people a change to review the removal of the opaque attribute while I track that last failure down.

Signed-off-by: Stella Stamenova <[email protected]>
@tungld
Copy link
Collaborator

tungld commented Aug 12, 2022

@sstamenova DenseResourceElementsAttr seems to be the right choice to me too. We have used OpaqueElementsAttr to store encoded data (a raw array of chars) for NNPA, so to make sure that DenseResourceElementsAttr works well, I need to check several models on NNPA to see if it really works or not, and it may take time.

Given that DenseResourceElementsAttr was added recently (From August 1), could we postpone the DenseResourceElementsAttr part in this update? However I still want to have your code to test DenseResourceElementsAttr, so we can go with a new PR that does not have DenseResourceElementsAttr.

As a side note, when using DenseResourceElementsAttr, I see some additional bits, e.g. 00001000 and 00000400 in your LIT tests, e.g.
0x00003E00400... becomes 0x0000100000003E00400..., 0x3F0000003F8000003F... becomes 0x000004003F0000003F8000003F.... Do you know why it is? Did you generate them using a LE or BE machine?

@sstamenova
Copy link
Collaborator Author

@tungld OpaqueElementsAttr was removed from mlir and they added DenseResourceElementsAttr at the same time to replace some of the functionality. Unfortunately, any update that picks up a new version of mlir has to replace OpaqueElementsAttr since it's gone, so we don't really have the option to delay it.

The first 4 bytes of the resource string is supposed to be the alignment. I tried to guess based on the tests what the intended alignment was, but since I didn't know what the source of the data was, I am guessing I didn't get them all correctly and would like to know what they are supposed to be.

@tungld
Copy link
Collaborator

tungld commented Aug 12, 2022

The first 4 bytes of the resource string is supposed to be the alignment. I tried to guess based on the tests what the intended alignment was, but since I didn't know what the source of the data was, I am guessing I didn't get them all correctly and would like to know what they are supposed to be.

If so, we need a way to specify alignment since NNPA needs 4K-alignment (Sometimes 8K-alignment). In the current code, all constants will be represented by KrnlGlobal with a separated alignment attribute. LLVM Global variables will be created from KrnlGlobal by setting alignment (https://github.com/onnx/onnx-mlir/blob/main/src/Conversion/KrnlToLLVM/KrnlGlobal.cpp#L75). Now if DenseResourceElementsAttr specifies alignment in its resource (or raw data in case of NNPA), I am not sure whether we need to change the lowering code for KrnlGlobal or not.

@sstamenova
Copy link
Collaborator Author

The first 4 bytes of the resource string is supposed to be the alignment. I tried to guess based on the tests what the intended alignment was, but since I didn't know what the source of the data was, I am guessing I didn't get them all correctly and would like to know what they are supposed to be.

If so, we need a way to specify alignment since NNPA needs 4K-alignment (Sometimes 8K-alignment). In the current code, all constants will be represented by KrnlGlobal with a separated alignment attribute. LLVM Global variables will be created from KrnlGlobal by setting alignment (https://github.com/onnx/onnx-mlir/blob/main/src/Conversion/KrnlToLLVM/KrnlGlobal.cpp#L75). Now if DenseResourceElementsAttr specifies alignment in its resource (or raw data in case of NNPA), I am not sure whether we need to change the lowering code for KrnlGlobal or not.

@tungld I can update the tests, but I would like to know if the change works with the models that you mentioned. I think the way it's currently setup, alignment can continue to be specified as it was before in addition to specifying it in the data.

Signed-off-by: Stella Stamenova <[email protected]>
@tungld
Copy link
Collaborator

tungld commented Aug 12, 2022

@tungld I can update the tests, but I would like to know if the change works with the models that you mentioned. I think the way it's currently setup, alignment can continue to be specified as it was before in addition to specifying it in the data.

@sstamenova yes, I will check this patch with some models. I just returned from a long vacation and it will take time for me to give you the result.

Signed-off-by: Stella Stamenova <[email protected]>
@sstamenova
Copy link
Collaborator Author

@tungld I looked at it again, and I think what will happen right now is that the explicitly specified alignment attribute on the constant will continue to work as it is (and it is still specified), while the alignment in the hex string will be used for the reading/writing of the hex string itself. I noticed that I had used a different value in the tests and the code (alignof(char) in the code was required because the data is a char* and the type checking for the resource would assert otherwise). I've updated the tests now to match that and I expect everything to work, but I don't have anything to test with aside from the lit tests and they're passing.

@tungld
Copy link
Collaborator

tungld commented Aug 15, 2022

@tungld I looked at it again, and I think what will happen right now is that the explicitly specified alignment attribute on the constant will continue to work as it is (and it is still specified), while the alignment in the hex string will be used for the reading/writing of the hex string itself.

@sstamenova Yes, I agree. The alignment in the hex string is used for buffers allocated during compilation only (See https://github.com/llvm/llvm-project/blob/main/mlir/include/mlir/IR/AsmState.h#L83), while our specified alignment attribute is used for allocating buffers during runtime (of the binary generated code). So they are quite independent.

@tungld
Copy link
Collaborator

tungld commented Aug 15, 2022

@jenkins-droid test this please

Signed-off-by: Stella Stamenova <[email protected]>
@sstamenova
Copy link
Collaborator Author

@jenkins-droid test this please

@sstamenova
Copy link
Collaborator Author

@tungld : Were you able to run some tests and verify the accelerator change?

@tungld
Copy link
Collaborator

tungld commented Aug 16, 2022

@sstamenova unfortunately, I got wrong results on NNPA when using this patch. After digging into this patch a bit, I know the reason. Let me explain here.

A key difference between OpaqueElementsAttr and DenseResourceElementsAttr is that, given a buffer, OpaqueElementsAttr's constructor will create a new buffer to store data then we can free the given buffer, while DenseResourceElementsAttr's constructor directly uses the given buffer. This difference is the key point so that DenseResourceElementsAttr can manage memory more efficiently for large buffers.

So, the code to free the given buffer is still there in this patch, so when using DenseResourceElementsAttr, the buffer would get random values.

I confirm that just removing free will result in the correct result.

…t the blob will now have control of its memory and will clean up appropriately.

Signed-off-by: Stella Stamenova <[email protected]>
@sstamenova
Copy link
Collaborator Author

@tungld : I made the couple of updates. I'd like to check this in sooner rather than later since the longer we wait, the more likely there are conflicting changes committed and I want to avoid that since this is not a trivial change. Can you have a look?

@sstamenova
Copy link
Collaborator Author

@tungld @AlexandreEichenberger Since we have fairly high confidence now that this will work on the NNPA machine, I think it makes sense to commit this sooner rather than later and if there are issues specifically on the NNPA machine, then someone who has access to run the tests can quickly iterate to fix the issues.

I understand that we don't want any breaks in any part of the project if it can be avoided, but gating a change to main that people are waiting on (and even creating alternate PRs for #1617) seems detrimental to the community. I think until the automated tests can run directly on the NNPA machine, we need to assume that there will be breaks sometimes that will have to be investigated separately - this is what the llvm community does for targets that are not tested on the public build bots and why we've added bots for targets and platforms (e.g. Windows) we care about when we can. Especially with a change like this one that moves llvm forward, there is a high likelihood that things get committed in the mean time that need to be updated to the new llvm commit making the process even more complicated, so there is value in iterating quickly as much as possible.

Copy link
Collaborator

@tungld tungld left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for the update!

@tungld tungld merged commit 993b86b into main Aug 19, 2022
@jenkins-droid
Copy link
Collaborator

Jenkins Linux s390x Build #7026 [push] Update the llvm commit t... started at 00:36

@jenkins-droid
Copy link
Collaborator

Jenkins Linux amd64 Build #7011 [push] Update the llvm commit t... started at 23:36

@jenkins-droid
Copy link
Collaborator

Jenkins Linux ppc64le Build #6119 [push] Update the llvm commit t... started at 00:37

@jenkins-droid
Copy link
Collaborator

Jenkins Linux amd64 Build #7011 [push] Update the llvm commit t... passed after 1 hr 55 min

@jenkins-droid
Copy link
Collaborator

Jenkins Linux s390x Build #7026 [push] Update the llvm commit t... passed after 1 hr 59 min

@jenkins-droid
Copy link
Collaborator

Jenkins Linux ppc64le Build #6119 [push] Update the llvm commit t... passed after 2 hr 9 min

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants