-
Notifications
You must be signed in to change notification settings - Fork 575
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance of looping over Tpetra CrsMatrix rows #118
Comments
Hi Andrey! Thanks for doing the comparison! btw is the matrix fill complete at this point? (It must be locally indexed, since you're getting local indices.) My main concern is that Tpetra is nearly 3x slower than Epetra for this very simple operation. My guess is that Tpetra spends its extra time in Teuchos::ArrayView construction and manipulation. Both Epetra and Tpetra do error checking by throwing exceptions, and both have a similar division of labor between the graph and the matrix (e.g., the matrix gets the row's indices by calling a method on its graph). Besides the use of ArrayView, the main difference is that Epetra tends to branch on the matrix's current state ("is storage optimized?"), while Tpetra abstracts that away behind CrsGraph::getRowInfo (returns a struct with the row's current number of entries, amount of space, and offset if applicable). It could be that filling the struct wastes time vs. just branching, but I'm sure the ArrayView manipulation plays a big part. It's possible to strip all that away and just return to extracting and passing around raw pointers (except for the underlying Kokkos Views that actually hold the data). Epetra could be optimized too. For example, 4-argument Epetra_CrsMatrix::ExtractMyRowView asks for the number of entries in the row twice (once in Epetra_Graph::ExtractMyRowView, and once in 3-argument Epetra_CrsMatrix::ExtractMyRowView). On the other hand, the Epetra and Tpetra interfaces do error checking and look up things for you, and you'll always have to pay something for that. A while back, @nmhamster and I debated a bit about this. He argued that the main interface we give to users should always perform reasonably well. Otherwise, we shouldn't give it to users. Do we want to spend more effort making Tpetra's interface perform well, or do we just want to skip it? For future reference:
|
What I mean specifically, is whether we expect users to access the matrix's entries through Tpetra methods, or whether they should always use the Kokkos interface. What I worry about with the latter, is that users will start asking for features that will make the latter more like the former, which will reduce performance, leading again to requests for a stripped-down interface. I think what @nmhamster was trying to say is that if we give users a higher-level interface, we should commit to making it reasonably fast. |
I would suggest running these tests through vtune to see where the differences are. At the conclusion of the milestone we had the performance between Epetra and Tpetra very similar for this operation. But that was pre-Kokkos and pre-KokkosRefactor. -Eric On Jan 29, 2016, at 11:16 PM, Mark Hoemmen <[email protected]mailto:[email protected]> wrote: Hi Andrey! Thanks for doing the comparison! btw is the matrix fill complete at this point? (It must be locally indexed, since you're getting local indices.) My main concern is that Tpetra is nearly 3x slower than Epetra for this very simple operation. My guess is that Tpetra spends its extra time in Teuchos::ArrayView construction and manipulation. Both Epetra and Tpetra do error checking by throwing exceptions, and both have a similar division of labor between the graph and the matrix (e.g., the matrix gets the row's indices by calling a method on its graph). Besides the use of ArrayView, the main difference is that Epetra tends to branch on the matrix's current state ("is storage optimized?"), while Tpetra abstracts that away behind CrsGraph::getRowInfo (returns a struct with the row's current number of entries, amount of space, and offset if applicable). It could be that filling the struct wastes time vs. just branching, but I'm sure the ArrayView manipulation plays a big part. It's possible to strip all that away and just return to extracting and passing around raw pointers (except for the underlying Kokkos Views that actuall y hold the data). Epetra could be optimized too. For example, 4-argument Epetra_CrsMatrix::ExtractMyRowView asks for the number of entries in the row twice (once in Epetra_Graph::ExtractMyRowView, and once in 3-argument Epetra_CrsMatrix::ExtractMyRowView). On the other hand, the Epetra and Tpetra interfaces do error checking and look up things for you, and you'll always have to pay something for that. A while back, @nmhamsterhttps://github.com/nmhamster and I debated a bit about this. He argued that the main interface we give to users should always perform reasonably well. Otherwise, we shouldn't give it to users. Do we want to spend more effort making Tpetra's interface perform well, or do we just want to skip it? For future reference:
Reply to this email directly or view it on GitHubhttps://github.com//issues/118#issuecomment-177084548. |
Yes, it is fill complete.
At this point I think it would be a misguided effort to work on the older Tpetra's interface. As Kokkos is our future, I think in Trilinos we would be in a good shape if we just start skipping directly to Kokkos objects. We just need to let people know that this could greatly benefit them for certain computational workflows.
I see the point. I think there is some uncertainty to it, as Tpetra's API supports both MPI and local operations. I think local API that can be done in terms of Kokkos should be done by it, and the API that requires accessing the communicator should be done by Tpetra. But I don't have as good picture of Tpetra as you do, so take it with a grain of salt. As for MueLu, my plan is to work on moving to Kokkos Refactor branch asap, as even with a single thread we get benefits. The reason I even started thinking about the issue is that I tried to compare the original and Kokkos Refactor variants of MueLu, and noticed the difference in a factory that just detected Dirichlet rows in a matrix. |
If I find time, I'll do the VTune run. You are right about the milestone, as I believe performance of, say, Gauss-Seidel was the same for Epetra/Tpetra. |
If "we" means Trilinos developers, I think that's acceptable. The main issue is that the features many users want in the Tpetra interface, are exactly the features that are missing or weakly implemented in the Kokkos interface. I'm thinking particularly of insert* (doesn't exist in Kokkos) and sumInto* / We can think of Tpetra's current interface as a vehicle for understanding how to improve fill performance for any interface. I'm not committed to it, other than for backwards compatibility. But it exists, and we can study it and learn how to improve it. Furthermore, enriching the Kokkos interface will help improve Tpetra fill performance, by stripping away complexity and letting Kokkos take charge of local data structures. (Tpetra::CrsMatrix semantics are a mess. I don't think we have ever untangled all the possible states.) |
Gauss-Seidel is a different case; it does not use the Tpetra interface for accessing matrix entries, but talks directly to the low-level data structures. I was never happy about putting a public Gauss-Seidel method in Tpetra, but it worked. |
I think it is a good separation of concerns to let Kokkos do what it does well. Also, it reduced direct user interaction with Tpetra and reduces Tpetra-Kokkos direct interactions.
|
I agree. It will take me 2-3 months of completely uninterrupted time to unravel Tpetra's semantics so I don't break backwards compatibility while pushing functionality into Kokkos, but I still agree.
Could you comment further on these points? |
If Kokkos is used directly, then Tpetra is no longer in the middle between the user and Kokkos, so user have fewer Tpetra calls and Tpetra has fewer Kokkos calls. Mike
|
@maherou, Mike, for many years we (SNL) have advocated strongly for abstraction in the software stack to maintain flexibility. I think this has been to our advantage. If we directly blend Kokkos and Tpetra are we losing that ability into the future? In our discussions with users during the codesign sessions, most times this has come up we have seen users express a want for Teptra interfaces over structures etc. One argument has been that not everyone buys into Kokkos, especially outside of Sandia, for a number of reasons, by having clear interfaces that are abstracted out they can continue to use their alternatives inside their applications. There may be some relatively small performance overheads (small as long as we get most of the inlining correct) to this approach but longer term the abstraction may provide greater flexibility? |
Just a note of caution : Not a lot of Tpetra functionality is needed to be pushed to the Kokkos level. Fixing this performance regression could be an immediate need. In the long term, I would argue petra abstractions are too monolithic for the regime of massive node level parallelism. (Eg : FillComplete(), All-or-nothing atomics, Poor support for asynchronous operations, tasking etc ). This needs not a refactor but complete rethinking of the design. |
One comment here about expected slowdowns of Tpetra vs direct usage of Kokkos. Generally the Tpetra interface is intended to handle a distributed memory object. I believe that that means it will always have a much higher overhead, since in generally it requires more error checks and other safe guards (like a richer state set) then a purely local object. Those things cost. The question in my mind is whether Tpetra objects should have "call through" functions to the underlying local Kokkos objects which omit some of the safe guards, or if its better to say: if you are doing a purely local operation, get the local objects to perform them. |
It seems a good API approach to design for calls through into local objects without requiring explicit access to the Kokkos objects. Good abstraction has in many ways given us the ability to really use products like Kokkos. By passing this to avoid performance slowdowns seems to me the wrong long term path without more analysis on why the code is slower, how we can minimize the overhead or redesign the API to maintain the abstraction. This may not be achievable in the end but I would think the aggressive use of inlining we see in many compilers should mean we get down to raw calls quite quickly where the API design allows for that (we certainly see this in many Kokkos compiles already). |
Tpetra and Xpetra allow to access the underlying local Kokkos objects (e.g. the Kokkos::CrsMatrix through the getLocalMatrix routine in Tpetra/Xpetra). Andrey is using them for his experiments. In my opinion, from a user's point of view, accessing the matrix/vector data through the local Kokkos objects is not very complicated. It's different if we're talking about matrix construction: One can locally fill the Kokkos::CrsMatrix objects (using parallel for, etc.) and then provide them in the constructor for generating the corresponding Tpetra/Xpetra::CrsMatrix object. Doing this right is extremely complicated as you need expert knowledge about how Tpetra organizes the global and local indices. For example, the user has to take care of which entries have to be communicated through MPI and put them in the right order after the local matrix entries. This is the exact opposite of what the user wants to do (especially when used to insertLocal and sumInto like routines!). Long in short: i think one has to distinguish between the use cases:
|
btw I really appreciate that we're having a good discussion about this. Sometimes I feel like we don't have enough of these sorts of discussions. Also, there's nothing personal about this for me. When I first started working on Tpetra, I prioritized correctness and backwards compatibility, taking lessons from Mike's TUG and other talks on the subject. This meant retaining some perhaps unfortunate design decisions, like forcing Tpetra's users to deal with the Teuchos Memory Management classes when filling a matrix. Epetra hides MPI for good reasons; Tpetra should have done the same with Teuchos stuff, where possible. |
We don't actually have to resolve this now. If we enrich the Kokkos interface with exactly what it needs for local operations, then it will be easy for Tpetra to invoke that interface. Stripping unnecessary complexity out of Tpetra's innards will make Tpetra's interface faster too. It will also help us figure out whether Tpetra has necessary complexity that will always make filling into Tpetra a lot slower than filling into Kokkos. Remember that some of Tpetra's internal complexity comes from the Petra Object Model. For example, the POM requires that multiple matrices be able to share the same graph. Both Epetra and Tpetra reflect this in their implementations. Christian and I spent a lot of effort preserving semantics when we did the Kokkos refactor of Tpetra. Think about how much trouble it was for apps and other Trilinos packages when we changed MultiVector from copy to view semantics to make it act more like Kokkos. Now imagine we change a bunch of other things too. (Remember the idiomatic way to test whether an Epetra matrix is empty on a process -- ask if it is neither locally nor globally indexed? This is inextricably tied to lazy allocation, which we won't be able to do if we want thread-parallel fill. Yet, I'm almost sure somebody will complain if we break this invariant in Tpetra. I broke a lot of tests when I tried getting rid of lazy allocation; see Issue #119.) I don't mean to say that we should keep Tpetra's interface as it is. What I mean to do is advocate caution. |
@trilinos/tpetra Start addressing Issue #118 by removing extra row Map lookup from CrsGraph's get{Global,Local}Row{Copy,View} methods. getRowInfo(FromGlobalIndex) already check the row Map; there's no need to check it again.
@trilinos/tpetra Continue addressing Issue #118 by removing extra row Map lookup from CrsMatrix's get{Global,Local}Row{Copy,View} methods, using the assumption explained in my previous commit message. Build/Test Cases Summary Enabled Packages: TpetraCore, Ifpack2, Amesos2, Zoltan2, MueLu, Stokhos Disabled Packages: FEI,STK,PyTrilinos,NOX,Teko,Piro 0) MPI_DEBUG => Test case MPI_DEBUG was not run! => Does not affect push readiness! (-1.00 min) 1) SERIAL_RELEASE => Test case SERIAL_RELEASE was not run! => Does not affect push readiness! (-1.00 min) 2) MPI_DEBUG_COMPLEX => passed: passed=274,notpassed=0 (64.07 min) 3) SERIAL_RELEASE => passed: passed=240,notpassed=0 (128.90 min) Other local commits for this build/test group: 7c24dde, 626bbdc
@trilinos/tpetra I was working on Issue #118 and Issue #41, and made some changes to Tpetra::CrsMatrix's getGlobalRowCopy and getLocalRowCopy methods. Tpetra's and Ifpack2's tests passed fine, but other downstream packages' tests did not. It looked like the tests were failing for a CrsMatrix with StaticProfile, never yet made fill complete, whose rows have extra space. This commit adds a unit test which exercises this case.
@trilinos/tpetra To reduce overhead of get*Row*, I added private methods to CrsGraph that get row views as raw pointers. The idea is to avoid the overhead of constructing a Teuchos::ArrayView, as well as its thread safety issues in a debug build (Teuchos_ENABLE_DEBUG=ON makes Teuchos::ArrayView not thread safe, because it retains a reference to the ArrayRCP from whence it came).
@trilinos/tpetra In the implementation of CrsMatrix::get{Global,Local}RowCopy, do not get row views of the current row's indices or values using Teuchos::ArrayView. Instead use the new nonpublic raw-pointer row view methods that CrsGraph and CrsMatrix now provide. This has the following benefits: 1. Thread safety: Teuchos::ArrayView is not thread safe in a debug build (when Teuchos_ENABLE_DEBUG=ON) 2. Performance (Issue #118): Creating and passing around Teuchos::ArrayView incurs overhead In a debug build, the new nonpublic raw-pointer row view methods do extra error checking. This should get back some of the bounds checking features of using Kokkos::View or Teuchos::ArrayView in a debug build.
@trilinos/tpetra This commit fixes Issue #132. @aprokop wrote a benchmark in MueLu that compares performance of getting local row views with Xpetra::CrsMatrix, Epetra_CrsMatrix, Tpetra::CrsMatrix, and KokkosSparse::CrsMatrix. See Issue #118 for discussion. The problem is that the benchmark lives in MueLu, and requires enabling experimental build options which are off by default (see Issue #128). This motivated me to port the benchmark to Tpetra. This commit adds the ported benchmark to Tpetra as an example. The main point of the benchmark is to compare Tpetra vs. Epetra vs. "raw" Kokkos performance, so I left Xpetra out. Build/Test Cases Summary Enabled Packages: TpetraCore Disabled Packages: FEI,STK,PyTrilinos,NOX,Teko,Piro 0) MPI_DEBUG => Test case MPI_DEBUG was not run! => Does not affect push readiness! (-1.00 min) 1) SERIAL_RELEASE => Test case SERIAL_RELEASE was not run! => Does not affect push readiness! (-1.00 min) 2) MPI_DEBUG_COMPLEX => passed: passed=86,notpassed=0 (3.94 min) 3) SERIAL_RELEASE => passed: passed=66,notpassed=0 (3.61 min)
I think the main issue is virtual method call overhead. Epetra ExtractMyRowView (virtual) 0.3318 (1) When I replace the implementation of |
@trilinos/tpetra See #118 for discussion. Build/Test Cases Summary Enabled Packages: TpetraCore Disabled Packages: FEI,STK,PyTrilinos,NOX,Teko,Piro 0) MPI_DEBUG => Test case MPI_DEBUG was not run! => Does not affect push readiness! (-1.00 min) 1) SERIAL_RELEASE => Test case SERIAL_RELEASE was not run! => Does not affect push readiness! (-1.00 min) 2) MPI_DEBUG_COMPLEX => passed: passed=86,notpassed=0 (5.43 min) 3) SERIAL_RELEASE => passed: passed=66,notpassed=0 (3.02 min) Other local commits for this build/test group: 194974a
I'm moving this to the backlog. We've noted some possible improvements in Tpetra, but please note that a big part of performance issues for methods that don't do much work, is virtual method call overhead. |
Tpetra::Details::findRelOffset is a new function (free function, not method) that finds the relative offset of an index in a row of a sparse graph or matrix, given an array of column indices in that row. The array may be either a rank-1 Kokkos::View or a raw 1-D array. The function lives in tpetra/core/src/Tpetra_Util.hpp. This fixes #205. I mean for the new function to replace or implement Tpetra::CrsGraph::find{Local,Global}Index. The function optimizes for the case where the input array to search is sorted, by using binary search in that case. This should make its performance comparable to that of Tpetra::CrsGraph::findLocalIndex or Epetra_CrsGraph::FindMyIndexLoc. I also added a test for this function. The test builds and passes. Note that it carefully exercises both the sorted and unsorted cases. Next step, not yet implemented is to optimize for the case of short rows, by using linear search in that case. This should actually improve on both Epetra and Tpetra, and possibly address #118.
I'm closing this issue, because we need more specific use cases to drive it. KokkosSparse::CrsMatrix already has a low-overhead row access interface. |
@trilinos/tpetra @jhux2 @mhoemmen @crtrott
Let me first admit that I am very likely doing something wrong.
I wrote a simple driver (located at muelu/test/perf_test_kokkos, which essentially finds the number of nonzeros in a CrsMatrix by looping through rows and adding lengths. It considers three scenarios:
The results were somewhat unexpected for me. I was running with a single MPI rank with OpenMp OMP_NUM_THREADS=1, disabled HWLOC (so that Kokkos respects this). Here are some results:
For Tpetra
For Epetra
So it seems to me that using local Kokkos matrix has absolutely be the way, as it is ~30 times faster than through Tpetra, and ~6 times faster than through Epetra.
I would like to know if anybody done any performance studies like this, or what could be the reason.
If I am doing something that is completely wrong, I would also like to know that.
The text was updated successfully, but these errors were encountered: