Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tpetra::CrsMatrix::getLocalDiagCopy: Kokkos-parallelize diagonal extraction #41

Closed
mhoemmen opened this issue Dec 3, 2015 · 20 comments
Closed

Comments

@mhoemmen
Copy link
Contributor

mhoemmen commented Dec 3, 2015

@trilinos/tpetra Performance tests with Nalu discovered that Ifpack2::Relaxation Jacobi setup had a sequential section in CrsMatrix::getLocalDiagCopy (the two-argument version, though it's easy enough to optimize the one-argument version too). Nalu developers experimented with putting a parallel_for loop around diagonal extraction in getLocalDiagCopy, and it worked great.

@mhoemmen
Copy link
Contributor Author

mhoemmen commented Dec 3, 2015

It's OK to make this a host execution space loop; then we can use a lambda.

crtrott added a commit that referenced this issue Dec 3, 2015
@trilinos/tpetra

This threads the function for host execution space. Was identified as
serial bottleneck for a Nalu run.
@crtrott
Copy link
Member

crtrott commented Dec 3, 2015

I made it parallel on the host (i.e. with GPUs it still copies it still syncs the data to the host and runs the loop there).

@mhoemmen
Copy link
Contributor Author

mhoemmen commented Dec 3, 2015

I think that's OK for now. Making it GPU-parallel would require either making Tpetra's interface device-friendly, or talking directly to the KokkosSparse::CrsMatrix. Thanks Christian! Is that the two-argument version?

@crtrott
Copy link
Member

crtrott commented Dec 3, 2015

Yes this is the two argument version. That is why I left the ticket open.

@crtrott crtrott closed this as completed Dec 3, 2015
@crtrott crtrott reopened this Dec 3, 2015
@etphipp
Copy link
Contributor

etphipp commented Dec 14, 2015

Is this code thread-safe? I am getting random seg faults in the RCPNodeHandle destructor from inside the lambda when I use the OpenMP execution space with more than one thread, so I don't believe it is.

Internally it appears to be using RCP's in a few places:

  • getLocalRowView() internally calls getRowMap(), which returns an RCP.
  • In a debug build ArrayView internally uses ArrayRCP.

@crtrott
Copy link
Member

crtrott commented Dec 14, 2015

Ah the joy of not having sufficient threaded unit testing. Let me look into it.

Christian

@crtrott
Copy link
Member

crtrott commented Dec 14, 2015

I am working on that right now. It probably explains a couple of occasional non-determinstic crashes in Nalu I have seen.

crtrott added a commit that referenced this issue Dec 14, 2015
@trilinos/tpetra

This fixes (hopefully) the issue reported by Eric in issue #41.
It seemed to have fixed non-deterministic failures in Nalu.
@crtrott
Copy link
Member

crtrott commented Dec 14, 2015

Pushed a fix.

@mhoemmen
Copy link
Contributor Author

Argh, sorry folks. I'm in the middle of debugging some possible issues with Tpetra::BlockCrsMatrix::getLocalDiagOffsets.

mfh

On 12/14/15, 9:45 AM, "etphipp" <[email protected]mailto:[email protected]> wrote:

Is this code thread-safe? I am getting random seg faults in the RCPNodeHandle destructor from inside the lambda when I use the OpenMP execution space with more than one thread, so I don't believe it is.

Internally it appears to be using RCP's in a few places:

  • getLocalRowView() internally calls getRowMap(), which returns an RCP.
  • In a debug build ArrayView internally uses ArrayRCP.

Reply to this email directly or view it on GitHubhttps://github.com//issues/41#issuecomment-164490574.

@crtrott
Copy link
Member

crtrott commented Dec 15, 2015

As I said I think I fixed it already, just waiting for Eric to confirm.

@etphipp
Copy link
Contributor

etphipp commented Dec 16, 2015

Yes, it appears to work now. Thanks.

@mhoemmen
Copy link
Contributor Author

mhoemmen commented Feb 3, 2016

One-argument getLocalDiagCopy is not yet thread parallel. It uses CrsMatrix::getView, a nonpublic method which returns a Teuchos::ArrayView of the entries in a row. Teuchos::ArrayView is not thread safe in a debug build (Teuchos_ENABLE_DEBUG=ON). Thus, we have to fix this before we can make the method thread parallel.

mhoemmen pushed a commit that referenced this issue Feb 3, 2016
@trilinos/tpetra

I was working on Issue #118 and Issue #41, and made some changes to
Tpetra::CrsMatrix's getGlobalRowCopy and getLocalRowCopy methods.
Tpetra's and Ifpack2's tests passed fine, but other downstream packages'
tests did not.  It looked like the tests were failing for a CrsMatrix
with StaticProfile, never yet made fill complete, whose rows have extra
space.  This commit adds a unit test which exercises this case.
mhoemmen pushed a commit that referenced this issue Feb 3, 2016
@trilinos/tpetra Partially fix Issue #41 for the one-argument version of
getLocalDiagCopy, by Kokkos-parallelizing using the host execution
space.

We have to use the host execution space for now, because some of the
methods we need to call can't yet be marked as CUDA device functions.

This fix required replacing calls to CrsMatrix::getView (returns
Teuchos::ArrayView, and therefore not thread safe in a debug build) with
the new CrsMatrix::getViewRawConst (returns raw pointers, should always
be thread safe).

Build/Test Cases Summary
Enabled Packages: TpetraCore, Ifpack2, Amesos2, Zoltan2, MueLu, Stokhos
Disabled Packages: FEI,STK,PyTrilinos,NOX,Teko,Piro
0) MPI_DEBUG => Test case MPI_DEBUG was not run! => Does not affect push readiness! (-1.00 min)
1) SERIAL_RELEASE => Test case SERIAL_RELEASE was not run! => Does not affect push readiness! (-1.00 min)
2) MPI_DEBUG_COMPLEX => passed: passed=275,notpassed=0 (55.42 min)
3) SERIAL_RELEASE => passed: passed=241,notpassed=0 (81.68 min)
Other local commits for this build/test group: 3cd960a, 9510811, 5ea4be8
@mhoemmen mhoemmen added the stage: in progress Work on the issue has started label Feb 15, 2016
@mhoemmen
Copy link
Contributor Author

Current status: both 1-argument and 2-argument getLocalDiagCopy, and getLocalDiagOffsets, are both thread-parallel, using the host execution space. Next step is execution on the CUDA device if appropriate to the current storage status.

@mhoemmen mhoemmen assigned mhoemmen and unassigned mhoemmen Mar 6, 2016
@mhoemmen mhoemmen added this to the Tpetra BCRS Kokkos-ization milestone Mar 6, 2016
@mhoemmen mhoemmen removed the stage: in progress Work on the issue has started label Mar 6, 2016
@mhoemmen mhoemmen changed the title Tpetra::CrsMatrix::getLocalDiagCopy: Thread-parallelize diagonal extraction Tpetra::CrsMatrix::getLocalDiagCopy: Kokkos-parallelize diagonal extraction Mar 6, 2016
@mhoemmen
Copy link
Contributor Author

mhoemmen commented Mar 6, 2016

I just renamed this from "Thread-parallelize" to "Kokkos-parallelize." The code already uses the host execution space (e.g., OpenMP) for parallelization; we just need to turn the existing lambda into a functor and take a few other measures to make this work with CUDA as well.

@crtrott crtrott added the stage: in progress Work on the issue has started label Mar 8, 2016
@jwillenbring jwillenbring removed the ATDM label Mar 9, 2016
@mhoemmen
Copy link
Contributor Author

See discussion on the following pull request: crtrott#3

@mhoemmen
Copy link
Contributor Author

@amklinv -- please refer to my most recent comments on your pull request: crtrott#3

In particular, #205 only blocks Kokkos-ization of one-argument getLocalDiagCopy. Ifpack2 only cares deeply about two-argument getLocalDiagCopy, which is a lot easier. You'll still need two code paths, depending on whether the matrix is fillComplete. If it's not, just use the existing code path, with its host execution space parallelization. If it is fillComplete, write a device functor that gets the KokkosSparse::CrsMatrix out and works on that. Remember that offsets are relative to each row, not absolute over the whole matrix. KokkosSparse::CrsMatrix has a clever row view thing that you can pretend is an unmanaged Kokkos::View (implements operator()).

@mhoemmen
Copy link
Contributor Author

@amklinv -- please refer to my latest patches for #212. This should serve as a model for full Kokkos-ization of two-argument (and indeed one-argument) getLocalDiagCopy.

@mhoemmen
Copy link
Contributor Author

Note that I did push those patches to fix #212.

@mhoemmen
Copy link
Contributor Author

mhoemmen commented May 11, 2016

Note that Tpetra::Experimental::BlockCrsMatrix::getLocalDiagCopy (the 2-argument Kokkos::View version that extracts the block diagonal) currently runs in the host execution space. This is partly because it uses a lambda (which requires a new CUDA version with a special flag) and partly because the lambda in question calls BlockCrsMatrix methods that aren't marked as device functions. Just marking the methods as device functions won't work, because that won't correctly capture *this (that issue will get fixed in the C++17 standard, I think). Thus, the right fix for now is to write a functor that takes all the necessary state from BlockCrsMatrix.

I would like to wait to work on this until we've finished evaluating @crtrott 's patch to improve BlockCrsMatrix apply() performance and make the interface more Kokkos-y. Otherwise, the merge could be too hard.

@mhoemmen mhoemmen assigned mhoemmen and amklinv and unassigned amklinv and mhoemmen May 18, 2016
@mhoemmen mhoemmen modified the milestones: Tpetra-threading, Tpetra BCRS Kokkos-ization May 18, 2016
@mhoemmen
Copy link
Contributor Author

Fixed by a commit on 12 May 2016. Not sure why this got so mixed up with BCRS.

@mhoemmen mhoemmen removed the stage: in progress Work on the issue has started label May 18, 2016
CamelliaDPG added a commit to CamelliaDPG/Trilinos that referenced this issue Apr 6, 2022
# This is the 1st commit message:

This PR extends previous work in "Structured Integration", a term that generalizes sum factorization algorithms for finite element assembly. The main thing that was missing for full support of the exact sequence was pullbacks in FunctionSpaceTools for spaces other than H(grad). This PR adds those, as well as implementations of some sample formulations -- corresponding to the natural norms in each function space -- in both unit tests and performance tests. There are some changes to the new data structures as well as to the new integration kernels that were required for this full support.

StructuredIntegrationTests has gotten an upgrade; for the assembly comparison tests (of which there are now five distinct formulations assembled: Poisson plus the natural norms for each space), we now use test templates. This simplifies declarations, makes it easier to see what cases are covered, and eliminates a fair amount of code redundancy.

Modified handling of tensor bases, especially with regard to CellTopology.

Initial (draft) implementation of arbitrary-dimensional H^1, L^2 bases on the hypercube.

Note: the basis equivalence test added here fails at the moment; an exception is thrown in TensorBasis::getValues() (the five-argument version that subclasses are nominally supposed to implement).

Made TensorBasis default implementation support OPERATOR_VALUE and OPERATOR_GRAD.  (TODO: delete implementations in subclasses that handle these for the five-argument getValues().)

Revised quadrature in BasisEquivalenceTests to deal correctly with extrusions.

Right now, the new basis equivalence test fails with an exception in getCubatureImpl(), where there's an assertion that there are either two or three direct cubatures.  (And the copy logic only handles those two cases.)  So we also need (probably) to revise this to support arbitrary dimensions.

Working to rewrite BasisEquivalenceTests using BasisValues instead of DynRankView; right now, things don't compile…

Dropping implementation of getValues() in tensor-product DerivedBasis_HGRAD and _HVOL, since now TensorBasis handles these.

But right now we fail during TensorBasis::allocateBasisValues() for OPERATOR_Dn.  Wrote a TODO there regarding what needs to change to fix this.

Implemented a default getSimpleOperatorDecomposition() within TensorBasis.  Extended getDkCardinality() computation within Basis to support spaceDim up to 7.

Revisions, including a TODO discussing one current test failure.  (There are almost certainly more.)

Commented out some debugging output.

Switching from fixed-length Kokkos::Array to dynamic Kokkos::vector for vector components, to support OP_Dn use cases.

Implemented TensorBasis::getOperatorDecomposition() built-in support for OP_D1 through OP_D10.

More work on TensorBasis; now all MonolithicExecutable tests pass.  Some other test failures:

	150 - Intrepid2_unit-test_Projection_Serial_Test_InterpolationProjection_HEX_MPI_1 (Failed)
	158 - Intrepid2_unit-test_Projection_Serial_Test_Convergence_HEX_MPI_1 (Failed)

(This is locally on my machine, a Serial Release build. Have not tried OpenMP or CUDA.)

Fixed the logic for setting shards topology and tags for H(curl) and H(div) bases on the hexahedron.

Locally (in serial, CPU), all tests now pass.

Added Intrepid2::CellTopology, with support for arbitrarily many tensorial extrusions.

Revised doxygen to reflect the fact that tags are supported by raw TensorBasis now.

Added support for tags to raw TensorBasis, in terms of component tags.

Getting BasisEquivalenceTests into better shape for testing hypercube bases, including tests up to spaceDim = 7, and fixed an issue in which the cellTopo used for checking tags was incorrect (used baseTopo instead of the tensorial extrusion).

Draft implementation of SerendipityBasis.  *Not* even a complete draft: refers to BasisValues::setOrdinalFilter(), which is not yet declared, much less implemented.  But because most methods here aren't referenced by code in a .cpp file, the changes do not interfere with a successful build.

Added BasisValues::setOrdinalFilter() method, and support within operator() and extent_int() to honor this.

For Serendipity, eliminated the pointType argument, which is ignored for all of our hierarchical basis families anyway.

Fixed definition of HostBasis type.

Added a first test against Serendipity bases: a check that the actual cardinality matches the formula for the cardinality.  This test passes.

Fixed a missing return in allocateBasisValues().

Added accessor for ordinal map.

Fixed a couple issues that affect BasisValues when you have a non-trivial ordinal filter.

Fixed a test failure that resulted from comparing on two different cell topologies as though they were the same.

Added tests against Serendipity basis.  These pass.

Generalized getDkEnumeration() and getDkEnumerationInverse() to arbitrary dimensions.  Made getTensorDkEnumeration() support up to total dimension of 7.

Added support for POINTTYPE_DEFAULT for nodal H(div) Hex and Quad bases.

Made triangle H(grad) basis fail cleanly when the order exceeds MaxOrder; was a buffer overrun.

Made various bases fail cleanly when the order exceeds MaxOrder; in most cases, this could cause a buffer overrun.

Added a check that points container has the right number of tensor components.

Fixed a couple places where we incorrectly determined the space dim for a basis because we neglected the number of tensorial extrusions, as returned by getNumTensorialExtrusions().

Fixed an error in which a Kokkos::Array was potentially used uninitialized.

Added getDomainDimension() method.

Made TensorBasis::allocateBasisValues() a bit more flexible with regard to inputPoints argument (allows more tensorial components in points than there are basis components, so long as the basis lines up; this case can arise when you have a hypercube topology for one of the componentsand a non-tensor basis defined on it; e.g. HGRAD_HEX_Cn_FEM, extruded some number of times).

Added tests against getDkEnumeration and getDkEnumerationInverse, and fixed the issues in getDkEnumerationInverse that these revealed.

Fixed an error in the treatment of tensor points for the first component basis of a TensorBasis (as when a 3-component point input is provided to HGRAD_QUAD x HGRAD_LINE).

Added a TensorPoints constructor that copies selected tensor components from another TensorPoints container.

Tweaked doxygen for the new TensorPoints constructor.

Removed extraneous template parameter from the subset constuctor for TensorPoints.

Fixed an error in the TensorData-combining constructor.

Added some sanity checks on TensorPoints components.

Continued refactoring StructuredIntegrationPerformance in preparation for H(div) and H(curl) test cases, as well as adding a choice of basis family.

Also, some ExecutionSpace --> DeviceType replacements.

TransformedVectorData class --> TransformedBasisValues (still need to rename the files).

Added preliminary support for getHGRADtransformVALUE().

Added preliminary support for H^1 -- that is, (grad,grad) + (value,value) -- assembly in StructuredIntegrationPerformance.  Right now, this fails with an exception during the structured assembly of (value,value), due to the transforms being the identity and therefore unset, but integrate() assumes that they are valid Data objects.

# This is the commit message trilinos#2:

Tweaked comment to cover the scalar basis value case.

# This is the commit message trilinos#3:

Made composedTransform construction cognizant of the invalid-transform-means-identity convention.

# This is the commit message trilinos#4:

Fixed H1StandardAssembly to make it tolerant of the case when a workset size does not evenly divide the number of cells.

# This is the commit message trilinos#5:

Removed console output indicating that H1StructuredAssembly was not yet fully implemented -- it is now, though we haven't verified the values at all.

# This is the commit message trilinos#6:

Added storage of the assembled matrix results, such that they can be compared across algorithms to verify that the various algorithms agree.  (Have not yet implemented that comparison, but have added a TODO where it should go.)

# This is the commit message trilinos#7:

Added comparisons of assembled stiffness matrices to verify correctness.  For Poisson, these pass.  For the new H^1 formulation, they do not pass yet.

Fixed domainExtents when generating CellGeometry (was 0, leading to NaNs in Jacobians).

# This is the commit message trilinos#8:

Some fixes for integrate() to support (value, value) integral (for H^1 values, at least).  I've run a couple H^1 test cases here, and these pass.

# This is the commit message trilinos#9:

Commented out my debugging test cases in favor of the standard (CI) test cases.

# This is the commit message trilinos#10:

Initial effort at H(div) performance tests.  At present, fail with an exception during the structured integration.  TODO in place for a fix.

# This is the commit message trilinos#11:

A bunch of fixes toward getting the H(div) formulation working.  From the test I'm running right now (single-cell), it looks like everything is working except the Uniform case, and in that case, we have an incorrect jacobianDividedByJacobianDet (the second and third diagonal entries are missing).  The likely culprit is the in-place product computation; maybe it doesn't handle the block-plus diagonal well?

This includes some console output that shows the wrong values for jacobianDividedByJacobianDet.

# This is the commit message trilinos#12:

Added getWritableEntryWithPassThroughOption() and getEntryWithPassThroughOption(), allowing in-place combinations involving block-diagonal matrices.

Fixed a bug in getDataExtent() in which the for loop iterated over the whole activeDims_ container, rather than the first numActiveDims_ entries.  If the data there happened to match the dimension d in the argument, this could have led to incorrect behavior.

Fixed a bug in Data(std::vector<DimensionInfo> dimInfoVector) constructor, in which blockPlusDiagonal dimensions would not be processed correctly.

# This is the commit message trilinos#13:

Added unit test InPlaceProduct_Matrix, which motivated the changes in the previous commit.

# This is the commit message trilinos#14:

Got H(div) structured assembly working.

# This is the commit message trilinos#15:

Got rid of the "hypercube" modifier in the various assembly routine names: so far, we only test hypercubes, but the assembly is agnostic to that.  The fact that it's a hypercube is in the geometry object which is passed in.

# This is the commit message trilinos#16:

Added definitions, implementations of getHCURLtransformVALUE and getHCURLtransformCURL.

# This is the commit message trilinos#17:

Fixed file creation dates listed in headers (not exact; just don't want to claim that these were there last summer)…

# This is the commit message trilinos#18:

Added H(curl) formulations, and filled out support for this in StructuredIntegrationPerformance.

These already pass!  The H(curl) transformations look a lot like the H(div) ones...

# This is the commit message trilinos#19:

Made getDkEnumeration(), getDkTensorIndex() KOKKOS_INLINE_FUNCTIONs (had been merely inline).

# This is the commit message trilinos#20:

Eliminating use of Kokkos::vector to store elements in VectorData's vectorComponents_ member; evidently, Kokkos::vector's operator[] requires UVM to execute on device under CUDA; it becomes merely an inline function when KOKKOS_ENABLE_CUDA_UVM is unset.

# This is the commit message trilinos#21:

Removing now-incorrect/unnecessary construction of VectorArray objects (now that VectorArray is a fixed-length Kokkos::Array).

# This is the commit message trilinos#22:

Eliminated a couple of unused variables.

# This is the commit message trilinos#23:

Changed dimToComponent_, dimToComponentDim_, numDimsForComponent_ from Kokkos::vector to Kokkos::Array.

# This is the commit message trilinos#24:

Eliminated a few more incorrect/unnecessary VectorArray() constructor calls.

# This is the commit message trilinos#25:

Dropped reference (&) from a few Data object definitions; looks like these are causing problems with lambda capture on CUDA.

# This is the commit message trilinos#26:

Dropped some const qualifiers, as well as references, for the affine path in IntegrationTools.

# This is the commit message trilinos#27:

Added support for outputting timings data to file.

# This is the commit message trilinos#28:

StructuredIntegrationPerformance: Reworking the way that meshes get sized, so that the stiffness matrices fit within a 2 GB budget.

# This is the commit message trilinos#29:

Reworking calibration to use the "core integration" throughput as the thing to maximize.  Also increased the max workset size to match the new max cell count (32768).

# This is the commit message trilinos#30:

Fixed a bug in which the various structured assembly methods would undercount flop estimates (we were clobbering one integration's flops with another's).

# This is the commit message trilinos#31:

Trying something with Calibration mode, in which we skip over workset sizes that were bigger than the optimal choice for prior polyOrders.

# This is the commit message trilinos#32:

Implemented L^2 formulation, including getHVOLtransformVALUE() implementation in FunctionSpaceTools.

Loosened tolerances.  Added some calibration results (these are not yet complete).

# This is the commit message trilinos#33:

Added missing fences, which meant that "Standard" timers were undercounting (dramatically for H(vol)).

# This is the commit message trilinos#34:

Added calibrations for all formulations/platforms except Serial.  Running Serial now...

# This is the commit message trilinos#35:

Finished calibrations (BestSerial, BestOpenMP_16, BestCuda).

# This is the commit message trilinos#36:

Moved assembly headers to a common location; started refactoring StructuredIntegrationTests to use this.  (So far, it only uses Poisson formulation in testing.)

# This is the commit message trilinos#37:

An attempt to fix a failure in HexahedronHierarchicalDGVersusHierarchicalCG_HGRAD -- by expanding the max compenents in VectorData.  At the moment, this is leading to segfaults without much clarity in lldb as to what's causing it.  Pushing so that I can build on other machines with other debuggers.

# This is the commit message trilinos#38:

Attempting to resolve some "missing return statement" warnings from nvcc.

# This is the commit message trilinos#39:

Fixing unused variable warning.

# This is the commit message trilinos#40:

Turning off tests for OPERATOR_D3 and above for derived basis classes.

# This is the commit message trilinos#41:

Set MaxVectorComponents = 7 = MaxTensorComponents.  There's a weird issue exhibited by an Apple clang (debug) build of MonolithicExecutable: a seg fault during HexahedronHierarchicalDGVersusHierarchicalCG_HGRAD, with little or no information from the debugger on what went wrong, which evidently happens whenever MaxVectorComponents != MaxTensorComponents.

# This is the commit message trilinos#42:

Eliminating certain OPERATOR_Dn tests for compatibility with VectorData limitations.

# This is the commit message trilinos#43:

Fixed a couple issues with CUDA build.

# This is the commit message trilinos#44:

Got rid of some extraneous "using" lines in Standard Assembly headers.

# This is the commit message trilinos#45:

Tweaked doxygen.

# This is the commit message trilinos#46:

Reverting some changes that are specific to the Serendipity branch -- want to exclude those from this branch, and do a separate PR for Serendipity.

# This is the commit message trilinos#47:

Deleted Serendipity basis.

# This is the commit message trilinos#48:

More serendipity reversion.

# This is the commit message trilinos#49:

Some tweaks to exception tests, code formatting.

# This is the commit message trilinos#50:

More Serendipity reversion (possibly done, but I haven't tried building yet!).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants