Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tpetra: Add optional 'atomic' argument to sumIntoLocalValue(s) #31

Closed
mhoemmen opened this issue Nov 25, 2015 · 7 comments
Closed

Tpetra: Add optional 'atomic' argument to sumIntoLocalValue(s) #31

mhoemmen opened this issue Nov 25, 2015 · 7 comments

Comments

@mhoemmen
Copy link
Contributor

@trilinos/tpetra

For Tpetra::MultiVector, Tpetra::Vector, and Tpetra::CrsMatrix, add an optional bool atomic argument to sumIntoLocalValue resp. sumIntoLocalValues. If true, use Kokkos::atomic_add to do the +=; else, use +=. The default value of atomic should depend on the class' execution space: it should be false if and only if the execution space is Kokkos::Serial (so that existing MPI-only users don't have to pay for atomic updates).

@srajama1
Copy link
Contributor

Why would some one want to pass false with say OpenMP execution space. Won't that give you just wrong answer ?

@crtrott
Copy link
Member

crtrott commented Nov 25, 2015

You might do coloring in the algorithm which calls sumInto. In that case you don't want to pay for atomics, if you avoid conflicts through other means.

@srajama1
Copy link
Contributor

Can you Call sumInto on a sub problem with Tpetra ? The coloring has to be called by Tpetra to minimize the atomics, right ?

May be you are thinking of a case where the whole matrix has no conflicts. If that is the case, sure, go ahead. It is a very special case but reasonable. Expecting the user could coloring is not ideal. Tpetra has to color the graph and do the right thing.

That is the primary use case for having coloring in the kernels.

@crtrott
Copy link
Member

crtrott commented Nov 25, 2015

Yeah but sumInto is something you call inside the parallel kernel. It is not a global I work on everything function. And currently the parallelism in the assembly is not inside of Tpetra but in the application. So the application has to do the coloring. We might write another library for that, but I don't think tpetra can do it (at least I can't see the way to do that right now).

@srajama1
Copy link
Contributor

Got it. I was thinking of a special case too. Given a static graph to Tpetra and a static partitioning of rows to threads, Tpetra will know exactly which rowids need atomics, right ?

@mhoemmen
Copy link
Contributor Author

The discussion the three of us had Wednesday 25 Nov 2015 clarified the need for Tpetra sumInto to support the option of atomic updates. In particular, Tpetra is too low a level of abstraction to know how applications want to do fill. Example: for finite elements, knowing whether or not atomics are necessary for a particular row or matrix entry requires knowing the discretization order and the shape of elements (and thus needs a mapping from matrix entries back to element topology). One can always build higher-level interfaces using the lower-level interface. Furthermore, this interface will help applications gradually port to use Kokkos. (They can start by turning on atomics everywhere, get correct answers perhaps with some performance loss, then explore using coloring or some other method to turn off atomics where they are not needed.)

@mhoemmen
Copy link
Contributor Author

My commits on Thu 26 Nov 2015, committed tonight (Sat 28 Nov 2015), fix this issue.

CamelliaDPG added a commit to CamelliaDPG/Trilinos that referenced this issue Apr 6, 2022
# This is the 1st commit message:

This PR extends previous work in "Structured Integration", a term that generalizes sum factorization algorithms for finite element assembly. The main thing that was missing for full support of the exact sequence was pullbacks in FunctionSpaceTools for spaces other than H(grad). This PR adds those, as well as implementations of some sample formulations -- corresponding to the natural norms in each function space -- in both unit tests and performance tests. There are some changes to the new data structures as well as to the new integration kernels that were required for this full support.

StructuredIntegrationTests has gotten an upgrade; for the assembly comparison tests (of which there are now five distinct formulations assembled: Poisson plus the natural norms for each space), we now use test templates. This simplifies declarations, makes it easier to see what cases are covered, and eliminates a fair amount of code redundancy.

Modified handling of tensor bases, especially with regard to CellTopology.

Initial (draft) implementation of arbitrary-dimensional H^1, L^2 bases on the hypercube.

Note: the basis equivalence test added here fails at the moment; an exception is thrown in TensorBasis::getValues() (the five-argument version that subclasses are nominally supposed to implement).

Made TensorBasis default implementation support OPERATOR_VALUE and OPERATOR_GRAD.  (TODO: delete implementations in subclasses that handle these for the five-argument getValues().)

Revised quadrature in BasisEquivalenceTests to deal correctly with extrusions.

Right now, the new basis equivalence test fails with an exception in getCubatureImpl(), where there's an assertion that there are either two or three direct cubatures.  (And the copy logic only handles those two cases.)  So we also need (probably) to revise this to support arbitrary dimensions.

Working to rewrite BasisEquivalenceTests using BasisValues instead of DynRankView; right now, things don't compile…

Dropping implementation of getValues() in tensor-product DerivedBasis_HGRAD and _HVOL, since now TensorBasis handles these.

But right now we fail during TensorBasis::allocateBasisValues() for OPERATOR_Dn.  Wrote a TODO there regarding what needs to change to fix this.

Implemented a default getSimpleOperatorDecomposition() within TensorBasis.  Extended getDkCardinality() computation within Basis to support spaceDim up to 7.

Revisions, including a TODO discussing one current test failure.  (There are almost certainly more.)

Commented out some debugging output.

Switching from fixed-length Kokkos::Array to dynamic Kokkos::vector for vector components, to support OP_Dn use cases.

Implemented TensorBasis::getOperatorDecomposition() built-in support for OP_D1 through OP_D10.

More work on TensorBasis; now all MonolithicExecutable tests pass.  Some other test failures:

	150 - Intrepid2_unit-test_Projection_Serial_Test_InterpolationProjection_HEX_MPI_1 (Failed)
	158 - Intrepid2_unit-test_Projection_Serial_Test_Convergence_HEX_MPI_1 (Failed)

(This is locally on my machine, a Serial Release build. Have not tried OpenMP or CUDA.)

Fixed the logic for setting shards topology and tags for H(curl) and H(div) bases on the hexahedron.

Locally (in serial, CPU), all tests now pass.

Added Intrepid2::CellTopology, with support for arbitrarily many tensorial extrusions.

Revised doxygen to reflect the fact that tags are supported by raw TensorBasis now.

Added support for tags to raw TensorBasis, in terms of component tags.

Getting BasisEquivalenceTests into better shape for testing hypercube bases, including tests up to spaceDim = 7, and fixed an issue in which the cellTopo used for checking tags was incorrect (used baseTopo instead of the tensorial extrusion).

Draft implementation of SerendipityBasis.  *Not* even a complete draft: refers to BasisValues::setOrdinalFilter(), which is not yet declared, much less implemented.  But because most methods here aren't referenced by code in a .cpp file, the changes do not interfere with a successful build.

Added BasisValues::setOrdinalFilter() method, and support within operator() and extent_int() to honor this.

For Serendipity, eliminated the pointType argument, which is ignored for all of our hierarchical basis families anyway.

Fixed definition of HostBasis type.

Added a first test against Serendipity bases: a check that the actual cardinality matches the formula for the cardinality.  This test passes.

Fixed a missing return in allocateBasisValues().

Added accessor for ordinal map.

Fixed a couple issues that affect BasisValues when you have a non-trivial ordinal filter.

Fixed a test failure that resulted from comparing on two different cell topologies as though they were the same.

Added tests against Serendipity basis.  These pass.

Generalized getDkEnumeration() and getDkEnumerationInverse() to arbitrary dimensions.  Made getTensorDkEnumeration() support up to total dimension of 7.

Added support for POINTTYPE_DEFAULT for nodal H(div) Hex and Quad bases.

Made triangle H(grad) basis fail cleanly when the order exceeds MaxOrder; was a buffer overrun.

Made various bases fail cleanly when the order exceeds MaxOrder; in most cases, this could cause a buffer overrun.

Added a check that points container has the right number of tensor components.

Fixed a couple places where we incorrectly determined the space dim for a basis because we neglected the number of tensorial extrusions, as returned by getNumTensorialExtrusions().

Fixed an error in which a Kokkos::Array was potentially used uninitialized.

Added getDomainDimension() method.

Made TensorBasis::allocateBasisValues() a bit more flexible with regard to inputPoints argument (allows more tensorial components in points than there are basis components, so long as the basis lines up; this case can arise when you have a hypercube topology for one of the componentsand a non-tensor basis defined on it; e.g. HGRAD_HEX_Cn_FEM, extruded some number of times).

Added tests against getDkEnumeration and getDkEnumerationInverse, and fixed the issues in getDkEnumerationInverse that these revealed.

Fixed an error in the treatment of tensor points for the first component basis of a TensorBasis (as when a 3-component point input is provided to HGRAD_QUAD x HGRAD_LINE).

Added a TensorPoints constructor that copies selected tensor components from another TensorPoints container.

Tweaked doxygen for the new TensorPoints constructor.

Removed extraneous template parameter from the subset constuctor for TensorPoints.

Fixed an error in the TensorData-combining constructor.

Added some sanity checks on TensorPoints components.

Continued refactoring StructuredIntegrationPerformance in preparation for H(div) and H(curl) test cases, as well as adding a choice of basis family.

Also, some ExecutionSpace --> DeviceType replacements.

TransformedVectorData class --> TransformedBasisValues (still need to rename the files).

Added preliminary support for getHGRADtransformVALUE().

Added preliminary support for H^1 -- that is, (grad,grad) + (value,value) -- assembly in StructuredIntegrationPerformance.  Right now, this fails with an exception during the structured assembly of (value,value), due to the transforms being the identity and therefore unset, but integrate() assumes that they are valid Data objects.

# This is the commit message trilinos#2:

Tweaked comment to cover the scalar basis value case.

# This is the commit message trilinos#3:

Made composedTransform construction cognizant of the invalid-transform-means-identity convention.

# This is the commit message trilinos#4:

Fixed H1StandardAssembly to make it tolerant of the case when a workset size does not evenly divide the number of cells.

# This is the commit message trilinos#5:

Removed console output indicating that H1StructuredAssembly was not yet fully implemented -- it is now, though we haven't verified the values at all.

# This is the commit message trilinos#6:

Added storage of the assembled matrix results, such that they can be compared across algorithms to verify that the various algorithms agree.  (Have not yet implemented that comparison, but have added a TODO where it should go.)

# This is the commit message trilinos#7:

Added comparisons of assembled stiffness matrices to verify correctness.  For Poisson, these pass.  For the new H^1 formulation, they do not pass yet.

Fixed domainExtents when generating CellGeometry (was 0, leading to NaNs in Jacobians).

# This is the commit message trilinos#8:

Some fixes for integrate() to support (value, value) integral (for H^1 values, at least).  I've run a couple H^1 test cases here, and these pass.

# This is the commit message trilinos#9:

Commented out my debugging test cases in favor of the standard (CI) test cases.

# This is the commit message trilinos#10:

Initial effort at H(div) performance tests.  At present, fail with an exception during the structured integration.  TODO in place for a fix.

# This is the commit message trilinos#11:

A bunch of fixes toward getting the H(div) formulation working.  From the test I'm running right now (single-cell), it looks like everything is working except the Uniform case, and in that case, we have an incorrect jacobianDividedByJacobianDet (the second and third diagonal entries are missing).  The likely culprit is the in-place product computation; maybe it doesn't handle the block-plus diagonal well?

This includes some console output that shows the wrong values for jacobianDividedByJacobianDet.

# This is the commit message trilinos#12:

Added getWritableEntryWithPassThroughOption() and getEntryWithPassThroughOption(), allowing in-place combinations involving block-diagonal matrices.

Fixed a bug in getDataExtent() in which the for loop iterated over the whole activeDims_ container, rather than the first numActiveDims_ entries.  If the data there happened to match the dimension d in the argument, this could have led to incorrect behavior.

Fixed a bug in Data(std::vector<DimensionInfo> dimInfoVector) constructor, in which blockPlusDiagonal dimensions would not be processed correctly.

# This is the commit message trilinos#13:

Added unit test InPlaceProduct_Matrix, which motivated the changes in the previous commit.

# This is the commit message trilinos#14:

Got H(div) structured assembly working.

# This is the commit message trilinos#15:

Got rid of the "hypercube" modifier in the various assembly routine names: so far, we only test hypercubes, but the assembly is agnostic to that.  The fact that it's a hypercube is in the geometry object which is passed in.

# This is the commit message trilinos#16:

Added definitions, implementations of getHCURLtransformVALUE and getHCURLtransformCURL.

# This is the commit message trilinos#17:

Fixed file creation dates listed in headers (not exact; just don't want to claim that these were there last summer)…

# This is the commit message trilinos#18:

Added H(curl) formulations, and filled out support for this in StructuredIntegrationPerformance.

These already pass!  The H(curl) transformations look a lot like the H(div) ones...

# This is the commit message trilinos#19:

Made getDkEnumeration(), getDkTensorIndex() KOKKOS_INLINE_FUNCTIONs (had been merely inline).

# This is the commit message trilinos#20:

Eliminating use of Kokkos::vector to store elements in VectorData's vectorComponents_ member; evidently, Kokkos::vector's operator[] requires UVM to execute on device under CUDA; it becomes merely an inline function when KOKKOS_ENABLE_CUDA_UVM is unset.

# This is the commit message trilinos#21:

Removing now-incorrect/unnecessary construction of VectorArray objects (now that VectorArray is a fixed-length Kokkos::Array).

# This is the commit message trilinos#22:

Eliminated a couple of unused variables.

# This is the commit message trilinos#23:

Changed dimToComponent_, dimToComponentDim_, numDimsForComponent_ from Kokkos::vector to Kokkos::Array.

# This is the commit message trilinos#24:

Eliminated a few more incorrect/unnecessary VectorArray() constructor calls.

# This is the commit message trilinos#25:

Dropped reference (&) from a few Data object definitions; looks like these are causing problems with lambda capture on CUDA.

# This is the commit message trilinos#26:

Dropped some const qualifiers, as well as references, for the affine path in IntegrationTools.

# This is the commit message trilinos#27:

Added support for outputting timings data to file.

# This is the commit message trilinos#28:

StructuredIntegrationPerformance: Reworking the way that meshes get sized, so that the stiffness matrices fit within a 2 GB budget.

# This is the commit message trilinos#29:

Reworking calibration to use the "core integration" throughput as the thing to maximize.  Also increased the max workset size to match the new max cell count (32768).

# This is the commit message trilinos#30:

Fixed a bug in which the various structured assembly methods would undercount flop estimates (we were clobbering one integration's flops with another's).

# This is the commit message trilinos#31:

Trying something with Calibration mode, in which we skip over workset sizes that were bigger than the optimal choice for prior polyOrders.

# This is the commit message trilinos#32:

Implemented L^2 formulation, including getHVOLtransformVALUE() implementation in FunctionSpaceTools.

Loosened tolerances.  Added some calibration results (these are not yet complete).

# This is the commit message trilinos#33:

Added missing fences, which meant that "Standard" timers were undercounting (dramatically for H(vol)).

# This is the commit message trilinos#34:

Added calibrations for all formulations/platforms except Serial.  Running Serial now...

# This is the commit message trilinos#35:

Finished calibrations (BestSerial, BestOpenMP_16, BestCuda).

# This is the commit message trilinos#36:

Moved assembly headers to a common location; started refactoring StructuredIntegrationTests to use this.  (So far, it only uses Poisson formulation in testing.)

# This is the commit message trilinos#37:

An attempt to fix a failure in HexahedronHierarchicalDGVersusHierarchicalCG_HGRAD -- by expanding the max compenents in VectorData.  At the moment, this is leading to segfaults without much clarity in lldb as to what's causing it.  Pushing so that I can build on other machines with other debuggers.

# This is the commit message trilinos#38:

Attempting to resolve some "missing return statement" warnings from nvcc.

# This is the commit message trilinos#39:

Fixing unused variable warning.

# This is the commit message trilinos#40:

Turning off tests for OPERATOR_D3 and above for derived basis classes.

# This is the commit message trilinos#41:

Set MaxVectorComponents = 7 = MaxTensorComponents.  There's a weird issue exhibited by an Apple clang (debug) build of MonolithicExecutable: a seg fault during HexahedronHierarchicalDGVersusHierarchicalCG_HGRAD, with little or no information from the debugger on what went wrong, which evidently happens whenever MaxVectorComponents != MaxTensorComponents.

# This is the commit message trilinos#42:

Eliminating certain OPERATOR_Dn tests for compatibility with VectorData limitations.

# This is the commit message trilinos#43:

Fixed a couple issues with CUDA build.

# This is the commit message trilinos#44:

Got rid of some extraneous "using" lines in Standard Assembly headers.

# This is the commit message trilinos#45:

Tweaked doxygen.

# This is the commit message trilinos#46:

Reverting some changes that are specific to the Serendipity branch -- want to exclude those from this branch, and do a separate PR for Serendipity.

# This is the commit message trilinos#47:

Deleted Serendipity basis.

# This is the commit message trilinos#48:

More serendipity reversion.

# This is the commit message trilinos#49:

Some tweaks to exception tests, code formatting.

# This is the commit message trilinos#50:

More Serendipity reversion (possibly done, but I haven't tried building yet!).
trilinos-autotester pushed a commit that referenced this issue Feb 23, 2024
pyrol: Allow to import unsupported classes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants