Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Add tests and rudimentary protections for default-constructed PortableCollections #44844

Closed
wants to merge 4 commits into from

Conversation

makortel
Copy link
Contributor

@makortel makortel commented Apr 24, 2024

PR description:

The HCAL CUDA->Alpaka porting work by @kakwok demonstrated bad behavior with default-constructed PortableCollection, so I added some tests to demonstrate those (at one point the HCAL work seemed to point to some strange behavior also for zero-sized PortableCollection, but I was not able to replicate that, and the strange behavior also didn't repeat later with the HCAL Alpaka code).

A default-constructed PortableCollection is in a state where it has no buffer. The PortableCollection interface does not provide a way to check the state, and therefore a caller has no way to know if PortableCollection::buffer() leads to defined or undefined behavior (because std::optional<T>::operator*() leads to undefined behavior if the optional does not contain value). This PR takes one attempt to add a function PortableCollection::isValid() that allows checking the validity, and adds assert()s to all accessors (plus size() to be able to access the SoA size without the assert(). I'm not sure if this the behavior we really want, but at least it is a starting point for a discussion.

The tests for 0-size SoA Layout and PortableCollection are for ensuring valid behavior when the SoA has also scalars in addition to columns. (and now I'm wondering what should be the behavior of a 0-size PortableCollection for a SoA that has only columns?)

The PortableObject and PortableMultiCollection should also be treated consistently, but I wanted to get feedback first on the direction we want to go first.

PR validation:

Unit tests pass

If this PR is a backport please specify the original PR and why you need to backport that PR. If this PR will be backported please specify to which release cycle the backport is meant for:

Possibly to be backported to 14_0_X

…eCollections

Also add tests for zero-sized PortableCollections
@cmsbuild
Copy link
Contributor

cmsbuild commented Apr 24, 2024

cms-bot internal usage

View& view() {
assert(isValid());
return view_;
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively the View accessors could be left unchecked, and require the users to explicitly check isValid() or/and the column/scalar accessors of the View to be non-nullptr when it is not clearly guaranteed that the PortableCollection is non-default constructed.

(we could also throw an exception instead of assert(), but maybe the use of default-constructed PortableCollection could be more of a logic error rather than something that would depend e.g. on the data?)

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-44844/40093

  • This PR adds an extra 24KB to repository

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @makortel for master.

It involves the following packages:

  • DataFormats/Portable (heterogeneous)
  • DataFormats/SoATemplate (heterogeneous)

@fwyzard, @cmsbuild, @makortel can you please review it and eventually sign? Thanks.
@missirol, @rovere this is something you requested to watch as well.
@antoniovilela, @rappoccio, @sextonkennedy you are the release manager for this.

cms-bot commands are listed here

//REQUIRE(coll->num() == 42);

// CopyToDevice<PortableHostCollection<T>> is not defined
#ifndef ALPAKA_ACC_CPU_B_SEQ_T_SEQ_ENABLED
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These #ifdefs could be removed with #43969

@makortel
Copy link
Contributor Author

enable gpu

@makortel
Copy link
Contributor Author

@cmsbuild, please test

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-ef15bb/39082/summary.html
COMMIT: c30702c
CMSSW: CMSSW_14_1_X_2024-04-24-1100/el8_amd64_gcc12
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/44844/39082/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

There are some workflows for which there are errors in the baseline:
24834.78 step 2
25088.203 step 3
The results for the comparisons for these workflows could be incomplete
This means most likely that the IB is having errors in the relvals.The error does NOT come from this pull request

Summary:

GPU Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total files compared: 3
  • DQMHistoTests: Total histograms compared: 39740
  • DQMHistoTests: Total failures: 23
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 39717
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 2 files compared)
  • Checked 8 log files, 10 edm output root files, 3 DQM output files
  • TriggerResults: no differences found

@fwyzard
Copy link
Contributor

fwyzard commented Apr 25, 2024

Are there any use case for a default-constructed PortableCollection, other than the ROOT dictionary ?

If that is the only use case, could we make the default constructor private, and somehow declare the ROOT dictionary stuff as a friend ?

@fwyzard
Copy link
Contributor

fwyzard commented Apr 25, 2024

what should be the behavior of a 0-size PortableCollection for a SoA that has only columns?

I think that a 0-size PortableCollection should be a well-defined object, with a device and a 0-size buffer associated to it.

The reason that in general we cannot do this for a default constructed PortableCollection is that we don't know what device to use.
Though, for the host case, we do know since there is only one possibility (so far), and we could do the same ?

@makortel
Copy link
Contributor Author

Are there any use case for a default-constructed PortableCollection, other than the ROOT dictionary ?

It's a good question, and I'd be tempted to answer "no". On the other hand, this issue came up with code along

// in class definition
device::EDPutToken<PortableCollection<...>> putToken_;

...

// in produce() function
if (inputVector.empty()) {
  iEvent.emplace(putToken_); // leads to default-constructed PortableCollection
}

. It's not necessarily a good use case, but it is easy to do.

If that is the only use case, could we make the default constructor private, and somehow declare the ROOT dictionary stuff as a friend ?

I don't know how easy it would be to figure out the components in ROOT that need the default constructor (some of them might even be in anonymous namespaces). In addition, there are some code paths where edm::Wrapper<T> calls the default constructor of T. In addition, since there are classes that either inherit from PortableCollection, or use it via composition, at least the default constructors of those classes must be able to call the default constructor of PortableCollection. Having the Portable{Host,Device}Collection to declare all of those as friends doesn't sound very scalable.

With @Dr15Jones we were not able to come up with hacks that would allow hiding the default constructor from user code (or even reporting with static analyzer) would not explode to our face in some way.

(in the long term it would be great to be able to move edm::Wrapper<T> to use std::optional<T> some day, but that won't be easy either)

@fwyzard
Copy link
Contributor

fwyzard commented Apr 25, 2024

I see.

Can we prevent at least default-constructed PortableDeviceCollections ?

@makortel
Copy link
Contributor Author

what should be the behavior of a 0-size PortableCollection for a SoA that has only columns?

I think that a 0-size PortableCollection should be a well-defined object, with a device and a 0-size buffer associated to it.

I agree 0-size PortableCollection should be defined at least to the extent of having a device, and the recommended way of asking the size returns 0.

What would be the meaning of the 0-size buffer? Whatever happens to be returned by the underlying allocator of the backend? E.g. for malloc() that would be implementation-defined (cppreference), and for cudaMalloc() apparently returns NULL (NVIDIA forum).

Does Alpaka itself make any assumptions on whether the underlying allocator returns a nullptr or not?

Would we want a buffer object, but containing a nullptr? Or do we care?

The reason that in general we cannot do this for a default constructed PortableCollection is that we don't know what device to use. Though, for the host case, we do know since there is only one possibility (so far), and we could do the same ?

Right, I thought the host case too. Device itself is known, but the we would not be able to make the "queue association" (i.e. cached allocation). I'm not sure if that would really matter though for this corner case.

@makortel
Copy link
Contributor Author

Can we prevent at least default-constructed PortableDeviceCollections ?

The additional indirection (or wrapping) via edm::DeviceProduct<T> would at least technically make it straightforward. I need to think a bit and make some tests. It would be an asymmetry between PortableHostCollection and PortableDeviceCollection, but it probably would prevent the use of the default constructor in a portable code (i.e. even if such code would compile for the Serial backend, it would not compile for GPU backends).

layout_(std::move(other.layout_)),
view_(std::move(other.view_))
{
other.buffer_.reset();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... and I just learned something new 🤷🏻

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... and I just learned something new 🤷🏻

Just curious, what was that?

Comment on lines +70 to +75
{
auto tmp = std::move(other.view_);
other.view_ = View();
view_ = View();
view_ = std::move(tmp);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should (to be verified) safe to do just

Suggested change
{
auto tmp = std::move(other.view_);
other.view_ = View();
view_ = View();
view_ = std::move(tmp);
}
{
view_ = other.view_; // self-assignment is safe
other.view_ = View();
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

however, what about the layout_ ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should (to be verified) safe to do just

This is because View contains only pointers and fundamental types? Right, then I'd also expect the self-assignment to be safe.

however, what about the layout_ ?

"Whoops", I guess, thanks for catching. Then I need to think a test that would demonstrate it failing.

{
auto tmp = std::move(other.buffer_);
other.buffer_.reset();
buffer_.reset();
Copy link
Contributor

@fwyzard fwyzard Sep 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this reset() necessary ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this reset() necessary ?

I'd think (now) it would not be necessary, and I was just following the pattern in an example https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#Rc-move-self (that uses raw pointers where the delete in the assigned-to-object's member is necessary). I'd expect the assignment on the following line

buffer_ = std::move(tmp);

to work equally well in both cases of self-assignment and assignment from a different object. Any existing value in buffer_ (be it a valid Buffer or a moved-from Buffer) should get destructed first before initializing the content of buffer_ from tmp.

Comment on lines +36 to +37
//coll->num() = 42;
//REQUIRE(coll->num() == 42);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
//coll->num() = 42;
//REQUIRE(coll->num() == 42);
//coll_h->num() = 42;
//REQUIRE(coll_H->num() == 42);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching

@makortel
Copy link
Contributor Author

In e4d2b53 I explored what it would entail to make PortableHostCollection's moved-from state well defined (isValid() == false, and metadata().size() returns 0). That made me wonder how much we really care of that state.

Mhm, what is the use case of being able to check the metadata().size() for an invalid portable collection ?

Thinking now (can't remember anymore what I was thinking back then), the only case coming to mind would be catching a programming errors along

class Foo : public SynchronizingEDProducer {
  std::optional<PortableCollection<FooLayout>> coll_;

  void acquire(...) {
    if (not coll_) {
      coll_ = PortableCollection<FooLayout>(...);
    }
    // access coll_
  }

  void produce(...) {
    iEvent.emplace(token_, std::move(*coll_));
  }
};

or where the aforementioned coll_ has been moved from (possibly in some complicated hard-to-catch way) before iEvent.emplace(). Hopefully most (if not all) these cases would eventually result a crash.

But one could also argue that such use-after-move cases should be caught by code review or by static analysis tools rather than defensive coding. I'm presently leaning towards this direction.

@makortel
Copy link
Contributor Author

Would the interface still make sense if isValid() was private ?

If we go with

  • PortableDeviceCollection can not be default-constructed
  • PortableHostCollection can be default-constructed, in which case we can do a 0-element allocation (whose result being implementation defined for column-only SoA)
    • Behavior of element access being undefined is acceptable
  • All use-after-move's of PortableCollections are similar programming errors as use-after-moves of anything else, and we don't need code defensively for that case

I think the isValid() and separate size() methods could be removed.

I'm not fond of the possibility that invalid collections or objects are stored, and that users have to check explicitly for their validity.

I would prefer that the collections are either valid (even if potentially with 0 size) or not produced at all (i.e. iEvent.emplace(putToken_) is not called).

I fully agree.

However, I haven't given too much thought to the possible consequences, and I could be convinced that there may be valid uses cases for storing an invalid collection.

I can only think of a programming errors like use-after-move resulting an invalid PortableCollection object.

@makortel
Copy link
Contributor Author

@fwyzard Do you know how alpaka::memcpy() behaves for 0-sized buffers? (I'm wondering if the CopyTo{Host,Device}::copyAsync() should skip the memcpy() for 0-sized buffers)

@fwyzard
Copy link
Contributor

fwyzard commented Sep 12, 2024

Do you know how alpaka::memcpy() behaves for 0-sized buffers? (I'm wondering if the CopyTo{Host,Device}::copyAsync() should skip the memcpy() for 0-sized buffers)

alpaka::memcpy will not do anything for a 0-sized buffer or 0-sized copy.

@fwyzard
Copy link
Contributor

fwyzard commented Sep 12, 2024

Looking at the example at #44844 (comment) (took me a while to figure out the error) how about adding a check in Event::emplace(token, collection) that checks if the buffer is valid ?

@makortel
Copy link
Contributor Author

Looking at the example at #44844 (comment) (took me a while to figure out the error) how about adding a check in Event::emplace(token, collection) that checks if the buffer is valid ?

We can't generally tell if an object being put in the event is "valid". But we could add a hook there (e.g. member function or standalone function) that the framework would call at the time of put() that would tell if the object would be valid. The hook function would be called only if it is defined (that's what we do already with post_insert()). In absence of the hook function the objects would be considered as valid by default.

I don't think we could make such a check with emplace() though. In emplace() the parameters after the PutToken are forwarded to the constructor (in the case of the example above invoking the move constructor). (well ok, maybe it would be possible to overload the emplace() for the case where the object itself is being passed on)

With put() the example above would be expressed as

class Foo : public SynchronizingEDProducer {
  std::unique_ptr<PortableCollection<FooLayout>> coll_;

  void acquire(...) {
    if (not coll_) {
      coll_ = std::make_unique<PortableCollection<FooLayout>>(...);
    }
    // access coll_
  }

  void produce(...) {
    iEvent.put(token_, std::move(coll_));
  }
};

and, given the move behavior of unique_ptr, that would work correctly, and to me the validity check doesn't look that beneficial.

@fwyzard
Copy link
Contributor

fwyzard commented Sep 12, 2024

I don't think we could make such a check with emplace() though. In emplace() the parameters after the PutToken are forwarded to the constructor (in the case of the example above invoking the move constructor). (well ok, maybe it would be possible to overload the emplace() for the case where the object itself is being passed on)

With emplace(), would it make sense to call isValid() on the object after it has been constructed ?

@makortel
Copy link
Contributor Author

I don't think we could make such a check with emplace() though. In emplace() the parameters after the PutToken are forwarded to the constructor (in the case of the example above invoking the move constructor). (well ok, maybe it would be possible to overload the emplace() for the case where the object itself is being passed on)

With emplace(), would it make sense to call isValid() on the object after it has been constructed ?

Good point, at least that would be technically possible (has to be done in edm::Event, but that would have been probably best anyway). Actually, the whole validity check (of class invariants) could be implemented already today with post_insert() method, which the framework would call from both put() and emplace().

@makortel
Copy link
Contributor Author

I think we need to make a decision which way to proceed:

  1. Make PortableHostCollection default constructor to behave the same as the other constructor called with 0 elements
    • The buffer_ member would not have to be std::optional anymore
    • No need for buffer existence checks
  2. Keep the present behavior of PortableHostCollection default constructor, i.e. such object does not have the buffer.
    • Need to add checks on buffer existence to (at least) buffer() and zeroInitialise() (throwing an exception if buffer_ does not have a value)
    • Need to implement post_insert() to ensure buffer_ is set

In both cases the moved-from state could be implemented as

  1. The default-constructed state
    • This is what the Core Guidelines recommend, unless there is "an exceptionally good reason not to" link
  2. The same as the behavior of alpaka::Buf or std::optional

I think it would be cleanest to pick option 1 for both default constructor and move-from behavior. I could buy the "performance argument" to prefer option 2 more easily for the moved-from state (where one would clearly avoid a memory allocation) than for the default constructor (when one trades the memory allocation with conditional code). I'd expect us to encounter (much) more moved-from states than default construction.

@makortel
Copy link
Contributor Author

Notes from discussion with @fwyzard

@cmsbuild
Copy link
Contributor

Milestone for this pull request has been moved to CMSSW_15_0_X. Please open a backport if it should also go in to CMSSW_14_2_X.

@cmsbuild cmsbuild modified the milestones: CMSSW_14_2_X, CMSSW_15_0_X Nov 22, 2024
@makortel
Copy link
Contributor Author

makortel commented Dec 3, 2024

  • We could test if it would be possible to allow not defining default constructor by having Wrapper<T>'s default constructor to call e.g. special tag-argument constructor if such exists. The tag type could be along edm::ConstructorForIO.

I had tested this approach, and after understanding better from Philippe how exactly ROOT does the construction of the Wrapper<T> (also in case where a sub-branch is accessed in bare ROOT), this seems like a viable way forward.

On this path, we could avoid the "IO constructor" to allocate the buffer (just to be thrown away in the ROOTReadStreamer() function), at the "cost" of keeping the buffer_ as std::optional. But in contrast to the present state of this PR, I'd be comfortable with removing the validity checks of the optional, because it would be without a value only in a very controlled code path. @fwyzard Would you agree?

@fwyzard
Copy link
Contributor

fwyzard commented Dec 5, 2024

Does #46877 go in the direction you had in mind ?

@makortel
Copy link
Contributor Author

makortel commented Dec 6, 2024

Does #46877 go in the direction you had in mind ?

Yes (thanks!). Given it, I think we can close this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants