Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with plugin template of alpaka_serial_sync::HCALRecHitSoAProducer #86

Closed
missirol opened this issue Mar 4, 2024 · 25 comments
Closed

Comments

@missirol
Copy link
Contributor

missirol commented Mar 4, 2024

I encountered an issue related to the plugin alpaka_serial_sync::HCALRecHitSoAProducer while trying to test the integration of the "Alpaka customisations" in ConfDB.

The issue can be seen comparing the modules hltHbheRecHitSoA and hltHbheRecHitSoACPUSerial in

/users/missirol/test/dev/CMSSW_14_0_0/tmp/240303_firstAlpakaMenu/Test02/HLT/V1
# release template: CMSSW_14_0_1_HLT2
hltHbheRecHitSoA = cms.EDProducer( "HCALRecHitSoAProducer@alpaka",
    src = cms.InputTag( "hltHbhereco" ),
    synchronise = cms.untracked.bool( False ),
    alpaka = cms.untracked.PSet(  backend = cms.untracked.string( "" ) )
)

hltHbheRecHitSoACPUSerial = cms.EDProducer( "alpaka_serial_sync::HCALRecHitSoAProducer",
    synchronise = cms.untracked.bool( False ),
    src = cms.untracked.InputTag( "hltHbhereco" )
)

The issue is that hltHbheRecHitSoACPUSerial.src is an untracked parameter, while it should be tracked.

I see a similar problem for the alpaka_serial_sync:: version of other plugins fixed in cms-sw/cmssw#44291. On the other hand, their auto-generated cfi files look okay, and in ConfDB I do not see this problem for some of the other alpaka_serial_sync::* plugins (so it does not affect every alpaka_serial_sync::* plugin, in general).

Is this somehow related to the parsing of the latest CMSSW updates into ConfDB ?


More context.

  • I updated the Alpaka customisations in the branch below implementing technical changes to aid HLT-menu integration (e.g. removal of ConditionalTasks where possible).
    https://github.com/missirol/cmssw/commits/devel_hltCustomAlpaka

  • I applied this customisation to the latest HLT combined table, parsed into ConfDB (using V03-05-00), created subtables, and found the runtime error in [1] when testing the GRun subtable.

  • While checking the current Alpaka customisation with Andrea, he explained that the "CPU only" modules in the HLT menu will need to use the backend-specific plugin type (e.g. alpaka_serial_sync::HCALRecHitSoAProducer) instead of *@alpaka plus backend selection via the alpaka PSet parameter. As far as I understood, this (alpaka_serial_sync::) is necessary in order to make the CPU-only modules run even in the case where one explicitly requires the presence of a GPU device (for example, via accelerators = [ 'gpu-*']). This was missed in the discussion in Add support for Alpaka modules in HLT configs of CMSSW_14_0_0_pre1 or higher #85.

[1]

----- Begin Fatal Exception 04-Mar-2024 20:30:20 CET-----------------------
An exception of category 'Configuration' occurred while
   [0] Constructing the EventProcessor
   [1] Validating configuration of module: class=alpaka_serial_sync::HCALRecHitSoAProducer label='hltHbheRecHitSoACPUSerial'
Exception Message:
In the configuration, parameter "src" is defined as an untracked InputTag.
It should be tracked.
----- End Fatal Exception -------------------------------------------------
@Martin-Grunewald
Copy link
Contributor

Martin-Grunewald commented Mar 5, 2024

Hmm, my first comment: do we actually care? We had decided not to use the format alpaka_serial_sync::HCALRecHitSoAProducer due to the :: (the unlikely string hack was removed), but said we'll use the format HCALRecHitSoAProducer@alpaka. Why is this agreement (also Andrea agreed to) now reverted? I read your reasons above, but the :: would require the unlikely string hack as otherwise one can not even add such modules to the path in the GUI. Can we make changes elsewhere to avoid it still?

Separate is the question why ConfDb parsing seems to mess up the trackiness of some parameter. I'll need to look at the details later but can only assume there is more than one cfi file for this specific plugin and the other has this parameter different and is parsed first (?)

@Martin-Grunewald
Copy link
Contributor

This is one of the plugins fixed up by cms-sw/cmssw#44290 it seems, BUT when making template CMSSW_14_0_1_HLT2 I only fixed the module templates for the non :: versions. The others are still the old ones in the template, where the src is missing and thus the GUI can only insert an untracked one if you force it.

@fwyzard
Copy link
Contributor

fwyzard commented Mar 5, 2024

Hi @Martin-Grunewald

We had decided not to use the format alpaka_serial_sync::HCALRecHitSoAProducer due to the :: (the unlikely string hack was removed), but said we'll use the format HCALRecHitSoAProducer@alpaka. Why is this agreement (also Andrea agreed to) now reverted?

unfortunately we care, because after discussing this last December we have run into a scenario where using only the @alpaka format does not work.

To run the GPU vs CPU comparisons offline (like in the PdmV campaigns for release validation) one should configure the jobs to require running on a GPU - otherwise we may end up running on non-GPU nodes and run the same job twice on a CPU.

The way to configure this in CMSSW is to set process.options.accelerator = [ 'gpu-*' ].
This makes the jobs fail if a GPU is not available.

However, this also make all @alpaka modules to accept only the cuda_async (for gpu-nvidia) and rocm_async (for gpu-amd) back-ends (or both, for gpu-*).
And an @alpaka module configured with module.alpaka.backend = 'serial_sync' will fail.

So there are few options:

  1. use the alpaka_serial_sync:: approach, and hack (again) ConfDB to support namespaces in the C++ module types
  2. use the module.alpaka.backend = 'serial_sync' approach, and modify the framework to let these modules run on a CPU even if one sets process.options.accelerator = [ 'gpu-*' ]
  3. use the module.alpaka.backend = 'serial_sync' approach, and do not set process.options.accelerator = [ 'gpu-*' ] in the validation campaigns
  4. something else we haven't thought of yet

I'm not really very fond of 3., because we open the possibility of false negatives in the validation.

Matti told me he would like to make 2. work, but I'm not 100% sure I like the idea (of forcing a backend choice, and then forcing modules to ignore that choice). Either way, I don't think this would be ready any time soon, and the backport to 14.0.x could be complicated (I just don't know).

I don't know what 4. could be, by definition :-)

So - at the moment - the option that is left seems to be only 1.

@Martin-Grunewald
Copy link
Contributor

Martin-Grunewald commented Mar 5, 2024

I assume we need only a few modules of the :: version? Can we make a typedef and instantiate them under a name w/out ::? If we need many more, could we convert :: to something else, an allowed character like _ or so, avoiding hacking the unlikely string?
If all this becomes too much overhead or inflexible, then we need to put back the unlikely-string hack and hope that elsewhere we do not run into problems with :: in the class names (possibly implicitly assumed in the ConfDb or GUI code).

@Martin-Grunewald
Copy link
Contributor

Can one extend the module.alpaka.backend approach with a feature taking care of this special case of running?

@fwyzard
Copy link
Contributor

fwyzard commented Mar 5, 2024

I assume we need only a few modules of the :: version?

So far it looks like a dozen or so - a copy of all or almost all the alpaka modules in the menu.

Can we make a typedef and instantiate them under a name w/out ::?

Technically it should be possible, though if we do that, I would prefer not to declare both alpaka_cuda_async::Module and ModuleCudaAsync as plugins.
But I don't know if and how much of the @alpaka logic would need to be changed.

That said, there is nothing inherently wrong in using a namespace specifier in the class name, either in C++ or in the python-based language we use - it's only a limitation in ConfDB :-/

If all this becomes too much overhead or inflexible, then we need to put back the unlikely-string hack and hope that elsewhere we do not run into problems with :: in the class names (possibly implicitly assumed in the ConfDb or GUI code).

I admit I'm not familiar with the unlikely-string hack.
Why is it a problem to have colons in a module's type name ?

@fwyzard
Copy link
Contributor

fwyzard commented Mar 5, 2024

Can one extend the module.alpaka.backend approach with a feature taking care of this special case of running?

Wouldn't that be option 2. above ?

@Martin-Grunewald
Copy link
Contributor

Hmm, with a value different than 'serial_sync' (ie, not overloading what is meant by 'serial_sync')?
But OK, it may just be the same, I do not know enough about the underlying alpaka framework.

@Martin-Grunewald
Copy link
Contributor

Martin-Grunewald commented Mar 5, 2024

The unlikely-string hack is needed because otherwise, if you use the GUI to add a module to a path where that module contains a ::, the GUI crashes. It uses : internally as a delimiter. The unlikely-string hack works around this, because it seems once the module is in the config (through uploading a py file), it seems fine? (@missirol) Though I did not test changing any parameters subsequently in the GUI of such a module.

@missirol had initially put in the hack, but I did not like it and then it was reverted - see here the code location:
29ddea5

@fwyzard
Copy link
Contributor

fwyzard commented Mar 5, 2024

Would it be less of a hack if we change the delimiter everywhere to something that can never be part of a c++ name (_e.g. /) ?
Or do we store strings that embed the current delimiter : in the database ?

@missirol
Copy link
Contributor Author

missirol commented Mar 5, 2024

it seems once the module is in the config (through uploading a py file), it seems fine? (@missirol) Though I did not test changing any parameters subsequently in the GUI of such a module.

The parsing of modules of type alpaka*::* from python to ConfDB works out of the box in V03-05-00. Changing parameters of such modules via the GUI also works okay, as far as I have tested, see

/users/missirol/test/dev/CMSSW_14_0_0/tmp/240303_firstAlpakaMenu/Test02/HLT/V{2,3}

The hack only concerns the "Add Module > Open Module Inserter" portion of the GUI (for what I know).

@Martin-Grunewald
Copy link
Contributor

Would it be less of a hack if we change the delimiter everywhere to something that can never be part of a c++ name (_e.g. /) ? Or do we store strings that embed the current delimiter : in the database ?

I'd prefer that but OTOH that would require diving down inside the belly of the belly of the beast and check and find and change everywhere where it is needed - ie, time we do not have (?).

OK, I am resigned to go with the unlikely-string hack, at least it was already tested!

In addition, it would be great to get a 14_0_X IB with the fillDescription-fix PRs #44288 and #44291 so that I can re-parse.

@missirol
Copy link
Contributor Author

missirol commented Mar 5, 2024

Would it be less of a hack if we change the delimiter everywhere to something that can never be part of a c++ name (_e.g. /) ?

I looked into this, but I could not figure this out. I think this delimiter is used in various places, so some care is needed to change it properly.

As Martin just commented, to make progress on the Alpaka migration this week, we just need a new template. The hack is only for the GUI and it is not strictly needed this week.

@fwyzard
Copy link
Contributor

fwyzard commented Mar 5, 2024

OK, I am resigned to go with the unlikely-string hack, at least it was already tested!

By the way, I think you can use a simpler unlikelyStr = "##" for now, since # cannot be part of a C++ type, as far as I know.

@Martin-Grunewald
Copy link
Contributor

Please check and try ConfDb template CMSSW_14_0_X_2024-03-05-1100, it is based on that IB which has all fillDescriptions-fixes found so far in.

@missirol
Copy link
Contributor Author

missirol commented Mar 5, 2024

Thanks @Martin-Grunewald , will try.

@missirol
Copy link
Contributor Author

missirol commented Mar 5, 2024

Good news: with CMSSW_14_0_X_2024-03-05-1100, I see that the modules in question (e.g. the instance alpaka_serial_sync::HCALRecHitSoAProducer) are now okay. This solves the issue discussed in the description.

/users/missirol/test/dev/CMSSW_14_0_0/tmp/240303_firstAlpakaMenu/Test03/HLT

Bad news: I could not fully test this at runtime, because the menus based on CMSSW_14_0_X_2024-03-05-1100 cannot be properly converted to python if they contain Alpaka modules, because this check from #85, based on the name of the release template, fails.

# ERROR -- ConfDB determined that this HLT configuration uses Alpaka modules.
#          This is supported only for HLT configurations based on the release template of CMSSW_14_0_0_pre1 or higher.

This goes back to the fact that GUI considers that "IB names" (14_0_X-*) are always "lower than" any given pre-release or full release (more info here).

The short-term fix is to have the same release, or higher, parsed under a different name.. (e.g. CMSSW_14_0_1_HLT3).

@missirol
Copy link
Contributor Author

missirol commented Mar 6, 2024

#!/bin/bash

hltGetConfiguration v3-test/run3:/users/missirol/test/dev/CMSSW_14_0_0/tmp/240303_firstAlpakaMenu/Test03/GRun \
   --globaltag auto:phase1_2024_realistic \
   --mc \
   --no-prescale \
   --max-events 100 \
   --input /store/mc/Run3Winter24Digi/TT_TuneCP5_13p6TeV_powheg-pythia8/GEN-SIM-RAW/133X_mcRun3_2024_realistic_v8-v2/80000/dc984f7f-2e54-48c4-8950-5daa848b6db9.root \
   --eras Run3 --l1-emulator uGT --l1 L1Menu_Collisions2024_v1_0_0_xml \
   > hlt1.py

#   --customise HLTrigger/Configuration/customizeHLTforAlpaka.customizeHLTforAlpaka \

cat <<@EOF >> hlt1.py
for out_mod_label, out_mod in process.outputModules_().items():
  try: out_mod.SelectEvents.SelectEvents = ['HLTriggerFirstPath']
  except: pass
@EOF

edmConfigDump hlt1.py > hlt1_dump.py

cmsRun hlt1.py &> hlt1.log

A bit of progress: the test above works without runtime errors (checked both on CPU and GPU). To do this, I implemented a workaround for the release-template name and i deployed it to the confdbv3-test instance temporarily.

To do it properly, we still need a template with the "better" name, unless there are better suggestions, of course.

At least this shows that there are no other issues after that (at least testing on 100 events).

@missirol
Copy link
Contributor Author

missirol commented Mar 6, 2024

We can start looking at [1] (target) vs [2] (baseline) to figure out the minimal set of changes to integrate the Alpaka updates, but this is going beyond the scope of this issue. At some point, we should move this part of the discussion to a JIRA ticket (even just one for documentation purposes, and we can do a clean integration ticket after we iron out the details.. whichever way you see fit).

[1]

/users/missirol/test/dev/CMSSW_14_0_0/tmp/240303_firstAlpakaMenu/Test03/HLT

[2]

/users/missirol/test/dev/CMSSW_14_0_0/tmp/240303_firstAlpakaMenu/Test03/Ref/HLT

@Martin-Grunewald
Copy link
Contributor

I have renamed template CMSSW_14_0_X_2024-03-05-1100 to CMSSW_14_0_1_HLT3 in the standard (hltdev) DB.

@missirol
Copy link
Contributor Author

missirol commented Mar 6, 2024

Thanks, Martin. With the new name, the test in #86 (comment) works without specifying v3-test/run3: (as expected). I reverted the changes in the confdbv3-test instance (the latter is now identical again to V03-05-00).

Obviously, I had no idea one could rename a release template. Just out of curiosity, how is it done ?

@Martin-Grunewald
Copy link
Contributor

Martin-Grunewald commented Mar 6, 2024

Using sqlplus (available in CMSSW developer areas) to mess with the Oracle Db directly:

# which sqlplus
# /cvmfs/cms-ib.cern.ch/sw/x86_64/week1/el8_amd64_gcc12/cms/cmssw-patch/CMSSW_14_0_X_2024-03-05-1100/external/el8_amd64_gcc12/bin/sqlplus

like this:

  sqlplus CMS_HLT_V3_W@CMS_ORCOFF_PROD           (and enter password)
  => then at SQL prompt
       SELECT * FROM u_softreleases SoftwareReleases;
       UPDATE u_softreleases SoftwareReleases SET RELEASETAG='<NEW NAME>' WHERE ID=<number of tag to be changed>;

ie, WHERE ID= is the ID returned by the SELECT command for the RELEASETAG to be changed, in this case:

UPDATE u_softreleases SoftwareReleases SET RELEASETAG='CMSSW_14_0_1_HLT3' WHERE ID=105;

@missirol
Copy link
Contributor Author

missirol commented Mar 6, 2024

Thanks, magic..

I think this issue can be closed. I will open a PR to introduce the hack for the "Output Module Inserter". How do we proceed with discussing the Alpaka integration ?

@Martin-Grunewald
Copy link
Contributor

Martin-Grunewald commented Mar 6, 2024

JIRA ticket or STORM mattermost channel?

@missirol
Copy link
Contributor Author

missirol commented Mar 6, 2024

I can open a JIRA ticket to document what i checked so far, so Andrea can also review it. Later we can see if we use this ticket for the integration, or we do the integration in a quick follow up ticket.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants