Merge LH5 files from multiple threads #190

gipert · 2024-12-09T13:28:13Z

Is it possible to handle this in pure C++ in a reasonable way? If not, we should provide some Python remage wrapper that performs the concatenation.

ManuelHu · 2024-12-10T09:08:18Z

there are "hyperslabs" to read and write partial datasets availbale in the C++ API, but their usage is very verbose, if you compare them to equivalent python slices. Also you need to take care about all the low-level things (chunking, ...) yourself.

I would really not like to maintain such code (I found a - certainly much more sophisticated and feature-rich - implementation of dataset merging having >>2k LOC).
That is a much more than just the "name and attribute juggling" that we are currently doing...

Yurivanderburg · 2024-12-12T10:09:04Z

Not sure if this helps, but I found loading the lh5 files as awkward arrays and use ak.concatenate works very nicely. I can share some code if you want.

tdixon97 · 2024-12-12T10:19:43Z

This is exactly the approach we suggest people to use.
However, the combination of the lh5 files directly in C++ / with an extra python script is a bit more tricky since you do not want to just concatenate but also sort by g4_evtid (since the files each contain a random set of g4_evtids). Then this also has to be done in a memory efficiency way (not reading the full data into memory), and it should be fast.

gipert · 2024-12-12T12:09:53Z

We also want to do some merging in the simulation production workflow to avoid cluttering the filesystem with a huge amount of files.

gipert · 2025-01-06T11:17:05Z

Now technically possible with #210.

ManuelHu · 2025-01-06T15:39:40Z

Do we want to try to guess the output file names ourselves in python code, or do we want to have some way to communicate data (i.e. file names) from the remage-cpp process to the python wrapper?

gipert · 2025-01-06T16:00:00Z

Uhm sounds complicated. Maybe we just make sure that all the file names are known before running remage-cpp.

ManuelHu · 2025-01-06T16:07:30Z

Geant4 internally mangles the file names in MT mode, so there is no way to set all of them. We can only set 1 output file name, and get n files with a thread-number (_t{0..n} for n threads) inserted into the filename... The "original"/passed in output file will not exist afterwards

the easies (infrastructure-wise) option would be to replicate the output file reanming scheme in python... or we could just glob on ${ORIG_BASENAME}_t*.${ORIG_EXT}

there is also another problem: the number of threads also can be (indirectly) influenced via various env variables - which would influence the number of output files... :-( So we cannot just use the -n argument to get the output file count, if we want to care for the weird edge cases

gipert · 2025-01-06T16:38:01Z

the easies (infrastructure-wise) option would be to replicate the output file reanming scheme in python...

this is the way...

there is also another problem: the number of threads also can be (indirectly) influenced via various env variables

then we'll have to understand how this works and implement protections in the python wrapper...

ManuelHu · 2025-01-11T11:58:23Z

another complication: the outpur file name can be set in the macro (and not in the command line args). In this case we need remage reporting back the details to the python wrapper anyway (unless we want to re-implement macro parsing in python ... 🫣)

tdixon97 · 2025-01-11T12:20:41Z

If remage now has a python wrapper we could potentially move away from geant4 macro files? Which are horrible.

gipert · 2025-01-11T12:52:58Z

I am not sure we need to support naming output files with macro commands. Maybe we can just remove the functionality.

Toby, that sounds complicated. You still need a way to pass the configuration options to remage-cpp.

Manuel, should we think about moving all the code you wrote to make LH5 files to the Python wrapper at some point?

ManuelHu · 2025-01-12T11:10:12Z

should we think about moving all the code you wrote to make LH5 files to the Python wrapper at some point?

maybe, there were problems with h5py and the geant4-written files, the last time I tried.

Don't get me wrong, it would be very nice to get rid of this C++ code.

But if we do so, the same problem ("how to propagate parameters/data") would also exist for input files. There we do not even have command line switches (and in the current architecture we cannot inject data from the command line into generators, because at that point it is not really clear what generator the user will actually choose...) We might even have multiple different input files for different processes/generators/vertex generators...

gipert added discussion Further information is requested output Output Schemes labels Dec 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge LH5 files from multiple threads #190

Merge LH5 files from multiple threads #190

gipert commented Dec 9, 2024

ManuelHu commented Dec 10, 2024

Yurivanderburg commented Dec 12, 2024

tdixon97 commented Dec 12, 2024

gipert commented Dec 12, 2024

gipert commented Jan 6, 2025

ManuelHu commented Jan 6, 2025 •

edited

Loading

gipert commented Jan 6, 2025

ManuelHu commented Jan 6, 2025 •

edited

Loading

gipert commented Jan 6, 2025

ManuelHu commented Jan 11, 2025

tdixon97 commented Jan 11, 2025

gipert commented Jan 11, 2025

ManuelHu commented Jan 12, 2025

Merge LH5 files from multiple threads #190

Merge LH5 files from multiple threads #190

Comments

gipert commented Dec 9, 2024

ManuelHu commented Dec 10, 2024

Yurivanderburg commented Dec 12, 2024

tdixon97 commented Dec 12, 2024

gipert commented Dec 12, 2024

gipert commented Jan 6, 2025

ManuelHu commented Jan 6, 2025 • edited Loading

gipert commented Jan 6, 2025

ManuelHu commented Jan 6, 2025 • edited Loading

gipert commented Jan 6, 2025

ManuelHu commented Jan 11, 2025

tdixon97 commented Jan 11, 2025

gipert commented Jan 11, 2025

ManuelHu commented Jan 12, 2025

ManuelHu commented Jan 6, 2025 •

edited

Loading

ManuelHu commented Jan 6, 2025 •

edited

Loading