Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge LH5 files from multiple threads #190

Open
gipert opened this issue Dec 9, 2024 · 13 comments
Open

Merge LH5 files from multiple threads #190

gipert opened this issue Dec 9, 2024 · 13 comments
Labels
discussion Further information is requested output Output Schemes

Comments

@gipert
Copy link
Member

gipert commented Dec 9, 2024

Is it possible to handle this in pure C++ in a reasonable way? If not, we should provide some Python remage wrapper that performs the concatenation.

@gipert gipert added discussion Further information is requested output Output Schemes labels Dec 9, 2024
@ManuelHu
Copy link
Collaborator

there are "hyperslabs" to read and write partial datasets availbale in the C++ API, but their usage is very verbose, if you compare them to equivalent python slices. Also you need to take care about all the low-level things (chunking, ...) yourself.

I would really not like to maintain such code (I found a - certainly much more sophisticated and feature-rich - implementation of dataset merging having >>2k LOC).
That is a much more than just the "name and attribute juggling" that we are currently doing...

@Yurivanderburg
Copy link

Not sure if this helps, but I found loading the lh5 files as awkward arrays and use ak.concatenate works very nicely. I can share some code if you want.

@tdixon97
Copy link
Collaborator

This is exactly the approach we suggest people to use.
However, the combination of the lh5 files directly in C++ / with an extra python script is a bit more tricky since you do not want to just concatenate but also sort by g4_evtid (since the files each contain a random set of g4_evtids). Then this also has to be done in a memory efficiency way (not reading the full data into memory), and it should be fast.

@gipert
Copy link
Member Author

gipert commented Dec 12, 2024

We also want to do some merging in the simulation production workflow to avoid cluttering the filesystem with a huge amount of files.

@gipert
Copy link
Member Author

gipert commented Jan 6, 2025

Now technically possible with #210.

@ManuelHu
Copy link
Collaborator

ManuelHu commented Jan 6, 2025

Do we want to try to guess the output file names ourselves in python code, or do we want to have some way to communicate data (i.e. file names) from the remage-cpp process to the python wrapper?

@gipert
Copy link
Member Author

gipert commented Jan 6, 2025

Uhm sounds complicated. Maybe we just make sure that all the file names are known before running remage-cpp.

@ManuelHu
Copy link
Collaborator

ManuelHu commented Jan 6, 2025

Geant4 internally mangles the file names in MT mode, so there is no way to set all of them. We can only set 1 output file name, and get n files with a thread-number (_t{0..n} for n threads) inserted into the filename... The "original"/passed in output file will not exist afterwards

the easies (infrastructure-wise) option would be to replicate the output file reanming scheme in python... or we could just glob on ${ORIG_BASENAME}_t*.${ORIG_EXT}


there is also another problem: the number of threads also can be (indirectly) influenced via various env variables - which would influence the number of output files... :-( So we cannot just use the -n argument to get the output file count, if we want to care for the weird edge cases

@gipert
Copy link
Member Author

gipert commented Jan 6, 2025

the easies (infrastructure-wise) option would be to replicate the output file reanming scheme in python...

this is the way...

there is also another problem: the number of threads also can be (indirectly) influenced via various env variables

then we'll have to understand how this works and implement protections in the python wrapper...

@ManuelHu
Copy link
Collaborator

another complication: the outpur file name can be set in the macro (and not in the command line args). In this case we need remage reporting back the details to the python wrapper anyway (unless we want to re-implement macro parsing in python ... 🫣)

@tdixon97
Copy link
Collaborator

If remage now has a python wrapper we could potentially move away from geant4 macro files? Which are horrible.

@gipert
Copy link
Member Author

gipert commented Jan 11, 2025

I am not sure we need to support naming output files with macro commands. Maybe we can just remove the functionality.

Toby, that sounds complicated. You still need a way to pass the configuration options to remage-cpp.

Manuel, should we think about moving all the code you wrote to make LH5 files to the Python wrapper at some point?

@ManuelHu
Copy link
Collaborator

should we think about moving all the code you wrote to make LH5 files to the Python wrapper at some point?

maybe, there were problems with h5py and the geant4-written files, the last time I tried.

Don't get me wrong, it would be very nice to get rid of this C++ code.

But if we do so, the same problem ("how to propagate parameters/data") would also exist for input files. There we do not even have command line switches (and in the current architecture we cannot inject data from the command line into generators, because at that point it is not really clear what generator the user will actually choose...) We might even have multiple different input files for different processes/generators/vertex generators...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion Further information is requested output Output Schemes
Projects
None yet
Development

No branches or pull requests

4 participants