-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merge LH5 files from multiple threads #190
Comments
there are "hyperslabs" to read and write partial datasets availbale in the C++ API, but their usage is very verbose, if you compare them to equivalent python slices. Also you need to take care about all the low-level things (chunking, ...) yourself. I would really not like to maintain such code (I found a - certainly much more sophisticated and feature-rich - implementation of dataset merging having >>2k LOC). |
Not sure if this helps, but I found loading the lh5 files as awkward arrays and use ak.concatenate works very nicely. I can share some code if you want. |
This is exactly the approach we suggest people to use. |
We also want to do some merging in the simulation production workflow to avoid cluttering the filesystem with a huge amount of files. |
Now technically possible with #210. |
Do we want to try to guess the output file names ourselves in python code, or do we want to have some way to communicate data (i.e. file names) from the |
Uhm sounds complicated. Maybe we just make sure that all the file names are known before running |
Geant4 internally mangles the file names in MT mode, so there is no way to set all of them. We can only set 1 output file name, and get n files with a thread-number ( the easies (infrastructure-wise) option would be to replicate the output file reanming scheme in python... or we could just glob on there is also another problem: the number of threads also can be (indirectly) influenced via various env variables - which would influence the number of output files... :-( So we cannot just use the |
this is the way...
then we'll have to understand how this works and implement protections in the python wrapper... |
another complication: the outpur file name can be set in the macro (and not in the command line args). In this case we need remage reporting back the details to the python wrapper anyway (unless we want to re-implement macro parsing in python ... 🫣) |
If remage now has a python wrapper we could potentially move away from geant4 macro files? Which are horrible. |
I am not sure we need to support naming output files with macro commands. Maybe we can just remove the functionality. Toby, that sounds complicated. You still need a way to pass the configuration options to remage-cpp. Manuel, should we think about moving all the code you wrote to make LH5 files to the Python wrapper at some point? |
maybe, there were problems with h5py and the geant4-written files, the last time I tried. Don't get me wrong, it would be very nice to get rid of this C++ code. But if we do so, the same problem ("how to propagate parameters/data") would also exist for input files. There we do not even have command line switches (and in the current architecture we cannot inject data from the command line into generators, because at that point it is not really clear what generator the user will actually choose...) We might even have multiple different input files for different processes/generators/vertex generators... |
Is it possible to handle this in pure C++ in a reasonable way? If not, we should provide some Python remage wrapper that performs the concatenation.
The text was updated successfully, but these errors were encountered: