Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Event-based splitting of jobs. #7

Closed
IzaakWN opened this issue Feb 5, 2021 · 2 comments
Closed

Event-based splitting of jobs. #7

IzaakWN opened this issue Feb 5, 2021 · 2 comments
Labels
enhancement New feature or request

Comments

@IzaakWN
Copy link
Collaborator

IzaakWN commented Feb 5, 2021

Right now jobs are split by number of files. However, the number of entries varies wildly between nanoAOD files. If the submission routine in pico.py allows for event-based splitting of jobs, it would make it possible to create jobs and output files more uniform in length and size, and have easier finetuning of batch submission parameters such as maximum run time. With event-splitting, smaller files can be combined into one job, or a single large file can be split into several jobs.

It would not be too hard to implement–I think.

The post-processor already allows to define a start event index and maximum number of events, so "all" one needs to do it add this as an option for the job argument list.

But first one needs to split the files into chunks that may overlap over not. Right now chunks are made here:

fchunks = chunkify(infiles,nfilesperjob_) # file chunks

Currently, the chunks are saved as a dictionary in the JSON job config file for bookkeeping during resubmission, e.g.

"chunkdict": {
  "0": [ "nano_1.root",  "nano_2.root" ]
  "1": [ "nano_2.root",  "nano_3.root" ]
  ...
}

The trickiest part is to save it in this config format for bookkeeping in the resubmission and status routines. This is where a lot of bugs might creep in if the information is not stored and retrieved correctly. The simplest and most compact would be to simply add it to the end of the usual filename in the chunk dictionary of the config JSON file,

"chunkdict": {
  "0": [ "nano_1.root:0:1000" ]
  "1": [ "nano_1.root:1000:2000" ]
  ...
}

and parse it in checkchunks.

It should be possible. I plan to implement it in the near future.

@IzaakWN IzaakWN added the enhancement New feature or request label Feb 5, 2021
@IzaakWN
Copy link
Collaborator Author

IzaakWN commented Feb 9, 2021

Implemented as per commit 4e46663.

Tested with some samples, and so far the submission, resubmission and status checks seems to work as expected. For skimming, however, there seems to be a bug in nanoAOD-tools, see issue cms-nanoAOD/nanoAOD-tools#269. Until that is fixed, one should probably set self.firstEntry to 0 in this line: https://github.com/cms-nanoAOD/nanoAOD-tools/blob/25a793ec55b30fe7107af263c4523f20ff1a5fbd/python/postprocessing/framework/output.py#L175-L176

@IzaakWN
Copy link
Collaborator Author

IzaakWN commented Nov 23, 2022

Fix cms-nanoAOD/nanoAOD-tools#276 was merged.

@IzaakWN IzaakWN closed this as completed Nov 23, 2022
pmastrap pushed a commit to pmastrap/TauFW that referenced this issue Nov 10, 2023
IzaakWN added a commit that referenced this issue Jun 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant