Skip to content

Repository for running coffea analysis workflows

License

Notifications You must be signed in to change notification settings

ValVau/CoffeaRunner

ย 
ย 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

BTVNanoCommissioning

Linting btag ttbar ctag ttbar ctag DY+jets Workflow ctag W+c Workflow BTA Workflow QCD Workflow Code style: black

Repository for Commissioning studies in the BTV POG based on (custom) nanoAOD samples Detailed documentation in btv-wiki

Requirements

Setup

โ— suggested to install under bash environment :heavy_exclamation_mark: :heavy_exclamation_mark: not fully supported in EL9 machines yet, recommended to run in EL7 or EL8

# only first time, including submodules
git clone --recursive [email protected]:cms-btv-pog/BTVNanoCommissioning.git 

# activate enviroment once you have coffea framework 
conda activate btv_coffea

Coffea installation with Micromamba

For installing Micromamba, see [here]

wget -L micro.mamba.pm/install.sh
# Run and follow instructions on screen
bash install.sh

NOTE: always make sure that conda, python, and pip point to local micromamba installation (which conda etc.).

You can simply create the environment through the existing test_env.yml under your micromamba environment using micromamba, and activate it

micromamba env create -f test_env.yml 
micromamba activate btv_coffea

Once the environment is set up, compile the python package:

pip install -e .
pip install -e .[dev] # for developer

Other installation options for coffea

See https://coffeateam.github.io/coffea/installation.html

Quick launch of all tasks

Now you can use various shell scripts to directly launch the runner scripts with predefined scaleouts. You can modify and customize the scripts inside the scripts/submit directory according to your needs. Each script takes arguments from arguments.txt directory, that has 4 inputs i.e. - Campaign name, year, executor and luminosity. To launch any workflow, for example W+c

./ctag_wc.sh arguments.txt

Additional scripts are provided to make a directory structure that creates directories locally and copies them in the remote BTV eos area https://btvweb.web.cern.ch/Commissioning/dataMC/. Finally plots can be directly monitored in the webpage.

Structure

Each workflow can be a separate "processor" file, creating the mapping from NanoAOD to the histograms we need. Workflow processors can be passed to the runner.py script along with the fileset these should run over. Multiple executors can be chosen (for now iterative - one by one, uproot/futures - multiprocessing and dask-slurm).

To test a small set of files to see whether the workflows run smoothly, run:

python runner.py --workflow ttsemilep_sf --json metadata/test_bta_run3.json --campaign Summer23 --year 2023

More options for runner.py

more options

--wf {validation,ttcom,ttdilep_sf,ttsemilep_sf,emctag_ttdilep_sf,ctag_ttdilep_sf,ectag_ttdilep_sf,ctag_ttsemilep_sf,ectag_ttsemilep_sf,ctag_Wc_sf,ectag_Wc_sf,ctag_DY_sf,ectag_DY_sf,BTA,BTA_addPFMuons,BTA_addAllTracks,BTA_ttbar}, --workflow {validation,ttcom,ttdilep_sf,ttsemilep_sf,emctag_ttdilep_sf,ctag_ttdilep_sf,ectag_ttdilep_sf,ctag_ttsemilep_sf,ectag_ttsemilep_sf,ctag_Wc_sf,ectag_Wc_sf,ctag_DY_sf,ectag_DY_sf,BTA,BTA_addPFMuons,BTA_addAllTracks,BTA_ttbar}
                        Which processor to run
  -o OUTPUT, --output OUTPUT
                        Output histogram filename (default: hists.coffea)
  --samples SAMPLEJSON, --json SAMPLEJSON
                        JSON file containing dataset and file locations
                        (default: dummy_samples.json)
  --year YEAR           Year
  --campaign CAMPAIGN   Dataset campaign, change the corresponding correction
                        files{ "Rereco17_94X","Winter22Run3","Summer23","Summer23BPix","Summer22","Summer22EE","2018_UL","2017_UL","2016preVFP_UL","2016postVFP_UL","prompt_dataMC"}
  --isSyst              Run with systematics, all, weight_only(no JERC uncertainties included),JERC_split, None(not extract)
  --isArray             Output root files
  --noHist              Not save histogram coffea files
  --overwrite           Overwrite existing files
   --executor {iterative,futures,parsl/slurm,parsl/condor,parsl/condor/naf_lite,dask/condor,dask/condor/brux,dask/slurm,dask/lpc,dask/lxplus,dask/casa}
                        The type of executor to use (default: futures). Other options can be implemented. For
                        example see https://parsl.readthedocs.io/en/stable/userguide/configuring.html-
                        `parsl/slurm` - tested at DESY/Maxwell- `parsl/condor` - tested at DESY, RWTH-
                        `parsl/condor/naf_lite` - tested at DESY- `dask/condor/brux` - tested at BRUX (Brown U)-
                        `dask/slurm` - tested at DESY/Maxwell- `dask/condor` - tested at DESY, RWTH- `dask/lpc` -
                        custom lpc/condor setup (due to write access restrictions)- `dask/lxplus` - custom
                        lxplus/condor setup (due to port restrictions)
  -j WORKERS, --workers WORKERS
                        Number of workers (cores/threads) to use for multi- worker executors (e.g. futures or condor) (default:
                        3)
  -s SCALEOUT, --scaleout SCALEOUT
                        Number of nodes to scale out to if using slurm/condor.
                        Total number of concurrent threads is ``workers x
                        scaleout`` (default: 6)
  --memory MEMORY       Memory used in jobs (in GB) ``(default: 4GB)
  --disk DISK           Disk used in jobs  ``(default: 4GB)
  --voms VOMS           Path to voms proxy, made accessible to worker nodes.
                        By default a copy will be made to $HOME.
  --chunk N             Number of events per process chunk
  --retries N           Number of retries for coffea processor
  --fsize FSIZE         (Specific for dask/lxplus file splitting, default: 50) Numbers of files processed per
                        dask-worker
  --index INDEX         (Specific for dask/lxplus file splitting, default: 0,0) Format:
                        $dict_index_start,$file_index_start,$dict_index_stop,$file_index_stop. Stop indices are
                        optional. $dict_index refers to the index, splitted $dict_index and $file_index with ','
                        $dict_index refers to the sample dictionary of the samples json file. $file_index refers to the N-th batch of files per dask-worker, with its size being defined by the option --index. The job will start (stop) submission from (with) the corresponding indices.
  --validate            Do not process, just check all files are accessible
  --skipbadfiles        Skip bad files.
  --only ONLY           Only process specific dataset or file
  --limit N             Limit to the first N files of each dataset in sample
                        JSON
  --max N               Max number of chunks to run in total

Roadmap for running the tool (for commissioning tasks)

  1. Is the .json file ready? If not, create it following the instructions in the Make the json files section. Please use the correct naming scheme

  2. Add the lumiMask, correction files (SFs, pileup weight), and JER, JEC files under the dict entry in utils/AK4_parameters.py. See details in Correction files configurations

  3. If the JERC file jec_compiled.pkl.gz is missing in the data/JME/${campaign} directory, create it through Create compiled JERC file

  4. If selections and output histogram/arrays need to be changed, modify the dedicated workflows

  5. Run the workflow with dedicated input and campaign name. Example commands for Run 3 can be found here. For first usage, the JERC file needs to be recompiled first, see Create compiled JERC file. You can also specify --isArray to store the skimmed root files

  6. Fetch the failed files to obtain events that have been processed and events that have to be resubmitted using scripts/dump_processed.py. Check the luminosity of the processed dataset used for the plotting script and re-run failed jobs if needed (details in get procssed info)

  7. Once you obtain the .coffea file(s), you can make plots using the plotting scripts, if the xsection for your sample is missing, please add it to src/BTVNanoCommissioning/helpers/xsection.py

Check out notes for developer for more info!

Commands for different phase space

After a small test, you can run the full campaign for a dedicated phase space, separately for data and for MC.

 python runner.py --workflow $WF --json metadata/$JSON  --campaign $CAMPAIGN --year $YEAR (--executor ${scaleout_site}) 

b-SFs

details

  • Dileptonic ttbar phase space : check performance for btag SFs, emu channel
python runner.py --workflow ttdiilep_sf --json metadata/data_Summer23_2023_em_BTV_Run3_2023_Comm_MINIAODv4_NanoV12.json --campaign Summer23 --year 2023 (--executor ${scaleout_site})
  • Semileptonic ttbar phase space : check performance for btag SFs, muon channel
python runner.py --workflow ttsemilep_sf --json metadata/data_Summer23_2023_mu_BTV_Run3_2023_Comm_MINIAODv4_NanoV12.json --campaign Summer23 --year 2023 (--executor ${scaleout_site})

c-SFs

details

  • Dileptonic ttbar phase space : check performance for charm SFs, bjets enriched SFs, muon channel
python runner.py --workflow ctag_ttdilep_sf --json metadata/data_Summer23_2023_mu_BTV_Run3_2023_Comm_MINIAODv4_NanoV12.json  --campaign Summer23 --year 2023(--executor ${scaleout_site})
  • Semileptonic ttbar phase space : check performance for charm SFs, bjets enriched SFs, muon channel
python runner.py --workflow ctag_ttsemilep_sf --json metadata/data_Summer23_2023_mu_BTV_Run3_2023_Comm_MINIAODv4_NanoV12.json  --campaign Summer23 --year 2023(--executor ${scaleout_site})
  • W+c phase space : check performance for charm SFs, cjets enriched SFs, muon channel
python runner.py --workflow ctag_Wc_sf --json metadata/data_Summer23_2023_mu_BTV_Run3_2023_Comm_MINIAODv4_NanoV12.json  --campaign Summer23 --year 2023(--executor ${scaleout_site})
  • DY phase space : check performance for charm SFs, light jets enriched SFs, muon channel
python runner.py --workflow ctag_DY_sf --json metadata/data_Summer23_2023_mu_BTV_Run3_2023_Comm_MINIAODv4_NanoV12.json  --campaign Summer23 --year 2023(--executor ${scaleout_site})

Validation - check different campaigns

details

Only basic jet selections(PUID, ID, pT, $\eta$) applied. Put the json files with different campaigns, plot ROC & efficiency

python runner.py --workflow valid --json metadata/$json file

BTA - BTagAnalyzer Ntuple producer

Based on Congqiao's development to produce BTA ntuples based on PFNano.

โ— Only the newest version BTV_Run3_2023_Comm_MINIAODv4 ntuples work. Example files are given in this json. Optimize the chunksize(--chunk) in terms of the memory usage. This depends on sample, if the sample has huge jet collection/b-c hardons. The more info you store, the more memory you need. I would suggest to test with iterative to estimate the size.

details

Run with the nominal BTA workflow to include the basic event variables, jet observables, and GEN-level quarks, hadrons, leptons, and V0 variables.

python runner.py --wf BTA --json metadata/test_bta_run3.json --campaign Summer22EE --isJERC

Run with the BTA_addPFMuons workflow to additionally include the PFMuon and TrkInc collection, used by the b-tag SF derivation with the QCD(ฮผ) methods.

python runner.py --wf BTA_addPFMuons --json metadata/test_bta_run3.json --campaign Summer22EE --isJERC

Run with the BTA_addAllTracks workflow to additionally include the Tracks collection, used by the JP variable calibration.

python runner.py --wf BTA_addAllTracks --json metadata/test_bta_run3.json --campaign Summer22EE --isJERC

Scale-out

Scale out can be notoriously tricky between different sites. Coffea's integration of slurm and dask makes this quite a bit easier and for some sites the ``native'' implementation is sufficient, e.g Condor@DESY. However, some sites have certain restrictions for various reasons, in particular Condor @CERN and @FNAL. The scaleout scheme is named as follows: $cluster_schedule_system/scheduler/site. The existing sites are documented in sites configuration while standalone condor submission is possible and strongly suggested when working on lxplus.

Memory usage is also useful to adapt to cluster. Check the memory by calling memory_usage_psutil() from helpers.func.memory_usage_psutil to optimize job size. Example with ectag_Wc_sf summarized below.

details

Type Array+Hist Hist only Array Only
DoubleMuon (BTA,BTV_Comm_v2) 1243MB 848MB 1249MB
DoubleMuon (PFCands, BTV_Comm_v1) 1650MB 1274MB 1632MB
DoubleMuon (Nano_v11) 1183MB 630MB 1180MB
WJets_inc (BTA,BTV_Comm_v2) 1243MB 848MB 1249MB
WJets_inc (PFCands, BTV_Comm_v1) 1650MB 1274MB 1632MB
WJets_inc (Nano_v11) 1183MB 630MB 1180MB

Sites configuration with dask/parsl schedular

details

Condor@FNAL (CMSLPC)

Follow setup instructions at https://github.com/CoffeaTeam/lpcjobqueue. After starting the singularity container run with

python runner.py --wf ttcom --executor dask/lpc

Condor@CERN (lxplus)

Only one port is available per node, so its possible one has to try different nodes until hitting one with 8786 being open. Other than that, no additional configurations should be necessary.

python runner.py --wf ttcom --executor dask/lxplus

jobs automatically split to 50 files per jobs to avoid job failure due to crowded cluster on lxplus with the naming scheme hist_$workflow_$json_$dictindex_$fileindex.coffea. The .coffea files can be then combined at plotting level

โ— The optimal scaleout options on lxplus are -s 50 --chunk 50000

To deal with unstable condor cluster and dask worker on lxplus, you can resubmit failure jobs via --index $dictindex,$fileindex option. $dictindex refers to the index in the .json dict, $fileindex refers to the index of the file list split to 50 files per dask-worker. The total number of files of each dict can be computed by math.ceil(len($filelist)/50) The job will start from the corresponding indices.

Coffea-casa (Nebraska AF)

Coffea-casa is a JupyterHub based analysis-facility hosted at Nebraska. For more information and setup instuctions see https://coffea-casa.readthedocs.io/en/latest/cc_user.html

After setting up and checking out this repository (either via the online terminal or git widget utility run with

python runner.py --wf ttcom --executor dask/casa

Authentication is handled automatically via login auth token instead of a proxy. File paths need to replace xrootd redirector with "xcache", runner.py does this automatically.

Condor@DESY

python runner.py --wf ttcom --executor dask/condor(parsl/condor)

Maxwell@DESY

python runner.py --wf ttcom --executor parsl/slurm

Standalone condor jobs@lxplus/cmsconnect

โ— โ— โ— Strongly suggest to use this in lxplus โ— โ— โ— You have the option to run the framework through "standalone condor jobs", bypassing the native coffea-supported job submission system. Within each job you submit, a standalone script will execute the following on the worker node:

  • Set up a minimal required Python environment.
  • Retrieve the BTVNanoCommissioning repository, either from a git link or transferred locally.
  • Launch the python runner.py ... command to execute the coffea framework in the iterative executor mode.

This utility is currently adapted for the lxplus and cmsconnect condor systems. To generate jobs for launching, replace python runner.py with python condor/submitter.py, append the existing arguments, and add the following arguments in addition:

  • --jobName: Specify the desired condor job name. A dedicated folder will be generated, including all submission-related files.
  • --outputXrootdDir: Indicate the XRootD directory's path (starting with root://) where the produced .coffea (and .root) files from each worker node will be transferred to.
  • --condorFileSize: Define the number of files to process per condor job (default is 50). The input file list will be divided based on this count.
  • --remoteRepo (optional, but recommended): Specify the path and branch of the remote repository to download the BTVNanoCommissioning repository. If not specified, the local directory will be packed and transferred as the condor input, potentially leading to higher loads for condor transfers. Use the format e.g. --remoteRepo 'https://github.com/cms-btv-pog/BTVNanoCommissioning.git -b master'.

After executing the command, a new folder will be created, preparing the submission. Follow the on-screen instructions and utilize condor_submit ... to submit the jdl file. The output will be transferred to the designated XRootD destination.

The script provided by Pablo to resubmit failure jobs in script/missingFiles.py from the original job folder.

Frequent issues for standalone condor jobs submission

  1. CMS Connect provides a condor interface where one can submit jobs to all resources available in the CMS Global Pool. See WorkBookCMSConnect Twiki for the instructions if you use it for the first time.
  2. The submitted jobs are of the kind which requires a proper setup of the X509 proxy, to use the XRootD service to access and store data. In the generated .jdl file, you may see a line configured for this purpose use_x509userproxy = true. If you have not submitted jobs of this kind on lxplus condor, we recommend you to add a line
    export X509_USER_PROXY=$HOME/x509up_u`id -u`
    to .bashrc and run it so the proxy file will be stored in your AFS folder instead of in your /tmp/USERNAME folder. For submission on cmsconnect, no specific action is required.

Make the dataset json files

Use fetch.py in folder scripts/ to obtain your samples json files. You can create $input_list ,which can be a list of datasets taken from CMS DAS or names of dataset(need to specify campaigns explicity), and create the json contains dataset_name:[filelist]. One can specify the local path in that input list for samples not published in CMS DAS. $output_json_name$ is the name of your output samples json file.

The --whitelist_sites, --blacklist_sites are considered for fetch dataset if multiple sites are available

## File publish in DAS, input MC file name list, specified --from_dataset and add campaign info, if more than one campaign found, would ask for specify explicity
python scripts/fetch.py -i $MC_FILE_LIST -o ${output_json_name} --from_dataset --campaign Run3Summer23BPixNanoAODv12
## File publish in DAS, input DAS path
python fetch.py --input ${input_DAS_list} --output ${output_json_name} (--xrd {prefix_forsite})

## Not publish case, specify site by --xrd prefix
python fetch.py --input ${input_list} --output ${output_json_name} --xrd {prefix_forsite}
# where the input list should contains
$DATASET_NAME $PATH_TO_FILE

The output_json_name must contain the BTV name tag (e.g. BTV_Run3_2022_Comm_v1).

You might need to rename the json key name with following name scheme:

For the data sample please use the naming scheme,

$dataset_$Run
#i.e.
SingleMuon_Run2022C-PromptReco-v1

For MC, please be consistent with the dataset name in CMS DAS, as it cannot be mapped to the cross section otherwise.

$dataset
#i.e.
WW_TuneCP5_13p6TeV-pythia8

Get processed information

Get the run & luminosity information for the processed events from the coffea output files. When you use --skipbadfiles, the submission will ignore files not accesible(or time out) by xrootd. This script helps you to dump the processed luminosity into a json file which can be calculated by brilcalc tool and provide a list of failed lumi sections by comparing the original json input to the one from the .coffea files.

# all is default, dump lumi and failed files, if run -t lumi only case. no json file need to be specified
python scripts/dump_processed.py -c $COFFEA_FILES -n $OUTPUT_NAME (-j $ORIGINAL_JSON -t [all,lumi,failed])

Correction files configurations

โ— If the correction files are not supported yet by jsonpog-integration, you can still try with custom input data.

Options with custom input data

All the lumiMask, correction files (SFs, pileup weight), and JEC, JER files are under BTVNanoCommissioning/src/data/ following the substructure ${type}/${campaign}/${files}(except lumiMasks and Prescales)

Type File type Comments
lumiMasks .json Masked good lumi-section used for physics analysis
Prescales .txt HLT paths for prescaled triggers
PU .pkl.gz or .histo.root Pileup reweight files, matched MC to data
LSF .histo.root Lepton ID/Iso/Reco/Trigger SFs
BTV .csv or .root b-tagger, c-tagger SFs
JME .txt JER, JEC files

Create a dict entry under correction_config with dedicated campaigns in BTVNanoCommissioning/src/utils/AK4_parameters.py.

Take `Rereco17_94X` as an example.

# specify campaign
"Rereco17_94X": 
{
      ##Load files with dedicated type:filename
        "lumiMask": "Cert_314472-325175_13TeV_17SeptEarlyReReco2018ABC_PromptEraD_Collisions18_JSON.txt",
        "PU": "94XPUwei_corrections.pkl.gz",
        "JME": "jec_compiled.pkl.gz",
      ## Btag SFs- create dict specifying SFs for DeepCSV b-tag(DeepCSVB),  DeepJet b-tag(DeepJetB),DeepCSV c-tag(DeepCSVC),  DeepJet c-tag(DeepJetC),
        "BTV": {
            ### b-tag 
            "DeepCSVB": "DeepCSV_94XSF_V5_B_F.csv",
            "DeepJetB": "DeepFlavour_94XSF_V4_B_F.csv",
            ### c-tag
            "DeepCSVC": "DeepCSV_ctagSF_MiniAOD94X_2017_pTincl_v3_2_interp.root",
            "DeepJetC": "DeepJet_ctagSF_MiniAOD94X_2017_pTincl_v3_2_interp.root",
        },
        ## lepton SF - create dict specifying SFs for electron/muon ID/ISO/RECO SFs
        "LSF": {
            ### Following the scheme "${SF_name} ${histo_name_in_root_file}": "${file}"
            "ele_Trig TrigSF": "Ele32_L1DoubleEG_TrigSF_vhcc.histo.root",
            "ele_ID EGamma_SF2D": "ElectronIDSF_94X_MVA80WP.histo.root",
            "ele_Rereco EGamma_SF2D": "ElectronRecoSF_94X.histo.root",
            "mu_ID NUM_TightID_DEN_genTracks_pt_abseta": "RunBCDEF_MuIDSF.histo.root",
            "mu_ID_low NUM_TightID_DEN_genTracks_pt_abseta": "RunBCDEF_MuIDSF_lowpT.histo.root",
            "mu_Iso NUM_TightRelIso_DEN_TightIDandIPCut_pt_abseta": "RunBCDEF_MuISOSF.histo.root",
        },
    },

Use central maintained jsonpog-integration

The official correction files collected in jsonpog-integration is updated by POG except lumiMask and JME still updated by maintainer. No longer to request input files in the correction_config.

See the example with `2017_UL`.

  "2017_UL": {
        # Same with custom config
        "lumiMask": "Cert_294927-306462_13TeV_UL2017_Collisions17_MuonJSON.txt",
        "JME": "jec_compiled.pkl.gz",
        # no config need to be specify for PU weights
        "PU": None,
        # Btag SFs - specify $TAGGER : $TYPE-> find [$TAGGER_$TYPE] in json file
        "BTV": {"deepCSV": "shape", "deepJet": "shape"},
        "roccor": None,
         # JMAR, IDs from JME- Following the scheme: "${SF_name}": "${WP}"
        "JMAR": {"PUJetID_eff": "L"},
        "LSF": {
        # Electron SF - Following the scheme: "${SF_name} ${SF_map} ${year}": "${WP}"
        # https://github.com/cms-egamma/cms-egamma-docs/blob/master/docs/EgammaSFJSON.md
            "ele_ID 2017 UL-Electron-ID-SF": "wp90iso",
            "ele_Reco 2017 UL-Electron-ID-SF": "RecoAbove20",

        # Muon SF - Following the scheme: "${SF_name} ${year}": "${WP}"
        # WPs : ['NUM_GlobalMuons_DEN_genTracks', 'NUM_HighPtID_DEN_TrackerMuons', 'NUM_HighPtID_DEN_genTracks', 'NUM_IsoMu27_DEN_CutBasedIdTight_and_PFIsoTight', 'NUM_LooseID_DEN_TrackerMuons', 'NUM_LooseID_DEN_genTracks', 'NUM_LooseRelIso_DEN_LooseID', 'NUM_LooseRelIso_DEN_MediumID', 'NUM_LooseRelIso_DEN_MediumPromptID', 'NUM_LooseRelIso_DEN_TightIDandIPCut', 'NUM_LooseRelTkIso_DEN_HighPtIDandIPCut', 'NUM_LooseRelTkIso_DEN_TrkHighPtIDandIPCut', 'NUM_MediumID_DEN_TrackerMuons', 'NUM_MediumID_DEN_genTracks', 'NUM_MediumPromptID_DEN_TrackerMuons', 'NUM_MediumPromptID_DEN_genTracks', 'NUM_Mu50_or_OldMu100_or_TkMu100_DEN_CutBasedIdGlobalHighPt_and_TkIsoLoose', 'NUM_SoftID_DEN_TrackerMuons', 'NUM_SoftID_DEN_genTracks', 'NUM_TightID_DEN_TrackerMuons', 'NUM_TightID_DEN_genTracks', 'NUM_TightRelIso_DEN_MediumID', 'NUM_TightRelIso_DEN_MediumPromptID', 'NUM_TightRelIso_DEN_TightIDandIPCut', 'NUM_TightRelTkIso_DEN_HighPtIDandIPCut', 'NUM_TightRelTkIso_DEN_TrkHighPtIDandIPCut', 'NUM_TrackerMuons_DEN_genTracks', 'NUM_TrkHighPtID_DEN_TrackerMuons', 'NUM_TrkHighPtID_DEN_genTracks']

            "mu_Reco 2017_UL": "NUM_TrackerMuons_DEN_genTracks",
            "mu_HLT 2017_UL": "NUM_IsoMu27_DEN_CutBasedIdTight_and_PFIsoTight",
            "mu_ID 2017_UL": "NUM_TightID_DEN_TrackerMuons",
            "mu_Iso 2017_UL": "NUM_TightRelIso_DEN_TightIDandIPCut",
        },
    },

Get Prescale weights

! this only works in lxplus

Generate prescale weights using brilcalc

python scripts/dump_prescale.py --HLT $HLT --lumi $LUMIMASK
# HLT : put prescaled triggers
# lumi: golden lumi json

Create compiled JERC file(pkl.gz)

โ— In case existing correction file doesn't work for you due to the incompatibility of cloudpickle in different python versions. Please recompile the file to get new pickle file.

Under compile_jec.py you need to create dedicated jet factory files with different campaigns. Following the name scheme with mc for MC and data${run} for data.

Compile correction pickle files for a specific JEC campaign by changing the dict of jet_factory, and define the MC campaign and the output file name by passing it as arguments to the python script:

python -m BTVNanoCommissioning.utils.compile_jec ${campaign} jec_compiled
e.g. python -m BTVNanoCommissioning.utils.compile_jec Summer23 jec_compiled

Prompt data/MC checks and validation

Prompt data/MC checks (prompt_dataMC campaign, WIP)

To quickly check the data/MC quickly, run part data/MC files, no SFs/JEC are applied, only the lumimasks.

  1. Get the file list from DAS, use the scripts/fetch.py scripts to obtain the jsons
  2. Replace the lumimask name in prompt_dataMC in AK4_parameters.py , you can do sed -i 's/$LUMIMASK_DATAMC/xxx.json/g
  3. Run through the dataset to obtained the coffea files
  4. Dump the lumi information via dump_processed.py, then use brilcalc to get the dedicated luminosity info
  5. Obtained data MC plots

Validation workflow

Plotting code

data/MC comparisons

:exclamation_mark: If using wildcard for input, do not forget the quoatation marks! (see 2nd example below)

You can specify -v all to plot all the variables in the coffea file, or use wildcard options (e.g. -v "*DeepJet*" for the input variables containing DeepJet)

๐Ÿ†• non-uniform rebinning is possible, specify the bins with list of edges --autorebin 50,80,81,82,83,100.5

python scripts/plotdataMC.py -i a.coffea,b.coffea --lumi 41500 -p ttdilep_sf -v z_mass,z_pt  
python scripts/plotdataMC.py -i "test*.coffea" --lumi 41500 -p ttdilep_sf -v z_mass,z_pt # with wildcard option need ""
more arguments


options:
  -h, --help            show this help message and exit
  --lumi LUMI           luminosity in /pb
  --com COM             sqrt(s) in TeV
  -p {ttdilep_sf,ttsemilep_sf,ctag_Wc_sf,ctag_DY_sf,ctag_ttsemilep_sf,ctag_ttdilep_sf}, --phase {dilep_sf,ttsemilep_sf,ctag_Wc_sf,ctag_DY_sf,ctag_ttsemilep_sf,ctag_ttdilep_sf}
                        which phase space
  --log LOG             log on y axis
  --norm NORM           Use for reshape SF, scale to same yield as no SFs case
  -v VARIABLE, --variable VARIABLE
                        variables to plot, splitted by ,. Wildcard option * available as well. Specifying `all` will run through all variables.
  --SF                  make w/, w/o SF comparisons
  --ext EXT             prefix name
  -i INPUT, --input INPUT
                        input coffea files (str), splitted different files with ','. Wildcard option * available as well.
   --autorebin AUTOREBIN
                        Rebin the plotting variables, input `int` or `list`. int: merge N bins. list of number: rebin edges(non-uniform bin is possible)
   --xlabel XLABEL      rename the label for x-axis
   --ylabel YLABEL      rename the label for y-axis
   --splitOSSS SPLITOSSS 
                        Only for W+c phase space, split opposite sign(1) and same sign events(-1), if not specified, the combined OS-SS phase space is used
   --xrange XRANGE      custom x-range, --xrange xmin,xmax
   --flow FLOW 
                        str, optional {None, 'show', 'sum'} Whether plot the under/overflow bin. If 'show', add additional under/overflow bin. If 'sum', add the under/overflow bin content to first/last bin.
   --split {flavor,sample,sample_flav}
                        Decomposition of MC samples. Default is split to jet flavor(udsg, pu, c, b), possible to split by group of MC
                        samples. Combination of jetflavor+ sample split is also possible 

data/data, MC/MC comparisons

You can specify -v all to plot all the variables in the coffea file, or use wildcard options (e.g. -v "*DeepJet*" for the input variables containing DeepJet) :exclamation_mark: If using wildcard for input, do not forget the quoatation marks! (see 2nd example below)

# with merge map, compare ttbar with data
python scripts/comparison.py -i "*.coffea" --mergemap '{"ttbar": ["TTto2L2Nu_TuneCP5_13p6TeV_powheg-pythia8","TTto4Q_TuneCP5_13p6TeV_powheg-pythia8","TTtoLNu2Q_TuneCP5_13p6TeV_powheg-pythia8],"data":["MuonRun2022C-27Jun2023-v1","MuonRun2022D-27Jun2023-v1"]}' -r ttbar -c data -v mu_pt  -p ttdilep_sf
# if no  mergemap, take the key name directly
python scripts/comparison.py -i datac.coffea,datad.coffea -r MuonRun2022C-27Jun2023-v1 -c MuonRun2022D-27Jun2023-v1 -v mu_pt  -p ttdilep_sf
more arguments

options:
 -h, --help            show this help message and exit
 -p {dilep_sf,ttsemilep_sf,ctag_Wc_sf,ctag_DY_sf,ctag_ttsemilep_sf,ctag_ttdilep_sf}, --phase {dilep_sf,ttsemilep_sf,ctag_Wc_sf,ctag_DY_sf,ctag_ttsemilep_sf,ctag_ttdilep_sf}
                       which phase space
 -i INPUT, --input INPUT
                       input coffea files (str), splitted different files with ','. Wildcard option * available as well.
 -r REF, --ref REF     referance dataset
 -c COMPARED, --compared COMPARED
                       compared datasets, splitted by ,
 --sepflav SEPFLAV     seperate flavour(b/c/light)
 --log                 log on y axis
 -v VARIABLE, --variable VARIABLE
                       variables to plot, splitted by ,. Wildcard option * available as well. Specifying `all` will run through all variables.
 --ext EXT             prefix name
 --com COM             sqrt(s) in TeV
 --mergemap MERGEMAP
                       Group list of sample(keys in coffea) as reference/compare set as dictionary format. Keys would be the new lables of the group
  --autorebin AUTOREBIN
                       Rebin the plotting variables, input `int` or `list`. int: merge N bins. list of number: rebin edges(non-uniform bin is possible)
  --xlabel XLABEL      rename the label for x-axis
  --ylabel YLABEL      rename the label for y-axis
  --norm               compare shape, normalized yield to reference
  --xrange XRANGE       custom x-range, --xrange xmin,xmax
  --flow FLOW 
                       str, optional {None, 'show', 'sum'} Whether plot the under/overflow bin. If 'show', add additional under/overflow bin. If 'sum', add the under/overflow bin content to first/last bin.

ROCs & efficiency plots

Extract the ROCs for different tagger and efficiencies from validation workflow

python scripts/validation_plot.py -i  $INPUT_COFFEA -v $VERSION

Correlation plots study

You can perform a study of linear correlations of b-tagging input variables. Additionally, soft muon variables may be added into the study by requesting --SMu argument. If you wan to limit the outputs only to DeepFlavB, PNetB and RobustParTAK4B, you can use the --limit_outputs option. If you want to use only the set of variables used for tagger training, not just all the input variables, then use the option --limit_inputs. To limit number of files read, make use of option --max_files. In case your study requires splitting samples by flavour, use --flavour_split. --split_region_b performs a sample splitting based on the DeepFlavB >/< 0.5. For Data/MC comparison purpose pay attention - change ranking factors (xs/sumw) in L420!

python correlation_plots.py $input_folder [--max_files $nmax_files --SMu --limit_inputs --limit_outputs --specify_MC --flavour_split --split_region_b]

2D plots (Correlation study-related)

To further investigate the correlations, one can create the 2D plots of the variables used in this study. Inputs and optional arguments are the same as for the correlation plots study.

python 2Dhistogramms.py $input_folder [--max_files $nmax_files --SMu --limit_inputs --limit_outputs --specify_MC --flavour_split --split_region_b]

Store histograms from coffea file

Use scripts/make_template.py to dump 1D/2D histogram from .coffea to TH1D/TH2D with hist. MC histograms can be reweighted to according to luminosity value given via --lumi. You can also merge several files

python scripts/make_template.py -i "testfile/*.coffea" --lumi 7650 -o test.root -v mujet_pt -a '{"flav":0,"osss":"sum"}'
more arguments

  -i INPUT, --input INPUT
                        Input coffea file(s)
  -v VARIABLE, --variable VARIABLE
                        Variables to store (histogram name)
  -a AXIS, --axis AXIS  dict, put the slicing of histogram, specify 'sum' option as string
  --lumi LUMI           Luminosity in /pb
  -o OUTPUT, --output OUTPUT
                        output root file name
  --mergemap MERGEMAP   Specify mergemap as dict, '{merge1:[dataset1,dataset2]...}' Also works with the json file with dict

mergemap example

{
    "WJets": ["WJetsToLNu_TuneCP5_13p6TeV-madgraphMLM-pythia8"],
    "VV": [ "WW_TuneCP5_13p6TeV-pythia8", "WZ_TuneCP5_13p6TeV-pythia8", "ZZ_TuneCP5_13p6TeV-pythia8"],
    "TT": [ "TTTo2J1L1Nu_CP5_13p6TeV_powheg-pythia8", "TTTo2L2Nu_CP5_13p6TeV_powheg-pythia8"],
    "ST":[ "TBbarQ_t-channel_4FS_CP5_13p6TeV_powheg-madspin-pythia8", "TbarWplus_DR_AtLeastOneLepton_CP5_13p6TeV_powheg-pythia8", "TbarBQ_t-channel_4FS_CP5_13p6TeV_powheg-madspin-pythia8", "TWminus_DR_AtLeastOneLepton_CP5_13p6TeV_powheg-pythia8"],
"data":[ "Muon_Run2022C-PromptReco-v1", "SingleMuon_Run2022C-PromptReco-v1", "Muon_Run2022D-PromptReco-v1", "Muon_Run2022D-PromptReco-v2"]
}

Notes for developers

The BTV tutorial for coffea part is under notebooks and the template to construct new workflow is src/BTVNanoCommissioning/workflows/example.py Here are some tips provided for developers working on their forked version of this repository. Also some useful git commands can be found here

Setup CI pipeline for fork branch

Since the CI pipelines involve reading files via xrootd and access gitlab.cern.ch, you need to save some secrets in your forked directory.

Yout can find the secret configuration in the direcotry : Settings>>Secrets>>Actions, and create the following secrets:

  • GIT_CERN_SSH_PRIVATE:
    1. Create a ssh key pair with ssh-keygen -t rsa -b 4096 (do not overwrite with your local one), add the public key to your CERN gitlab account
    2. Copy the private key to the entry
  • GRID_PASSWORD: Add your grid password to the entry.
  • GRID_USERCERT & GRID_USERKEY: Encrypt your grid user certification base64 -i ~/.globus/userkey.pem | awk NF=NF RS= OFS= and base64 -i ~/.globus/usercert.pem | awk NF=NF RS= OFS= and copy the output to the entry.

Special commit head messages could run different commands in actions (add the flag in front of your commit) The default configureation is doing

python runner.py --workflow emctag_ttdilep_sf --json metadata/test_bta_run3.json --limit 1 --executor iterative --campaign Summer23 --isArray --isSyst all
  • [skip ci]: not running ci at all in the commit message
  • ci:skip array : remove --isArray option
  • ci:skip syst : remove --isSyst all option
  • ci:JERC_split : change systematic option to split JERC uncertainty sources --isSyst JERC_split
  • ci:weight_only : change systematic option to weight only variations --isSyst weight_only

Running jupyter remotely

  1. On your local machine, edit .ssh/config:
Host lxplus*
  HostName lxplus7.cern.ch
  User <your-user-name>
  ForwardX11 yes
  ForwardAgent yes
  ForwardX11Trusted yes
Host *_f
  LocalForward localhost:8800 localhost:8800
  ExitOnForwardFailure yes
  1. Connect to remote with ssh lxplus_f
  2. Start a jupyter notebook:
jupyter notebook --ip=127.0.0.1 --port 8800 --no-browser
  1. URL for notebook will be printed, copy and open in local browser

About

Repository for running coffea analysis workflows

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 72.5%
  • Python 26.7%
  • Other 0.8%