-
Notifications
You must be signed in to change notification settings - Fork 38
/
Copy pathttbar_analysis_pipeline.py
580 lines (480 loc) · 25.7 KB
/
ttbar_analysis_pipeline.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
# ---
# jupyter:
# jupytext:
# formats: ipynb,py:percent
# text_representation:
# extension: .py
# format_name: percent
# format_version: '1.3'
# jupytext_version: 1.14.1
# kernelspec:
# display_name: Python 3 (ipykernel)
# language: python
# name: python3
# ---
# %% [markdown]
# # CMS Open Data $t\bar{t}$: from data delivery to statistical inference
#
# We are using [2015 CMS Open Data](https://cms.cern/news/first-cms-open-data-lhc-run-2-released) in this demonstration to showcase an analysis pipeline.
# It features data delivery and processing, histogram construction and visualization, as well as statistical inference.
#
# This notebook was developed in the context of the [IRIS-HEP AGC tools 2022 workshop](https://indico.cern.ch/e/agc-tools-2).
# This work was supported by the U.S. National Science Foundation (NSF) Cooperative Agreement OAC-1836650 (IRIS-HEP).
#
# This is a **technical demonstration**.
# We are including the relevant workflow aspects that physicists need in their work, but we are not focusing on making every piece of the demonstration physically meaningful.
# This concerns in particular systematic uncertainties: we capture the workflow, but the actual implementations are more complex in practice.
# If you are interested in the physics side of analyzing top pair production, check out the latest results from [ATLAS](https://twiki.cern.ch/twiki/bin/view/AtlasPublic/TopPublicResults) and [CMS](https://cms-results.web.cern.ch/cms-results/public-results/preliminary-results/)!
# If you would like to see more technical demonstrations, also check out an [ATLAS Open Data example](https://indico.cern.ch/event/1076231/contributions/4560405/) demonstrated previously.
#
# This notebook implements most of the analysis pipeline shown in the following picture, using the tools also mentioned there:
# ![ecosystem visualization](utils/ecosystem.png)
# %% [markdown]
# ### Data pipelines
#
# There are two possible pipelines: one with `ServiceX` enabled, and one using only `coffea` for processing.
# ![processing pipelines](utils/processing_pipelines.png)
# %% [markdown]
# ### Imports: setting up our environment
# %% tags=[]
import asyncio
import logging
import os
import time
import awkward as ak
import cabinetry
from coffea import processor
from coffea.nanoevents import NanoAODSchema
from func_adl import ObjectStream
from func_adl_servicex import ServiceXSourceUpROOT
import hist
import json
import yaml
import matplotlib.pyplot as plt
import numpy as np
import uproot
import utils # contains code for bookkeeping and cosmetics, as well as some boilerplate
logging.getLogger("cabinetry").setLevel(logging.INFO)
# %% [markdown]
# ### Configuration: number of files and data delivery path
#
# The number of files per sample set here determines the size of the dataset we are processing.
# There are 9 samples being used here, all part of the 2015 CMS Open Data release.
#
# These samples were originally published in miniAOD format, but for the purposes of this demonstration were pre-converted into nanoAOD format. More details about the inputs can be found [here](https://github.com/iris-hep/analysis-grand-challenge/tree/main/datasets/cms-open-data-2015).
#
# The table below summarizes the amount of data processed depending on the `N_FILES_MAX_PER_SAMPLE` setting.
#
# | setting | number of files | total size | number of events |
# | --- | --- | --- | --- |
# | `1` | 9 | 22.9 GB | 10455719 |
# | `2` | 18 | 42.8 GB | 19497435 |
# | `5` | 43 | 105 GB | 47996231 |
# | `10` | 79 | 200 GB | 90546458 |
# | `20` | 140 | 359 GB | 163123242 |
# | `50` | 255 | 631 GB | 297247463 |
# | `100` | 395 | 960 GB | 470397795 |
# | `200` | 595 | 1.40 TB | 705273291 |
# | `-1` | 787 | 1.78 TB | 940160174 |
#
# The input files are all in the 1–3 GB range.
# %% tags=[]
### GLOBAL CONFIGURATION
# input files per process, set to e.g. 10 (smaller number = faster)
N_FILES_MAX_PER_SAMPLE = 5
# enable Dask
USE_DASK = True
# enable ServiceX
USE_SERVICEX = False
### LOAD OTHER CONFIGURATION VARIABLES
with open("config.yaml") as config_file:
config = yaml.safe_load(config_file)
# %% [markdown]
# ### Defining our `coffea` Processor
#
# The processor includes a lot of the physics analysis details:
# - event filtering and the calculation of observables,
# - event weighting,
# - calculating systematic uncertainties at the event and object level,
# - filling all the information into histograms that get aggregated and ultimately returned to us by `coffea`.
# %% tags=[]
# functions creating systematic variations
def flat_variation(ones):
# 2.5% weight variations
return (1.0 + np.array([0.025, -0.025], dtype=np.float32)) * ones[:, None]
def btag_weight_variation(i_jet, jet_pt):
# weight variation depending on i-th jet pT (7.5% as default value, multiplied by i-th jet pT / 50 GeV)
return 1 + np.array([0.075, -0.075]) * (ak.singletons(jet_pt[:, i_jet]) / 50).to_numpy()
def jet_pt_resolution(pt):
# normal distribution with 5% variations, shape matches jets
counts = ak.num(pt)
pt_flat = ak.flatten(pt)
resolution_variation = np.random.normal(np.ones_like(pt_flat), 0.05)
return ak.unflatten(resolution_variation, counts)
class TtbarAnalysis(processor.ProcessorABC):
def __init__(self, disable_processing, io_branches):
num_bins = 25
bin_low = 50
bin_high = 550
name = "observable"
label = "observable [GeV]"
self.hist = (
hist.Hist.new.Reg(num_bins, bin_low, bin_high, name=name, label=label)
.StrCat(["4j1b", "4j2b"], name="region", label="Region")
.StrCat([], name="process", label="Process", growth=True)
.StrCat([], name="variation", label="Systematic variation", growth=True)
.Weight()
)
self.disable_processing = disable_processing
self.io_branches = io_branches
def only_do_IO(self, events):
for branch in self.io_branches:
if "_" in branch:
split = branch.split("_")
object_type = split[0]
property_name = '_'.join(split[1:])
ak.materialized(events[object_type][property_name])
else:
ak.materialized(events[branch])
return {"hist": {}}
def process(self, events):
if self.disable_processing:
# IO testing with no subsequent processing
return self.only_do_IO(events)
histogram = self.hist.copy()
process = events.metadata["process"] # "ttbar" etc.
variation = events.metadata["variation"] # "nominal" etc.
# normalization for MC
x_sec = events.metadata["xsec"]
nevts_total = events.metadata["nevts"]
lumi = 3378 # /pb
if process != "data":
xsec_weight = x_sec * lumi / nevts_total
else:
xsec_weight = 1
#### systematics
# example of a simple flat weight variation, using the coffea nanoevents systematics feature
if process == "wjets":
events.add_systematic("scale_var", "UpDownSystematic", "weight", flat_variation)
# jet energy scale / resolution systematics
# need to adjust schema to instead use coffea add_systematic feature, especially for ServiceX
# cannot attach pT variations to events.jet, so attach to events directly
# and subsequently scale pT by these scale factors
events["pt_nominal"] = 1.0
events["pt_scale_up"] = 1.03
events["pt_res_up"] = jet_pt_resolution(events.Jet.pt)
pt_variations = ["pt_nominal", "pt_scale_up", "pt_res_up"] if variation == "nominal" else ["pt_nominal"]
for pt_var in pt_variations:
### event selection
# very very loosely based on https://arxiv.org/abs/2006.13076
selected_electrons = events.Electron[(events.Electron.pt > 25)] # require pt > 25 GeV for electrons
selected_muons = events.Muon[(events.Muon.pt > 25)] # require pt > 25 GeV for muons
jet_filter = (events.Jet.pt * events[pt_var]) > 25 # pT > 25 GeV for jets (scaled by systematic variations)
selected_jets = events.Jet[jet_filter]
# single lepton requirement
event_filters = ((ak.count(selected_electrons.pt, axis=1) + ak.count(selected_muons.pt, axis=1)) == 1)
# at least four jets
pt_var_modifier = events[pt_var] if "res" not in pt_var else events[pt_var][jet_filter]
event_filters = event_filters & (ak.count(selected_jets.pt * pt_var_modifier, axis=1) >= 4)
# at least one b-tagged jet ("tag" means score above threshold)
B_TAG_THRESHOLD = 0.5
event_filters = event_filters & (ak.sum(selected_jets.btagCSVV2 >= B_TAG_THRESHOLD, axis=1) >= 1)
# apply event filters
selected_events = events[event_filters]
selected_electrons = selected_electrons[event_filters]
selected_muons = selected_muons[event_filters]
selected_jets = selected_jets[event_filters]
for region in ["4j1b", "4j2b"]:
# further filtering: 4j1b CR with single b-tag, 4j2b SR with two or more tags
if region == "4j1b":
region_filter = ak.sum(selected_jets.btagCSVV2 >= B_TAG_THRESHOLD, axis=1) == 1
selected_jets_region = selected_jets[region_filter]
# use HT (scalar sum of jet pT) as observable
pt_var_modifier = (
events[pt_var][event_filters][region_filter]
if "res" not in pt_var
else events[pt_var][jet_filter][event_filters][region_filter]
)
observable = ak.sum(selected_jets_region.pt * pt_var_modifier, axis=-1)
elif region == "4j2b":
region_filter = ak.sum(selected_jets.btagCSVV2 > B_TAG_THRESHOLD, axis=1) >= 2
selected_jets_region = selected_jets[region_filter]
pt_var_modifier = (
events[pt_var][event_filters][region_filter]
if "res" not in pt_var
else events[pt_var][jet_filter][event_filters][region_filter]
)
# overwrite jets object with modified pT (handled by correctionlib in later versions)
selected_jets_region = ak.zip(
{"pt": selected_jets_region.pt * pt_var_modifier,
"eta": selected_jets_region.eta,
"phi": selected_jets_region.phi,
"mass": selected_jets_region.mass,
"btagCSVV2": selected_jets_region.btagCSVV2},
with_name="PtEtaPhiMLorentzVector"
)
# reconstruct hadronic top as bjj system with largest pT
trijet = ak.combinations(selected_jets_region, 3, fields=["j1", "j2", "j3"]) # trijet candidates
trijet["p4"] = trijet.j1 + trijet.j2 + trijet.j3 # calculate four-momentum of tri-jet system
trijet["max_btag"] = np.maximum(trijet.j1.btagCSVV2, np.maximum(trijet.j2.btagCSVV2, trijet.j3.btagCSVV2))
trijet = trijet[trijet.max_btag > B_TAG_THRESHOLD] # at least one-btag in trijet candidates
# pick trijet candidate with largest pT and calculate mass of system
trijet_mass = trijet["p4"][ak.argmax(trijet.p4.pt, axis=1, keepdims=True)].mass
observable = ak.flatten(trijet_mass)
### histogram filling
if pt_var == "pt_nominal":
# nominal pT, but including 2-point systematics
histogram.fill(
observable=observable, region=region, process=process,
variation=variation, weight=xsec_weight
)
if variation == "nominal":
# also fill weight-based variations for all nominal samples
for weight_name in events.systematics.fields:
for direction in ["up", "down"]:
# extract the weight variations and apply all event & region filters
weight_variation = events.systematics[weight_name][direction][
f"weight_{weight_name}"][event_filters][region_filter]
# fill histograms
histogram.fill(
observable=observable, region=region, process=process,
variation=f"{weight_name}_{direction}", weight=xsec_weight*weight_variation
)
# calculate additional systematics: b-tagging variations
for i_var, weight_name in enumerate([f"btag_var_{i}" for i in range(4)]):
for i_dir, direction in enumerate(["up", "down"]):
# create systematic variations that depend on object properties (here: jet pT)
if len(observable):
weight_variation = btag_weight_variation(i_var, selected_jets_region.pt)[:, i_dir]
else:
weight_variation = 1 # no events selected
histogram.fill(
observable=observable, region=region, process=process,
variation=f"{weight_name}_{direction}", weight=xsec_weight*weight_variation
)
elif variation == "nominal":
# pT variations for nominal samples
histogram.fill(
observable=observable, region=region, process=process,
variation=pt_var, weight=xsec_weight
)
output = {"nevents": {events.metadata["dataset"]: len(events)}, "hist": histogram}
return output
def postprocess(self, accumulator):
return accumulator
# %% [markdown]
# ### "Fileset" construction and metadata
#
# Here, we gather all the required information about the files we want to process: paths to the files and asociated metadata.
# %% tags=[]
fileset = utils.construct_fileset(N_FILES_MAX_PER_SAMPLE, use_xcache=False, af_name=config["benchmarking"]["AF_NAME"]) # local files on /data for ssl-dev
print(f"processes in fileset: {list(fileset.keys())}")
print(f"\nexample of information in fileset:\n{{\n 'files': [{fileset['ttbar__nominal']['files'][0]}, ...],")
print(f" 'metadata': {fileset['ttbar__nominal']['metadata']}\n}}")
# %% [markdown]
# ### ServiceX-specific functionality: query setup
#
# Define the func_adl query to be used for the purpose of extracting columns and filtering.
# %% tags=[]
def get_query(source: ObjectStream) -> ObjectStream:
"""Query for event / column selection: >=4j >=1b, ==1 lep with pT>25 GeV, return relevant columns
"""
return source.Where(lambda e: e.Electron_pt.Where(lambda pt: pt > 25).Count()
+ e.Muon_pt.Where(lambda pt: pt > 25).Count() == 1)\
.Where(lambda f: f.Jet_pt.Where(lambda pt: pt > 25).Count() >= 4)\
.Where(lambda g: {"pt": g.Jet_pt,
"btagCSVV2": g.Jet_btagCSVV2}.Zip().Where(lambda jet:
jet.btagCSVV2 >= 0.5
and jet.pt > 25).Count() >= 1)\
.Select(lambda h: {"Electron_pt": h.Electron_pt,
"Muon_pt": h.Muon_pt,
"Jet_mass": h.Jet_mass,
"Jet_pt": h.Jet_pt,
"Jet_eta": h.Jet_eta,
"Jet_phi": h.Jet_phi,
"Jet_btagCSVV2": h.Jet_btagCSVV2,
})
# %% [markdown]
# ### Caching the queried datasets with `ServiceX`
#
# Using the queries created with `func_adl`, we are using `ServiceX` to read the CMS Open Data files to build cached files with only the specific event information as dictated by the query.
# %% tags=[]
if USE_SERVICEX:
# dummy dataset on which to generate the query
dummy_ds = ServiceXSourceUpROOT("cernopendata://dummy", "Events", backend_name="uproot")
# tell low-level infrastructure not to contact ServiceX yet, only to
# return the qastle string it would have sent
dummy_ds.return_qastle = True
# create the query
query = get_query(dummy_ds).value()
# now we query the files using a wrapper around ServiceXDataset to transform all processes at once
t0 = time.time()
ds = utils.ServiceXDatasetGroup(fileset, backend_name="uproot", ignore_cache=config["global"]["SERVICEX_IGNORE_CACHE"])
files_per_process = ds.get_data_rootfiles_uri(query, as_signed_url=True, title="CMS ttbar")
print(f"ServiceX data delivery took {time.time() - t0:.2f} seconds")
# update fileset to point to ServiceX-transformed files
for process in fileset.keys():
fileset[process]["files"] = [f.url for f in files_per_process[process]]
# %% [markdown]
# ### Execute the data delivery pipeline
#
# What happens here depends on the flag `USE_SERVICEX`. If set to true, the processor is run on the data previously gathered by ServiceX, then will gather output histograms.
#
# When `USE_SERVICEX` is false, the input files need to be processed during this step as well.
# %% tags=[]
NanoAODSchema.warn_missing_crossrefs = False # silences warnings about branches we will not use here
if USE_DASK:
executor = processor.DaskExecutor(client=utils.get_client(af=config["global"]["AF"]))
else:
executor = processor.FuturesExecutor(workers=config["benchmarking"]["NUM_CORES"])
run = processor.Runner(executor=executor, schema=NanoAODSchema, savemetrics=True, metadata_cache={}, chunksize=config["benchmarking"]["CHUNKSIZE"])
if USE_SERVICEX:
treename = "servicex"
else:
treename = "Events"
filemeta = run.preprocess(fileset, treename=treename) # pre-processing
t0 = time.monotonic()
all_histograms, metrics = run(fileset, treename, processor_instance=TtbarAnalysis(config["benchmarking"]["DISABLE_PROCESSING"], config["benchmarking"]["IO_BRANCHES"][config["benchmarking"]["IO_FILE_PERCENT"]])) # processing
exec_time = time.monotonic() - t0
all_histograms = all_histograms["hist"]
print(f"\nexecution took {exec_time:.2f} seconds")
# %% tags=[]
# track metrics
dataset_source = "/data" if fileset["ttbar__nominal"]["files"][0].startswith("/data") else "https://xrootd-local.unl.edu:1094" # TODO: xcache support
metrics.update({
"walltime": exec_time,
"num_workers": config["benchmarking"]["NUM_CORES"],
"af": config["benchmarking"]["AF_NAME"],
"dataset_source": dataset_source,
"use_dask": USE_DASK,
"use_servicex": USE_SERVICEX,
"systematics": config["benchmarking"]["SYSTEMATICS"],
"n_files_max_per_sample": N_FILES_MAX_PER_SAMPLE,
"cores_per_worker": config["benchmarking"]["CORES_PER_WORKER"],
"chunksize": config["benchmarking"]["CHUNKSIZE"],
"disable_processing": config["benchmarking"]["DISABLE_PROCESSING"],
"io_file_percent": config["benchmarking"]["IO_FILE_PERCENT"]
})
# save metrics to disk
if not os.path.exists("metrics"):
os.makedirs("metrics")
timestamp = time.strftime('%Y%m%d-%H%M%S')
af_name = metrics["af"]
metric_file_name = f"metrics/{af_name}-{timestamp}.json"
with open(metric_file_name, "w") as f:
f.write(json.dumps(metrics))
print(f"metrics saved as {metric_file_name}")
#print(f"event rate per worker (full execution time divided by NUM_CORES={NUM_CORES}): {metrics['entries'] / NUM_CORES / exec_time / 1_000:.2f} kHz")
print(f"event rate per worker (pure processtime): {metrics['entries'] / metrics['processtime'] / 1_000:.2f} kHz")
print(f"amount of data read: {metrics['bytesread']/1000**2:.2f} MB") # likely buggy: https://github.com/CoffeaTeam/coffea/issues/717
# %% [markdown]
# ### Inspecting the produced histograms
#
# Let's have a look at the data we obtained.
# We built histograms in two phase space regions, for multiple physics processes and systematic variations.
# %% tags=[]
utils.set_style()
all_histograms[120j::hist.rebin(2), "4j1b", :, "nominal"].stack("process")[::-1].plot(stack=True, histtype="fill", linewidth=1, edgecolor="grey")
plt.legend(frameon=False)
plt.title(">= 4 jets, 1 b-tag")
plt.xlabel("HT [GeV]");
# %% tags=[]
all_histograms[:, "4j2b", :, "nominal"].stack("process")[::-1].plot(stack=True, histtype="fill", linewidth=1,edgecolor="grey")
plt.legend(frameon=False)
plt.title(">= 4 jets, >= 2 b-tags")
plt.xlabel("$m_{bjj}$ [Gev]");
# %% [markdown]
# Our top reconstruction approach ($bjj$ system with largest $p_T$) has worked!
#
# Let's also have a look at some systematic variations:
# - b-tagging, which we implemented as jet-kinematic dependent event weights,
# - jet energy variations, which vary jet kinematics, resulting in acceptance effects and observable changes.
#
# We are making of [UHI](https://uhi.readthedocs.io/) here to re-bin.
# %% tags=[]
# b-tagging variations
all_histograms[120j::hist.rebin(2), "4j1b", "ttbar", "nominal"].plot(label="nominal", linewidth=2)
all_histograms[120j::hist.rebin(2), "4j1b", "ttbar", "btag_var_0_up"].plot(label="NP 1", linewidth=2)
all_histograms[120j::hist.rebin(2), "4j1b", "ttbar", "btag_var_1_up"].plot(label="NP 2", linewidth=2)
all_histograms[120j::hist.rebin(2), "4j1b", "ttbar", "btag_var_2_up"].plot(label="NP 3", linewidth=2)
all_histograms[120j::hist.rebin(2), "4j1b", "ttbar", "btag_var_3_up"].plot(label="NP 4", linewidth=2)
plt.legend(frameon=False)
plt.xlabel("HT [GeV]")
plt.title("b-tagging variations");
# %% tags=[]
# jet energy scale variations
all_histograms[:, "4j2b", "ttbar", "nominal"].plot(label="nominal", linewidth=2)
all_histograms[:, "4j2b", "ttbar", "pt_scale_up"].plot(label="scale up", linewidth=2)
all_histograms[:, "4j2b", "ttbar", "pt_res_up"].plot(label="resolution up", linewidth=2)
plt.legend(frameon=False)
plt.xlabel("$m_{bjj}$ [Gev]")
plt.title("Jet energy variations");
# %% [markdown]
# ### Save histograms to disk
#
# We'll save everything to disk for subsequent usage.
# This also builds pseudo-data by combining events from the various simulation setups we have processed.
# %% tags=[]
utils.save_histograms(all_histograms, fileset, "histograms.root")
# %% [markdown]
# ### Statistical inference
#
# We are going to perform a re-binning for the statistical inference.
# This is planned to be conveniently provided via cabinetry (see [cabinetry#412](https://github.com/scikit-hep/cabinetry/issues/412), but in the meantime we can achieve this via [template building overrides](https://cabinetry.readthedocs.io/en/latest/advanced.html#overrides-for-template-building).
# The implementation is provided in a function in `utils/`.
#
# A statistical model has been defined in `config.yml`, ready to be used with our output.
# We will use `cabinetry` to combine all histograms into a `pyhf` workspace and fit the resulting statistical model to the pseudodata we built.
# %% tags=[]
config = cabinetry.configuration.load("cabinetry_config.yml")
# rebinning: lower edge 110 GeV, merge bins 2->1
rebinning_router = utils.get_cabinetry_rebinning_router(config, rebinning=slice(110j, None, hist.rebin(2)))
cabinetry.templates.build(config, router=rebinning_router)
cabinetry.templates.postprocess(config) # optional post-processing (e.g. smoothing)
ws = cabinetry.workspace.build(config)
cabinetry.workspace.save(ws, "workspace.json")
# %% [markdown]
# We can inspect the workspace with `pyhf`, or use `pyhf` to perform inference.
# %% tags=[]
# !pyhf inspect workspace.json | head -n 20
# %% [markdown]
# Let's try out what we built: the next cell will perform a maximum likelihood fit of our statistical model to the pseudodata we built.
# %% tags=[]
model, data = cabinetry.model_utils.model_and_data(ws)
fit_results = cabinetry.fit.fit(model, data)
cabinetry.visualize.pulls(
fit_results, exclude="ttbar_norm", close_figure=True, save_figure=False
)
# %% [markdown]
# For this pseudodata, what is the resulting ttbar cross-section divided by the Standard Model prediction?
# %% tags=[]
poi_index = model.config.poi_index
print(f"\nfit result for ttbar_norm: {fit_results.bestfit[poi_index]:.3f} +/- {fit_results.uncertainty[poi_index]:.3f}")
# %% [markdown]
# Let's also visualize the model before and after the fit, in both the regions we are using.
# The binning here corresponds to the binning used for the fit.
# %% tags=[]
model_prediction = cabinetry.model_utils.prediction(model)
figs = cabinetry.visualize.data_mc(model_prediction, data, close_figure=True, config=config)
figs[0]["figure"]
# %% tags=[]
figs[1]["figure"]
# %% [markdown]
# We can see very good post-fit agreement.
# %% tags=[]
model_prediction_postfit = cabinetry.model_utils.prediction(model, fit_results=fit_results)
figs = cabinetry.visualize.data_mc(model_prediction_postfit, data, close_figure=True, config=config)
figs[0]["figure"]
# %% tags=[]
figs[1]["figure"]
# %% [markdown]
# ### What is next?
#
# Our next goals for this pipeline demonstration are:
# - making this analysis even **more feature-complete**,
# - **addressing performance bottlenecks** revealed by this demonstrator,
# - **collaborating** with you!
#
# Please do not hesitate to get in touch if you would like to join the effort, or are interested in re-implementing (pieces of) the pipeline with different tools!
#
# Our mailing list is [email protected], sign up via the [Google group](https://groups.google.com/a/iris-hep.org/g/analysis-grand-challenge).