Make validate_histos.py from the IRIS-HEP repo pass on RDF's output histograms #28

eguiraud · 2023-07-19T16:29:17Z

No description provided.

eguiraud · 2023-07-19T16:31:44Z

@andriiknu let's use this issue to track the state of the agreement between the histograms produced by the RDF implementation and the reference implementation. At the moment we are implementing v1 so I guess the references are the ones here: https://github.com/iris-hep/analysis-grand-challenge/tree/v1.1.0/analyses/cms-open-data-ttbar/reference (tag v1.1.0 of the reference implementation).

Can you summarize what the current state is?

eguiraud · 2023-07-19T16:33:43Z

I guess the important underlying question is: do you think the current agreement (and the current state of the implementation) is good enough that we can tag a v1?

eguiraud · 2023-07-19T19:17:49Z

I just checked for the current version of main (6c7e045).

It turns out that the only mismatches are caused by:

bin migrations, which the latest version of the validate_histograms.py script from the reference implementation handles automatically
a problem with the reference implementation: it artificially adds 1e-6 to all bin contents. this will be fixed there: Port histogram re-binning to main iris-hep/analysis-grand-challenge#166 (comment)

The IRIS-HEP implementation should tag a v1.2 version soon that solves these issues. Then our outputs should pass the validation script, as far as I can tell. I'll leave this open until then.

One question remaining (for @andriiknu and @alexander-held): the RDF implementation does not produce histograms 4j1b_pseudodata and 4j2b_pseudodata: should it?

alexander-held · 2023-07-19T19:36:59Z

One question remaining (for @andriiknu and @alexander-held): the RDF implementation does not produce histograms 4j1b_pseudodata and 4j2b_pseudodata: should it?

I would say we do not explicitly require those histograms to be written out in the analysis task. They are needed for the statistical inference stage, and https://agc.readthedocs.io/en/latest/taskbackground.html#statistical-model describes how to obtain them, but they are built from other histograms (which by themselves are required). In that sense I would argue that it is left up to the implementation to decide whether to write out dedicated histograms (and then use those for the dataset to fit the statistical model to) or to build that dataset on-the-fly from the required histograms at a later stage.

andriiknu · 2023-07-19T20:12:17Z

@eguiraud , @alexander-held
Jet_btag inconsistency in selection cuts can be summarized via the following:

4j1b region is produced by NOT strict condition (btag>=0.5) in the selection cut (main implementation, v1.1.0 implementation)
4j2b region and trijet use strict conditions ( btag>0.5 and trijet_btag>0.5) as a selection cut

eguiraud · 2023-07-19T20:18:42Z

Thank you @andriiknu , the reference implementation will soon release a v1.2 tag which will use > everywhere. This is tracked in iris-hep/analysis-grand-challenge#174 now

EDIT:
and we will follow suit right after

eguiraud · 2023-07-19T20:24:25Z

@alexander-held then should validate_histograms.py ignore 4j1b_pseudodata and 4j2b_pseudodata? currently it complains if they are missing

alexander-held · 2023-07-19T20:26:31Z

Good point, it doesn't do harm to have them in the reference files but we should just skip that comparison if the histograms were not produced by a given implementation.

eguiraud · 2023-07-19T20:38:08Z

Cool, I'll open a PR in a few minutes

EDIT: iris-hep/analysis-grand-challenge#177

eguiraud · 2023-07-19T21:05:12Z

Alright, I think the little mismatches we currently have are all understood.

I think we'll just wait for the reference implementation to tag a v1.2, adapt to that (e.g. making the btag cuts more uniform) and we'll tag a v1.0 of the RDF implementation!

If v1.2 will also remove the 1e-6 offsetting of the histogram bin contents mentioned at iris-hep/analysis-grand-challenge#166 (comment) we should have perfect agreement between our output histograms. Otherwise that's the one remaining snag and we'll sync up in v2.

eguiraud · 2023-07-20T15:48:29Z

This is fixed: current main (6557e89) passes validation against v1.2 of the reference implementation.

eguiraud assigned andriiknu Jul 19, 2023

eguiraud assigned eguiraud and unassigned andriiknu Jul 20, 2023

eguiraud closed this as completed Jul 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make validate_histos.py from the IRIS-HEP repo pass on RDF's output histograms #28

Make validate_histos.py from the IRIS-HEP repo pass on RDF's output histograms #28

eguiraud commented Jul 19, 2023

eguiraud commented Jul 19, 2023

eguiraud commented Jul 19, 2023

eguiraud commented Jul 19, 2023

alexander-held commented Jul 19, 2023

andriiknu commented Jul 19, 2023 •

edited

Loading

eguiraud commented Jul 19, 2023 •

edited

Loading

eguiraud commented Jul 19, 2023 •

edited

Loading

alexander-held commented Jul 19, 2023

eguiraud commented Jul 19, 2023 •

edited

Loading

eguiraud commented Jul 19, 2023

eguiraud commented Jul 20, 2023

Make validate_histos.py from the IRIS-HEP repo pass on RDF's output histograms #28

Make validate_histos.py from the IRIS-HEP repo pass on RDF's output histograms #28

Comments

eguiraud commented Jul 19, 2023

eguiraud commented Jul 19, 2023

eguiraud commented Jul 19, 2023

eguiraud commented Jul 19, 2023

alexander-held commented Jul 19, 2023

andriiknu commented Jul 19, 2023 • edited Loading

eguiraud commented Jul 19, 2023 • edited Loading

eguiraud commented Jul 19, 2023 • edited Loading

alexander-held commented Jul 19, 2023

eguiraud commented Jul 19, 2023 • edited Loading

eguiraud commented Jul 19, 2023

eguiraud commented Jul 20, 2023

andriiknu commented Jul 19, 2023 •

edited

Loading

eguiraud commented Jul 19, 2023 •

edited

Loading

eguiraud commented Jul 19, 2023 •

edited

Loading

eguiraud commented Jul 19, 2023 •

edited

Loading