Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make validate_histos.py from the IRIS-HEP repo pass on RDF's output histograms #28

Closed
eguiraud opened this issue Jul 19, 2023 · 11 comments
Assignees

Comments

@eguiraud
Copy link
Member

No description provided.

@eguiraud
Copy link
Member Author

@andriiknu let's use this issue to track the state of the agreement between the histograms produced by the RDF implementation and the reference implementation. At the moment we are implementing v1 so I guess the references are the ones here: https://github.com/iris-hep/analysis-grand-challenge/tree/v1.1.0/analyses/cms-open-data-ttbar/reference (tag v1.1.0 of the reference implementation).

Can you summarize what the current state is?

@eguiraud
Copy link
Member Author

I guess the important underlying question is: do you think the current agreement (and the current state of the implementation) is good enough that we can tag a v1?

@eguiraud
Copy link
Member Author

I just checked for the current version of main (6c7e045).

It turns out that the only mismatches are caused by:

The IRIS-HEP implementation should tag a v1.2 version soon that solves these issues. Then our outputs should pass the validation script, as far as I can tell. I'll leave this open until then.

One question remaining (for @andriiknu and @alexander-held): the RDF implementation does not produce histograms 4j1b_pseudodata and 4j2b_pseudodata: should it?

@alexander-held
Copy link

One question remaining (for @andriiknu and @alexander-held): the RDF implementation does not produce histograms 4j1b_pseudodata and 4j2b_pseudodata: should it?

I would say we do not explicitly require those histograms to be written out in the analysis task. They are needed for the statistical inference stage, and https://agc.readthedocs.io/en/latest/taskbackground.html#statistical-model describes how to obtain them, but they are built from other histograms (which by themselves are required). In that sense I would argue that it is left up to the implementation to decide whether to write out dedicated histograms (and then use those for the dataset to fit the statistical model to) or to build that dataset on-the-fly from the required histograms at a later stage.

@andriiknu
Copy link
Contributor

andriiknu commented Jul 19, 2023

@eguiraud , @alexander-held
Jet_btag inconsistency in selection cuts can be summarized via the following:

  1. 4j1b region is produced by NOT strict condition (btag>=0.5) in the selection cut (main implementation, v1.1.0 implementation)
  2. 4j2b region and trijet use strict conditions ( btag>0.5 and trijet_btag>0.5) as a selection cut

@eguiraud
Copy link
Member Author

eguiraud commented Jul 19, 2023

Thank you @andriiknu , the reference implementation will soon release a v1.2 tag which will use > everywhere. This is tracked in iris-hep/analysis-grand-challenge#174 now

EDIT:
and we will follow suit right after

@eguiraud
Copy link
Member Author

eguiraud commented Jul 19, 2023

@alexander-held then should validate_histograms.py ignore 4j1b_pseudodata and 4j2b_pseudodata? currently it complains if they are missing

@alexander-held
Copy link

Good point, it doesn't do harm to have them in the reference files but we should just skip that comparison if the histograms were not produced by a given implementation.

@eguiraud
Copy link
Member Author

eguiraud commented Jul 19, 2023

Cool, I'll open a PR in a few minutes

EDIT: iris-hep/analysis-grand-challenge#177

@eguiraud
Copy link
Member Author

Alright, I think the little mismatches we currently have are all understood.

I think we'll just wait for the reference implementation to tag a v1.2, adapt to that (e.g. making the btag cuts more uniform) and we'll tag a v1.0 of the RDF implementation!

If v1.2 will also remove the 1e-6 offsetting of the histogram bin contents mentioned at iris-hep/analysis-grand-challenge#166 (comment) we should have perfect agreement between our output histograms. Otherwise that's the one remaining snag and we'll sync up in v2.

@eguiraud eguiraud assigned eguiraud and unassigned andriiknu Jul 20, 2023
@eguiraud
Copy link
Member Author

This is fixed: current main (6557e89) passes validation against v1.2 of the reference implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants