Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Reader proposal for the commondata proposal #1526

Closed
wants to merge 10 commits into from

Conversation

scarlehoff
Copy link
Member

In the root of the repository there is a jupyter notebook that should be able to work with this branch of the code.

I made two small corrections to the proposal yaml/json files (I commented about them in the other PR). The only actual change is the variants of course.

What I did:

Added a KinematicsSpec and UncertaintiesSpec. These two are loaded inside the commondata file when the user asks explicitly for kinematics or uncertainties since they might be changed by the metadata.

At the moment CommonDataSpec takes as input the datafile (as before) but in the light of the discussion today maybe the loader should only read up the metadata (and the information of dataset_input) and everything else done afterwards.

dataset_input gains a new key: variant

The CommonData coming from libNNPDF has ben removed. The new format I think might please @Zaharid since now not only nothing gets loaded until the user explicitly ask for it but if the user only needs, say, the data (no uncertainties), the kinematics or the uncertainties, that's the only thing that will get loaded. Instead, in the current version, CommonData had to load everything.

Note: the "new loader" already works by setting the environment path BUILDMASTER_PATH to wherever the commondata folders are, since at the moment they are not installed.

@Zaharid
Copy link
Contributor

Zaharid commented Feb 17, 2022

At the moment CommonDataSpec takes as input the datafile (as before) but in the light of the discussion today maybe the loader should only read up the metadata (and the information of dataset_input) and everything else done afterwards.

I think that would be good.

@Zaharid
Copy link
Contributor

Zaharid commented Feb 17, 2022

One benefit of the python implementation of commondata was that there was an index for everything and you could be sure that point n was the nth point in the order it was implemented regardless of cuts and the like.

I wonder if we could have something like that here. Conceptually the index is the kinematics file and it is "required" in order to load either uncertainties or central values, at least insofar we want to make sense of what the data is.

As it is now, there is not much stopping these things from being mismatched.

@scarlehoff
Copy link
Member Author

One benefit of the python implementation of commondata was that there was an index for everything and you could be sure that point n was the nth point in the order it was implemented regardless of cuts and the like.

What do you mean by this? (do you mean C++ and not python?)

The data is the raw data but we can load the data as a indexed dataframe instead of an array (so if later on it is cut the index is "kept"). Is this what you mean?

I wonder if we could have something like that here.

Yes, we can. To first approximation everything is possible here (and easy to shift around). That first commit is a minimal thing able to read up, but your imagination is the limit.

I wonder whether we should open some collab with the jupyter notebook...

@scarlehoff
Copy link
Member Author

Ah, I understand what you meant now (sorry). I was confused because the python CommonData class is a different stage from the libNNPDF CommonData (which is the one substituted here).
I'll update this PR using that class and move kinematics and uncertainties also to coredata.py.

@scarlehoff
Copy link
Member Author

I've updated everything taking into account some of the discussion in this thread. It's quite different from the first version.

As @Zaharid suggested, now everything held is a DataFrame and the only non-derived quantities are indeed the original DataFrames for Kinematics, Uncertainties and Data. This facilitates thing as one doesn't need to keep track of the number of points but rather look explicitly at how many are left after the cuts (for instance).

The uncertainties are not fully working since the proposal is going to change a bit but the base is already there. Please have a look at the jupyter notebook, the output at the end of the day is the same coredata.CommonData of current validphys, the only difference is that now Kinematics and Uncertainties live in their own instances of fancy dataframes.

The loading and parsing I am currently doing in CommonDataSpec. It is very self-contained so after we have converged it can be copied into the appropiate functions inside commondataparser.py.

@scarlehoff scarlehoff force-pushed the NEW_COMMONDATA_READER branch from 147b54f to f04f7da Compare March 17, 2022 16:23
@scarlehoff
Copy link
Member Author

Reader updated with the latest changes by @enocera
It's much cleaner now with the yaml file (while the result are equivalent to the previous ones)

@scarlehoff
Copy link
Member Author

There are two parts in this PR:

  1. The reading of the data, that's all in core.py and it looks quite complicated, probably because I don't know well enough how to create pandas objects. I'm not very happy with this part.
  2. The validphys python object for uncertainties, kinematics and commondata. That's all in coredata.py. I'm much happier with this part. They are all dataframes and always treated as dataframes. Some of them were already there in the past, some substitute the C++ version.

In all honesty, I think (1) could be moved to the module that generates the commondata, so that we have a module (call it buildmaster) that generates the data and knows how to read the data. We only define the dataframe the data has to be coerced into since this we know won't change (the data will have a central value, it will have kinematic variables, the uncertainties can be systematic or statistical, the treatment can be additive or multiplicative, etc).
@alecandido this goes in the direction of what we were talking about yesterday, it only depends on pandas so one could even probably convince experimentalists that using the format is reasonable)

(2) instead will always stay in vp, they are objects that take dataframe and then manipulate them in a way that makes our life easier (extract additive uncertainties, treat multiplicative errors, extract the central value, etc).


return CommonDataMetadata(name, int(nsys_str), int(ndata_str),
get_kinlabel_key(process_type_str))
def __init__(self, name, variant, metadata_file=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't look like None is a valid default for metadata_file if you are trying to open it next.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that part is broken. I forgot I had broken it though. Still, related to your other point. I agree, the CommonData is (should be) fully defined by its metadata file and variant, no need for the name.


@property
def ndata(self):
return self.metadata.ndata
"""Return the number of datapoints"""
return len(self.data)
Copy link
Contributor

@Zaharid Zaharid Mar 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not assume we have (or load) data for this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then we need to add back the nunber of datapoints to the metadata (I think it was removed in some iteration)

self._metadata = peek_commondata_metadata(self.datafile)
return self._metadata
"""Return the number of systematic uncertainties"""
return self.metadata.get("proctype")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we need to discuss what to do with process types, as it is a bit ambiguous. On one hand there is "the string that identifies what kinematics mean" (which is roughly what we have in the current commondata files). We may or may not need that with more extensive metadata. Then there are random groupings that people pull out of their hat (e.g. the process type for thcovmat), which we do need to store somewhere.

cc @enocera

@scarlehoff scarlehoff closed this Apr 5, 2022
@scarlehoff scarlehoff deleted the NEW_COMMONDATA_READER branch February 22, 2023 12:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants