[WIP] Reader proposal for the commondata proposal #1526

scarlehoff · 2022-02-17T16:40:40Z

In the root of the repository there is a jupyter notebook that should be able to work with this branch of the code.

I made two small corrections to the proposal yaml/json files (I commented about them in the other PR). The only actual change is the variants of course.

What I did:

Added a KinematicsSpec and UncertaintiesSpec. These two are loaded inside the commondata file when the user asks explicitly for kinematics or uncertainties since they might be changed by the metadata.

At the moment CommonDataSpec takes as input the datafile (as before) but in the light of the discussion today maybe the loader should only read up the metadata (and the information of dataset_input) and everything else done afterwards.

dataset_input gains a new key: variant

The CommonData coming from libNNPDF has ben removed. The new format I think might please @Zaharid since now not only nothing gets loaded until the user explicitly ask for it but if the user only needs, say, the data (no uncertainties), the kinematics or the uncertainties, that's the only thing that will get loaded. Instead, in the current version, CommonData had to load everything.

Note: the "new loader" already works by setting the environment path BUILDMASTER_PATH to wherever the commondata folders are, since at the moment they are not installed.

validphys2/src/validphys/core.py

Zaharid · 2022-02-17T17:06:46Z

At the moment CommonDataSpec takes as input the datafile (as before) but in the light of the discussion today maybe the loader should only read up the metadata (and the information of dataset_input) and everything else done afterwards.

I think that would be good.

Zaharid · 2022-02-17T17:43:08Z

One benefit of the python implementation of commondata was that there was an index for everything and you could be sure that point n was the nth point in the order it was implemented regardless of cuts and the like.

I wonder if we could have something like that here. Conceptually the index is the kinematics file and it is "required" in order to load either uncertainties or central values, at least insofar we want to make sense of what the data is.

As it is now, there is not much stopping these things from being mismatched.

scarlehoff · 2022-02-17T18:29:15Z

One benefit of the python implementation of commondata was that there was an index for everything and you could be sure that point n was the nth point in the order it was implemented regardless of cuts and the like.

What do you mean by this? (do you mean C++ and not python?)

The data is the raw data but we can load the data as a indexed dataframe instead of an array (so if later on it is cut the index is "kept"). Is this what you mean?

I wonder if we could have something like that here.

Yes, we can. To first approximation everything is possible here (and easy to shift around). That first commit is a minimal thing able to read up, but your imagination is the limit.

I wonder whether we should open some collab with the jupyter notebook...

scarlehoff · 2022-02-18T08:47:03Z

Ah, I understand what you meant now (sorry). I was confused because the python CommonData class is a different stage from the libNNPDF CommonData (which is the one substituted here).
I'll update this PR using that class and move kinematics and uncertainties also to coredata.py.

scarlehoff · 2022-02-21T14:26:01Z

I've updated everything taking into account some of the discussion in this thread. It's quite different from the first version.

As @Zaharid suggested, now everything held is a DataFrame and the only non-derived quantities are indeed the original DataFrames for Kinematics, Uncertainties and Data. This facilitates thing as one doesn't need to keep track of the number of points but rather look explicitly at how many are left after the cuts (for instance).

The uncertainties are not fully working since the proposal is going to change a bit but the base is already there. Please have a look at the jupyter notebook, the output at the end of the day is the same coredata.CommonData of current validphys, the only difference is that now Kinematics and Uncertainties live in their own instances of fancy dataframes.

The loading and parsing I am currently doing in CommonDataSpec. It is very self-contained so after we have converged it can be copied into the appropiate functions inside commondataparser.py.

scarlehoff · 2022-03-17T16:57:04Z

Reader updated with the latest changes by @enocera
It's much cleaner now with the yaml file (while the result are equivalent to the previous ones)

scarlehoff · 2022-03-29T09:28:41Z

There are two parts in this PR:

The reading of the data, that's all in core.py and it looks quite complicated, probably because I don't know well enough how to create pandas objects. I'm not very happy with this part.
The validphys python object for uncertainties, kinematics and commondata. That's all in coredata.py. I'm much happier with this part. They are all dataframes and always treated as dataframes. Some of them were already there in the past, some substitute the C++ version.

In all honesty, I think (1) could be moved to the module that generates the commondata, so that we have a module (call it buildmaster) that generates the data and knows how to read the data. We only define the dataframe the data has to be coerced into since this we know won't change (the data will have a central value, it will have kinematic variables, the uncertainties can be systematic or statistical, the treatment can be additive or multiplicative, etc).
@alecandido this goes in the direction of what we were talking about yesterday, it only depends on pandas so one could even probably convince experimentalists that using the format is reasonable)

(2) instead will always stay in vp, they are objects that take dataframe and then manipulate them in a way that makes our life easier (extract additive uncertainties, treat multiplicative errors, extract the central value, etc).

Zaharid · 2022-03-29T17:54:42Z

validphys2/src/validphys/core.py


-    return CommonDataMetadata(name, int(nsys_str), int(ndata_str),
-                              get_kinlabel_key(process_type_str))
+    def __init__(self, name, variant, metadata_file=None):


Doesn't look like None is a valid default for metadata_file if you are trying to open it next.

Yeah, that part is broken. I forgot I had broken it though. Still, related to your other point. I agree, the CommonData is (should be) fully defined by its metadata file and variant, no need for the name.

validphys2/src/validphys/core.py

Zaharid · 2022-03-29T18:05:10Z

validphys2/src/validphys/core.py


    @property
    def ndata(self):
-        return self.metadata.ndata
+        """Return the number of datapoints"""
+        return len(self.data)


We should not assume we have (or load) data for this.

then we need to add back the nunber of datapoints to the metadata (I think it was removed in some iteration)

validphys2/src/validphys/core.py

Zaharid · 2022-03-29T18:09:13Z

validphys2/src/validphys/core.py

-            self._metadata = peek_commondata_metadata(self.datafile)
-        return self._metadata
+        """Return the number of systematic uncertainties"""
+        return self.metadata.get("proctype")


I guess we need to discuss what to do with process types, as it is a bit ambiguous. On one hand there is "the string that identifies what kinematics mean" (which is roughly what we have in the current commondata files). We may or may not need that with more extensive metadata. Then there are random groupings that people pull out of their hat (e.g. the process type for thcovmat), which we do need to store somewhere.

cc @enocera

validphys2/src/validphys/core.py

validphys2/src/validphys/coredata.py

…ecks, cached_property

scarlehoff assigned Zaharid and enocera Feb 17, 2022

Zaharid reviewed Feb 17, 2022

View reviewed changes

validphys2/src/validphys/core.py Outdated Show resolved Hide resolved

scarlehoff added 3 commits March 17, 2022 13:59

initial commit

cf61a30

revised proposal

a090a55

update reader

f04f7da

scarlehoff force-pushed the NEW_COMMONDATA_READER branch from 147b54f to f04f7da Compare March 17, 2022 16:23

scarlehoff added 2 commits March 28, 2022 21:41

beautify uncertainties dataframe

23ec2e0

ensure the return types are always pandas dataframes

7ceadf8