-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Reader proposal for the commondata proposal #1526
Conversation
I think that would be good. |
One benefit of the python implementation of commondata was that there was an index for everything and you could be sure that point n was the nth point in the order it was implemented regardless of cuts and the like. I wonder if we could have something like that here. Conceptually the index is the kinematics file and it is "required" in order to load either uncertainties or central values, at least insofar we want to make sense of what the data is. As it is now, there is not much stopping these things from being mismatched. |
What do you mean by this? (do you mean C++ and not python?) The data is the raw data but we can load the data as a indexed dataframe instead of an array (so if later on it is cut the index is "kept"). Is this what you mean?
Yes, we can. To first approximation everything is possible here (and easy to shift around). That first commit is a minimal thing able to read up, but your imagination is the limit. I wonder whether we should open some collab with the jupyter notebook... |
Ah, I understand what you meant now (sorry). I was confused because the python |
I've updated everything taking into account some of the discussion in this thread. It's quite different from the first version. As @Zaharid suggested, now everything held is a The uncertainties are not fully working since the proposal is going to change a bit but the base is already there. Please have a look at the jupyter notebook, the output at the end of the day is the same The loading and parsing I am currently doing in |
147b54f
to
f04f7da
Compare
Reader updated with the latest changes by @enocera |
There are two parts in this PR:
In all honesty, I think (1) could be moved to the module that generates the commondata, so that we have a module (call it buildmaster) that generates the data and knows how to read the data. We only define the dataframe the data has to be coerced into since this we know won't change (the data will have a central value, it will have kinematic variables, the uncertainties can be systematic or statistical, the treatment can be additive or multiplicative, etc). (2) instead will always stay in vp, they are objects that take dataframe and then manipulate them in a way that makes our life easier (extract additive uncertainties, treat multiplicative errors, extract the central value, etc). |
validphys2/src/validphys/core.py
Outdated
|
||
return CommonDataMetadata(name, int(nsys_str), int(ndata_str), | ||
get_kinlabel_key(process_type_str)) | ||
def __init__(self, name, variant, metadata_file=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't look like None
is a valid default for metadata_file
if you are trying to open it next.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that part is broken. I forgot I had broken it though. Still, related to your other point. I agree, the CommonData
is (should be) fully defined by its metadata file and variant, no need for the name.
|
||
@property | ||
def ndata(self): | ||
return self.metadata.ndata | ||
"""Return the number of datapoints""" | ||
return len(self.data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not assume we have (or load) data
for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then we need to add back the nunber of datapoints to the metadata (I think it was removed in some iteration)
self._metadata = peek_commondata_metadata(self.datafile) | ||
return self._metadata | ||
"""Return the number of systematic uncertainties""" | ||
return self.metadata.get("proctype") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we need to discuss what to do with process types, as it is a bit ambiguous. On one hand there is "the string that identifies what kinematics mean" (which is roughly what we have in the current commondata files). We may or may not need that with more extensive metadata. Then there are random groupings that people pull out of their hat (e.g. the process type for thcovmat), which we do need to store somewhere.
cc @enocera
…ecks, cached_property
In the root of the repository there is a jupyter notebook that should be able to work with this branch of the code.
I made two small corrections to the proposal yaml/json files (I commented about them in the other PR). The only actual change is the variants of course.
What I did:
Added a
KinematicsSpec
andUncertaintiesSpec
. These two are loaded inside the commondata file when the user asks explicitly for kinematics or uncertainties since they might be changed by the metadata.At the moment
CommonDataSpec
takes as input the datafile (as before) but in the light of the discussion today maybe the loader should only read up the metadata (and the information ofdataset_input
) and everything else done afterwards.dataset_input
gains a new key:variant
The
CommonData
coming fromlibNNPDF
has ben removed. The new format I think might please @Zaharid since now not only nothing gets loaded until the user explicitly ask for it but if the user only needs, say, the data (no uncertainties), the kinematics or the uncertainties, that's the only thing that will get loaded. Instead, in the current version,CommonData
had to load everything.Note: the "new loader" already works by setting the environment path
BUILDMASTER_PATH
to wherever the commondata folders are, since at the moment they are not installed.