Uniquely identifying derivation pathways/provenance for featurization #191

JosephMontoya-TRI · 2019-02-07T01:52:37Z

I have a keen interest in making a featurizer that uses propnet-derived features, but I'm not sure how to create an identifier for every Quantity that contains the information for its symbol+evaluation pathway (which I'd want to separate to maximize my feature set. I think a provenance could probably be meaningfully hashed, but I'm not sure how to do it off of the top of my head.

clegaspi · 2019-02-07T02:02:36Z

As of December, or so, every propnet quantity is assigned a unique ID when it is created (it's a random uuid). It was intended to be used as a bookkeeping mechanism so that we wouldn't have to save the values of quantities in provenance trees, but instead refer to the quantity object by ID.

These IDs may be sufficiently unique for your featurizer, although they alone do not hold information about provenance.

With the new PR, the hash value of a quantity will take into account provenance, although it does not guarantee equality because it doesn't hash the value.

montoyjh · 2019-02-07T02:52:18Z

Right, I could certainly distinguish among the quantities generated for a single material using that. What I'm saying is that I want to be able to identify distinct quantities that were derived in exactly the same way for a set of multiple materials, so I can use them as features corresponding to a dataset.

For example, I might get 50 vicker's hardnesses per material with the standard MP dataset. If I want to use these as features, I'd like to be able to put them into columns that correspond to "identical" features, which in my mind corresponds to the derivation path.

clegaspi · 2019-02-07T03:49:00Z

Oh, I see what you're getting at. Hmm, yeah it's not immediately obvious to me how to do that either. I imagine you'd have to hash the whole model tree in some deterministic way.

JosephMontoya-TRI · 2019-02-07T03:53:28Z

Yeah, that's what I was thinking too. It might be an interesting idea to do that for other reasons as well. For example, graph evaluation might be really facile if you could "cache" the action of the graph for datasets that are isomorphic, which I think might be easier than doing the logic of graph evaluation every time.

clegaspi · 2019-02-07T18:01:00Z

@dmrdjenovich Do you have any thoughts about this? Since you were just working with tree traversal.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uniquely identifying derivation pathways/provenance for featurization #191

Uniquely identifying derivation pathways/provenance for featurization #191

JosephMontoya-TRI commented Feb 7, 2019

clegaspi commented Feb 7, 2019 •

edited

Loading

montoyjh commented Feb 7, 2019 •

edited

Loading

clegaspi commented Feb 7, 2019

JosephMontoya-TRI commented Feb 7, 2019 •

edited

Loading

clegaspi commented Feb 7, 2019

Uniquely identifying derivation pathways/provenance for featurization #191

Uniquely identifying derivation pathways/provenance for featurization #191

Comments

JosephMontoya-TRI commented Feb 7, 2019

clegaspi commented Feb 7, 2019 • edited Loading

montoyjh commented Feb 7, 2019 • edited Loading

clegaspi commented Feb 7, 2019

JosephMontoya-TRI commented Feb 7, 2019 • edited Loading

clegaspi commented Feb 7, 2019

clegaspi commented Feb 7, 2019 •

edited

Loading

montoyjh commented Feb 7, 2019 •

edited

Loading

JosephMontoya-TRI commented Feb 7, 2019 •

edited

Loading