Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uniquely identifying derivation pathways/provenance for featurization #191

Open
JosephMontoya-TRI opened this issue Feb 7, 2019 · 5 comments

Comments

@JosephMontoya-TRI
Copy link

I have a keen interest in making a featurizer that uses propnet-derived features, but I'm not sure how to create an identifier for every Quantity that contains the information for its symbol+evaluation pathway (which I'd want to separate to maximize my feature set. I think a provenance could probably be meaningfully hashed, but I'm not sure how to do it off of the top of my head.

@clegaspi
Copy link
Contributor

clegaspi commented Feb 7, 2019

As of December, or so, every propnet quantity is assigned a unique ID when it is created (it's a random uuid). It was intended to be used as a bookkeeping mechanism so that we wouldn't have to save the values of quantities in provenance trees, but instead refer to the quantity object by ID.

These IDs may be sufficiently unique for your featurizer, although they alone do not hold information about provenance.

With the new PR, the hash value of a quantity will take into account provenance, although it does not guarantee equality because it doesn't hash the value.

@montoyjh
Copy link
Collaborator

montoyjh commented Feb 7, 2019

Right, I could certainly distinguish among the quantities generated for a single material using that. What I'm saying is that I want to be able to identify distinct quantities that were derived in exactly the same way for a set of multiple materials, so I can use them as features corresponding to a dataset.

For example, I might get 50 vicker's hardnesses per material with the standard MP dataset. If I want to use these as features, I'd like to be able to put them into columns that correspond to "identical" features, which in my mind corresponds to the derivation path.

@clegaspi
Copy link
Contributor

clegaspi commented Feb 7, 2019

Oh, I see what you're getting at. Hmm, yeah it's not immediately obvious to me how to do that either. I imagine you'd have to hash the whole model tree in some deterministic way.

@JosephMontoya-TRI
Copy link
Author

JosephMontoya-TRI commented Feb 7, 2019

Yeah, that's what I was thinking too. It might be an interesting idea to do that for other reasons as well. For example, graph evaluation might be really facile if you could "cache" the action of the graph for datasets that are isomorphic, which I think might be easier than doing the logic of graph evaluation every time.

@clegaspi
Copy link
Contributor

clegaspi commented Feb 7, 2019

@dmrdjenovich Do you have any thoughts about this? Since you were just working with tree traversal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants