-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
General policy for curating shared information for subdivided prefixes #1222
Comments
see the addition to curation guidelines added #1217 |
Thanks @cthoyt! In addition to thinking about this problem (which is more general than just related to publications) for new prefixes, I think there are important implications on existing prefixes, since this problem is pretty pervasive. I was curious to see empirically what the situation looks like so I wrote a bit of code to check, using the exported registry so that external mappings' effects are accounted for: from collections import defaultdict
import matplotlib.pyplot as plt
import requests
res = requests.get('https://raw.githubusercontent.com/biopragmatics/bioregistry/'
'refs/heads/main/exports/registry/registry.json')
registry = res.json()
# Organizing prefixes by root
prefixes_by_root = defaultdict(list)
for prefix, data in registry.items():
prefix_parts = prefix.split('.')
prefix_root = prefix_parts[0] if len(prefix_parts) == 1 else '.'.join(prefix_parts[:-1])
prefixes_by_root[prefix_root].append(prefix)
# Quantifying families with shared root
families = {root: prefixes for root, prefixes in prefixes_by_root.items()
if len(prefixes) > 1}
nprefixes = len(registry)
nfamilies = len(families)
nprefixes_in_families = sum(len(prefixes) for prefixes in families.values())
print(f'Total prefixes: {nprefixes}, out of which {nprefixes_in_families} '
f'prefixes are in a total of {nfamilies} families.')
nrooted = len([_ for root, prefixes in families.items() if root in prefixes])
print(f'Out of {nfamilies} families, {nrooted} have a root, '
f'the remaining {nfamilies - nrooted} are rootless.')
# Analyzing publications
pub_prevalences = []
key_priority = ['pubmed', 'doi', 'url']
def get_pub_key(pub):
for key in key_priority:
if key in pub:
return pub[key]
for root, prefixes in families.items():
prefixes_by_pub = defaultdict(list)
for prefix in prefixes:
for publication in registry[prefix].get('publications', []):
prefixes_by_pub[get_pub_key(publication)].append(prefix)
for pub, prefixes_for_pub in prefixes_by_pub.items():
pub_prevalence = len(prefixes_for_pub) / len(prefixes)
pub_prevalences.append(pub_prevalence)
_ = plt.boxplot(pub_prevalences) which produces
This means that a really large proportion ~28% of prefixes are in a "family" of this type. Though these choices are often well justified based on the primary foucs of a database, there is no clear pattern in terms of which families have a "root" I wonder if it would be worth trying to standardize some of this automatically (with post-hoc quality control/curation) for publications - and potentially other metadata in a similar status. |
See also the more generally I've considered how to share information between related resources (not just publications, but also homepage, contact person, etc.). Open to suggestions but having a curation script that copies publications is a quick and dirty solution |
We have run into this issue in a couple of different settings now so we might want to discuss a general approach to the problem of metadata curated across semantic spaces for a given database.
Example:
pubchem
.When a resource like
pubchem
is subdivided into multiple semantic spaces includingpubchem.substance
orpubchem.compound
, these prefixes each contain some combination of information that isuri_format
,example
,pattern
) and others that are almost alwayspublications
,homepage
,contact
, etc.).There is also the question of mappings:
pubchem.compound
's mapping tomiriam
'spubchem.compound
or ton2t
'spubchem.compound
) while other mappings arepubchem.compound
's mapping tofairsharing
'sFAIRsharing.qt3w7z
)Currently, the data that is generic for the entire database is not propagated in a predictable way into semantic space-specific prefixes. The fact that some data is surfaced via prioritized mappings makes the situation more complicated. For example, we have
pubchem.compound
are specifically about PubChem's BioAssay subset for which there is a dedicated prefix at pubchem.bioassay but that record again only refers to the same 1 publication thatpubchem.substance
does.So the main question of this issue is: should this type of data be standardized based on its shared vs subspace-specific status? If so, what should be the general policy for this?
This is relevant for e.g., #1214 and #1204 and many existing records.
The text was updated successfully, but these errors were encountered: