Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

General policy for curating shared information for subdivided prefixes #1222

Open
bgyori opened this issue Oct 22, 2024 · 3 comments
Open

General policy for curating shared information for subdivided prefixes #1222

bgyori opened this issue Oct 22, 2024 · 3 comments

Comments

@bgyori
Copy link
Contributor

bgyori commented Oct 22, 2024

We have run into this issue in a couple of different settings now so we might want to discuss a general approach to the problem of metadata curated across semantic spaces for a given database.

Example: pubchem.

  • When a resource like pubchem is subdivided into multiple semantic spaces including pubchem.substance or pubchem.compound, these prefixes each contain some combination of information that is

    • specific to the prefix (e.g., uri_format, example, pattern) and others that are almost always
    • generic for the entire database (e.g., publications, homepage, contact, etc.).
  • There is also the question of mappings:

    • some mappings are semantic space-specific (e.g., pubchem.compound's mapping to miriam's pubchem.compound or to n2t's pubchem.compound) while other mappings are
    • generic for the entire database e.g., pubchem.compound's mapping to fairsharing's FAIRsharing.qt3w7z)

Currently, the data that is generic for the entire database is not propagated in a predictable way into semantic space-specific prefixes. The fact that some data is surfaced via prioritized mappings makes the situation more complicated. For example, we have

  • pubchem.compound listed with 8 publications in total
  • pubchem.substance with only 1 of those 8 publications
  • To further complicate things, 3 of the 8 papers attached to pubchem.compound are specifically about PubChem's BioAssay subset for which there is a dedicated prefix at pubchem.bioassay but that record again only refers to the same 1 publication that pubchem.substance does.

So the main question of this issue is: should this type of data be standardized based on its shared vs subspace-specific status? If so, what should be the general policy for this?

This is relevant for e.g., #1214 and #1204 and many existing records.

@cthoyt
Copy link
Member

cthoyt commented Oct 24, 2024

see the addition to curation guidelines added #1217

@bgyori
Copy link
Contributor Author

bgyori commented Oct 24, 2024

Thanks @cthoyt! In addition to thinking about this problem (which is more general than just related to publications) for new prefixes, I think there are important implications on existing prefixes, since this problem is pretty pervasive. I was curious to see empirically what the situation looks like so I wrote a bit of code to check, using the exported registry so that external mappings' effects are accounted for:

from collections import defaultdict

import matplotlib.pyplot as plt
import requests

res = requests.get('https://raw.githubusercontent.com/biopragmatics/bioregistry/'
                   'refs/heads/main/exports/registry/registry.json')
registry = res.json()

# Organizing prefixes by root
prefixes_by_root = defaultdict(list)
for prefix, data in registry.items():
    prefix_parts = prefix.split('.')
    prefix_root = prefix_parts[0] if len(prefix_parts) == 1 else '.'.join(prefix_parts[:-1])
    prefixes_by_root[prefix_root].append(prefix)

# Quantifying families with shared root
families = {root: prefixes for root, prefixes in prefixes_by_root.items()
            if len(prefixes) > 1}
nprefixes = len(registry)
nfamilies = len(families)
nprefixes_in_families = sum(len(prefixes) for prefixes in families.values())
print(f'Total prefixes: {nprefixes}, out of which {nprefixes_in_families} '
      f'prefixes are in a total of {nfamilies} families.')

nrooted = len([_ for root, prefixes in families.items() if root in prefixes])
print(f'Out of {nfamilies} families, {nrooted} have a root, '
      f'the remaining {nfamilies - nrooted} are rootless.')

# Analyzing publications

pub_prevalences = []

key_priority = ['pubmed', 'doi', 'url']
def get_pub_key(pub):
    for key in key_priority:
        if key in pub:
            return pub[key]

for root, prefixes in families.items():
    prefixes_by_pub = defaultdict(list)
    for prefix in prefixes:
        for publication in registry[prefix].get('publications', []):
            prefixes_by_pub[get_pub_key(publication)].append(prefix)
    for pub, prefixes_for_pub in prefixes_by_pub.items():
        pub_prevalence = len(prefixes_for_pub) / len(prefixes)
        pub_prevalences.append(pub_prevalence)

_ = plt.boxplot(pub_prevalences)

which produces

Total prefixes: 1802, out of which 505 prefixes are in a total of 163 families.
Out of 163 families, 67 have a root, the remaining 96 are rootless.

image

This means that a really large proportion ~28% of prefixes are in a "family" of this type. Though these choices are often well justified based on the primary foucs of a database, there is no clear pattern in terms of which families have a "root"
prefix (e.g., we have pfam and pfam.clan in the pfam family but we don't have a pubchem root). Per the box plot above, publications are curated in a pretty ad-hoc way for existing prefixes in a family. It's likely that a small number of these publications are prefix-specific but most of them should be shared across the family.

I wonder if it would be worth trying to standardize some of this automatically (with post-hoc quality control/curation) for publications - and potentially other metadata in a similar status.

@cthoyt
Copy link
Member

cthoyt commented Oct 24, 2024

See also the part_of field, described at https://biopragmatics.github.io/bioregistry/datamodel/#part-of where the instances you described plus additional ones have been curated. You can get a quick overview on https://bioregistry.io/highlights/relations

more generally I've considered how to share information between related resources (not just publications, but also homepage, contact person, etc.). Open to suggestions but having a curation script that copies publications is a quick and dirty solution

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants