General policy for curating shared information for subdivided prefixes #1222

bgyori · 2024-10-22T16:09:14Z

We have run into this issue in a couple of different settings now so we might want to discuss a general approach to the problem of metadata curated across semantic spaces for a given database.

Example: pubchem.

When a resource like pubchem is subdivided into multiple semantic spaces including pubchem.substance or pubchem.compound, these prefixes each contain some combination of information that is
- specific to the prefix (e.g., uri_format, example, pattern) and others that are almost always
- generic for the entire database (e.g., publications, homepage, contact, etc.).
There is also the question of mappings:
- some mappings are semantic space-specific (e.g., pubchem.compound's mapping to miriam's pubchem.compound or to n2t's pubchem.compound) while other mappings are
- generic for the entire database e.g., pubchem.compound's mapping to fairsharing's FAIRsharing.qt3w7z)

Currently, the data that is generic for the entire database is not propagated in a predictable way into semantic space-specific prefixes. The fact that some data is surfaced via prioritized mappings makes the situation more complicated. For example, we have

pubchem.compound listed with 8 publications in total
pubchem.substance with only 1 of those 8 publications
To further complicate things, 3 of the 8 papers attached to pubchem.compound are specifically about PubChem's BioAssay subset for which there is a dedicated prefix at pubchem.bioassay but that record again only refers to the same 1 publication that pubchem.substance does.

So the main question of this issue is: should this type of data be standardized based on its shared vs subspace-specific status? If so, what should be the general policy for this?

This is relevant for e.g., #1214 and #1204 and many existing records.

The text was updated successfully, but these errors were encountered:

cthoyt · 2024-10-24T07:20:59Z

see the addition to curation guidelines added #1217

bgyori · 2024-10-24T15:17:43Z

Thanks @cthoyt! In addition to thinking about this problem (which is more general than just related to publications) for new prefixes, I think there are important implications on existing prefixes, since this problem is pretty pervasive. I was curious to see empirically what the situation looks like so I wrote a bit of code to check, using the exported registry so that external mappings' effects are accounted for:

from collections import defaultdict

import matplotlib.pyplot as plt
import requests

res = requests.get('https://raw.githubusercontent.com/biopragmatics/bioregistry/'
                   'refs/heads/main/exports/registry/registry.json')
registry = res.json()

# Organizing prefixes by root
prefixes_by_root = defaultdict(list)
for prefix, data in registry.items():
    prefix_parts = prefix.split('.')
    prefix_root = prefix_parts[0] if len(prefix_parts) == 1 else '.'.join(prefix_parts[:-1])
    prefixes_by_root[prefix_root].append(prefix)

# Quantifying families with shared root
families = {root: prefixes for root, prefixes in prefixes_by_root.items()
            if len(prefixes) > 1}
nprefixes = len(registry)
nfamilies = len(families)
nprefixes_in_families = sum(len(prefixes) for prefixes in families.values())
print(f'Total prefixes: {nprefixes}, out of which {nprefixes_in_families} '
      f'prefixes are in a total of {nfamilies} families.')

nrooted = len([_ for root, prefixes in families.items() if root in prefixes])
print(f'Out of {nfamilies} families, {nrooted} have a root, '
      f'the remaining {nfamilies - nrooted} are rootless.')

# Analyzing publications

pub_prevalences = []

key_priority = ['pubmed', 'doi', 'url']
def get_pub_key(pub):
    for key in key_priority:
        if key in pub:
            return pub[key]

for root, prefixes in families.items():
    prefixes_by_pub = defaultdict(list)
    for prefix in prefixes:
        for publication in registry[prefix].get('publications', []):
            prefixes_by_pub[get_pub_key(publication)].append(prefix)
    for pub, prefixes_for_pub in prefixes_by_pub.items():
        pub_prevalence = len(prefixes_for_pub) / len(prefixes)
        pub_prevalences.append(pub_prevalence)

_ = plt.boxplot(pub_prevalences)

which produces

Total prefixes: 1802, out of which 505 prefixes are in a total of 163 families.
Out of 163 families, 67 have a root, the remaining 96 are rootless.

This means that a really large proportion ~28% of prefixes are in a "family" of this type. Though these choices are often well justified based on the primary foucs of a database, there is no clear pattern in terms of which families have a "root"
prefix (e.g., we have pfam and pfam.clan in the pfam family but we don't have a pubchem root). Per the box plot above, publications are curated in a pretty ad-hoc way for existing prefixes in a family. It's likely that a small number of these publications are prefix-specific but most of them should be shared across the family.

I wonder if it would be worth trying to standardize some of this automatically (with post-hoc quality control/curation) for publications - and potentially other metadata in a similar status.

cthoyt · 2024-10-24T15:40:35Z

See also the part_of field, described at https://biopragmatics.github.io/bioregistry/datamodel/#part-of where the instances you described plus additional ones have been curated. You can get a quick overview on https://bioregistry.io/highlights/relations

more generally I've considered how to share information between related resources (not just publications, but also homepage, contact person, etc.). Open to suggestions but having a curation script that copies publications is a quick and dirty solution

bgyori mentioned this issue Oct 25, 2024

Add part_of for discovered families of prefixes #1232

Open

bgyori mentioned this issue Nov 2, 2024

Extend iedb prefix to multiple semantic subspaces #1238

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

General policy for curating shared information for subdivided prefixes #1222

General policy for curating shared information for subdivided prefixes #1222

bgyori commented Oct 22, 2024

cthoyt commented Oct 24, 2024

bgyori commented Oct 24, 2024

cthoyt commented Oct 24, 2024

General policy for curating shared information for subdivided prefixes #1222

General policy for curating shared information for subdivided prefixes #1222

Comments

bgyori commented Oct 22, 2024

cthoyt commented Oct 24, 2024

bgyori commented Oct 24, 2024

cthoyt commented Oct 24, 2024