During the hackaton on Mapping the Impact of Research Software in Science, we embarked on a project aimed at harmonizing the data models of three distinct software citation datasets: SoMeSci, Softcite, and RRID. Our endeavor was geared towards laying down the groundwork for the creation of a gold dataset.
By endeavoring to harmonize the data models of SoMeSci, Softcite, and RRID, we aim to construct a gold dataset that can significantly contribute to the automated extraction of software citations from scientific literature.
Feature | SoMeSci | Softcite | RRID |
---|---|---|---|
Description | A 5 Star Open Data Gold Standard Knowledge Graph of software mentions in scientific articles. | A gold dataset of software mentions in biomedical and economic research publications. | A portal for obtaining and exploring Research Resource Identifiers (RRIDs) for referencing research resources. |
Data Model Overview | Contains annotations with relation labels for additional information such as version, developer, URL, or citations. Distinguishes between different types like applications, plugins, or programming environments, and different types of mentions like usage or creation. | Contains metadata of annotated research publications and software mentions identified in these publications. Further annotated with details about the software including software version, publisher, and access URL. | Uses RRIDs - persistent and unique identifiers for referencing a research resource. Identifiers are prefixed with "RRID:" followed by a tag indicating the source authority. |
Number of Software Mentions | 3,756 software mentions in 1,367 PubMed Central articles. | 5,134 uniq software mentions in 4,971 research publications (v2.0, 2023). | 78,140 software mentions. |
Domain | Life sciences | Life sciences and social sciences. | Biomedical literature and other domains referencing the generation or use of research resources. |
Usage | Provides training samples for Named Entity Recognition, Relation Extraction, Entity Disambiguation, and Entity Linking. | Designed for supervised learning based scholarly text mining, software entity recognition in text, and investigating how software has been used for research. | Promotes research resource identification, discovery, and reuse. Facilitates citation of resources in biomedical literature and other places that reference their generation or use. |
- openaccess_rrid_links.txt: Collection of the links of open access publications with software mentions from the RRID repository.
- Registry of RRID mentions. DOI: 10.5281/zenodo.10048228
- openaccess_selection.py: Script to select open access publications (and links to the RRID annotations) from the Registry. It uses the PubMed API to detect open access publications.
The collection of links will be used in the future to extract the sentences of software mentions.
This repository was developed as part of the Mapping the Impact of Research Software in Science hackathon hosted by the Chan Zuckerberg Initiative (CZI). By participating in this hackathon, owners of this repository acknowledge the following:
- The code for this project is hosted by the project contributors in a repository created from a template generated by CZI. The purpose of this template is to help ensure that repositories adhere to the hackathon’s project naming conventions and licensing recommendations. CZI does not claim any ownership or intellectual property on the outputs of the hackathon. This repository allows the contributing teams to maintain ownership of code after the project, and indicates that the code produced is not a CZI product, and CZI does not assume responsibility for assuring the legality, usability, safety, or security of the code produced.
- This project is published under a MIT license.
Contributions to this project are subject to CZI’s Contributor Covenant code of conduct. By participating, contributors are expected to uphold this code of conduct.
If you believe you have found a security issue, please responsibly disclose by contacting the repository owner via the ‘security’ tab above.
Anita Bandrowski, Esteban Gonzalez, Tom Honeyman, James Howison, Anne L'Hôte, Arcangelo Massari, David Schindler